summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2017-10-13lightnvm: pblk: refactor rqd alloc/freeJavier González5-23/+32
Refactor the rqd allocation and free functions so that all I/O types can use these helper functions. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: improve naming for internal req.Javier González5-36/+41
Each request type sent to the LightNVM subsystem requires different metadata. Until now, we have tailored this metadata based on write, read and erase commands. However, pblk uses different metadata for internal writes that do not hit the write buffer. Instead of abusing the metadata for reads, create a new request type - internal write to improve code readability. In the process, create internal values for each I/O type instead of abusing the READ/WRITE macros, as suggested by Christoph. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: allocate bio size more accuratelyJavier González3-14/+15
Wait until we know the exact number of ppas to be sent to the device, before allocating the bio. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: simplify path on REQ_PREFLUSHJavier González2-8/+4
On REQ_PREFLUSH, directly tag the I/O context flags to signal a flush in the write to cache path, instead of finding the correct entry context and imposing a memory barrier. This simplifies the code and might potentially prevent race conditions when adding functionality to the write path. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: put bio on bio completionJavier González4-12/+8
Simplify put bio by doing it on bio end_io instead of manually putting it on the completion path. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: refactor read path on GCJavier González1-55/+39
Simplify the part of the garbage collector where data is read from the line being recycled and moved into an internal queue before being copied to the memory buffer. This allows to get rid of a dedicated function, which introduces an unnecessary dependency on the code. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: simplify data validity check on GCJavier González6-106/+109
When a line is selected for recycling by the garbage collector (GC), the line state changes and the invalid bitmap is frozen, preventing invalidations from happening. Throughout the GC, the L2P map is checked to verify that not data being recycled has been updated. The last check is done before the new map is being stored on the L2P table. Though this algorithm works, it requires a number of corner cases to be checked each time the L2P table is being updated. This complicates readability and is error prone in case that the recycling algorithm is modified. Instead, this patch makes the invalid bitmap accessible even when the line is being recycled. When recycled data is being remapped, it is enough to check the invalid bitmap for the line before updating the L2P table. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: refactor read lba sanity checkJavier González1-19/+10
Refactor lba sanity check on read path to avoid code duplication. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: normalize ppa namingsJavier González1-23/+25
Normalize the way we name ppa variables to improve code readability. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: use constant for GC max inflightJavier González2-3/+3
Use a constant to set the maximum number of inflight GC requests allowed. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: remove checks on mempool alloc.Javier González4-59/+12
As part of the mempool audit on pblk, remove unnecessary mempool allocation checks on mempools. Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: do not use a mempool for line bitmapsJavier González5-50/+14
pblk holds two sector bitmaps: one to keep track of the mapped sectors while the line is active and another one to keep track of the invalid sectors. The latter is kept during the whole live of the line, until it is recycled. Since we cannot guarantee forward progress for the mempool in this case, get rid of the mempool and simply allocate memory through kmalloc. Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: decouple read/erase mempoolsJavier González3-13/+22
Since read and erase paths offer different guarantees for inflight I/Os, separate the mempools to set the right min_nr for each on creation. Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: simplify work_queue mempoolJavier González5-45/+47
In pblk, we have a mempool to allocate a generic structure that we pass along workqueues. This is heavily used in the GC path in order to have enough inflight reads and fully utilize the GC bandwidth. However, the current GC path copies data to the host memory and puts it back into the write buffer. This requires a vmalloc allocation for the data and a memory copy. Thus, guaranteeing the allocation by using a mempool for the structure in itself does not give us much. Until we implement support for vector copy to avoid moving data through the host, just allocate the workqueue structure using kmalloc. This allows us to have a much smaller mempool. Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: fix min size for page mempoolJavier González4-12/+13
pblk uses an internal page mempool for allocating pages on internal bios. The main two users of this memory pool are partial reads (reads with some sectors in cache and some on media) and padded writes, which need to add dummy pages to an existing bio already containing valid data (and with a large enough bioset allocated). In both cases, the maximum number of pages per bio is defined by the maximum number of physical sectors supported by the underlying device. This patch fixes a bad mempool allocation, where the min_nr of elements on the pool was fixed (to 16), which is lower than the maximum number of sectors supported by NVMe (as of the time for this patch). Instead, use the maximum number of allowed sectors reported by the device. Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: avoid deadlock on low LUN configJavier González3-1/+9
On low LUN configurations, make sure not to send bios that are bigger than the buffer size. Fixes: a4bd217b4326 ("lightnvm: physical block device (pblk) target") Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: fix write I/O sync statJavier González1-1/+1
Fix stat counter to collect the right number of I/Os being synced on the completion path. Fixes: 0880a9aa2d91f ("lightnvm: pblk: delete redundant buffer pointer") Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: free padded entries in write bufferJavier González2-2/+6
When a REQ_FLUSH reaches pblk, the bio cannot be directly completed. Instead, data on the write buffer is flushed and the bio is completed on the completion pah. This might require some sectors to be padded in order to guarantee a successful write. This patch fixes a memory leak on the padded pages. A consequence of this bad free was that internal bios not containing data (only a flush) were not being completed. Fixes: a4bd217b4326 ("lightnvm: physical block device (pblk) target") Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: use right flag for GC allocationJavier González1-2/+5
The data buffer for the GC path allocates virtual memory through vmalloc. When this change was introduced, a flag signaling kmalloc'ed memory was wrongly introduced. Use the right flag when creating a bio from this buffer. Fixes: de54e703a422 ("lightnvm: pblk: use vmalloc for GC data buffer") Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: initialize debug stat counterJavier González1-0/+1
Initialize the stat counter for garbage collected reads. Fixes: a4bd217b43268 ("lightnvm: physical block device (pblk) target") Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: reuse pblk_gc_should_kickRakesh Pandit3-27/+9
This is a trivial change which reuses pblk_gc_should_kick instead of repeating it again in pblk_rl_free_lines_inc. Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Made it apply to the common case. Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: print incompatible line version correctlyRakesh Pandit3-3/+4
Correct it by converting little endian to cpu endian and also define a macro for line version so that maintenance is easy. Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: improve error message if down_timeout failsRakesh Pandit1-10/+2
The two pr_err messages are useless as they don't differentiate error code. Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: fix message if L2P MAP is in deviceRakesh Pandit1-1/+1
This usually happens if we are developing with qemu and ll2pmode has default value. Improve description. Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: protect line bitmap while submitting meta ioRakesh Pandit1-0/+2
It seems pblk_dealloc_page would race against pblk_alloc_pages for line bitmap for sector allocation.The chances are very low but might as well protect the bitmap properly. Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: include NVM Express driver if OCSSD is selected for buildRakesh Pandit1-1/+2
Because NVM needs BLK_DEV_NVME, select it automatically if we mark NVM in config file before building kernel. Also append PCI to depends as select doesn't automatically add dependencies. Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: pblk: fix error path in pblk_lines_alloc_metadataRakesh Pandit1-1/+4
Use appropriate memory free calls based on allocation type used and also fix number of times free is called if kmalloc fails. Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: remove already calculated nr_chnlsRakesh Pandit1-1/+0
Remove repeated calculation for number of channels while creating a target device. Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: protect target type list with correct locksRakesh Pandit1-4/+4
nvm_tgt_types list was protected by wrong lock for NVM_INFO ioctl call and can race with addition or removal of target types. Also unregistering target type was not protected correctly. Fixes: 5cd907853 ("lightnvm: remove nested lock conflict with mm") Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: prevent bd removal if busyRakesh Pandit1-0/+14
When a virtual block device is formatted and mounted after creating with "nvme lnvm create... -t pblk", a removal from "nvm lnvm remove" would result in this: 446416.309757] bdi-block not registered [446416.309773] ------------[ cut here ]------------ [446416.309780] WARNING: CPU: 3 PID: 4319 at fs/fs-writeback.c:2159 __mark_inode_dirty+0x268/0x340 Ideally removal should return -EBUSY as block device is mounted after formatting. This patch tries to address this checking if whole device or any partition of it already mounted or not before removal. Whole device is checked using "bd_super" member of block device. This member is always set once block device has been mounted using a filesystem. Another member "bd_part_count" takes care of checking any if any partitions are under use. "bd_part_count" is only updated under locks when partitions are opened or closed (first open and last release). This at least does take care sending -EBUSY if removal is being attempted while whole block device or any partition is mounted. Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-13lightnvm: prevent target type module removal when in useRakesh Pandit3-0/+6
If target type module e.g. pblk here is unloaded (rmmod) while module is in use (after creating target) system crashes. We fix this by using module API refcnt. Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-12mtip32xx: Clean up unused variablesChristos Gkekas1-7/+0
Remove variables that are set but never used. Signed-off-by: Christos Gkekas <chris.gekas@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-12fs/block_dev: remove vfs_msg() interfaceRakesh Pandit2-23/+0
Replaced by pr_err usage in commit ef51042472f5 ("block, dax: move "select DAX" from BLOCK to FS_DAX") Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Acked-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-10block: set request_list for requestShaohua Li2-1/+6
Legacy queue sets request's request_list, mq doesn't. This makes mq does the same thing, so we can find cgroup of a request. Note, we really only use blkg field of request_list, it's pointless to allocate mempool for request_list in mq case. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-10blk-stat: delete useless codeShaohua Li2-41/+9
Fix two issues: - the per-cpu stat flush is unnecessary, nobody uses per-cpu stat except sum it to global stat. We can do the calculation there. The flush just wastes cpu time. - some fields are signed int/s64. I don't see the point. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-10blk-throttle: fix null pointer dereference while throttling writeback IOsJiufei Xue1-2/+10
A null pointer dereference can occur when blkcg is removed manually with writeback IOs inflight. This is caused by the following case: Writeback kworker submit the bio and set bio->bi_cg_private to tg in blk_throtl_assoc_bio. Then we remove the block cgroup manually, the blkg and tg would be freed if there is no request inflight. When the submitted bio come back, blk_throtl_bio_endio() fetch the tg which was already freed. Fix this by increasing the refcount of blkg in funcion blk_throtl_assoc_bio() so that the blkg will not be freed until the bio_endio called. Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: Jiufei Xue <jiufei.xjf@alibaba-inc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-10blkcg: check pol->cpd_free_fn before free cpdweiping zhang1-2/+2
check pol->cpd_free_fn() instead of pol->cpd_alloc_fn() when free cpd. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: weiping zhang <zhangweiping@didichuxing.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-10writeback: merge try_to_writeback_inodes_sb_nr() into callerRakesh Pandit2-27/+7
Since commit 925a6efb8ff0c ("Btrfs: stop using try_to_writeback_inodes_sb_nr to flush delalloc") this function hasn't been used outside so stop exporting it. In addition we merge it into try_to_writeback_inodes_sb() which is the only caller. Also change return type of try_to_writeback_inodes_sb to void as the only user ext4 doesn't care. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-09block, bfq: fix unbalanced decrements of burst sizePaolo Valente1-2/+57
The commit "block, bfq: decrease burst size when queues in burst exit" introduced the decrement of burst_size on the removal of a bfq_queue from the burst list. Unfortunately, this decrement can happen to be performed even when burst size is already equal to 0, because of unbalanced decrements. A description follows of the cause of these unbalanced decrements, namely a wrong assumption, and of the way how this wrong assumption leads to unbalanced decrements. The wrong assumption is that a bfq_queue can exit only if the process associated with the bfq_queue has exited. This is false, because a bfq_queue, say Q, may exit also as a consequence of a merge with another bfq_queue. In this case, Q exits because the I/O of its associated process has been redirected to another bfq_queue. The decrement unbalance occurs because Q may then be re-created after a split, and added back to the current burst list, *without* incrementing burst_size. burst_size is not incremented because Q is not a new bfq_queue added to the burst list, but a bfq_queue only temporarily removed from the list, and, before the commit "bfq-sq, bfq-mq: decrease burst size when queues in burst exit", burst_size was not decremented when Q was removed. This commit addresses this issue by just checking whether the exiting bfq_queue is a merged bfq_queue, and, in that case, not decrementing burst_size. Unfortunately, this still leaves room for unbalanced decrements, in the following rarer case: on a split, the bfq_queue happens to be inserted into a different burst list than that it was removed from when merged. If this happens, the number of elements in the new burst list becomes higher than burst_size (by one). When the bfq_queue then exits, it is of course not in a merged state any longer, thus burst_size is decremented, which results in an unbalanced decrement. To handle this sporadic, unlucky case in a simple way, this commit also checks that burst_size is larger than 0 before decrementing it. Finally, this commit removes an useless, extra check: the check that the bfq_queue is sync, performed before checking whether the bfq_queue is in the burst list. This extra check is redundant, because only sync bfq_queues can be inserted into the burst list. Fixes: 7cb04004fa37 ("block, bfq: decrease burst size when queues in burst exit") Reported-by: Philip Müller <philm@manjaro.org> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com> Tested-by: Philip Müller <philm@manjaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-09block,bfq: Disable writeback throttlingLuca Miccio2-2/+3
Similarly to CFQ, BFQ has its write-throttling heuristics, and it is better not to combine them with further write-throttling heuristics of a different nature. So this commit disables write-back throttling for a device if BFQ is used as I/O scheduler for that device. Signed-off-by: Luca Miccio <lucmiccio@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-09writeback: schedule periodic writeback with sysctlYafang Shao1-2/+8
After disable periodic writeback by writing 0 to dirty_writeback_centisecs, the handler wb_workfn() will not be entered again until the dirty background limit reaches or sync syscall is executed or no enough free memory available or vmscan is triggered. So the periodic writeback can't be enabled by writing a non-zero value to dirty_writeback_centisecs. As it can be disabled by sysctl, it should be able to enable by sysctl as well. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-06block/bio: Remove null checks before mempool_destroy in bioset_freeTim Hansen1-5/+2
This patch removes redundant checks for null values on bio_pool and bvec_pool. Found using make coccicheck M=block/ on linux-net tree on the next-20170929 tag. Signed-off-by: Tim Hansen <devtimhansen@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-06block: remove unnecessary NULL checks in bioset_integrity_free()Tim Hansen1-5/+2
mempool_destroy() already checks for a NULL value being passed in, this eliminates duplicate checks. This was caught by running make coccicheck M=block/ on linus' tree on commit 77ede3a014a32746002f7889211f0cecf4803163 (current head as of this patch). Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Tim Hansen <devtimhansen@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-06backing-dev: kill unused pdflush_proc_obsolete()Jens Axboe2-22/+0
After commit b35bd0d9f8a8, pdflush_proc_obsolete() is no longer used. Kill the function and declaration. Reported-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-05block: remove QUEUE_FLAG_STACKABLEChristoph Hellwig6-33/+3
We already have a queue_is_rq_based helper to check if a request_queue is request based, so we can remove the flag for it. Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-04sysctl: remove /proc/sys/vm/nr_pdflush_threadsJens Axboe2-10/+0
This tunable has been obsolete since 2.6.32, and writes to the file have been failing and complaining in dmesg since then: nr_pdflush_threads exported in /proc is scheduled for removal That was 8 years ago. Remove the file ABI obsolete notice, and the sysfs file. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-04writeback: eliminate work item allocation in bd_start_writeback()Jens Axboe4-60/+57
Handle start-all writeback like we do periodic or kupdate style writeback - by marking the bdi_writeback as needing a full flush, and simply waking the thread. This eliminates the need to allocate and queue a specific work item just for this purpose. After this change, we truly only ever have one of them running at any point in time. We mark the need to start all flushes, and the writeback thread will clear it once it has processed the request. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-04blk-mq: document the need to have STARTED and COMPLETED share a byteJens Axboe2-0/+13
For memory ordering guarantees on stores, we need to ensure that these two bits share the same byte of storage in the unsigned long. Add a comment as to why, and a BUILD_BUG_ON() to ensure that we don't violate this requirement. Suggested-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-04blk-mq: attempt to fix atomic flag memory orderingPeter Zijlstra2-13/+41
Attempt to untangle the ordering in blk-mq. The patch introducing the single smp_mb__before_atomic() is obviously broken in that it doesn't clearly specify a pairing barrier and an obtained guarantee. The comment is further misleading in that it hints that the deadline store and the COMPLETE store also need to be ordered, but AFAICT there is no such dependency. However what does appear to be important is the clear happening _after_ the store, and that worked by pure accident. This clarifies blk_mq_start_request() -- we should not get there with STARTING set -- this simplifies the code and makes the barrier usage sane (the old code could be read to allow not having _any_ atomic after the barrier, in which case the barrier hasn't got anything to order). We then also introduce the missing pairing barrier for it. Also down-grade the barrier to smp_wmb(), this is cheaper for PowerPC/ARM and doesn't cost anything extra on x86. And it documents the STARTING vs COMPLETE ordering. Although I've not been entirely successful in reverse engineering the blk-mq state machine so there might still be more funnies around timeout vs requeue. If I got anything wrong, feel free to educate me by adding comments to clarify things ;-) Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Will Deacon <will.deacon@arm.com> Cc: Ming Lei <tom.leiming@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Fixes: 538b75341835 ("blk-mq: request deadline must be visible before marking rq as started") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-03block: move __elv_next_request to blk-core.cChristoph Hellwig2-41/+40
No need to have this helper inline in a header. Also drop the __ prefix. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>