summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2018-01-12blk-mq: simplify queue mapping & schedule with each possisble CPUChristoph Hellwig1-11/+8
The previous patch assigns interrupt vectors to all possible CPUs, so now hctx can be mapped to possible CPUs, this patch applies this fact to simplify queue mapping & schedule so that we don't need to handle CPU hotplug for dealing with physical CPU plug & unplug. With this simplication, we can work well on physical CPU plug & unplug, which is a normal use case for VM at least. Make sure we allocate blk_mq_ctx structures for all possible CPUs, and set hctx->numa_node for possible CPUs which are mapped to this hctx. And only choose the online CPUs for schedule. Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU") (merged the three into one because any single one may not work, and fix selecting online CPUs for scheduler) Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-12genirq/affinity: assign vectors to all possible CPUsChristoph Hellwig1-15/+15
Currently we assign managed interrupt vectors to all present CPUs. This works fine for systems were we only online/offline CPUs. But in case of systems that support physical CPU hotplug (or the virtualized version of it) this means the additional CPUs covered for in the ACPI tables or on the command line are not catered for. To fix this we'd either need to introduce new hotplug CPU states just for this case, or we can start assining vectors to possible but not present CPUs. Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com> Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU") Cc: linux-kernel@vger.kernel.org Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-11blk-mq: Reduce the number of if-statements in blk_mq_mark_tag_wait()Bart Van Assche1-34/+35
This patch does not change any functionality but makes the blk_mq_mark_tag_wait() code slightly easier to read. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Omar Sandoval <osandov@fb.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-11null_blk: remove explicit 'select FAULT_INJECTION'Arnd Bergmann2-1/+12
Selecting FAULT_INJECTION causes a Kconfig warning when CONFIG_DEBUG_KERNEL is not set: warning: (BLK_DEV_NULL_BLK && DRM_I915_SELFTEST) selects FAULT_INJECTION which has unmet direct dependencies (DEBUG_KERNEL) The other drivers that use FAULT_INJECTION tend to have a separate Kconfig symbol for turning on that feature, so let's do the same thing here. This may add a bit more complexity than we like, but it avoids the warning and is more consistent with the rest of the kernel. Fixes: 93b570464cce ("null_blk: add option for managing IO timeouts") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10blk-mq: Add locking annotations to hctx_lock() and hctx_unlock()Bart Van Assche1-0/+2
This patch avoids that sparse reports the following: block/blk-mq.c:637:33: warning: context imbalance in 'hctx_unlock' - unexpected unlock block/blk-mq.c:642:9: warning: context imbalance in 'hctx_lock' - wrong count at exit Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10block: silently forbid sending any ioctl to a partitionPaolo Bonzini1-29/+0
After the first few months, the message has not led to many bug reports. It's been almost five years now, and in practice the main source of it seems to be MTIOCGET that someone is using to detect tape devices. While we could whitelist it just like CDROM_GET_CAPABILITY, this patch just removes the message altogether. The patch also removes the "safe but not very useful" ioctl whitelist, as suggested by Christoph. I doubt anything is using most of those ioctls _in general_, let alone on a partition. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10block: rearrange a few request fields for better cache layoutJens Axboe2-22/+24
Move completion related items (like the call single data) near the end of the struct, instead of mixing them in with the initial queueing related fields. Move queuelist below the bio structures. Then we have all queueing related bits in the first cache line. This yields a 1.5-2% increase in IOPS for a null_blk test, both for sync and for high thread count access. Sync test goes form 975K to 992K, 32-thread case from 20.8M to 21.2M IOPS. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10block: convert REQ_ATOM_COMPLETE to stealing rq->__deadline bitJens Axboe5-22/+12
We only have one atomic flag left. Instead of using an entire unsigned long for that, steal the bottom bit of the deadline field that we already reserved. Remove ->atomic_flags, since it's now unused. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10block: add accessors for setting/querying request deadlineJens Axboe4-8/+27
We reduce the resolution of request expiry, but since we're already using jiffies for this where resolution depends on the kernel configuration and since the timeout resolution is coarse anyway, that should be fine. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10block: remove REQ_ATOM_POLL_SLEPTJens Axboe4-6/+5
We don't need this to be an atomic flag, it can be a regular flag. We either end up on the same CPU for the polling, in which case the state is sane, or we did the sleep which would imply the needed barrier to ensure we see the right state. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10blk-mq: add a few missing debugfs RQF_ flagsJens Axboe1-0/+2
We are missing ZONE_WRITE_LOCKED and MQ_TIMEOUT_EXPIRED, add them so the debugfs bits can decode them. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10dm mpath: Use blk_path_errorKeith Busch1-17/+2
Uses common code for determining if an error should be retried on alternate path. Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10nvme/multipath: Use blk_path_errorKeith Busch1-13/+1
Uses common code for determining if an error should be retried on alternate path. Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10block: Provide blk_status_t decoding for path errorsKeith Busch1-0/+28
This patch provides a common decoder for block status path related errors that may be retried so various entities wishing to consult this do not have to duplicate this decision. Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10nvme/multipath: Consult blk_status_t for failoverKeith Busch3-42/+16
This removes nvme multipath's specific status decoding to see if failover is needed, using the generic blk_status_t that was decoded earlier. This abstraction from the raw NVMe status means all status decoding exists in one place. Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10nvme: Add more command status translationKeith Busch1-0/+7
This adds more NVMe status code translations to blk_status_t values, and captures all the current status codes NVMe multipath uses. Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10blk-mq: Explain when 'active_queues' is decrementedBart Van Assche1-0/+6
It is nontrivial to derive from the blk-mq source code when blk_mq_tags.active_queues is decremented. Hence add a comment that explains this. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10partitions/msdos: Unable to mount UFS 44bsd partitionsRichard Narron1-1/+3
UFS partitions from newer versions of FreeBSD 10 and 11 use relative addressing for their subpartitions. But older versions of FreeBSD still use absolute addressing just like OpenBSD and NetBSD. Instead of simply testing for a FreeBSD partition, the code needs to also test if the starting offset of the C subpartition is zero. https://bugzilla.kernel.org/show_bug.cgi?id=197733 Signed-off-by: Richard Narron <comet.berkeley@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10null_blk: add option for managing IO timeoutsJens Axboe2-4/+43
Use the fault injection framework to provide a way for null_blk to configure timeouts. This only works for queue_mode 1 and 2, since the bio mode doesn't have code for tracking timeouts. Let's say you want to have a 10% chance of timing out every 100,000 requests, and for 5 total timeouts, you could do: modprobe null_blk timeout="100000,10,0,5" This is useful for adding blktests to test that IO timeouts are handled appropriately. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10block, bfq: fix occurrences of request finish method's old nameChiara Bruschi1-13/+13
Commit '7b9e93616399' ("blk-mq-sched: unify request finished methods") changed the old name of current bfq_finish_request method, but left it unchanged elsewhere in the code (related comments, part of function name bfq_put_rq_priv_body). This commit fixes all occurrences of the old name of this method by changing them into the current name. Fixes: 7b9e93616399 ("blk-mq-sched: unify request finished methods") Reviewed-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Federico Motta <federico@willer.it> Signed-off-by: Chiara Bruschi <bruschi.chiara@outlook.it> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09Revert "block: blk-merge: try to make front segments in full size"Ming Lei1-49/+5
This reverts commit a2d37968d784363842f87820a21e106741d28004. If max segment size isn't 512-aligned, this patch won't work well. Also once multipage bvec is enabled, adjacent bvecs won't be physically contiguous if page is added via bio_add_page(), so we don't need this kind of complicated logic. Reported-by: Dmitry Osipenko <digetx@gmail.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09null_blk: wire up timeoutsJens Axboe1-0/+16
This is needed to ensure that we actually handle timeouts. Without it, the queue_mode=1 path will never call blk_add_timer(), and the queue_mode=2 path will continually just return EH_RESET_TIMER and we never actually complete the offending request. This was used to test the new timeout code, and the changes around killing off REQ_ATOM_COMPLETE. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09bfq-iosched: don't call bfqg_and_blkg_put for !CONFIG_BFQ_GROUP_IOSCHEDJens Axboe1-1/+1
It's not available if we don't have group io scheduling set, and there's no need to call it. Fixes: 0d52af590552 ("block, bfq: release oom-queue ref to root group on exit") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09bcache: closures: move control bits one bit rightMichael Lyle1-4/+4
Otherwise, architectures that do negated adds of atomics (e.g. s390) to do atomic_sub fail in closure_set_stopped. Signed-off-by: Michael Lyle <mlyle@lyle.org> Cc: Kent Overstreet <kent.overstreet@gmail.com> Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09block: Fix kernel-doc warnings reported when building with W=1Bart Van Assche2-3/+5
Commit 3a025e1d1c2e ("Add optional check for bad kernel-doc comments") causes W=1 the kernel-doc script to be run and thereby causes several new warnings to appear when building the kernel with W=1. Fix the block layer kernel-doc headers such that the block layer again builds cleanly with W=1. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09blk-mq: Fix spelling in a source code commentBart Van Assche1-2/+2
Change "nedeing" into "needing" and "caes" into "cases". Fixes: commit f906a6a0f426 ("blk-mq: improve tag waiting setup for non-shared tags") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Omar Sandoval <osandov@fb.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09blk-mq: silence false positive warnings in hctx_unlock()Jens Axboe1-2/+4
In some stupider versions of gcc, it complains: block/blk-mq.c: In function ‘blk_mq_complete_request’: ./include/linux/srcu.h:175:2: warning: ‘srcu_idx’ may be used uninitialized in this function [-Wmaybe-uninitialized] __srcu_read_unlock(sp, idx); ^ block/blk-mq.c:620:6: note: ‘srcu_idx’ was declared here int srcu_idx; ^ which is completely bogus, since we only use srcu_idx when hctx->flags & BLK_MQ_F_BLOCKING is set, and that's the case where hctx_lock() has initialized it. Just set it to '0' in the normal path in hctx_lock() to silence this annoying warning. Fixes: 04ced159cec8 ("blk-mq: move hctx lock/unlock into a helper") Fixes: 5197c05e16b4 ("blk-mq: protect completion path with RCU") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09blk-mq: rename blk_mq_hw_ctx->queue_rq_srcu to ->srcuTejun Heo2-8/+8
The RCU protection has been expanded to cover both queueing and completion paths making ->queue_rq_srcu a misnomer. Rename it to ->srcu as suggested by Bart. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Bart Van Assche <Bart.VanAssche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09blk-mq: remove REQ_ATOM_STARTEDTejun Heo4-33/+10
After the recent updates to use generation number and state based synchronization, we can easily replace REQ_ATOM_STARTED usages by adding an extra state to distinguish completed but not yet freed state. Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with blk_mq_rq_state() tests. REQ_ATOM_STARTED no longer has any users left and is removed. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mqTejun Heo3-8/+10
After the recent updates to use generation number and state based synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE except to avoid firing the same timeout multiple times. Remove all REQ_ATOM_COMPLETE usages and use a new rq_flags flag RQF_MQ_TIMEOUT_EXPIRED to avoid firing the same timeout multiple times. This removes atomic bitops from hot paths too. v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out(). v3: Added RQF_MQ_TIMEOUT_EXPIRED flag. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "jianchao.wang" <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09blk-mq: make blk_abort_request() trigger timeout pathTejun Heo3-7/+10
With issue/complete and timeout paths now using the generation number and state based synchronization, blk_abort_request() is the only one which depends on REQ_ATOM_COMPLETE for arbitrating completion. There's no reason for blk_abort_request() to be a completely separate path. This patch makes blk_abort_request() piggyback on the timeout path instead of trying to terminate the request directly. This removes the last dependency on REQ_ATOM_COMPLETE in blk-mq. Note that this makes blk_abort_request() asynchronous - it initiates abortion but the actual termination will happen after a short while, even when the caller owns the request. AFAICS, SCSI and ATA should be fine with that and I think mtip32xx and dasd should be safe but not completely sure. It'd be great if people who know the drivers take a look. v2: - Add comment explaining the lack of synchronization around ->deadline update as requested by Bart. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Asai Thambi SP <asamymuthupa@micron.com> Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Cc: Jan Hoeppner <hoeppner@linux.vnet.ibm.com> Cc: Bart Van Assche <Bart.VanAssche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETETejun Heo1-3/+3
blk_mq_check_inflight() and blk_mq_poll_hybrid_sleep() test REQ_ATOM_COMPLETE to determine the request state. Both uses are speculative and we can test REQ_ATOM_STARTED and blk_mq_rq_state() for equivalent results. Replace the tests. This will allow removing REQ_ATOM_COMPLETE usages from blk-mq. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09blk-mq: replace timeout synchronization with a RCU and generation based schemeTejun Heo7-79/+230
Currently, blk-mq timeout path synchronizes against the usual issue/completion path using a complex scheme involving atomic bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence rules. Unfortunately, it contains quite a few holes. There's a complex dancing around REQ_ATOM_STARTED and REQ_ATOM_COMPLETE between issue/completion and timeout paths; however, they don't have a synchronization point across request recycle instances and it isn't clear what the barriers add. blk_mq_check_expired() can easily read STARTED from N-2'th iteration, deadline from N-1'th, blk_mark_rq_complete() against Nth instance. In fact, it's pretty easy to make blk_mq_check_expired() terminate a later instance of a request. If we induce 5 sec delay before time_after_eq() test in blk_mq_check_expired(), shorten the timeout to 2s, and issue back-to-back large IOs, blk-mq starts timing out requests spuriously pretty quickly. Nothing actually timed out. It just made the call on a recycle instance of a request and then terminated a later instance long after the original instance finished. The scenario isn't theoretical either. This patch replaces the broken synchronization mechanism with a RCU and generation number based one. 1. Each request has a u64 generation + state value, which can be updated only by the request owner. Whenever a request becomes in-flight, the generation number gets bumped up too. This provides the basis for the timeout path to distinguish different recycle instances of the request. Also, marking a request in-flight and setting its deadline are protected with a seqcount so that the timeout path can fetch both values coherently. 2. The timeout path fetches the generation, state and deadline. If the verdict is timeout, it records the generation into a dedicated request abortion field and does RCU wait. 3. The completion path is also protected by RCU (from the previous patch) and checks whether the current generation number and state match the abortion field. If so, it skips completion. 4. The timeout path, after RCU wait, scans requests again and terminates the ones whose generation and state still match the ones requested for abortion. By now, the timeout path knows that either the generation number and state changed if it lost the race or the completion will yield to it and can safely timeout the request. While it's more lines of code, it's conceptually simpler, doesn't depend on direct use of subtle memory ordering or coherence, and hopefully doesn't terminate the wrong instance. While this change makes REQ_ATOM_COMPLETE synchronization unnecessary between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't removed yet as it's still used in other places. Future patches will move all state tracking to the new mechanism and remove all bitops in the hot paths. Note that this patch adds a comment explaining a race condition in BLK_EH_RESET_TIMER path. The race has always been there and this patch doesn't change it. It's just documenting the existing race. v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao. - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter. - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter. v3: - Fixed possible extended seqcount / u64_stats_sync read looping spotted by Peter. - MQ_RQ_IDLE was incorrectly being set in complete_request instead of free_request. Fixed. v4: - Rebased on top of hctx_lock() refactoring patch. - Added comment explaining the use of hctx_lock() in completion path. v5: - Added comments requested by Bart. - Note the addition of BLK_EH_RESET_TIMER race condition in the commit message. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "jianchao.wang" <jianchao.w.wang@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <Bart.VanAssche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09blk-mq: protect completion path with RCUTejun Heo1-0/+5
Currently, blk-mq protects only the issue path with RCU. This patch puts the completion path under the same RCU protection. This will be used to synchronize issue/completion against timeout by later patches, which will also add the comments. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09blk-mq: move hctx lock/unlock into a helperJens Axboe1-34/+32
Move the RCU vs SRCU logic into lock/unlock helpers, which makes the actual functional bits within the locked region much easier to read. tj: Reordered in front of timeout revamp patches and added the missing blk_mq_run_hw_queue() conversion. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09block, bfq: release oom-queue ref to root group on exitPaolo Valente1-0/+3
On scheduler init, a reference to the root group, and a reference to its corresponding blkg are taken for the oom queue. Yet these references are not released on scheduler exit, which prevents these objects from be freed. This commit adds the missing reference releases. Reported-by: Davide Ferrari <davideferrari8@gmail.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09block, bfq: put async queues for root bfq groups tooPaolo Valente1-2/+5
For each pair [device for which bfq is selected as I/O scheduler, group in blkio/io], bfq maintains a corresponding bfq group. Each such bfq group contains a set of async queues, with each async queue created on demand, i.e., when some I/O request arrives for it. On creation, an async queue gets an extra reference, to make sure that the queue is not freed as long as its bfq group exists. Accordingly, to allow the queue to be freed after the group exited, this extra reference must released on group exit. The above holds also for a bfq root group, i.e., for the bfq group corresponding to the root blkio/io root for a given device. Yet, by mistake, the references to the existing async queues of a root group are not released when the latter exits. This causes a memory leak when the instance of bfq for a given device exits. In a similar vein, bfqg_stats_xfer_dead is not executed for a root group. This commit fixes bfq_pd_offline so that the latter executes the above missing operations for a root group too. Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reported-by: Guoqing Jiang <gqjiang@suse.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Davide Ferrari <davideferrari8@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-09blk-mq: fix kernel oops in blk_mq_tag_idle()Ming Lei1-1/+2
HW queues may be unmapped in some cases, such as blk_mq_update_nr_hw_queues(), then we need to check it before calling blk_mq_tag_idle(), otherwise the following kernel oops can be triggered, so fix it by checking if the hw queue is unmapped since it doesn't make sense to idle the tags any more after hw queues are unmapped. [ 440.771298] Workqueue: nvme-wq nvme_rdma_del_ctrl_work [nvme_rdma] [ 440.779104] task: ffff894bae755ee0 ti: ffff893bf9bc8000 task.ti: ffff893bf9bc8000 [ 440.788359] RIP: 0010:[<ffffffffb730e2b4>] [<ffffffffb730e2b4>] __blk_mq_tag_idle+0x24/0x40 [ 440.798697] RSP: 0018:ffff893bf9bcbd10 EFLAGS: 00010286 [ 440.805538] RAX: 0000000000000000 RBX: ffff895bb131dc00 RCX: 000000000000011f [ 440.814426] RDX: 00000000ffffffff RSI: 0000000000000120 RDI: ffff895bb131dc00 [ 440.823301] RBP: ffff893bf9bcbd10 R08: 000000000001b860 R09: 4a51d361c00c0000 [ 440.832193] R10: b5907f32b4cc7003 R11: ffffd6cabfb57000 R12: ffff894bafd1e008 [ 440.841091] R13: 0000000000000001 R14: ffff895baf770000 R15: 0000000000000080 [ 440.849988] FS: 0000000000000000(0000) GS:ffff894bbdcc0000(0000) knlGS:0000000000000000 [ 440.859955] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 440.867274] CR2: 0000000000000008 CR3: 000000103d098000 CR4: 00000000001407e0 [ 440.876169] Call Trace: [ 440.879818] [<ffffffffb7309d68>] blk_mq_exit_hctx+0xd8/0xe0 [ 440.887051] [<ffffffffb730dc40>] blk_mq_free_queue+0xf0/0x160 [ 440.894465] [<ffffffffb72ff679>] blk_cleanup_queue+0xd9/0x150 [ 440.901881] [<ffffffffc08a802b>] nvme_ns_remove+0x5b/0xb0 [nvme_core] [ 440.910068] [<ffffffffc08a811b>] nvme_remove_namespaces+0x3b/0x60 [nvme_core] [ 440.919026] [<ffffffffc08b817b>] __nvme_rdma_remove_ctrl+0x2b/0xb0 [nvme_rdma] [ 440.928079] [<ffffffffc08b8237>] nvme_rdma_del_ctrl_work+0x17/0x20 [nvme_rdma] [ 440.937126] [<ffffffffb70ab58a>] process_one_work+0x17a/0x440 [ 440.944517] [<ffffffffb70ac3a8>] worker_thread+0x278/0x3c0 [ 440.951607] [<ffffffffb70ac130>] ? manage_workers.isra.24+0x2a0/0x2a0 [ 440.959760] [<ffffffffb70b352f>] kthread+0xcf/0xe0 [ 440.966055] [<ffffffffb70b3460>] ? insert_kthread_work+0x40/0x40 [ 440.973715] [<ffffffffb76d8658>] ret_from_fork+0x58/0x90 [ 440.980586] [<ffffffffb70b3460>] ? insert_kthread_work+0x40/0x40 [ 440.988229] Code: 5b 41 5c 5d c3 66 90 0f 1f 44 00 00 48 8b 87 20 01 00 00 f0 0f ba 77 40 01 19 d2 85 d2 75 08 c3 0f 1f 80 00 00 00 00 55 48 89 e5 <f0> ff 48 08 48 8d 78 10 e8 7f 0f 05 00 5d c3 0f 1f 00 66 2e 0f [ 441.011620] RIP [<ffffffffb730e2b4>] __blk_mq_tag_idle+0x24/0x40 [ 441.019301] RSP <ffff893bf9bcbd10> [ 441.024052] CR2: 0000000000000008 Reported-by: Zhang Yi <yizhan@redhat.com> Tested-by: Zhang Yi <yizhan@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: fix writeback target calc on large devicesMichael Lyle2-4/+34
Bcache needs to scale the dirty data in the cache over the multiple backing disks in order to calculate writeback rates for each. The previous code did this by multiplying the target number of dirty sectors by the backing device size, and expected it to fit into a uint64_t; this blows up on relatively small backing devices. The new approach figures out the bdev's share in 16384ths of the overall cached data. This is chosen to cope well when bdevs drastically vary in size and to ensure that bcache can cross the petabyte boundary for each backing device. This has been improved based on Tang Junhui's feedback to ensure that every device gets a share of dirty data, no matter how small it is compared to the total backing pool. The existing mechanism is very limited; this is purely a bug fix to remove limits on volume size. However, there still needs to be change to make this "fair" over many volumes where some are idle. Reported-by: Jack Douglas <jack@douglastechnology.co.uk> Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: fix misleading error message in bch_count_io_errors()Coly Li4-7/+16
Bcache only does recoverable I/O for read operations by calling cached_dev_read_error(). For write opertions there is no I/O recovery for failed requests. But in bch_count_io_errors() no matter read or write I/Os, before errors counter reaches io error limit, pr_err() always prints "IO error on %, recoverying". For write requests this information is misleading, because there is no I/O recovery at all. This patch adds a parameter 'is_read' to bch_count_io_errors(), and only prints "recovering" by pr_err() when the bio direction is READ. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: reduce cache_set devices iteration by devices_max_usedColy Li4-5/+9
Member devices of struct cache_set is used to reference all attached bcache devices to this cache set. If it is treated as array of pointers, size of devices[] is indicated by member nr_uuids of struct cache_set. nr_uuids is calculated in drivers/md/super.c:bch_cache_set_alloc(), bucket_bytes(c) / sizeof(struct uuid_entry) Bucket size is determined by user space tool "make-bcache", by default it is 1024 sectors (defined in bcache-tools/make-bcache.c:main()). So default nr_uuids value is 4096 from the above calculation. Every time when bcache code iterates bcache devices of a cache set, all the 4096 pointers are checked even only 1 bcache device is attached to the cache set, that's a wast of time and unncessary. This patch adds a member devices_max_used to struct cache_set. Its value is 1 + the maximum used index of devices[] in a cache set. When iterating all valid bcache devices of a cache set, use c->devices_max_used in for-loop may reduce a lot of useless checking. Personally, my motivation of this patch is not for performance, I use it in bcache debugging, which helps me to narrow down the scape to check valid bcached devices of a cache set. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: fix unmatched generic_end_io_acct() & generic_start_io_acct()Zhai Zhaoxuan1-2/+2
The function cached_dev_make_request() and flash_dev_make_request() call generic_start_io_acct() with (struct bcache_device)->disk when they start a closure. Then the function bio_complete() calls generic_end_io_acct() with (struct search)->orig_bio->bi_disk when the closure has done. Since the `bi_disk` is not the bcache device, the generic_end_io_acct() is called with a wrong device queue. It causes the "inflight" (in struct hd_struct) counter keep increasing without decreasing. This patch fix the problem by calling generic_end_io_acct() with (struct bcache_device)->disk. Signed-off-by: Zhai Zhaoxuan <kxuanobj@gmail.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: mark closure_sync() __schedKent Overstreet1-1/+2
[edit by mlyle: include sched/debug.h to get __sched] Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: Fix, improve efficiency of closure_sync()Kent Overstreet2-60/+46
Eliminates cases where sync can race and fail to complete / get stuck. Removes many status flags and simplifies entering-and-exiting closure sleeping behaviors. [mlyle: fixed conflicts due to changed return behavior in mainline. extended commit comment, and squashed down two commits that were mostly contradictory to get to this state. Changed __set_current_state to set_current_state per Jens review comment] Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: allow quick writeback when backing idleMichael Lyle3-0/+29
If the control system would wait for at least half a second, and there's been no reqs hitting the backing disk for awhile: use an alternate mode where we have at most one contiguous set of writebacks in flight at a time. (But don't otherwise delay). If front-end IO appears, it will still be quick, as it will only have to contend with one real operation in flight. But otherwise, we'll be sending data to the backing disk as quickly as it can accept it (with one op at a time). Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: writeback: properly order backing device IOMichael Lyle2-0/+37
Writeback keys are presently iterated and dispatched for writeback in order of the logical block address on the backing device. Multiple may be, in parallel, read from the cache device and then written back (especially when there are contiguous I/O). However-- there was no guarantee with the existing code that the writes would be issued in LBA order, as the reads from the cache device are often re-ordered. In turn, when writing back quickly, the backing disk often has to seek backwards-- this slows writeback and increases utilization. This patch introduces an ordering mechanism that guarantees that the original order of issue is maintained for the write portion of the I/O. Performance for writeback is significantly improved when there are multiple contiguous keys or high writeback rates. Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn> Tested-by: Tang Junhui <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: fix wrong return value in bch_debug_init()Tang Junhui4-47/+87
in bch_debug_init(), ret is always 0, and the return value is useless, change it to return 0 if be success after calling debugfs_create_dir(), else return a non-zero value. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: segregate flash only volume write streamsTang Junhui1-5/+14
In such scenario that there are some flash only volumes , and some cached devices, when many tasks request these devices in writeback mode, the write IOs may fall to the same bucket as bellow: | cached data | flash data | cached data | cached data| flash data| then after writeback of these cached devices, the bucket would be like bellow bucket: | free | flash data | free | free | flash data | So, there are many free space in this bucket, but since data of flash only volumes still exists, so this bucket cannot be reclaimable, which would cause waste of bucket space. In this patch, we segregate flash only volume write streams from cached devices, so data from flash only volumes and cached devices can store in different buckets. Compare to v1 patch, this patch do not add a additionally open bucket list, and it is try best to segregate flash only volume write streams from cached devices, sectors of flash only volumes may still be mixed with dirty sectors of cached device, but the number is very small. [mlyle: fixed commit log formatting, permissions, line endings] Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: Use PTR_ERR_OR_ZERO()Vasyl Gomonovych1-4/+1
Fix ptr_ret.cocci warnings: drivers/md/bcache/btree.c:1800:1-3: WARNING: PTR_ERR_OR_ZERO can be used Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR Generated by: scripts/coccinelle/api/ptr_ret.cocci Signed-off-by: Vasyl Gomonovych <gomonovych@gmail.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-08bcache: stop writeback thread after detachingTang Junhui1-0/+6
Currently, when a cached device detaching from cache, writeback thread is not stopped, and writeback_rate_update work is not canceled. For example, after the following command: echo 1 >/sys/block/sdb/bcache/detach you can still see the writeback thread. Then you attach the device to the cache again, bcache will create another writeback thread, for example, after below command: echo ba0fb5cd-658a-4533-9806-6ce166d883b9 > /sys/block/sdb/bcache/attach then you will see 2 writeback threads. This patch stops writeback thread and cancels writeback_rate_update work when cached device detaching from cache. Compare with patch v1, this v2 patch moves code down into the register lock for safety in case of any future changes as Coly and Mike suggested. [edit by mlyle: commit log spelling/formatting] Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>