summaryrefslogtreecommitdiffstats
path: root/block
AgeCommit message (Collapse)AuthorFilesLines
2018-03-27blk-mq: Allow PCI vector offset for mapping queuesKeith Busch1-2/+4
The PCI interrupt vectors intended to be associated with a queue may not start at 0; a driver may allocate pre_vectors for special use. This patch adds an offset parameter so blk-mq may find the intended affinity mask and updates all drivers using this API accordingly. Cc: Don Brace <don.brace@microsemi.com> Cc: <qla2xxx-upstream@qlogic.com> Cc: <linux-scsi@vger.kernel.org> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26block, bfq: lower-bound the estimated peak rate to 1Paolo Valente2-2/+25
If a storage device handled by BFQ happens to be slower than 7.5 KB/s for a certain amount of time (in the order of a second), then the estimated peak rate of the device, maintained in BFQ, becomes equal to 0. The reason is the limited precision with which the rate is represented (details on the range of representable values in the comments introduced by this commit). This leads to a division-by-zero error where the estimated peak rate is used as divisor. Such a type of failure has been reported in [1]. This commit addresses this issue by: 1. Lower-bounding the estimated peak rate to 1 2. Adding and improving comments on the range of rates representable [1] https://www.spinics.net/lists/kernel/msg2739205.html Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-21Fix slab name "biovec-(1<<(21-12))"Mikulas Patocka1-2/+2
I'm getting a slab named "biovec-(1<<(21-12))". It is caused by unintended expansion of the macro BIO_MAX_PAGES. This patch renames it to biovec-max. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-19block: Change a rcu_read_{lock,unlock}_sched() pair into ↵Bart Van Assche1-2/+2
rcu_read_{lock,unlock}() scsi_device_quiesce() uses synchronize_rcu() to guarantee that the effect of blk_set_preempt_only() will be visible for percpu_ref_tryget() calls that occur after the queue unfreeze by using the approach explained in https://lwn.net/Articles/573497/. The rcu read lock and unlock calls in blk_queue_enter() form a pair with the synchronize_rcu() call in scsi_device_quiesce(). Both scsi_device_quiesce() and blk_queue_enter() must either use regular RCU or RCU-sched. Since neither the RCU-protected code in blk_queue_enter() nor blk_queue_usage_counter_release() sleeps, regular RCU protection is sufficient. Note: scsi_device_quiesce() does not have to be modified since it already uses synchronize_rcu(). Reported-by: Tejun Heo <tj@kernel.org> Fixes: 3a0a529971ec ("block, scsi: Make SCSI quiesce and resume work reliably") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Cc: Martin Steigerwald <martin@lichtvoll.de> Cc: stable@vger.kernel.org # v4.15 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-17block: bio_check_eod() needs to consider partitionsChristoph Hellwig1-53/+40
bio_check_eod() should check partition size not the whole disk if bio->bi_partno is non-zero. Do this by moving the call to bio_check_eod() into blk_partition_remap(). Based on an earlier patch from Jiufei Xue. Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index") Reported-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-17blk-mq-debugfs: Show more request state informationBart Van Assche1-1/+15
Since commit 634f9e4631a8 ("blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq") blk_rq_is_complete() only reports whether or not a request has completed for legacy queues. Hence modify the blk-mq-debugfs code such that it shows the blk-mq request state again. Fixes: 634f9e4631a8 ("blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-16blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()Joseph Qi1-16/+62
We've triggered a WARNING in blk_throtl_bio() when throttling writeback io, which complains blkg->refcnt is already 0 when calling blkg_get(), and then kernel crashes with invalid page request. After investigating this issue, we've found it is caused by a race between blkcg_bio_issue_check() and cgroup_rmdir(), which is described below: writeback kworker cgroup_rmdir cgroup_destroy_locked kill_css css_killed_ref_fn css_killed_work_fn offline_css blkcg_css_offline blkcg_bio_issue_check rcu_read_lock blkg_lookup spin_trylock(q->queue_lock) blkg_destroy spin_unlock(q->queue_lock) blk_throtl_bio spin_lock_irq(q->queue_lock) ... spin_unlock_irq(q->queue_lock) rcu_read_unlock Since rcu can only prevent blkg from releasing when it is being used, the blkg->refcnt can be decreased to 0 during blkg_destroy() and schedule blkg release. Then trying to blkg_get() in blk_throtl_bio() will complains the WARNING. And then the corresponding blkg_put() will schedule blkg release again, which result in double free. This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg creation in blkcg_bio_issue_check()"). Before this commit, it will lookup first and then try to lookup/create again with queue_lock. Since revive this logic is a bit drastic, so fix it by only offlining pd during blkcg_css_offline(), and move the rest destruction (especially blkg_put()) into blkcg_css_free(), which should be the right way as discussed. Fixes: ae1188963611 ("blkcg: consolidate blkg creation in blkcg_bio_issue_check()") Reported-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-16block: sed-opal: fix u64 short atom lengthJonas Rabenstein1-7/+4
The length must be given as bytes and not as 4 bit tuples. Reviewed-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-13bsg: split handling of SCSI CDBs vs transport requeuesChristoph Hellwig2-198/+222
The current BSG design tries to shoe-horn the transport-specific passthrough commands into the overall framework for SCSI passthrough requests. This has a couple problems: - each passthrough queue has to set the QUEUE_FLAG_SCSI_PASSTHROUGH flag despite not dealing with SCSI commands at all. Because of that these queues could also incorrectly accept SCSI commands from in-kernel users or through the legacy SCSI_IOCTL_SEND_COMMAND ioctl. - the real SCSI bsg queues also incorrectly accept bsg requests of the BSG_SUB_PROTOCOL_SCSI_TRANSPORT type - the bsg transport code is almost unredable because it tries to reuse different SCSI concepts for its own purpose. This patch instead adds a new bsg_ops structure to handle the two cases differently, and thus solves all of the above problems. Another side effect is that the bsg-lib queues also don't need to embedd a struct scsi_request anymore. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-13bsg-lib: remove bsg_job.reqChristoph Hellwig1-8/+6
Users of the bsg-lib interface should only use the bsg_job data structure and not know about implementation details of it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-13bsg-lib: introduce a timeout field in struct bsg_jobChristoph Hellwig1-0/+1
The zfcp driver wants to know the timeout for a bsg job, so add a field to struct bsg_job for it in preparation of not exposing the request to the bsg-lib users. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-09block: Suppress kernel-doc warnings triggered by blk-zoned.cBart Van Assche1-2/+2
Avoid that building with W=1 causes the kernel-doc tool to complain about undocumented function arguments for the blk-zoned.c source file. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-08block: Move the queue_flag_*() functions from a public into a private header ↵Bart Van Assche1-0/+69
file This patch helps to avoid that new code gets introduced in block drivers that manipulates queue flags without holding the queue lock when that lock should be held. Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-08block: Use blk_queue_flag_*() in drivers instead of queue_flag_*()Bart Van Assche1-2/+2
This patch has been generated as follows: for verb in set_unlocked clear_unlocked set clear; do replace-in-files queue_flag_${verb} blk_queue_flag_${verb%_unlocked} \ $(git grep -lw queue_flag_${verb} drivers block/bsg*) done Except for protecting all queue flag changes with the queue lock this patch does not change any functionality. Cc: Mike Snitzer <snitzer@redhat.com> Cc: Shaohua Li <shli@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-08block: Protect queue flag changes with the queue lockBart Van Assche2-4/+4
Since the queue flags may be changed concurrently from multiple contexts after a queue becomes visible in sysfs, make these changes safe by protecting these with the queue lock. Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-08block: Introduce blk_queue_flag_{set,clear,test_and_{set,clear}}()Bart Van Assche5-49/+88
Introduce functions that modify the queue flags and that protect these modifications with the request queue lock. Except for moving one wake_up_all() call from inside to outside a critical section, this patch does not change any functionality. Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-08block: Use the queue_flag_*() functions instead of open-coding theseBart Van Assche4-7/+7
Except for changing the atomic queue flag manipulations that are protected by the queue lock into non-atomic manipulations, this patch does not change any functionality. Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-06block: sed-opal: fix response string extractionJonas Rabenstein1-3/+23
Tokens are prefixed by a variable length of bytes. If a bytestring is not stored in an tiny or short atom, we have to skip more than one byte in order to have the actual bytes not prefixed by the bytes describing the actual length of the string. Acked-by: Jonathan Derrick <jonathan.derrick@intel.com> Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-28block: Fix a race between request queue removal and the block cgroup controllerBart Van Assche2-7/+31
Avoid that the following race can occur: blk_cleanup_queue() blkcg_print_blkgs() spin_lock_irq(lock) (1) spin_lock_irq(blkg->q->queue_lock) (2,5) q->queue_lock = &q->__queue_lock (3) spin_unlock_irq(lock) (4) spin_unlock_irq(blkg->q->queue_lock) (6) (1) take driver lock; (2) busy loop for driver lock; (3) override driver lock with internal lock; (4) unlock driver lock; (5) can take driver lock now; (6) but unlock internal lock. This change is safe because only the SCSI core and the NVME core keep a reference on a request queue after having called blk_cleanup_queue(). Neither driver accesses any of the removed data structures between its blk_cleanup_queue() and blk_put_queue() calls. Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Jan Kara <jack@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-28block: Fix a race between the cgroup code and request queue initializationBart Van Assche1-8/+16
Initialize the request queue lock earlier such that the following race can no longer occur: blk_init_queue_node() blkcg_print_blkgs() blk_alloc_queue_node (1) q->queue_lock = &q->__queue_lock (2) blkcg_init_queue(q) (3) spin_lock_irq(blkg->q->queue_lock) (4) q->queue_lock = lock (5) spin_unlock_irq(blkg->q->queue_lock) (6) (1) allocate an uninitialized queue; (2) initialize queue_lock to its default internal lock; (3) initialize blkcg part of request queue, which will create blkg and then insert it to blkg_list; (4) traverse blkg_list and find the created blkg, and then take its queue lock, here it is the default *internal lock*; (5) *race window*, now queue_lock is overridden with *driver specified lock*; (6) now unlock *driver specified lock*, not the locked *internal lock*, unlock balance breaks. The changes in this patch are as follows: - Move the .queue_lock initialization from blk_init_queue_node() into blk_alloc_queue_node(). - Only override the .queue_lock pointer for legacy queues because it is not useful for blk-mq queues to override this pointer. - For all all block drivers that initialize .queue_lock explicitly, change the blk_alloc_queue() call in the driver into a blk_alloc_queue_node() call and remove the explicit .queue_lock initialization. Additionally, initialize the spin lock that will be used as queue lock earlier if necessary. Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Ulf Hansson <ulf.hansson@linaro.org> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-28block: Add 'lock' as third argument to blk_alloc_queue_node()Bart Van Assche2-4/+5
This patch does not change any functionality. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Ulf Hansson <ulf.hansson@linaro.org> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-28block: clear ctx pending bit under ctx lockOmar Sandoval1-1/+1
When we insert a request, we set the software queue pending bit while holding the software queue lock. However, we clear it outside of the lock, so it's possible that a concurrent insert could reset the bit after we clear it but before we empty the request list. Afterwards, the bit would still be set but the software queue wouldn't have any requests in it, leading us to do a spurious run in the future. This is mostly a benign/theoretical issue, but it makes the following change easier to justify. Signed-off-by: Omar Sandoval <osandov@fb.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-28blk-mq-debugfs: Show zone locking informationBart Van Assche1-0/+16
When debugging the ZBC code in the mq-deadline scheduler it is very important to know which zones are locked and which zones are not locked. Hence this patch that exports the zone locking information through debugfs. Cc: Omar Sandoval <osandov@fb.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Tested-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-28blk-mq-debugfs: Reorder queue show and store methodsBart Van Assche1-65/+65
Make sure that the queue show and store methods are contiguous and also that these appear in alphabetical order. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-26genhd: Fix BUG in blkdev_open()Jan Kara1-1/+20
When two blkdev_open() calls for a partition race with device removal and recreation, we can hit BUG_ON(!bd_may_claim(bdev, whole, holder)) in blkdev_open(). The race can happen as follows: CPU0 CPU1 CPU2 del_gendisk() bdev_unhash_inode(part1); blkdev_open(part1, O_EXCL) blkdev_open(part1, O_EXCL) bdev = bd_acquire() bdev = bd_acquire() blkdev_get(bdev) bd_start_claiming(bdev) - finds old inode 'whole' bd_prepare_to_claim() -> 0 bdev_unhash_inode(whole); <device removed> <new device under same number created> blkdev_get(bdev); bd_start_claiming(bdev) - finds new inode 'whole' bd_prepare_to_claim() - this also succeeds as we have different 'whole' here... - bad things happen now as we have two exclusive openers of the same bdev The problem here is that block device opens can see various intermediate states while gendisk is shutting down and then being recreated. We fix the problem by introducing new lookup_sem in gendisk that synchronizes gendisk deletion with get_gendisk() and furthermore by making sure that get_gendisk() does not return gendisk that is being (or has been) deleted. This makes sure that once we ever manage to look up newly created bdev inode, we are also guaranteed that following get_gendisk() will either return failure (and we fail open) or it returns gendisk for the new device and following bdget_disk() will return new bdev inode (i.e., blkdev_open() follows the path as if it is completely run after new device is created). Reported-and-analyzed-by: Hou Tao <houtao1@huawei.com> Tested-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-26genhd: Add helper put_disk_and_module()Jan Kara2-13/+18
Add a proper counterpart to get_disk_and_module() - put_disk_and_module(). Currently it is opencoded in several places. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-26genhd: Rename get_disk() to get_disk_and_module()Jan Kara1-6/+4
Rename get_disk() to get_disk_and_module() to make sure what the function does. It's not a great name but at least it is now clear that put_disk() is not it's counterpart. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-26genhd: Fix leaked module reference for NVME devicesJan Kara1-0/+3
Commit 8ddcd653257c "block: introduce GENHD_FL_HIDDEN" added handling of hidden devices to get_gendisk() but forgot to drop module reference which is also acquired by get_disk(). Drop the reference as necessary. Arguably the function naming here is misleading as put_disk() is *not* the counterpart of get_disk() but let's fix that in the follow up commit since that will be more intrusive. Fixes: 8ddcd653257c18a669fcb75ee42c37054908e0d6 CC: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-24block: kyber: fix domain token leak during requeueMing Lei1-0/+1
When requeuing request, the domain token should have been freed before re-inserting the request to io scheduler. Otherwise, the assigned domain token will be leaked, and IO hang can be caused. Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Omar Sandoval <osandov@fb.com> Cc: stable@vger.kernel.org Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-24blk-mq: don't call io sched's .requeue_request when requeueing rq to ->dispatchMing Lei1-1/+3
__blk_mq_requeue_request() covers two cases: - one is that the requeued request is added to hctx->dispatch, such as blk_mq_dispatch_rq_list() - another case is that the request is requeued to io scheduler, such as blk_mq_requeue_request(). We should call io sched's .requeue_request callback only for the 2nd case. Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Omar Sandoval <osandov@fb.com> Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers") Cc: stable@vger.kernel.org Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-23block: pass inclusive 'lend' parameter to truncate_inode_pages_rangeMing Lei1-1/+1
The 'lend' parameter of truncate_inode_pages_range is required to be inclusive, so follow the rule. This patch fixes one memory corruption triggered by discard. Cc: <stable@vger.kernel.org> Cc: Dmitry Monakhov <dmonakhov@openvz.org> Fixes: 351499a172c0 ("block: Invalidate cache on discard v2") Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-13blk: optimization for classic pollingNitesh Shetty1-0/+1
This removes the dependency on interrupts to wake up task. Set task state as TASK_RUNNING, if need_resched() returns true, while polling for IO completion. Earlier, polling task used to sleep, relying on interrupt to wake it up. This made some IO take very long when interrupt-coalescing is enabled in NVMe. Reference: http://lists.infradead.org/pipermail/linux-nvme/2018-February/015435.html Changes since v2->v3: -using __set_current_state() instead of set_current_state() Changes since v1->v2: -setting task state once in blk_poll, instead of multiple callers. Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-11vfs: do bulk POLL* -> EPOLL* replacementLinus Torvalds1-2/+2
This is the mindless scripted replacement of kernel use of POLL* variables as described by Al, done by this script: for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'` for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done done with de-mangling cleanups yet to come. NOTE! On almost all architectures, the EPOLL* constants have the same values as the POLL* constants do. But they keyword here is "almost". For various bad reasons they aren't the same, and epoll() doesn't actually work quite correctly in some cases due to this on Sparc et al. The next patch from Al will sort out the final differences, and we should be all done. Scripted-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-07block, bfq: add requeue-request hookPaolo Valente1-25/+82
Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device be re-inserted into the active I/O scheduler for that device. As a consequence, I/O schedulers may get the same request inserted again, even several times, without a finish_request invoked on that request before each re-insertion. This fact is the cause of the failure reported in [1]. For an I/O scheduler, every re-insertion of the same re-prepared request is equivalent to the insertion of a new request. For schedulers like mq-deadline or kyber, this fact causes no harm. In contrast, it confuses a stateful scheduler like BFQ, which keeps state for an I/O request, until the finish_request hook is invoked on the request. In particular, BFQ may get stuck, waiting forever for the number of request dispatches, of the same request, to be balanced by an equal number of request completions (while there will be one completion for that request). In this state, BFQ may refuse to serve I/O requests from other bfq_queues. The hang reported in [1] then follows. However, the above re-prepared requests undergo a requeue, thus the requeue_request hook of the active elevator is invoked for these requests, if set. This commit then addresses the above issue by properly implementing the hook requeue_request in BFQ. [1] https://marc.info/?l=linux-block&m=151211117608676 Reported-by: Ivan Kozik <ivan@ludios.org> Reported-by: Alban Browaeys <alban.browaeys@gmail.com> Tested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Serena Ziviani <ziviani.serena@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-06block: Add should_fail_bio() for bpf error injectionHoward McLauchlan1-1/+10
The classic error injection mechanism, should_fail_request() does not support use cases where more information is required (from the entire struct bio, for example). To that end, this patch introduces should_fail_bio(), which calls should_fail_request() under the hood but provides a convenient place for kprobes to hook into if they require the entire struct bio. This patch also replaces some existing calls to should_fail_request() with should_fail_bio() with no degradation in performance. Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-06blk-wbt: account flush requests correctlyJens Axboe1-1/+9
Mikulas reported a workload that saw bad performance, and figured out what it was due to various other types of requests being accounted as reads. Flush requests, for instance. Due to the high latency of those, we heavily throttle the writes to keep the latencies in balance. But they really should be accounted as writes. Fix this by checking the exact type of the request. If it's a read, account as a read, if it's a write or a flush, account as a write. Any other request we disregard. Previously everything would have been mistakenly accounted as reads. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-04Merge tag 'for-linus-20180204' of git://git.kernel.dk/linux-blockLinus Torvalds4-7/+47
Pull more block updates from Jens Axboe: "Most of this is fixes and not new code/features: - skd fix from Arnd, fixing a build error dependent on sla allocator type. - blk-mq scheduler discard merging fixes, one from me and one from Keith. This fixes a segment miscalculation for blk-mq-sched, where we mistakenly think two segments are physically contigious even though the request isn't carrying real data. Also fixes a bio-to-rq merge case. - Don't re-set a bit on the buffer_head flags, if it's already set. This can cause scalability concerns on bigger machines and workloads. From Kemi Wang. - Add BLK_STS_DEV_RESOURCE return value to blk-mq, allowing us to distuingish between a local (device related) resource starvation and a global one. The latter might happen without IO being in flight, so it has to be handled a bit differently. From Ming" * tag 'for-linus-20180204' of git://git.kernel.dk/linux-block: block: skd: fix incorrect linux/slab_def.h inclusion buffer: Avoid setting buffer bits that are already set blk-mq-sched: Enable merging discard bio into request blk-mq: fix discard merge with scheduler attached blk-mq: introduce BLK_STS_DEV_RESOURCE
2018-02-01blk-mq-sched: Enable merging discard bio into requestKeith Busch1-0/+2
Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-01blk-mq: fix discard merge with scheduler attachedJens Axboe2-3/+28
I ran into an issue on my laptop that triggered a bug on the discard path: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3/0x430 Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd intel_gtt CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G U 4.15.0+ #176 Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) 12/19/2017 RIP: 0010:nvme_setup_cmd+0x3d3/0x430 RSP: 0018:ffff880423e9f838 EFLAGS: 00010217 RAX: 0000000000000000 RBX: ffff880423e9f8c8 RCX: 0000000000010000 RDX: ffff88022b200010 RSI: 0000000000000002 RDI: 00000000327f0000 RBP: ffff880421251400 R08: ffff88022b200000 R09: 0000000000000009 R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000ffff R13: ffff88042341e280 R14: 000000000000ffff R15: ffff880421251440 FS: 0000000000000000(0000) GS:ffff880441500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055b684795030 CR3: 0000000002e09006 CR4: 00000000001606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: nvme_queue_rq+0x40/0xa00 ? __sbitmap_queue_get+0x24/0x90 ? blk_mq_get_tag+0xa3/0x250 ? wait_woken+0x80/0x80 ? blk_mq_get_driver_tag+0x97/0xf0 blk_mq_dispatch_rq_list+0x7b/0x4a0 ? deadline_remove_request+0x49/0xb0 blk_mq_do_dispatch_sched+0x4f/0xc0 blk_mq_sched_dispatch_requests+0x106/0x170 __blk_mq_run_hw_queue+0x53/0xa0 __blk_mq_delay_run_hw_queue+0x83/0xa0 blk_mq_run_hw_queue+0x6c/0xd0 blk_mq_sched_insert_request+0x96/0x140 __blk_mq_try_issue_directly+0x3d/0x190 blk_mq_try_issue_directly+0x30/0x70 blk_mq_make_request+0x1a4/0x6a0 generic_make_request+0xfd/0x2f0 ? submit_bio+0x5c/0x110 submit_bio+0x5c/0x110 ? __blkdev_issue_discard+0x152/0x200 submit_bio_wait+0x43/0x60 ext4_process_freed_data+0x1cd/0x440 ? account_page_dirtied+0xe2/0x1a0 ext4_journal_commit_callback+0x4a/0xc0 jbd2_journal_commit_transaction+0x17e2/0x19e0 ? kjournald2+0xb0/0x250 kjournald2+0xb0/0x250 ? wait_woken+0x80/0x80 ? commit_timeout+0x10/0x10 kthread+0x111/0x130 ? kthread_create_worker_on_cpu+0x50/0x50 ? do_group_exit+0x3a/0xa0 ret_from_fork+0x1f/0x30 Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 48 8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff <0f> ff 4c 89 c7 41 bc 0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff ---[ end trace 50d361cc444506c8 ]--- print_req_error: I/O error, dev nvme0n1, sector 847167488 Decoding the assembly, the request claims to have 0xffff segments, while nvme counts two. This turns out to be because we don't check for a data carrying request on the mq scheduler path, and since blk_phys_contig_segment() returns true for a non-data request, we decrement the initial segment count of 0 and end up with 0xffff in the unsigned short. There are a few issues here: 1) We should initialize the segment count for a discard to 1. 2) The discard merging is currently using the data limits for segments and sectors. Fix this up by having attempt_merge() correctly identify the request, and by initializing the segment count correctly for discards. This can only be triggered with mq-deadline on discard capable devices right now, which isn't a common configuration. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-30blk-mq: introduce BLK_STS_DEV_RESOURCEMing Lei2-4/+17
This status is returned from driver to block layer if device related resource is unavailable, but driver can guarantee that IO dispatch will be triggered in future when the resource is available. Convert some drivers to return BLK_STS_DEV_RESOURCE. Also, if driver returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls. BLK_MQ_DELAY_QUEUE is 3 ms because both scsi-mq and nvmefc are using that magic value. If a driver can make sure there is in-flight IO, it is safe to return BLK_STS_DEV_RESOURCE because: 1) If all in-flight IOs complete before examining SCHED_RESTART in blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue is run immediately in this case by blk_mq_dispatch_rq_list(); 2) if there is any in-flight IO after/when examining SCHED_RESTART in blk_mq_dispatch_rq_list(): - if SCHED_RESTART isn't set, queue is run immediately as handled in 1) - otherwise, this request will be dispatched after any in-flight IO is completed via blk_mq_sched_restart() 3) if SCHED_RESTART is set concurently in context because of BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two cases and make sure IO hang can be avoided. One invariant is that queue will be rerun if SCHED_RESTART is set. Suggested-by: Jens Axboe <axboe@kernel.dk> Tested-by: Laurence Oberman <loberman@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-30Merge branch 'misc.poll' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull poll annotations from Al Viro: "This introduces a __bitwise type for POLL### bitmap, and propagates the annotations through the tree. Most of that stuff is as simple as 'make ->poll() instances return __poll_t and do the same to local variables used to hold the future return value'. Some of the obvious brainos found in process are fixed (e.g. POLLIN misspelled as POLL_IN). At that point the amount of sparse warnings is low and most of them are for genuine bugs - e.g. ->poll() instance deciding to return -EINVAL instead of a bitmap. I hadn't touched those in this series - it's large enough as it is. Another problem it has caught was eventpoll() ABI mess; select.c and eventpoll.c assumed that corresponding POLL### and EPOLL### were equal. That's true for some, but not all of them - EPOLL### are arch-independent, but POLL### are not. The last commit in this series separates userland POLL### values from the (now arch-independent) kernel-side ones, converting between them in the few places where they are copied to/from userland. AFAICS, this is the least disruptive fix preserving poll(2) ABI and making epoll() work on all architectures. As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and it will trigger only on what would've triggered EPOLLWRBAND on other architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered at all on sparc. With this patch they should work consistently on all architectures" * 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits) make kernel-side POLL... arch-independent eventpoll: no need to mask the result of epi_item_poll() again eventpoll: constify struct epoll_event pointers debugging printk in sg_poll() uses %x to print POLL... bitmap annotate poll(2) guts 9p: untangle ->poll() mess ->si_band gets POLL... bitmap stored into a user-visible long field ring_buffer_poll_wait() return value used as return value of ->poll() the rest of drivers/*: annotate ->poll() instances media: annotate ->poll() instances fs: annotate ->poll() instances ipc, kernel, mm: annotate ->poll() instances net: annotate ->poll() instances apparmor: annotate ->poll() instances tomoyo: annotate ->poll() instances sound: annotate ->poll() instances acpi: annotate ->poll() instances crypto: annotate ->poll() instances block: annotate ->poll() instances x86: annotate ->poll() instances ...
2018-01-29Merge branch 'for-4.16/block' of git://git.kernel.dk/linux-blockLinus Torvalds32-684/+1506
Pull block updates from Jens Axboe: "This is the main pull request for block IO related changes for the 4.16 kernel. Nothing major in this pull request, but a good amount of improvements and fixes all over the map. This contains: - BFQ improvements, fixes, and cleanups from Angelo, Chiara, and Paolo. - Support for SMR zones for deadline and mq-deadline from Damien and Christoph. - Set of fixes for bcache by way of Michael Lyle, including fixes from himself, Kent, Rui, Tang, and Coly. - Series from Matias for lightnvm with fixes from Hans Holmberg, Javier, and Matias. Mostly centered around pblk, and the removing rrpc 1.2 in preparation for supporting 2.0. - A couple of NVMe pull requests from Christoph. Nothing major in here, just fixes and cleanups, and support for command tracing from Johannes. - Support for blk-throttle for tracking reads and writes separately. From Joseph Qi. A few cleanups/fixes also for blk-throttle from Weiping. - Series from Mike Snitzer that enables dm to register its queue more logically, something that's alwways been problematic on dm since it's a stacked device. - Series from Ming cleaning up some of the bio accessor use, in preparation for supporting multipage bvecs. - Various fixes from Ming closing up holes around queue mapping and quiescing. - BSD partition fix from Richard Narron, fixing a problem where we can't mount newer (10/11) FreeBSD partitions. - Series from Tejun reworking blk-mq timeout handling. The previous scheme relied on atomic bits, but it had races where we would think a request had timed out if it to reused at the wrong time. - null_blk now supports faking timeouts, to enable us to better exercise and test that functionality separately. From me. - Kill the separate atomic poll bit in the request struct. After this, we don't use the atomic bits on blk-mq anymore at all. From me. - sgl_alloc/free helpers from Bart. - Heavily contended tag case scalability improvement from me. - Various little fixes and cleanups from Arnd, Bart, Corentin, Douglas, Eryu, Goldwyn, and myself" * 'for-4.16/block' of git://git.kernel.dk/linux-block: (186 commits) block: remove smart1,2.h nvme: add tracepoint for nvme_complete_rq nvme: add tracepoint for nvme_setup_cmd nvme-pci: introduce RECONNECTING state to mark initializing procedure nvme-rdma: remove redundant boolean for inline_data nvme: don't free uuid pointer before printing it nvme-pci: Suspend queues after deleting them bsg: use pr_debug instead of hand crafted macros blk-mq-debugfs: don't allow write on attributes with seq_operations set nvme-pci: Fix queue double allocations block: Set BIO_TRACE_COMPLETION on new bio during split blk-throttle: use queue_is_rq_based block: Remove kblockd_schedule_delayed_work{,_on}() blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly() lib/scatterlist: Fix chaining support in sgl_alloc_order() blk-throttle: track read and write request individually block: add bdev_read_only() checks to common helpers block: fail op_is_write() requests to read-only partitions blk-throttle: export io_serviced_recursive, io_service_bytes_recursive ...
2018-01-24bsg: use pr_debug instead of hand crafted macrosJohannes Thumshirn1-23/+17
Use pr_debug instead of hand crafted macros. This way it is not needed to re-compile the kernel to enable bsg debug outputs and it's possible to selectively enable specific prints. Cc: Joe Perches <joe@perches.com> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-24blk-mq-debugfs: don't allow write on attributes with seq_operations setEryu Guan1-1/+5
Attributes that only implement .seq_ops are read-only, any write to them should be rejected. But currently kernel would crash when writing to such debugfs entries, e.g. chmod +w /sys/kernel/debug/block/<dev>/requeue_list echo 0 > /sys/kernel/debug/block/<dev>/requeue_list chmod -w /sys/kernel/debug/block/<dev>/requeue_list Fix it by returning -EPERM in blk_mq_debugfs_write() when writing to such attributes. Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-23block: Set BIO_TRACE_COMPLETION on new bio during splitGoldwyn Rodrigues1-1/+1
We inadvertently set it again on the source bio, but we need to set it on the new split bio instead. Fixes: fbbaf700e7b1 ("block: trace completion of all bios.") Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-19blk-throttle: use queue_is_rq_basedweiping zhang1-1/+1
use queue_is_rq_based instead of open code. Signed-off-by: weiping zhang <zhangweiping@didichuxing.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-19block: Remove kblockd_schedule_delayed_work{,_on}()Bart Van Assche1-14/+0
The previous patch removed all users of these two functions. Hence also remove the functions themselves. Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-19blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delaysBart Van Assche1-4/+3
Make sure that calling blk_mq_run_hw_queue() or blk_mq_kick_requeue_list() triggers a queue run without delay even if blk_mq_delay_run_hw_queue() has been called recently and if its delay has not yet expired. Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-19blk-mq: Rename blk_mq_request_direct_issue() into ↵Bart Van Assche3-4/+4
blk_mq_request_issue_directly() Most blk-mq functions have a name that follows the pattern blk_mq_${action}. However, the function name blk_mq_request_direct_issue is an exception. Hence rename this function. This patch does not change any functionality. Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-18blk-throttle: track read and write request individuallyJoseph Qi1-55/+79
In mixed read/write workload on SSD, write latency is much lower than read. But now we only track and record read latency and then use it as threshold base for both read and write io latency accounting. As a result, write io latency will always be considered as good and bad_bio_cnt is much smaller than 20% of bio_cnt. That is to mean, the tg to be checked will be treated as idle most of the time and still let others dispatch more ios, even it is truly running under low limit and wants its low limit to be guaranteed, which is not we expected in fact. So track read and write request individually, which can bring more precise latency control for low limit idle detection. Signed-off-by: Joseph Qi <qijiang.qj@alibaba-inc.com> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>