summaryrefslogtreecommitdiffstats
path: root/block
AgeCommit message (Collapse)AuthorFilesLines
2021-08-23block: pass a request_queue to __blk_alloc_diskChristoph Hellwig2-5/+5
Pass in a request_queue and assign disk->queue in __blk_alloc_disk to ensure struct gendisk always has a valid ->queue pointer. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210816131910.615153-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23block: remove the minors argument to __alloc_disk_nodeChristoph Hellwig2-5/+3
This was a leftover from the legacy alloc_disk interface. Switch the scsi ULPs and dasd to set ->minors directly like all other drivers and remove the argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Stefan Haberland <sth@linux.ibm.com> [dasd] Link: https://lore.kernel.org/r/20210816131910.615153-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23block: cleanup the lockdep handling in *alloc_diskChristoph Hellwig2-5/+8
Pass the lockdep name to the low-level __blk_alloc_disk helper and hardcode the name for it given that the number of minors or node_id are not very useful information. While this passes a pointless argument for non-lockdep builds that is not really an issue as disk allocation is a probe time only slow path. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210816131910.615153-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23block: fix argument type of bio_trim()Chaitanya Kulkarni1-5/+7
The function bio_trim has offset and size arguments that are declared as int. The callers of this function use sector_t type when passing the offset and size, e.g. drivers/md/raid1.c:narrow_write_error() and drivers/md/raid1.c:narrow_write_error(). Change offset and size arguments to sector_t type for bio_trim(). Also, add WARN_ON_ONCE() to catch their overflow. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-21Merge tag 'block-5.14-2021-08-20' of git://git.kernel.dk/linux-blockLinus Torvalds4-32/+20
Pull block fixes from Jens Axboe: "Three fixes from Ming Lei that should go into 5.14: - Fix for a kernel panic when iterating over tags for some cases where a flush request is present, a regression in this cycle. - Request timeout fix - Fix flush request checking" * tag 'block-5.14-2021-08-20' of git://git.kernel.dk/linux-block: blk-mq: fix is_flush_rq blk-mq: fix kernel panic during iterating over flush request blk-mq: don't grab rq's refcount in blk_mq_check_expired()
2021-08-20block: add back the bd_holder_dir reference in bd_link_disk_holderChristoph Hellwig1-0/+7
This essentially reverts "block: remove the extra kobject reference in bd_link_disk_holder". That commit dropped the extra reference because the condition in the comment can't be true. But it turns out that comment did not actually describe the problematic situation, so add back the extra reference and document it properly. Fixes: fbd9a39542ec ("block: remove the extra kobject reference in bd_link_disk_holder") Reported-by: Tushar Sugandhi <tusharsu@linux.microsoft.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-18block: fix default IO priority handlingDamien Le Moal2-4/+4
The default IO priority is the best effort (BE) class with the normal priority level IOPRIO_NORM (4). However, get_task_ioprio() returns IOPRIO_CLASS_NONE/IOPRIO_NORM as the default priority and get_current_ioprio() returns IOPRIO_CLASS_NONE/0. Let's be consistent with the defined default and have both of these functions return the default priority IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_NORM) when the user did not define another default IO priority for the task. In include/uapi/linux/ioprio.h, introduce the IOPRIO_BE_NORM macro as an alias to IOPRIO_NORM to clarify that this default level applies to the BE priotity class. In include/linux/ioprio.h, define the macro IOPRIO_DEFAULT as IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_BE_NORM) and use this new macro when setting a priority to the default. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Link: https://lore.kernel.org/r/20210811033702.368488-7-damien.lemoal@wdc.com [axboe: drop unnecessary lightnvm change] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-18block: Introduce IOPRIO_NR_LEVELSDamien Le Moal4-11/+10
The BFQ scheduler and ioprio_check_cap() both assume that the RT priority class (IOPRIO_CLASS_RT) can have up to 8 different priority levels, similarly to the BE class (IOPRIO_CLASS_iBE). This is controlled using the IOPRIO_BE_NR macro , which is badly named as the number of levels also applies to the RT class. Introduce the class independent IOPRIO_NR_LEVELS macro, defined to 8, to make things clear. Keep the old IOPRIO_BE_NR macro definition as an alias for IOPRIO_NR_LEVELS. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Link: https://lore.kernel.org/r/20210811033702.368488-6-damien.lemoal@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-18block: bfq: fix bfq_set_next_ioprio_data()Damien Le Moal1-1/+1
For a request that has a priority level equal to or larger than IOPRIO_BE_NR, bfq_set_next_ioprio_data() prints a critical warning but defaults to setting the request new_ioprio field to IOPRIO_BE_NR. This is not consistent with the warning and the allowed values for priority levels. Fix this by setting the request new_ioprio field to IOPRIO_BE_NR - 1, the lowest priority level allowed. Cc: <stable@vger.kernel.org> Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210811033702.368488-2-damien.lemoal@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-17blk-mq: fix is_flush_rqMing Lei3-6/+7
is_flush_rq() is called from bt_iter()/bt_tags_iter(), and runs the following check: hctx->fq->flush_rq == req but the passed hctx from bt_iter()/bt_tags_iter() may be NULL because: 1) memory re-order in blk_mq_rq_ctx_init(): rq->mq_hctx = data->hctx; ... refcount_set(&rq->ref, 1); OR 2) tag re-use and ->rqs[] isn't updated with new request. Fix the issue by re-writing is_flush_rq() as: return rq->end_io == flush_end_io; which turns out simpler to follow and immune to data race since we have ordered WRITE rq->end_io and refcount_set(&rq->ref, 1). Fixes: 2e315dc07df0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter") Cc: "Blank-Burian, Markus, Dr." <blankburian@uni-muenster.de> Cc: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210818010925.607383-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-17blk-mq: fix kernel panic during iterating over flush requestMing Lei2-1/+8
For fixing use-after-free during iterating over requests, we grabbed request's refcount before calling ->fn in commit 2e315dc07df0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter"). Turns out this way may cause kernel panic when iterating over one flush request: 1) old flush request's tag is just released, and this tag is reused by one new request, but ->rqs[] isn't updated yet 2) the flush request can be re-used for submitting one new flush command, so blk_rq_init() is called at the same time 3) meantime blk_mq_queue_tag_busy_iter() is called, and old flush request is retrieved from ->rqs[tag]; when blk_mq_put_rq_ref() is called, flush_rq->end_io may not be updated yet, so NULL pointer dereference is triggered in blk_mq_put_rq_ref(). Fix the issue by calling refcount_set(&flush_rq->ref, 1) after flush_rq->end_io is set. So far the only other caller of blk_rq_init() is scsi_ioctl_reset() in which the request doesn't enter block IO stack and the request reference count isn't used, so the change is safe. Fixes: 2e315dc07df0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter") Reported-by: "Blank-Burian, Markus, Dr." <blankburian@uni-muenster.de> Tested-by: "Blank-Burian, Markus, Dr." <blankburian@uni-muenster.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/20210811142624.618598-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-17blk-mq: don't grab rq's refcount in blk_mq_check_expired()Ming Lei1-25/+5
Inside blk_mq_queue_tag_busy_iter() we already grabbed request's refcount before calling ->fn(), so needn't to grab it one more time in blk_mq_check_expired(). Meantime remove extra request expire check in blk_mq_check_expired(). Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/20210811155202.629575-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-16block: unexport blk_register_queueChristoph Hellwig1-1/+0
Not actually used in any modular code. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210816123649.601591-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-16blk-cgroup: stop using seq_get_bufChristoph Hellwig4-62/+37
seq_get_buf is a crutch that undoes all the memory safety of the seq_file interface. Use the normal seq_printf interfaces instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210810152623.1796144-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-16blk-cgroup: refactor blkcg_print_statChristoph Hellwig1-74/+74
Factor out a helper to deal with a single blkcg_gq to make the code a little bit easier to follow. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210810152623.1796144-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-16block: use bvec_virt in bio_integrity_{process,free}Christoph Hellwig1-5/+2
Use the bvec_virt helper to clean up the bio integrity processing a little bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@kernel.org> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20210804095634.460779-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-16block: ensure the bdi is freed after inode_detach_wbChristoph Hellwig1-1/+0
inode_detach_wb references the "main" bdi of the inode. With the recent change to move the bdi from the request_queue to the gendisk this causes a guaranteed use after free when using certain cgroup configurations. The big itself is older through as any non-default inode reference (e.g. an open file descriptor) could have injected this use after free even before that. Fixes: 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks") Reported-by: Qian Cai <quic_qiancai@quicinc.com> Reported-by: syzbot <syzbot+1fb38bb7d3ce0fa3e1c4@syzkaller.appspotmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210816122614.601358-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-16block: free the extended dev_t minor laterChristoph Hellwig2-4/+0
The dev_t is used as the inode hash, so we should only released it once then block device inode is gone from the inode cache. Move it to bdev_free_inode to ensure that. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210816122614.601358-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-14blk-throtl: optimize IOPS throttle for large IO scenariosChunguang Xu3-0/+36
After patch 54efd50 (block: make generic_make_request handle arbitrarily sized bios), the IO through io-throttle may be larger, and these IOs may be further split into more small IOs. However, IOPS throttle does not seem to be aware of this change, which makes the calculation of IOPS of large IOs incomplete, resulting in disk-side IOPS that does not meet expectations. Maybe we should fix this problem. We can reproduce it by set max_sectors_kb of disk to 128, set blkio.write_iops_throttle to 100, run a dd instance inside blkio and use iostat to watch IOPS: dd if=/dev/zero of=/dev/sdb bs=1M count=1000 oflag=direct As a result, without this change the average IOPS is 1995, with this change the IOPS is 98. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/65869aaad05475797d63b4c3fed4f529febe3c26.1627876014.git.brookxu@tencent.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-13Merge tag 'block-5.14-2021-08-13' of git://git.kernel.dk/linux-blockLinus Torvalds7-313/+22
Pull block fixes from Jens Axboe: "A few fixes for block that should go into 5.14: - Revert the mq-deadline cgroup addition. More work is needed on this front, let's revert it for now and get it right before having it in a released kernel (Tejun) - blk-iocost lockdep fix (Ming) - nbd double completion fix (Xie) - Fix for non-idling when clearing the shared tag flag (Yu)" * tag 'block-5.14-2021-08-13' of git://git.kernel.dk/linux-block: nbd: Aovid double completion of a request blk-mq: clear active_queues before clearing BLK_MQ_F_TAG_QUEUE_SHARED Revert "block/mq-deadline: Add cgroup support" blk-iocost: fix lockdep warning on blkcg->lock
2021-08-13blk-mq: clear active_queues before clearing BLK_MQ_F_TAG_QUEUE_SHAREDYu Kuai1-2/+4
We run a test that delete and recover devcies frequently(two devices on the same host), and we found that 'active_queues' is super big after a period of time. If device a and device b share a tag set, and a is deleted, then blk_mq_exit_queue() will clear BLK_MQ_F_TAG_QUEUE_SHARED because there is only one queue that are using the tag set. However, if b is still active, the active_queues of b might never be cleared even if b is deleted. Thus clear active_queues before BLK_MQ_F_TAG_QUEUE_SHARED is cleared. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210731062130.1533893-1-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-12block: pass a gendisk to bdev_resize_partitionChristoph Hellwig3-9/+9
bdev_resize_partition can only operate on the whole device. Make that clear by passing a gendisk instead of a block_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210810154512.1809898-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-12block: pass a gendisk to bdev_del_partitionChristoph Hellwig3-6/+6
bdev_del_partition can only operate on the whole device. Make that clear by passing a gendisk instead of a block_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210810154512.1809898-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-12block: pass a gendisk to bdev_add_partitionChristoph Hellwig3-6/+6
bdev_add_partition can only operate on the whole device. Make that clear by passing a gendisk instead of a block_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210810154512.1809898-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-12block: store a gendisk in struct parsed_partitionsChristoph Hellwig14-73/+52
Partition scanning only happens on the whole device, so pass a struct gendisk instead of the whole device block_device to the scanners. This allows to simplify printing the device name in various places as the disk name is available in disk->name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Stefan Haberland <sth@linux.ibm.com> Link: https://lore.kernel.org/r/20210810154512.1809898-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-12block: remove GENHD_FL_UPChristoph Hellwig2-6/+4
Just check inode_unhashed on the whole device bdev inode instead, and provide a helper to check for that information. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210809064028.1198327-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-11Revert "block/mq-deadline: Add cgroup support"Tejun Heo5-307/+14
This reverts commit 08a9ad8bf607 ("block/mq-deadline: Add cgroup support") and a follow-up commit c06bc5a3fb42 ("block/mq-deadline: Remove a WARN_ON_ONCE() call"). The added cgroup support has the following issues: * It breaks cgroup interface file format rule by adding custom elements to a nested key-value file. * It registers mq-deadline as a cgroup-aware policy even though all it's doing is collecting per-cgroup stats. Even if we need these stats, this isn't the right way to add them. * It hasn't been reviewed from cgroup side. Cc: Bart Van Assche <bvanassche@acm.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-10genirq: Change force_irqthreads to a static keyTanner Love1-1/+1
With CONFIG_IRQ_FORCED_THREADING=y, testing the boolean force_irqthreads could incur a cache line miss in invoke_softirq() and other places. Replace the test with a static key to avoid the potential cache miss. [ tglx: Dropped the IDE part, removed the export and updated blk-mq ] Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Tanner Love <tannerlove@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210602180338.3324213-1-tannerlove.kernel@gmail.com
2021-08-09blk-iocost: fix lockdep warning on blkcg->lockMing Lei1-4/+4
blkcg->lock depends on q->queue_lock which may depend on another driver lock required in irq context, one example is dm-thin: Chain exists of: &pool->lock#3 --> &q->queue_lock --> &blkcg->lock Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&blkcg->lock); local_irq_disable(); lock(&pool->lock#3); lock(&q->queue_lock); <Interrupt> lock(&pool->lock#3); Fix the issue by using spin_lock_irq(&blkcg->lock) in ioc_weight_write(). Cc: Tejun Heo <tj@kernel.org> Reported-by: Bruno Goncalves <bgoncalv@redhat.com> Link: https://lore.kernel.org/linux-block/CA+QYu4rzz6079ighEanS3Qq_Dmnczcf45ZoJoHKVLVATTo1e4Q@mail.gmail.com/T/#u Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210803070608.1766400-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09Merge branch 'for-5.14-fixes' of ↵Linus Torvalds1-6/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: "One commit to fix a possible A-A deadlock around u64_stats_sync on 32bit machines caused by updating it without disabling IRQ when it may be read from IRQ context" * 'for-5.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: rstat: fix A-A deadlock on 32bit around u64_stats_sync
2021-08-09block: return ELEVATOR_DISCARD_MERGE if possibleMing Lei4-16/+8
When merging one bio to request, if they are discard IO and the queue supports multi-range discard, we need to return ELEVATOR_DISCARD_MERGE because both block core and related drivers(nvme, virtio-blk) doesn't handle mixed discard io merge(traditional IO merge together with discard merge) well. Fix the issue by returning ELEVATOR_DISCARD_MERGE in this situation, so both blk-mq and drivers just need to handle multi-range discard. Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 2705dfb20947 ("block: fix discard request merge") Link: https://lore.kernel.org/r/20210729034226.1591070-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09block: remove the bd_bdi in struct block_deviceChristoph Hellwig1-3/+4
Just retrieve the bdi from the disk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210809141744.1203023-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09block: move the bdi from the request_queue to the gendiskChristoph Hellwig8-50/+49
The backing device information only makes sense for file system I/O, and thus belongs into the gendisk and not the lower level request_queue structure. Move it there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210809141744.1203023-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09block: pass a gendisk to blk_queue_update_readaheadChristoph Hellwig2-4/+6
.. and rename the function to disk_update_readahead. This is in preparation for moving the BDI from the request_queue to the gendisk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210809141744.1203023-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09mm: hide laptop_mode_wb_timer entirely behind the BDI APIChristoph Hellwig1-5/+0
Don't leak the detaіls of the timer into the block layer, instead initialize the timer in bdi_alloc and delete it in bdi_unregister. Note that this means the timer is initialized (but not armed) for non-block queues as well now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210809141744.1203023-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09block: remove support for delayed queue registrationsChristoph Hellwig2-23/+7
Now that device mapper has been changed to register the disk once it is fully ready all this code is unused. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20210804094147.459763-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09block: support delayed holder registrationChristoph Hellwig2-17/+61
device mapper needs to register holders before it is ready to do I/O. Currently it does so by registering the disk early, which can leave the disk and queue in a weird half state where the queue is registered with the disk, except for sysfs and the elevator. And this state has been a bit promlematic before, and will get more so when sorting out the responsibilities between the queue and the disk. Support registering holders on an initialized but not registered disk instead by delaying the sysfs registration until the disk is registered. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20210804094147.459763-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09block: look up holders by bdevChristoph Hellwig2-10/+12
Invert they way the holder relations are tracked. This very slightly reduces the memory overhead for partitioned devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210804094147.459763-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09block: remove the extra kobject reference in bd_link_disk_holderChristoph Hellwig1-6/+0
Since commit 0d02129e76ed ("block: merge struct block_device and struct hd_struct") there is no way for the bdev to go away as long as there is a holder, so remove the extra references. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20210804094147.459763-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09block: make the block holder code optionalChristoph Hellwig3-0/+144
Move the block holder code into a separate file as it is not in any way related to the other block_dev.c code, and add a new selectable config option for it so that we don't have to build it without any remapped drivers selected. The Kconfig symbol contains a _DEPRECATED suffix to match the comments added in commit 49731baa41df ("block: restore multiple bd_link_disk_holder() support"). Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20210804094147.459763-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-07Merge tag 'block-5.14-2021-08-07' of git://git.kernel.dk/linux-blockLinus Torvalds3-3/+7
Pull block fixes from Jens Axboe: "A few minor fixes: - Fix ldm kernel-doc warning (Bart) - Fix adding offset twice for DMA address in n64cart (Christoph) - Fix use-after-free in dasd path handling (Stefan) - Order kyber insert trace correctly (Vincent) - raid1 errored write handling fix (Wei) - Fix blk-iolatency queue get failure handling (Yu)" * tag 'block-5.14-2021-08-07' of git://git.kernel.dk/linux-block: kyber: make trace_block_rq call consistent with documentation block/partitions/ldm.c: Fix a kernel-doc warning blk-iolatency: error out if blk_get_queue() failed in iolatency_set_limit() n64cart: fix the dma address in n64cart_do_bvec s390/dasd: fix use after free in dasd path handling md/raid10: properly indicate failure when ending a failed write request
2021-08-06kyber: make trace_block_rq call consistent with documentationVincent Fu1-1/+1
The kyber ioscheduler calls trace_block_rq_insert() *after* the request is added to the queue but the documentation for trace_block_rq_insert() says that the call should be made *before* the request is added to the queue. Move the tracepoint for the kyber ioscheduler so that it is consistent with the documentation. Signed-off-by: Vincent Fu <vincent.fu@samsung.com> Link: https://lore.kernel.org/r/20210804194913.10497-1-vincent.fu@samsung.com Reviewed by: Adam Manzanares <a.manzanares@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-05blk-mq: Introduce the BLK_MQ_F_NO_SCHED_BY_DEFAULT flagBart Van Assche1-0/+3
elevator_get_default() uses the following algorithm to select an I/O scheduler from inside add_disk(): - In case of a single hardware queue or if sharing hardware queues across multiple request queues (BLK_MQ_F_TAG_HCTX_SHARED), use mq-deadline. - Otherwise, use 'none'. This is a good choice for most but not for all block drivers. Make it possible to override the selection of mq-deadline with a new flag, namely BLK_MQ_F_NO_SCHED_BY_DEFAULT. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Martijn Coenen <maco@android.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210805174200.3250718-2-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-05block/partitions/ldm.c: Fix a kernel-doc warningBart Van Assche1-1/+1
Fix the following kernel-doc warning that appears when building with W=1: block/partitions/ldm.c:31: warning: expecting prototype for ldm(). Prototype was for ldm_debug() instead Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210805173447.3249906-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-05blk-iolatency: error out if blk_get_queue() failed in iolatency_set_limit()Yu Kuai1-1/+5
If queue is dying while iolatency_set_limit() is in progress, blk_get_queue() won't increment the refcount of the queue. However, blk_put_queue() will still decrement the refcount later, which will cause the refcout to be unbalanced. Thus error out in such case to fix the problem. Fixes: 8c772a9bfc7c ("blk-iolatency: fix IO hang due to negative inflight counter") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210805124645.543797-1-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02block: remove blk-mq-sysfs dead codeDamien Le Moal1-55/+0
In block/blk-mq-sysfs.c, struct blk_mq_ctx_sysfs_entry is not used to define any attribute since the "mq" sysfs directory contains only sub-directories (no attribute files). As a result, blk_mq_sysfs_show(), blk_mq_sysfs_store(), and struct sysfs_ops blk_mq_sysfs_ops are all unused and unnecessary. Remove all this unused code. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Link: https://lore.kernel.org/r/20210713081837.524422-1-damien.lemoal@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02block: add a helper to raise a media changed eventMatteo Croce1-15/+46
Refactor disk_check_events() and move some code into disk_event_uevent(). Then add disk_force_media_change(), a helper which will be used by devices to force issuing a DISK_EVENT_MEDIA_CHANGE event. Co-developed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Matteo Croce <mcroce@microsoft.com> Tested-by: Luca Boccassi <bluca@debian.org> Link: https://lore.kernel.org/r/20210712230530.29323-6-mcroce@linux.microsoft.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02block: export diskseq in sysfsMatteo Croce1-0/+10
Add a new sysfs handle to export the new diskseq value. Place it in <sysfs>/block/<disk>/diskseq and document it. $ grep . /sys/class/block/*/diskseq /sys/class/block/loop0/diskseq:13 /sys/class/block/loop1/diskseq:14 /sys/class/block/loop2/diskseq:5 /sys/class/block/loop3/diskseq:6 /sys/class/block/ram0/diskseq:1 /sys/class/block/ram1/diskseq:2 /sys/class/block/vda/diskseq:7 Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Matteo Croce <mcroce@microsoft.com> Tested-by: Luca Boccassi <bluca@debian.org> Link: https://lore.kernel.org/r/20210712230530.29323-5-mcroce@linux.microsoft.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02block: add ioctl to read the disk sequence numberMatteo Croce1-0/+2
Add a new BLKGETDISKSEQ ioctl which retrieves the disk sequence number from the genhd structure. # ./getdiskseq /dev/loop* /dev/loop0: 13 /dev/loop0p1: 13 /dev/loop0p2: 13 /dev/loop0p3: 13 /dev/loop1: 14 /dev/loop1p1: 14 /dev/loop1p2: 14 /dev/loop2: 5 /dev/loop3: 6 Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Matteo Croce <mcroce@microsoft.com> Tested-by: Luca Boccassi <bluca@debian.org> Link: https://lore.kernel.org/r/20210712230530.29323-4-mcroce@linux.microsoft.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02block: export the diskseq in ueventsMatteo Croce1-0/+9
Export the newly introduced diskseq in uevents: $ udevadm info /sys/class/block/* |grep -e DEVNAME -e DISKSEQ E: DEVNAME=/dev/loop0 E: DISKSEQ=1 E: DEVNAME=/dev/loop1 E: DISKSEQ=2 E: DEVNAME=/dev/loop2 E: DISKSEQ=3 E: DEVNAME=/dev/loop3 E: DISKSEQ=4 E: DEVNAME=/dev/loop4 E: DISKSEQ=5 E: DEVNAME=/dev/loop5 E: DISKSEQ=6 E: DEVNAME=/dev/loop6 E: DISKSEQ=7 E: DEVNAME=/dev/loop7 E: DISKSEQ=8 E: DEVNAME=/dev/nvme0n1 E: DISKSEQ=9 E: DEVNAME=/dev/nvme0n1p1 E: DISKSEQ=9 E: DEVNAME=/dev/nvme0n1p2 E: DISKSEQ=9 E: DEVNAME=/dev/nvme0n1p3 E: DISKSEQ=9 E: DEVNAME=/dev/nvme0n1p4 E: DISKSEQ=9 E: DEVNAME=/dev/nvme0n1p5 E: DISKSEQ=9 E: DEVNAME=/dev/sda E: DISKSEQ=10 E: DEVNAME=/dev/sda1 E: DISKSEQ=10 E: DEVNAME=/dev/sda2 E: DISKSEQ=10 Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Matteo Croce <mcroce@microsoft.com> Tested-by: Luca Boccassi <bluca@debian.org> Link: https://lore.kernel.org/r/20210712230530.29323-3-mcroce@linux.microsoft.com Signed-off-by: Jens Axboe <axboe@kernel.dk>