summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2020-08-28nvme-rdma: serialize controller teardown sequencesSagi Grimberg1-0/+6
In the timeout handler we may need to complete a request because the request that timed out may be an I/O that is a part of a serial sequence of controller teardown or initialization. In order to complete the request, we need to fence any other context that may compete with us and complete the request that is timing out. In this case, we could have a potential double completion in case a hard-irq or a different competing context triggered error recovery and is running inflight request cancellation concurrently with the timeout handler. Protect using a ctrl teardown_lock to serialize contexts that may complete a cancelled request due to error recovery or a reset. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28nvme-tcp: fix reset hang if controller died in the middle of a resetSagi Grimberg1-1/+12
If the controller becomes unresponsive in the middle of a reset, we will hang because we are waiting for the freeze to complete, but that cannot happen since we have commands that are inflight holding the q_usage_counter, and we can't blindly fail requests that times out. So give a timeout and if we cannot wait for queue freeze before unfreezing, fail and have the error handling take care how to proceed (either schedule a reconnect of remove the controller). Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28nvme-tcp: fix timeout handlerSagi Grimberg1-20/+36
When a request times out in a LIVE state, we simply trigger error recovery and let the error recovery handle the request cancellation, however when a request times out in a non LIVE state, we make sure to complete it immediately as it might block controller setup or teardown and prevent forward progress. However tearing down the entire set of I/O and admin queues causes freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really an overkill to what we actually need, which is to just fence controller teardown that may be running, stop the queue, and cancel the request if it is not already completed. Now that we have the controller teardown_lock, we can safely serialize request cancellation. This addresses a hang caused by calling extra queue freeze on controller namespaces, causing unfreeze to not complete correctly. Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28nvme-tcp: serialize controller teardown sequencesSagi Grimberg1-2/+9
In the timeout handler we may need to complete a request because the request that timed out may be an I/O that is a part of a serial sequence of controller teardown or initialization. In order to complete the request, we need to fence any other context that may compete with us and complete the request that is timing out. In this case, we could have a potential double completion in case a hard-irq or a different competing context triggered error recovery and is running inflight request cancellation concurrently with the timeout handler. Protect using a ctrl teardown_lock to serialize contexts that may complete a cancelled request due to error recovery or a reset. Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28nvme: have nvme_wait_freeze_timeout return if it timed outSagi Grimberg2-2/+3
Users can detect if the wait has completed or not and take appropriate actions based on this information (e.g. weather to continue initialization or rather fail and schedule another initialization attempt). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptanceSagi Grimberg1-1/+0
NVME_CTRL_NEW should never see any I/O, because in order to start initialization it has to transition to NVME_CTRL_CONNECTING and from there it will never return to this state. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pduZiye Yang1-1/+9
When handling commands without in-capsule data, we assign the ttag assuming we already have the queue commands array allocated (based on the queue size information in the connect data payload). However if the connect itself did not send the connect data in-capsule we have yet to allocate the queue commands,and we will assign a bogus ttag and suffer a NULL dereference when we receive the corresponding h2cdata pdu. Fix this by checking if we already allocated commands before dereferencing it when handling h2cdata, if we didn't, its for sure a connect and we should use the preallocated connect command. Signed-off-by: Ziye Yang <ziye.yang@intel.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28Merge branch 'md-fixes' of ↵Jens Axboe1-2/+5
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-5.9 Pull MD fix from Song. * 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md/raid5: make sure stripe_size as power of two
2020-08-27md/raid5: make sure stripe_size as power of twoYufen Yu1-2/+5
Commit 3b5408b98e4d ("md/raid5: support config stripe_size by sysfs entry") make stripe_size as a configurable value. It just requires stripe_size as multiple of 4KB. In fact, we should make sure stripe_size as power of two. Otherwise, stripe_shift which is the result of ilog2 can not represent the real stripe_size. Then, stripe_hash() and stripe_hash_locks_hash() may get unexpected value. Fixes: 3b5408b98e4d ("md/raid5: support config stripe_size by sysfs entry") Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-26loop: Set correct device size when using LOOP_CONFIGUREMartijn Coenen1-2/+2
The device size calculation was done before processing the loop configuration, which meant that the we set the size on the underlying block device incorrectly in case lo_offset/lo_sizelimit were set in the configuration. Delay computing the size until we've setup the device parameters correctly. Fixes: 3448914e8cc5("loop: Add LOOP_CONFIGURE ioctl") Reported-by: Lennart Poettering <mzxreary@0pointer.de> Tested-by: Yang Xu <xuyang2018.jy@cn.fujitsu.com> Signed-off-by: Martijn Coenen <maco@android.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-26nbd: restore default timeout when setting it to zeroHou Pu1-0/+2
If we configured io timeout of nbd0 to 100s. Later after we finished using it, we configured nbd0 again and set the io timeout to 0. We expect it would timeout after 30 seconds and keep retry. But in fact we could not change the timeout when we set it to 0. the timeout is still the original 100s. So change the timeout to default 30s when we set it to zero. It also behaves same as commit 2da22da57348 ("nbd: fix zero cmd timeout handling v2"). It becomes more important if we were reconfigure a nbd device and the io timeout it set to zero. Because it could take 30s to detect the new socket and thus io could be completed more quickly compared to 100s. Signed-off-by: Hou Pu <houpu@bytedance.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21null_blk: fix passing of REQ_FUA flag in null_handle_rqHou Pu1-1/+1
REQ_FUA should be checked using rq->cmd_flags instead of req_op(). Fixes: deb78b419dfda ("nullb: emulate cache") Signed-off-by: Hou Pu <houpu@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21nvmet: Disable keep-alive timer when kato is cleared to 0hAmit Engel1-0/+6
Based on nvme spec, when keep alive timeout is set to zero the keep-alive timer should be disabled. Signed-off-by: Amit Engel <amit.engel@dell.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21nvme: redirect commands on dying queueChao Leng1-4/+5
If a command send through nvme-multipath failed on a dying queue, resend it on another path. Signed-off-by: Chao Leng <lengchao@huawei.com> [hch: rebased on top of the completion refactoring] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21nvme: just check the status code type in nvme_is_path_errorChristoph Hellwig1-10/+2
Check the SCT sub-field for a path related status instead of enumerating invididual status code. As of NVMe 1.4 this adds "Internal Path Error" and "Controller Pathing Error" to the list, but it also future proofs for additional status codes added to the category. Suggested-by: Chao Leng <lengchao@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21nvme: refactor command completionChristoph Hellwig3-64/+89
Lift all the code to decide the dispostition of a completed command from nvme_complete_rq and nvme_failover_req into a new helper, which returns an emum of the potential actions. nvme_complete_rq then just switches on those and calls the proper helper for the action. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21nvme: rename and document nvme_end_requestChristoph Hellwig7-8/+14
nvme_end_request is a bit misnamed, as it wraps around the blk_mq_complete_* API. It's semantics also are non-trivial, so give it a more descriptive name and add a comment explaining the semantics. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21nvme: skip noiob for zoned devicesKeith Busch1-1/+1
Zoned block devices reuse the chunk_sectors queue limit to define zone boundaries. If a such a device happens to also report an optimal boundary, do not use that to define the chunk_sectors as that may intermittently interfere with io splitting and zone size queries. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21nvme-pci: fix PRP pool sizeChristoph Hellwig1-1/+2
All operations are based on the controller, not the host page size. Switch the dma pool to use the controller page size as well to avoid massive overallocations on large page size systems. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depthJohn Garry1-6/+6
Recently nvme_dev.q_depth was changed from an int to u16 type. This falls over for the queue depth calculation in nvme_pci_enable(), where NVME_CAP_MQES(dev->ctrl.cap) + 1 may overflow as a u16, as NVME_CAP_MQES() is a 16b number also. That happens for me, and this is the result: root@ubuntu:/home/john# [148.272996] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010 Mem abort info: ESR = 0x96000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=00000a27bf3c9000 [0000000000000010] pgd=0000000000000000, p4d=0000000000000000 Internal error: Oops: 96000004 [#1] PREEMPT SMP Modules linked in: nvme nvme_core CPU: 56 PID: 256 Comm: kworker/u195:0 Not tainted 5.8.0-next-20200812 #27 Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 - V1.16.01 03/15/2019 Workqueue: nvme-reset-wq nvme_reset_work [nvme] pstate: 80c00009 (Nzcv daif +PAN +UAO BTYPE=--) pc : __sg_alloc_table_from_pages+0xec/0x238 lr : __sg_alloc_table_from_pages+0xc8/0x238 sp : ffff800013ccbad0 x29: ffff800013ccbad0 x28: ffff0a27b3d380a8 x27: 0000000000000000 x26: 0000000000002dc2 x25: 0000000000000dc0 x24: 0000000000000000 x23: 0000000000000000 x22: ffff800013ccbbe8 x21: 0000000000000010 x20: 0000000000000000 x19: 00000000fffff000 x18: ffffffffffffffff x17: 00000000000000c0 x16: fffffe289eaf6380 x15: ffff800011b59948 x14: ffff002bc8fe98f8 x13: ff00000000000000 x12: ffff8000114ca000 x11: 0000000000000000 x10: ffffffffffffffff x9 : ffffffffffffffc0 x8 : ffff0a27b5f9b6a0 x7 : 0000000000000000 x6 : 0000000000000001 x5 : ffff0a27b5f9b680 x4 : 0000000000000000 x3 : ffff0a27b5f9b680 x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000000000 Call trace: __sg_alloc_table_from_pages+0xec/0x238 sg_alloc_table_from_pages+0x18/0x28 iommu_dma_alloc+0x474/0x678 dma_alloc_attrs+0xd8/0xf0 nvme_alloc_queue+0x114/0x160 [nvme] nvme_reset_work+0xb34/0x14b4 [nvme] process_one_work+0x1e8/0x360 worker_thread+0x44/0x478 kthread+0x150/0x158 ret_from_fork+0x10/0x34 Code: f94002c3 6b01017f 540007c2 11000486 (f8645aa5) ---[ end trace 89bb2b72d59bf925 ]--- Fix by making onto a u32. Also use u32 for nvme_dev.q_depth, as we assign this value from nvme_dev.q_depth, and nvme_dev.q_depth will possibly hold 65536 - this avoids the same crash as above. Fixes: 61f3b8963097 ("nvme-pci: use unsigned for io queue depth") Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21nvme: Use spin_lock_irq() when taking the ctrl->lockLogan Gunthorpe1-4/+4
When locking the ctrl->lock spinlock IRQs need to be disabled to avoid a dead lock. The new spin_lock() calls recently added produce the following lockdep warning when running the blktest nvme/003: ================================ WARNING: inconsistent lock state -------------------------------- inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. ksoftirqd/2/22 [HC0[0]:SC1[1]:HE0:SE0] takes: ffff888276a8c4c0 (&ctrl->lock){+.?.}-{2:2}, at: nvme_keep_alive_end_io+0x50/0xc0 {SOFTIRQ-ON-W} state was registered at: lock_acquire+0x164/0x500 _raw_spin_lock+0x28/0x40 nvme_get_effects_log+0x37/0x1c0 nvme_init_identify+0x9e4/0x14f0 nvme_reset_work+0xadd/0x2360 process_one_work+0x66b/0xb70 worker_thread+0x6e/0x6c0 kthread+0x1e7/0x210 ret_from_fork+0x22/0x30 irq event stamp: 1449221 hardirqs last enabled at (1449220): [<ffffffff81c58e69>] ktime_get+0xf9/0x140 hardirqs last disabled at (1449221): [<ffffffff83129665>] _raw_spin_lock_irqsave+0x25/0x60 softirqs last enabled at (1449210): [<ffffffff83400447>] __do_softirq+0x447/0x595 softirqs last disabled at (1449215): [<ffffffff81b489b5>] run_ksoftirqd+0x35/0x50 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&ctrl->lock); <Interrupt> lock(&ctrl->lock); *** DEADLOCK *** no locks held by ksoftirqd/2/22. stack backtrace: CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 5.8.0-rc4-eid-vmlocalyes-dbg-00157-g7236657c6b3a #1450 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014 Call Trace: dump_stack+0xc8/0x11a print_usage_bug.cold.63+0x235/0x23e mark_lock+0xa9c/0xcf0 __lock_acquire+0xd9a/0x2b50 lock_acquire+0x164/0x500 _raw_spin_lock_irqsave+0x40/0x60 nvme_keep_alive_end_io+0x50/0xc0 blk_mq_end_request+0x158/0x210 nvme_complete_rq+0x146/0x500 nvme_loop_complete_rq+0x26/0x30 [nvme_loop] blk_done_softirq+0x187/0x1e0 __do_softirq+0x118/0x595 run_ksoftirqd+0x35/0x50 smpboot_thread_fn+0x1d3/0x310 kthread+0x1e7/0x210 ret_from_fork+0x22/0x30 Fixes: be93e87e7802 ("nvme: support for multiple Command Sets Supported and Effects log pages") Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Tested-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21nvmet: call blk_mq_free_request() directlyChaitanya Kulkarni1-3/+3
Instead of calling blk_put_request() which calls blk_mq_free_request(), call blk_mq_free_request() directly for NVMeOF passthru. This is to mainly avoid an extra function call in the completion path nvmet_passthru_req_done(). Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21nvmet: fix oops in pt cmd executionChaitanya Kulkarni1-3/+3
In the existing NVMeOF Passthru core command handling on failure of nvme_alloc_request() it errors out with rq value set to NULL. In the error handling path it calls blk_put_request() without checking if rq is set to NULL or not which produces following Oops:- [ 1457.346861] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 1457.347838] #PF: supervisor read access in kernel mode [ 1457.348464] #PF: error_code(0x0000) - not-present page [ 1457.349085] PGD 0 P4D 0 [ 1457.349402] Oops: 0000 [#1] SMP NOPTI [ 1457.349851] CPU: 18 PID: 10782 Comm: kworker/18:2 Tainted: G OE 5.8.0-rc4nvme-5.9+ #35 [ 1457.350951] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e3214 [ 1457.352347] Workqueue: events nvme_loop_execute_work [nvme_loop] [ 1457.353062] RIP: 0010:blk_mq_free_request+0xe/0x110 [ 1457.353651] Code: 3f ff ff ff 83 f8 01 75 0d 4c 89 e7 e8 1b db ff ff e9 2d ff ff ff 0f 0b eb ef 66 8 [ 1457.355975] RSP: 0018:ffffc900035b7de0 EFLAGS: 00010282 [ 1457.356636] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002 [ 1457.357526] RDX: ffffffffa060bd05 RSI: 0000000000000000 RDI: 0000000000000000 [ 1457.358416] RBP: 0000000000000037 R08: 0000000000000000 R09: 0000000000000000 [ 1457.359317] R10: 0000000000000000 R11: 000000000000006d R12: 0000000000000000 [ 1457.360424] R13: ffff8887ffa68600 R14: 0000000000000000 R15: ffff8888150564c8 [ 1457.361322] FS: 0000000000000000(0000) GS:ffff888814600000(0000) knlGS:0000000000000000 [ 1457.362337] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1457.363058] CR2: 0000000000000000 CR3: 000000081c0ac000 CR4: 00000000003406e0 [ 1457.363973] Call Trace: [ 1457.364296] nvmet_passthru_execute_cmd+0x150/0x2c0 [nvmet] [ 1457.364990] process_one_work+0x24e/0x5a0 [ 1457.365493] ? __schedule+0x353/0x840 [ 1457.365957] worker_thread+0x3c/0x380 [ 1457.366426] ? process_one_work+0x5a0/0x5a0 [ 1457.366948] kthread+0x135/0x150 [ 1457.367362] ? kthread_create_on_node+0x60/0x60 [ 1457.367934] ret_from_fork+0x22/0x30 [ 1457.368388] Modules linked in: nvme_loop(OE) nvmet(OE) nvme_fabrics(OE) null_blk nvme(OE) nvme_corer [ 1457.368414] ata_piix crc32c_intel virtio_pci libata virtio_ring serio_raw t10_pi virtio floppy dm_] [ 1457.380849] CR2: 0000000000000000 [ 1457.381288] ---[ end trace c6cab61bfd1f68fd ]--- [ 1457.381861] RIP: 0010:blk_mq_free_request+0xe/0x110 [ 1457.382469] Code: 3f ff ff ff 83 f8 01 75 0d 4c 89 e7 e8 1b db ff ff e9 2d ff ff ff 0f 0b eb ef 66 8 [ 1457.384749] RSP: 0018:ffffc900035b7de0 EFLAGS: 00010282 [ 1457.385393] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002 [ 1457.386264] RDX: ffffffffa060bd05 RSI: 0000000000000000 RDI: 0000000000000000 [ 1457.387142] RBP: 0000000000000037 R08: 0000000000000000 R09: 0000000000000000 [ 1457.388029] R10: 0000000000000000 R11: 000000000000006d R12: 0000000000000000 [ 1457.388914] R13: ffff8887ffa68600 R14: 0000000000000000 R15: ffff8888150564c8 [ 1457.389798] FS: 0000000000000000(0000) GS:ffff888814600000(0000) knlGS:0000000000000000 [ 1457.390796] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1457.391508] CR2: 0000000000000000 CR3: 000000081c0ac000 CR4: 00000000003406e0 [ 1457.392525] Kernel panic - not syncing: Fatal exception [ 1457.394138] Kernel Offset: disabled [ 1457.394677] ---[ end Kernel panic - not syncing: Fatal exception ]--- We fix this Oops by adding a new goto label out_put_req and reordering the blk_put_request call to avoid calling blk_put_request() with rq value is set to NULL. Here we also update the rest of the code accordingly. Fixes: 06b7164dfdc0 ("nvmet: add passthru code to process commands") Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21nvmet: add ns tear down label for pt-cmd handlingChaitanya Kulkarni1-4/+5
In the current implementation before submitting the passthru cmd we may come across error e.g. getting ns from passthru controller, allocating a request from passthru controller, etc. For all the failure cases it only uses single goto label fail_out. In the target code, we follow the pattern to have a separate label for each error out the case when setting up multiple things before the actual action. This patch follows the same pattern and renames generic fail_out label to out_put_ns and updates the error out cases in the nvmet_passthru_execute_cmd() where it is needed. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21nvme: multipath: round-robin: eliminate "fallback" variableMartin Wilck1-5/+4
If we find an optimized path, we quit the loop immediately. Thus we can use just one variable for the next path, slighly simplifying the code. Signed-off-by: Martin Wilck <mwilck@suse.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21nvme: multipath: round-robin: fix single non-optimized path caseMartin Wilck1-5/+10
If there's only one usable, non-optimized path, nvme_round_robin_path() returns NULL, which is wrong. Fix it by falling back to "old", like in the single optimized path case. Also, if the active path isn't changed, there's no need to re-assign the pointer. Fixes: 3f6e3246db0e ("nvme-multipath: fix logic for non-optimized paths") Signed-off-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Martin George <marting@netapp.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21nvme-fc: Fix wrong return value in __nvme_fc_init_request()Tianjia Zhang1-2/+2
On an error exit path, a negative error code should be returned instead of a positive return value. Fixes: e399441de9115 ("nvme-fabrics: Add host support for FC transport") Cc: James Smart <jsmart2021@gmail.com> Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21nvmet-passthru: Reject commands with non-sgl flags setLogan Gunthorpe1-0/+8
Any command with a non-SGL flag set (like fuse flags) should be rejected. Fixes: c1fef73f793b ("nvmet: add passthru code to process commands") Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21nvmet: fix a memory leakSagi Grimberg1-0/+1
We forgot to free new_model_number Fixes: 013b7ebe5a0d ("nvmet: make ctrl model configurable") Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21blkcg: fix memleak for iolatencyYufen Yu1-3/+5
Normally, blkcg_iolatency_exit() will free related memory in iolatency when cleanup queue. But if blk_throtl_init() return error and queue init fail, blkcg_iolatency_exit() will not do that for us. Then it cause memory leak. Fixes: d70675121546 ("block: introduce blk-iolatency io controller") Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21MAINTAINERS: Add missing header files to BLOCK LAYER sectionGeert Uytterhoeven1-0/+1
The various <linux/blk*.h> header files are part of the Block Layer. Add them to the corresponding section in the MAINTAINERS file, so scripts/get_maintainer.pl will pick them up. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21block: fix get_max_io_size()Keith Busch1-1/+1
A previous commit aligning splits to physical block sizes inadvertently modified one return case such that that it now returns 0 length splits when the number of sectors doesn't exceed the physical offset. This later hits a BUG in bio_split(). Restore the previous working behavior. Fixes: 9cc5169cd478b ("block: Improve physical block alignment of split bios") Reported-by: Eric Deal <eric.deal@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Cc: Bart Van Assche <bvanassche@acm.org> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21blk-mq: insert request not through ->queue_rq into sw/scheduler queueMing Lei1-1/+2
c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list") supposed to add request which has been through ->queue_rq() to the hw queue dispatch list, however it adds request running out of budget or driver tag to hw queue too. This way basically bypasses request merge, and causes too many request dispatched to LLD, and system% is unnecessary increased. Fixes this issue by adding request not through ->queue_rq into sw/scheduler queue, and this way is safe because no ->queue_rq is called on this request yet. High %system can be observed on Azure storvsc device, and even soft lock is observed. This patch reduces %system during heavy sequential IO, meantime decreases soft lockup risk. Fixes: c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list") Signed-off-by: Ming Lei <ming.lei@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21block/rnbd: Ensure err is always initialized in process_rdmaNathan Chancellor1-1/+2
Clang warns: drivers/block/rnbd/rnbd-srv.c:150:6: warning: variable 'err' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized] if (IS_ERR(bio)) { ^~~~~~~~~~~ drivers/block/rnbd/rnbd-srv.c:177:9: note: uninitialized use occurs here return err; ^~~ drivers/block/rnbd/rnbd-srv.c:150:2: note: remove the 'if' if its condition is always false if (IS_ERR(bio)) { ^~~~~~~~~~~~~~~~~~ drivers/block/rnbd/rnbd-srv.c:126:9: note: initialize the variable 'err' to silence this warning int err; ^ = 0 1 warning generated. err is indeed uninitialized when this statement is taken. Ensure that it is assigned the error value of bio before jumping to the error handling label. Fixes: 735d77d4fd28 ("rnbd: remove rnbd_dev_submit_io") Reported-by: Brooke Basile <brookebasile@gmail.com> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com> Link: https://github.com/ClangBuiltLinux/linux/issues/1134 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-18bfq: fix blkio cgroup leakage v4Dmitry Monakhov3-12/+3
Changes from v1: - update commit description with proper ref-accounting justification commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree") introduce leak forbfq_group and blkcg_gq objects because of get/put imbalance. In fact whole idea of original commit is wrong because bfq_group entity can not dissapear under us because it is referenced by child bfq_queue's entities from here: -> bfq_init_entity() ->bfqg_and_blkg_get(bfqg); ->entity->parent = bfqg->my_entity -> bfq_put_queue(bfqq) FINAL_PUT ->bfqg_and_blkg_put(bfqq_group(bfqq)) ->kmem_cache_free(bfq_pool, bfqq); So parent entity can not disappear while child entity is in tree, and child entities already has proper protection. This patch revert commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree") bfq_group leak trace caused by bad commit: -> blkg_alloc -> bfq_pq_alloc -> bfqg_get (+1) ->bfq_activate_bfqq ->bfq_activate_requeue_entity -> __bfq_activate_entity ->bfq_get_entity ->bfqg_and_blkg_get (+1) <==== : Note1 ->bfq_del_bfqq_busy ->bfq_deactivate_entity+0x53/0xc0 [bfq] ->__bfq_deactivate_entity+0x1b8/0x210 [bfq] -> bfq_forget_entity(is_in_service = true) entity->on_st_or_in_serv = false <=== :Note2 if (is_in_service) return; ==> do not touch reference -> blkcg_css_offline -> blkcg_destroy_blkgs -> blkg_destroy -> bfq_pd_offline -> __bfq_deactivate_entity if (!entity->on_st_or_in_serv) /* true, because (Note2) return false; -> bfq_pd_free -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2) So bfq_group and blkcg_gq will leak forever, see test-case below. ##TESTCASE_BEGIN: #!/bin/bash max_iters=${1:-100} #prep cgroup mounts mount -t tmpfs cgroup_root /sys/fs/cgroup mkdir /sys/fs/cgroup/blkio mount -t cgroup -o blkio none /sys/fs/cgroup/blkio # Prepare blkdev grep blkio /proc/cgroups truncate -s 1M img losetup /dev/loop0 img echo bfq > /sys/block/loop0/queue/scheduler grep blkio /proc/cgroups for ((i=0;i<max_iters;i++)) do mkdir -p /sys/fs/cgroup/blkio/a echo 0 > /sys/fs/cgroup/blkio/a/cgroup.procs dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null echo 0 > /sys/fs/cgroup/blkio/cgroup.procs rmdir /sys/fs/cgroup/blkio/a grep blkio /proc/cgroups done ##TESTCASE_END: Fixes: db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree") Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-17block: Fix page_is_mergeable() for compound pagesMatthew Wilcox (Oracle)1-5/+5
If we pass in an offset which is larger than PAGE_SIZE, then page_is_mergeable() thinks it's not mergeable with the previous bio_vec, leading to a large number of bio_vecs being used. Use a slightly more obvious test that the two pages are compatible with each other. Fixes: 52d52d1c98a9 ("block: only allow contiguous page structs in a bio_vec") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-17block: virtio_blk: fix handling single range discard requestMing Lei1-8/+23
1f23816b8eb8 ("virtio_blk: add discard and write zeroes support") starts to support multi-range discard for virtio-blk. However, the virtio-blk disk may report max discard segment as 1, at least that is exactly what qemu is doing. So far, block layer switches to normal request merge if max discard segment limit is 1, and multiple bios can be merged to single segment. This way may cause memory corruption in virtblk_setup_discard_write_zeroes(). Fix the issue by handling single max discard segment in straightforward way. Fixes: 1f23816b8eb8 ("virtio_blk: add discard and write zeroes support") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Changpeng Liu <changpeng.liu@intel.com> Cc: Daniel Verkamp <dverkamp@chromium.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-17block: respect queue limit of max discard segmentMing Lei1-2/+9
When queue_max_discard_segments(q) is 1, blk_discard_mergable() will return false for discard request, then normal request merge is applied. However, only queue_max_segments() is checked, so max discard segment limit isn't respected. Check max discard segment limit in the request merge code for fixing the issue. Discard request failure of virtio_blk is fixed. Fixes: 69840466086d ("block: fix the DISCARD request merge") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-17block: loop: set discard granularity and alignment for block device backed loopMing Lei1-15/+18
In case of block device backend, if the backend supports write zeros, the loop device will set queue flag of QUEUE_FLAG_DISCARD. However, limits.discard_granularity isn't setup, and this way is wrong, see the following description in Documentation/ABI/testing/sysfs-block: A discard_granularity of 0 means that the device does not support discard functionality. Especially 9b15d109a6b2 ("block: improve discard bio alignment in __blkdev_issue_discard()") starts to take q->limits.discard_granularity for computing max discard sectors. And zero discard granularity may cause kernel oops, or fail discard request even though the loop queue claims discard support via QUEUE_FLAG_DISCARD. Fix the issue by setup discard granularity and alignment. Fixes: c52abf563049 ("loop: Better discard support for block devices") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Coly Li <colyli@suse.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Xiao Ni <xni@redhat.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Evan Green <evgreen@chromium.org> Cc: Gwendal Grignou <gwendal@chromium.org> Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Cc: Andrzej Pietrasiewicz <andrzej.p@collabora.com> Cc: Christoph Hellwig <hch@lst.de> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-17blk-mq: order adding requests to hctx->dispatch and checking SCHED_RESTARTMing Lei2-0/+18
SCHED_RESTART code path is relied to re-run queue for dispatch requests in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding requests to hctx->dispatch. memory barriers have to be used for ordering the following two pair of OPs: 1) adding requests to hctx->dispatch and checking SCHED_RESTART in blk_mq_dispatch_rq_list() 2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch in blk_mq_sched_restart(). Without the added memory barrier, either: 1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in dispatch side or 2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue in dispatch side, meantime checking if there is request in hctx->dispatch from blk_mq_sched_restart() is missed. IO hang in ltp/fs_fill test is reported by kernel test robot: https://lkml.org/lkml/2020/7/26/77 Turns out it is caused by the above out-of-order OPs. And the IO hang can't be observed any more after applying this patch. Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers") Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: David Jeffery <djeffery@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-16bsg-lib: convert comma to semicolonXu Wang1-1/+1
Replace a comma between expression statements by a semicolon. Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-16block: blk-mq.c: fix @at_head kernel-doc warningRandy Dunlap1-0/+1
Fix a kernel-doc warning in block/blk-mq.c: ../block/blk-mq.c:1844: warning: Function parameter or member 'at_head' not described in 'blk_mq_request_bypass_insert' Fixes: 01e99aeca397 ("blk-mq: insert passthrough request into hctx->dispatch directly") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: André Almeida <andrealmeid@collabora.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ming Lei <ming.lei@redhat.com> Cc: linux-block@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-16Linux 5.9-rc1v5.9-rc1Linus Torvalds1-2/+2
2020-08-16Merge tag 'io_uring-5.9-2020-08-15' of git://git.kernel.dk/linux-blockLinus Torvalds4-156/+409
Pull io_uring fixes from Jens Axboe: "A few differerent things in here. Seems like syzbot got some more io_uring bits wired up, and we got a handful of reports and the associated fixes are in here. General fixes too, and a lot of them marked for stable. Lastly, a bit of fallout from the async buffered reads, where we now more easily trigger short reads. Some applications don't really like that, so the io_read() code now handles short reads internally, and got a cleanup along the way so that it's now easier to read (and documented). We're now passing tests that failed before" * tag 'io_uring-5.9-2020-08-15' of git://git.kernel.dk/linux-block: io_uring: short circuit -EAGAIN for blocking read attempt io_uring: sanitize double poll handling io_uring: internally retry short reads io_uring: retain iov_iter state over io_read/io_write calls task_work: only grab task signal lock when needed io_uring: enable lookup of links holding inflight files io_uring: fail poll arm on queue proc failure io_uring: hold 'ctx' reference around task_work queue + execute fs: RWF_NOWAIT should imply IOCB_NOIO io_uring: defer file table grabbing request cleanup for locked requests io_uring: add missing REQ_F_COMP_LOCKED for nested requests io_uring: fix recursive completion locking on oveflow flush io_uring: use TWA_SIGNAL for task_work uncondtionally io_uring: account locked memory before potential error case io_uring: set ctx sq/cq entry count earlier io_uring: Fix NULL pointer dereference in loop_rw_iter() io_uring: add comments on how the async buffered read retry works io_uring: io_async_buf_func() need not test page bit
2020-08-16parisc: fix PMD pages allocation by restoring pmd_alloc_one()Mike Rapoport1-0/+6
Commit 1355c31eeb7e ("asm-generic: pgalloc: provide generic pmd_alloc_one() and pmd_free_one()") converted parisc to use generic version of pmd_alloc_one() but it missed the fact that parisc uses order-1 pages for PMD. Restore the original version of pmd_alloc_one() for parisc, just use GFP_PGTABLE_KERNEL that implies __GFP_ZERO instead of GFP_KERNEL and memset. Fixes: 1355c31eeb7e ("asm-generic: pgalloc: provide generic pmd_alloc_one() and pmd_free_one()") Reported-by: Meelis Roos <mroos@linux.ee> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Tested-by: Meelis Roos <mroos@linux.ee> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lkml.kernel.org/r/9f2b5ebd-e4a4-0fa1-6cd3-4b9f6892d1ad@linux.ee Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-15Merge tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-blockLinus Torvalds8-66/+62
Pull block fixes from Jens Axboe: "A few fixes on the block side of things: - Discard granularity fix (Coly) - rnbd cleanups (Guoqing) - md error handling fix (Dan) - md sysfs fix (Junxiao) - Fix flush request accounting, which caused an IO slowdown for some configurations (Ming) - Properly propagate loop flag for partition scanning (Lennart)" * tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block: block: fix double account of flush request's driver tag loop: unset GENHD_FL_NO_PART_SCAN on LOOP_CONFIGURE rnbd: no need to set bi_end_io in rnbd_bio_map_kern rnbd: remove rnbd_dev_submit_io md-cluster: Fix potential error pointer dereference in resize_bitmaps() block: check queue's limits.discard_granularity in __blkdev_issue_discard() md: get sysfs entry after redundancy attr group create
2020-08-15Merge tag 'riscv-for-linus-5.9-mw1' of ↵Linus Torvalds1-8/+17
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fix from Palmer Dabbelt: "I collected a single fix during the merge window: we managed to break the early trap setup on !MMU, this fixes it" * tag 'riscv-for-linus-5.9-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: Setup exception vector for nommu platform
2020-08-15Merge tag 'sh-for-5.9' of git://git.libc.org/linux-shLinus Torvalds69-998/+370
Pull arch/sh updates from Rich Felker: "Cleanup, SECCOMP_FILTER support, message printing fixes, and other changes to arch/sh" * tag 'sh-for-5.9' of git://git.libc.org/linux-sh: (34 commits) sh: landisk: Add missing initialization of sh_io_port_base sh: bring syscall_set_return_value in line with other architectures sh: Add SECCOMP_FILTER sh: Rearrange blocks in entry-common.S sh: switch to copy_thread_tls() sh: use the generic dma coherent remap allocator sh: don't allow non-coherent DMA for NOMMU dma-mapping: consolidate the NO_DMA definition in kernel/dma/Kconfig sh: unexport register_trapped_io and match_trapped_io_handler sh: don't include <asm/io_trapped.h> in <asm/io.h> sh: move the ioremap implementation out of line sh: move ioremap_fixed details out of <asm/io.h> sh: remove __KERNEL__ ifdefs from non-UAPI headers sh: sort the selects for SUPERH alphabetically sh: remove -Werror from Makefiles sh: Replace HTTP links with HTTPS ones arch/sh/configs: remove obsolete CONFIG_SOC_CAMERA* sh: stacktrace: Remove stacktrace_ops.stack() sh: machvec: Modernize printing of kernel messages sh: pci: Modernize printing of kernel messages ...
2020-08-15io_uring: short circuit -EAGAIN for blocking read attemptJens Axboe1-1/+4
One case was missed in the short IO retry handling, and that's hitting -EAGAIN on a blocking attempt read (eg from io-wq context). This is a problem on sockets that are marked as non-blocking when created, they don't carry any REQ_F_NOWAIT information to help us terminate them instead of perpetually retrying. Fixes: 227c0c9673d8 ("io_uring: internally retry short reads") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-15io_uring: sanitize double poll handlingJens Axboe1-9/+25
There's a bit of confusion on the matching pairs of poll vs double poll, depending on if the request is a pure poll (IORING_OP_POLL_ADD) or poll driven retry. Add io_poll_get_double() that returns the double poll waitqueue, if any, and io_poll_get_single() that returns the original poll waitqueue. With that, remove the argument to io_poll_remove_double(). Finally ensure that wait->private is cleared once the double poll handler has run, so that remove knows it's already been seen. Cc: stable@vger.kernel.org # v5.8 Reported-by: syzbot+7f617d4a9369028b8a2c@syzkaller.appspotmail.com Fixes: 18bceab101ad ("io_uring: allow POLL_ADD with double poll_wait() users") Signed-off-by: Jens Axboe <axboe@kernel.dk>