summaryrefslogtreecommitdiffstats
path: root/drivers/nvme/host
AgeCommit message (Collapse)AuthorFilesLines
2022-10-07Merge tag 'for-6.1/passthrough-2022-10-04' of git://git.kernel.dk/linuxLinus Torvalds3-87/+158
Pull passthrough updates from Jens Axboe: "With these changes, passthrough NVMe support over io_uring now performs at the same level as block device O_DIRECT, and in many cases 6-8% better. This contains: - Add support for fixed buffers for passthrough (Anuj, Kanchan) - Enable batched allocations and freeing on passthrough, similarly to what we support on the normal storage path (me) - Fix from Geert fixing an issue with !CONFIG_IO_URING" * tag 'for-6.1/passthrough-2022-10-04' of git://git.kernel.dk/linux: io_uring: Add missing inline to io_uring_cmd_import_fixed() dummy nvme: wire up fixed buffer support for nvme passthrough nvme: pass ubuffer as an integer block: extend functionality to map bvec iterator block: factor out blk_rq_map_bio_alloc helper block: rename bio_map_put to blk_mq_map_bio_put nvme: refactor nvme_alloc_request nvme: refactor nvme_add_user_metadata nvme: Use blk_rq_map_user_io helper scsi: Use blk_rq_map_user_io helper block: add blk_rq_map_user_io io_uring: introduce fixed buffer support for io_uring_cmd io_uring: add io_uring_cmd_import_fixed nvme: enable batched completions of passthrough IO nvme: split out metadata vs non metadata end_io uring_cmd completions block: allow end_io based requests in the completion batch handling block: change request end_io handler to pass back a return value block: enable batched allocation for blk_mq_alloc_request() block: kill deprecated BUG_ON() in the flush handling
2022-10-07Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linuxLinus Torvalds8-383/+383
Pull block updates from Jens Axboe: - NVMe pull requests via Christoph: - handle number of queue changes in the TCP and RDMA drivers (Daniel Wagner) - allow changing the number of queues in nvmet (Daniel Wagner) - also consider host_iface when checking ip options (Daniel Wagner) - don't map pages which can't come from HIGHMEM (Fabio M. De Francesco) - avoid unnecessary flush bios in nvmet (Guixin Liu) - shrink and better pack the nvme_iod structure (Keith Busch) - add comment for unaligned "fake" nqn (Linjun Bao) - print actual source IP address through sysfs "address" attr (Martin Belanger) - various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang) - handle effects after freeing the request (Keith Busch) - copy firmware_rev on each init (Keith Busch) - restrict management ioctls to admin (Keith Busch) - ensure subsystem reset is single threaded (Keith Busch) - report the actual number of tagset maps in nvme-pci (Keith Busch) - small fabrics authentication fixups (Christoph Hellwig) - add common code for tagset allocation and freeing (Christoph Hellwig) - stop using the request_queue in nvmet (Christoph Hellwig) - set min_align_mask before calculating max_hw_sectors (Rishabh Bhatnagar) - send a rediscover uevent when a persistent discovery controller reconnects (Sagi Grimberg) - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi) - MD pull request via Song: - Various raid5 fix and clean up, by Logan Gunthorpe and David Sloan. - Raid10 performance optimization, by Yu Kuai. - sbitmap wakeup hang fixes (Hugh, Keith, Jan, Yu) - IO scheduler switching quisce fix (Keith) - s390/dasd block driver updates (Stefan) - support for recovery for the ublk driver (ZiyangZhang) - rnbd drivers fixes and updates (Guoqing, Santosh, ye, Christoph) - blk-mq and null_blk map fixes (Bart) - various bcache fixes (Coly, Jilin, Jules) - nbd signal hang fix (Shigeru) - block writeback throttling fix (Yu) - optimize the passthrough mapping handling (me) - prepare block cgroups to being gendisk based (Christoph) - get rid of an old PSI hack in the block layer, moving it to the callers instead where it belongs (Christoph) - blk-throttle fixes and cleanups (Yu) - misc fixes and cleanups (Liu Shixin, Liu Song, Miaohe, Pankaj, Ping-Xiang, Wolfram, Saurabh, Li Jinlin, Li Lei, Lin, Li zeming, Miaohe, Bart, Coly, Gaosheng * tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux: (162 commits) sbitmap: fix lockup while swapping block: add rationale for not using blk_mq_plug() when applicable block: adapt blk_mq_plug() to not plug for writes that require a zone lock s390/dasd: use blk_mq_alloc_disk blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all blk-mq: use quiesced elevator switch when reinitializing queues block: replace blk_queue_nowait with bdev_nowait nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data ...
2022-10-07Merge tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linuxLinus Torvalds4-5/+78
Pull io_uring updates from Jens Axboe: - Add supported for more directly managed task_work running. This is beneficial for real world applications that end up issuing lots of system calls as part of handling work. Normal task_work will always execute as we transition in and out of the kernel, even for "unrelated" system calls. It's more efficient to defer the handling of io_uring's deferred work until the application wants it to be run, generally in batches. As part of ongoing work to write an io_uring network backend for Thrift, this has been shown to greatly improve performance. (Dylan) - Add IOPOLL support for passthrough (Kanchan) - Improvements and fixes to the send zero-copy support (Pavel) - Partial IO handling fixes (Pavel) - CQE ordering fixes around CQ ring overflow (Pavel) - Support sendto() for non-zc as well (Pavel) - Support sendmsg for zerocopy (Pavel) - Networking iov_iter fix (Stefan) - Misc fixes and cleanups (Pavel, me) * tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linux: (56 commits) io_uring/net: fix notif cqe reordering io_uring/net: don't update msg_name if not provided io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL io_uring/rw: defer fsnotify calls to task context io_uring/net: fix fast_iov assignment in io_setup_async_msg() io_uring/net: fix non-zc send with address io_uring/net: don't skip notifs for failed requests io_uring/rw: don't lose short results on io_setup_async_rw() io_uring/rw: fix unexpected link breakage io_uring/net: fix cleanup double free free_iov init io_uring: fix CQE reordering io_uring/net: fix UAF in io_sendrecv_fail() selftest/net: adjust io_uring sendzc notif handling io_uring: ensure local task_work marks task as running io_uring/net: zerocopy sendmsg io_uring/net: combine fail handlers io_uring/net: rename io_sendzc() io_uring/net: support non-zerocopy sendto io_uring/net: refactor io_setup_async_addr io_uring/net: don't lose partial send_zc on fail ...
2022-09-30Merge tag 'block-6.0-2022-09-29' of git://git.kernel.dk/linuxLinus Torvalds2-3/+7
Pull block fixes from Jens Axboe: "A single NVMe pull request via Christoph with a few fixes that should go into the 6.0 release: - Fix IOC_PR_CLEAR and IOC_PR_RELEASE ioctls for nvme devices (Michael Kelley) - Disable Write Zeroes on Phison E3C/E4C (Tina Hsu)" * tag 'block-6.0-2022-09-29' of git://git.kernel.dk/linux: nvme-pci: disable Write Zeroes on Phison E3C/E4C nvme: Fix IOC_PR_CLEAR and IOC_PR_RELEASE ioctls for nvme devices
2022-09-30nvme: wire up fixed buffer support for nvme passthroughKanchan Joshi1-2/+16
if io_uring sends passthrough command with IORING_URING_CMD_FIXED flag, use the pre-registered buffer for IO (non-vectored variant). Pass the buffer/length to io_uring and get the bvec iterator for the range. Next, pass this bvec to block-layer and obtain a bio/request for subsequent processing. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220930062749.152261-13-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30nvme: pass ubuffer as an integerKanchan Joshi1-11/+12
This is a prep patch. Modify nvme_submit_user_cmd and nvme_map_user_request to take ubuffer as plain integer argument, and do away with nvme_to_user_ptr conversion in callers. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220930062749.152261-12-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30nvme: refactor nvme_alloc_requestKanchan Joshi1-37/+53
nvme_alloc_request expects a large number of parameters. Split this out into two functions to reduce number of parameters. First one retains the name nvme_alloc_request, while second one is named nvme_map_user_request. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220930062749.152261-8-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30nvme: refactor nvme_add_user_metadataKanchan Joshi1-10/+13
Pass struct request rather than bio. It helps to kill a parameter, and some processing clean-up too. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220930062749.152261-7-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30nvme: Use blk_rq_map_user_io helperAnuj Gupta1-16/+2
User blk_rq_map_user_io instead of duplicating the same code at different places Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/20220930062749.152261-6-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30nvme: enable batched completions of passthrough IOJens Axboe1-2/+1
Now that the normal passthrough end_io path doesn't need the request anymore, we can kill the explicit blk_mq_free_request() and just pass back RQ_END_IO_FREE instead. This enables the batched completion from freeing batches of requests at the time. This brings passthrough IO performance at least on par with bdev based O_DIRECT with io_uring. With this and batche allocations, peak performance goes from 110M IOPS to 122M IOPS. For IRQ based, passthrough is now also about 10% faster than previously, going from ~61M to ~67M IOPS. Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <kbusch@kernel.org> Co-developed-by: Stefan Roesch <shr@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30nvme: split out metadata vs non metadata end_io uring_cmd completionsJens Axboe1-18/+61
By splitting up the metadata and non-metadata end_io handling, we can remove any request dependencies on the normal non-metadata IO path. This is in preparation for enabling the normal IO passthrough path to pass the ownership of the request back to the block layer. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <kbusch@kernel.org> Co-developed-by: Stefan Roesch <shr@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30block: change request end_io handler to pass back a return valueJens Axboe3-7/+16
Everything is just converted to returning RQ_END_IO_NONE, and there should be no functional changes with this patch. In preparation for allowing the end_io handler to pass ownership back to the block layer, rather than retain ownership of the request. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30Merge branch 'for-6.1/io_uring' into for-6.1/passthroughJens Axboe4-5/+78
* for-6.1/io_uring: (56 commits) io_uring/net: fix notif cqe reordering io_uring/net: don't update msg_name if not provided io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL io_uring/rw: defer fsnotify calls to task context io_uring/net: fix fast_iov assignment in io_setup_async_msg() io_uring/net: fix non-zc send with address io_uring/net: don't skip notifs for failed requests io_uring/rw: don't lose short results on io_setup_async_rw() io_uring/rw: fix unexpected link breakage io_uring/net: fix cleanup double free free_iov init io_uring: fix CQE reordering io_uring/net: fix UAF in io_sendrecv_fail() selftest/net: adjust io_uring sendzc notif handling io_uring: ensure local task_work marks task as running io_uring/net: zerocopy sendmsg io_uring/net: combine fail handlers io_uring/net: rename io_sendzc() io_uring/net: support non-zerocopy sendto io_uring/net: refactor io_setup_async_addr io_uring/net: don't lose partial send_zc on fail ...
2022-09-30Merge branch 'for-6.1/block' into for-6.1/passthroughJens Axboe8-383/+383
* for-6.1/block: (162 commits) sbitmap: fix lockup while swapping block: add rationale for not using blk_mq_plug() when applicable block: adapt blk_mq_plug() to not plug for writes that require a zone lock s390/dasd: use blk_mq_alloc_disk blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all blk-mq: use quiesced elevator switch when reinitializing queues block: replace blk_queue_nowait with bdev_nowait nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data ... Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-27nvme: remove nvme_ctrl_init_connect_qChristoph Hellwig1-8/+0
Unused now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27nvme-fc: use the tagset alloc/free helpersChristoph Hellwig1-66/+17
Use the common helpers to allocate and free the tagsets. To make this work the generic nvme_ctrl now needs to be stored in the hctx private data instead of the nvme_fc_ctrl. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: James Smart <jsmart2021@gmail.com>
2022-09-27nvme-fc: store the generic nvme_ctrl in set->driver_dataChristoph Hellwig1-20/+12
Point the private data to the generic controller structure in preparation of using the common tagset init/exit code and use the chance the cleanup the init_hctx methods a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: James Smart <jsmart2021@gmail.com>
2022-09-27nvme-fc: keep ctrl->sqsize in sync with opts->queue_sizeChristoph Hellwig1-9/+1
Also update the sqsize field when capping the queue size, and remove the check a queue size that is larger than sqsize given that sqsize is only initialized from opts->queue_size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: James Smart <jsmart2021@gmail.com>
2022-09-27nvme-rdma: use the tagset alloc/free helpersChristoph Hellwig1-99/+34
Use the common helpers to allocate and free the tagsets. To make this work the generic nvme_ctrl now needs to be stored in the hctx private data instead of the nvme_rdma_ctrl. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27nvme-rdma: store the generic nvme_ctrl in set->driver_dataChristoph Hellwig1-6/+6
Point the private data to the generic controller structure in preparation of using the common tagset init/exit code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27nvme-tcp: use the tagset alloc/free helpersChristoph Hellwig1-85/+16
Use the common helpers to allocate and free the tagsets. To make this work the generic nvme_ctrl now needs to be stored in the hctx private data instead of the nvme_tcp_ctrl. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27nvme-tcp: store the generic nvme_ctrl in set->driver_dataChristoph Hellwig1-6/+6
Point the private data to the generic controller structure in preparation of using the common tagset init/exit code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27nvme-tcp: remove the unused queue_size member in nvme_tcp_queueChristoph Hellwig1-6/+3
->nvme_tcp_queue is not used anywhere, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27nvme: add common helpers to allocate and free tagsetsChristoph Hellwig2-0/+110
Add common helpers to allocate and tear down the admin and I/O tag sets, including the special queues allocated with them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2022-09-27nvme-pci: report the actual number of tagset mapsKeith Busch1-2/+4
We've been reporting 2 maps regardless of whether the module parameter asked for anything beyond the default queues. A consequence of this means that blk-mq will reinitialize the all the hardware contexts and io schedulers on every controller reset when the mapping is exactly the same as before. This unnecessary overhead is adding several milliseconds on a reset for environments that don't need it. Report the actual number of mappings in use. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27nvme-pci: set min_align_mask before calculating max_hw_sectorsRishabh Bhatnagar1-1/+2
If swiotlb is force enabled dma_max_mapping_size ends up calling swiotlb_max_mapping_size which takes into account the min align mask for the device. Set the min align mask for nvme driver before calling dma_max_mapping_size while calculating max hw sectors. Signed-off-by: Rishabh Bhatnagar <risbhat@amazon.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27nvme: send a rediscover uevent when a persistent discovery controller reconnectsSagi Grimberg2-0/+11
When a discovery controller is disconnected, no AENs will arrive to notify the host about discovery log change events. In order to solve this, send a uevent notification when a persistent discovery controller reconnects. We add a new ctrl flag NVME_CTRL_STARTED_ONCE that will be set on the first start, and consecutive calls will find it set, and send the event to userspace if the controller is a discovery controller. Upon the event reception, userspace will re-read the discovery log page and will act upon changes as it sees fit. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27nvme: enumerate controller flagsSagi Grimberg1-2/+5
We expect to grow a few of these flags for various purposes so make them a proper enumeration. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27nvme-pci: disable Write Zeroes on Phison E3C/E4CTina Hsu1-0/+4
E3C/E4C SSDs do support the Write Zeroes command in theory, but have very bad performance when using it. As the firmware has been frozen for these products we can not expect firmware improvements for it, so disable Write Zeroes. Signed-off-by: Tina Hsu <tina_hsu@phison.corp-partner.google.com> [hch: update the commit message] Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27nvme: Fix IOC_PR_CLEAR and IOC_PR_RELEASE ioctls for nvme devicesMichael Kelley1-3/+3
The IOC_PR_CLEAR and IOC_PR_RELEASE ioctls are non-functional on NVMe devices because the nvme_pr_clear() and nvme_pr_release() functions set the IEKEY field incorrectly. The IEKEY field should be set only when the key is zero (i.e, not specified). The current code does it backwards. Furthermore, the NVMe spec describes the persistent reservation "clear" function as an option on the reservation release command. The current implementation of nvme_pr_clear() erroneously uses the reservation register command. Fix these errors. Note that NVMe version 1.3 and later specify that setting the IEKEY field will return an error of Invalid Field in Command. The fix will set IEKEY when the key is zero, which is appropriate as these ioctls consider a zero key to be "unspecified", and the intention of the spec change is to require a valid key. Tested on a version 1.4 PCI NVMe device in an Azure VM. Fixes: 1673f1f08c88 ("nvme: move block_device_operations and ns/ctrl freeing to common code") Fixes: 1d277a637a71 ("NVMe: Add persistent reservation ops") Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27nvme: ensure subsystem reset is single threadedKeith Busch1-3/+13
The subsystem reset writes to a register, so we have to ensure the device state is capable of handling that otherwise the driver may access unmapped registers. Use the state machine to ensure the subsystem reset doesn't try to write registers on a device already undergoing this type of reset. Link: https://bugzilla.kernel.org/show_bug.cgi?id=214771 Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27nvme: restrict management ioctls to adminKeith Busch1-0/+6
The passthrough commands already have this restriction, but the other operations do not. Require the same capabilities for all users as all of these operations, which include resets and rescans, can be disruptive. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27nvme: copy firmware_rev on each initKeith Busch1-1/+2
The firmware revision can change on after a reset so copy the most recent info each time instead of just the first time, otherwise the sysfs firmware_rev entry may contain stale data. Reported-by: Jeff Lien <jeff.lien@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Chao Leng <lengchao@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-27nvme: handle effects after freeing the requestKeith Busch3-13/+17
If a reset occurs after the scan work attempts to issue a command, the reset may quisce the admin queue, which blocks the scan work's command from dispatching. The scan work will not be able to complete while the queue is quiesced. Meanwhile, the reset work will cancel all outstanding admin tags and wait until all requests have transitioned to idle, which includes the passthrough request. But the passthrough request won't be set to idle until after the scan_work flushes, so we're deadlocked. Fix this by handling the end effects after the request has been freed. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216354 Reported-by: Jonathan Derrick <Jonathan.Derrick@solidigm.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chao Leng <lengchao@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-21fs: add batch and poll flags to the uring_cmd_iopoll() handlerJens Axboe2-6/+12
We need the poll_flags to know how to poll for the IO, and we should have the batch structure in preparation for supporting batched completions with iopoll. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21nvme: wire up async polling for io passthrough commandsKanchan Joshi4-5/+72
Store a cookie during submission, and use that to implement completion-polling inside the ->uring_cmd_iopoll handler. This handler makes use of existing bio poll facility. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/20220823161443.49436-5-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-19nvme-tcp: print actual source IP address through sysfs "address" attrMartin Belanger1-1/+20
TCP transport relies on the routing table to determine which source address and interface to use when making a connection. Currently, there is no way to tell from userspace where a connection was made. This patch exposes the actual source address using a new field named "src_addr=" in the "address" attribute. This is needed to diagnose and identify connectivity issues. With the source address we can infer the interface associated with each connection. This was tested with nvme-cli 2.0 to verify it does not have any adverse effect. The new "src_addr=" field will simply be displayed in the output of the "list-subsys" or "list -v" commands as shown here. $ nvme list-subsys nvme-subsys0 - NQN=nqn.2014-08.org.nvmexpress.discovery \ +- nvme0 tcp traddr=192.168.56.1,trsvcid=8009,src_addr=192.168.56.101 live Signed-off-by: Martin Belanger <martin.belanger@dell.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19nvme-pci: move iod dma_len fill gapsKeith Busch1-1/+1
The 32-bit field, dma_len, packs better in the iod struct above the dma_addr_t on 64-bit systems. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19nvme-pci: iod npages fits in s8Keith Busch1-13/+16
The largest allowed transfer is 4MB, which can use at most 1025 PRPs. Each PRP is 8 bytes, so the maximum number of 4k nvme pages needed for the iod_list is 3, which fits in an 's8' type. While modifying this field, change the name to "nr_allocations" to better represent that this is referring to the number of units allocated from a dma_pool. Also introduce a BUILD_BUG_ON to ensure we never accidently increase the largest transfer limit beyond 127 chained prp lists. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19nvme-pci: iod's 'aborted' is a boolKeith Busch1-3/+3
It's only true or false, so make this a bool to reflect that and save some space in nvme_iod. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19nvme-pci: remove nvme_queue from nvme_iodKeith Busch1-15/+13
We can get the nvme_queue from the req just as easily, so remove the duplicate path to the same structure to save some space. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19nvme: consider also host_iface when checking ip optionsDaniel Wagner1-5/+18
It's perfectly fine to use the same traddr and trsvcid more than once as long we use different host interface. This is used in setups where the host has more than one interface but the target exposes only one traddr/trsvcid combination. Use the same acceptance rules for host_iface as we have for host_traddr. Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Chao Leng <lengchao@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19nvme-rdma: handle number of queue changesDaniel Wagner1-5/+21
On reconnect, the number of queues might have changed. In the case where we have more queues available than previously we try to access queues which are not initialized yet. The other case where we have less queues than previously, the connection attempt will fail because the target doesn't support the old number of queues and we end up in a reconnect loop. Thus, only start queues which are currently present in the tagset limited by the number of available queues. Then we update the tagset and we can start any new queue. Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19nvme-tcp: handle number of queue changesDaniel Wagner1-5/+21
On reconnect, the number of queues might have changed. In the case where we have more queues available than previously we try to access queues which are not initialized yet. The other case where we have less queues than previously, the connection attempt will fail because the target doesn't support the old number of queues and we end up in a reconnect loop. Thus, only start queues which are currently present in the tagset limited by the number of available queues. Then we update the tagset and we can start any new queue. Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19nvme: move from strlcpy with unused retval to strscpyWolfram Sang2-2/+2
Follow the advice of the below link and prefer 'strscpy' in this subsystem. Conversion is 1:1 because the return value is not used. Generated by a coccinelle script. Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/ Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19nvme: add comment for unaligned "fake" nqnLinjun Bao1-1/+5
Current "fake" nqn field is "nqn.2014.08.org.nvmexpress:", it is not aligned with the canonical version for history reasons. Signed-off-by: Linjun Bao <meljbao@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-09Merge tag 'block-6.0-2022-09-09' of git://git.kernel.dk/linux-blockLinus Torvalds2-8/+13
Pull block fixes from Jens Axboe: - NVMe pull via Christoph: - fix a use after free in nvmet (Bart Van Assche) - fix a use after free when detecting digest errors (Sagi Grimberg) - fix regression that causes sporadic TCP requests to time out (Sagi Grimberg) - fix two off by ones errors in the nvmet ZNS support (Dennis Maisenbacher) - requeue aen after firmware activation (Keith Busch) - Fix missing request flags in debugfs code (me) - Partition scan fix (Ming) * tag 'block-6.0-2022-09-09' of git://git.kernel.dk/linux-block: block: add missing request flags to debugfs code nvme: requeue aen after firmware activation nvmet: fix mar and mor off-by-one errors nvme-tcp: fix regression that causes sporadic requests to time out nvme-tcp: fix UAF when detecting digest errors nvmet: fix a use-after-free block: don't add partitions if GD_SUPPRESS_PART_SCAN is set
2022-09-07nvme: requeue aen after firmware activationKeith Busch1-3/+11
The driver prevents async event work while handling a processing paused event, but someone needs to restart it after the controller returns to a live state. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216400 Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-06nvme-tcp: fix regression that causes sporadic requests to time outSagi Grimberg1-4/+1
When we queue requests, we strive to batch as much as possible and also signal the network stack that more data is about to be sent over a socket with MSG_SENDPAGE_NOTLAST. This flag looks at the pending requests queued as well as queue->more_requests that is derived from the block layer last-in-batch indication. We set more_request=true when we flush the request directly from .queue_rq submission context (in nvme_tcp_send_all), however this is wrongly assuming that no other requests may be queued during the execution of nvme_tcp_send_all. Due to this, a race condition may happen where: 1. request X is queued as !last-in-batch 2. request X submission context calls nvme_tcp_send_all directly 3. nvme_tcp_send_all is preempted and schedules to a different cpu 4. request Y is queued as last-in-batch 5. nvme_tcp_send_all context sends request X+Y, however signals for both MSG_SENDPAGE_NOTLAST because queue->more_requests=true. ==> none of the requests is pushed down to the wire as the network stack is waiting for more data, both requests timeout. To fix this, we eliminate queue->more_requests and only rely on the queue req_list and send_list to be not-empty. Fixes: 122e5b9f3d37 ("nvme-tcp: optimize network stack with setting msg flags according to batch size") Reported-by: Jonathan Nicklin <jnicklin@blockbridge.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Jonathan Nicklin <jnicklin@blockbridge.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-06nvme-tcp: fix UAF when detecting digest errorsSagi Grimberg1-1/+1
We should also bail from the io_work loop when we set rd_enabled to true, so we don't attempt to read data from the socket when the TCP stream is already out-of-sync or corrupted. Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver") Reported-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>