summaryrefslogtreecommitdiffstats
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2022-08-02md/raid5: Refactor add_stripe_bio()Logan Gunthorpe1-30/+56
Factor out two helper functions from add_stripe_bio(): one to check for overlap (stripe_bio_overlaps()), and one to actually add the bio to the stripe (__add_stripe_bio()). The latter function will always succeed. This will be useful in the next patch so that overlap can be checked for multiple disks before adding any Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: Keep a reference to last stripe_head for batchLogan Gunthorpe1-12/+40
When batching, every stripe head has to find the previous stripe head to add to the batch list. This involves taking the hash lock which is highly contended during IO. Instead of finding the previous stripe_head each time, store a reference to the previous stripe_head in a pointer so that it doesn't require taking the contended lock another time. The reference to the previous stripe must be released before scheduling and waiting for work to get done. Otherwise, it can hold up raid5_activate_delayed() and deadlock. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: Refactor for loop in raid5_make_request() into while loopLogan Gunthorpe1-4/+5
The for loop with retry label can be more cleanly expressed as a while loop by moving the logical_sector increment into the success path. No functional changes intended. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: Move read_seqcount_begin() into make_stripe_request()Logan Gunthorpe1-7/+5
Now that prepare_to_wait() isn't in the way, move read_sequcount_begin() into make_stripe_request(). No functional changes intended. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: Drop the do_prepare flag in raid5_make_request()Logan Gunthorpe1-6/+2
prepare_to_wait() can be reasonably called after schedule instead of setting a flag and preparing in the next loop iteration. This means that prepare_to_wait() will be called before read_seqcount_begin(), but there shouldn't be any reason that the order matters here. On the first iteration of the loop prepare_to_wait() is already called first. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: Factor out helper from raid5_make_request() loopLogan Gunthorpe1-98/+133
Factor out the inner loop of raid5_make_request() into it's own helper called make_stripe_request(). The helper returns a number of statuses: SUCCESS, RETRY, SCHEDULE_AND_RETRY and FAIL. This makes the code a bit easier to understand and allows the SCHEDULE_AND_RETRY path to be made common. A context structure is added to contain do_flush. It will be used more in subsequent patches for state that needs to be kept outside the loop. No functional changes intended. This will be cleaned up further in subsequent patches to untangle the gen_lock and do_prepare logic further. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: Move common stripe get code into new find_get_stripe() helperLogan Gunthorpe1-67/+64
Both uses of find_stripe() require a fairly complicated dance to increment the reference count. Move this into a common find_get_stripe() helper. No functional changes intended. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()Logan Gunthorpe1-2/+3
stripe_add_to_batch_list() is better done in the loop in make_request instead of inside add_stripe_bio(). This is clearer and allows for storing the batch_head state outside the loop in a subsequent patch. The call to add_stripe_bio() in retry_aligned_read() is for read and batching only applies to write. So it's impossible for batching to happen at that call site. No functional changes intended. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: Refactor raid5_make_request loopLogan Gunthorpe1-54/+55
Break immediately if raid5_get_active_stripe() returns NULL and deindent the rest of the loop. Annotate this check with an unlikely(). This makes the code easier to read and reduces the indentation level. No functional changes intended. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: Factor out ahead_of_reshape() functionLogan Gunthorpe1-12/+15
There are a few uses of an ugly ternary operator in raid5_make_request() to check if a sector is a head of a reshape sector. Factor this out into a simple helper called ahead_of_reshape(). No functional changes intended. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: Make logic blocking check consistent with logic that blocksLogan Gunthorpe1-1/+1
The check in raid5_make_request differs very slightly from the logic that causes it to block lower down. This likely does not cause a bug as the check is fuzzy anyway (as reshape may move on between the first check and the subsequent check). However, make it consistent so it can be cleaned up in a subsequent patch. The condition which causes the schedule is: !(mddev->reshape_backwards ? logical_sector < conf->reshape_progress : logical_sector >= conf->reshape_progress) && (mddev->reshape_backwards ? logical_sector < conf->reshape_safe : logical_sector >= conf->reshape_safe) The condition that causes the early bailout is made to match this. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md: unlock mddev before reap sync_thread in action_storeGuoqing Jiang2-2/+18
Since the bug which commit 8b48ec23cc51a ("md: don't unregister sync_thread with reconfig_mutex held") fixed is related with action_store path, other callers which reap sync_thread didn't need to be changed. Let's pull md_unregister_thread from md_reap_sync_thread, then fix previous bug with belows. 1. unlock mddev before md_reap_sync_thread in action_store. 2. save reshape_position before unlock, then restore it to ensure position not changed accidentally by others. Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md: Explicitly create command-line configured devicesChris Webb3-1/+3
Boot-time assembly of arrays with md= command-line arguments breaks when CONFIG_BLOCK_LEGACY_AUTOLOAD is unset. md_setup_drive() in md-autodetect.c calls blkdev_get_by_dev(), assuming this implicitly creates the block device. Fix this by attempting to md_alloc() the array first. As in the probe path, ignore any error as failure is caught by blkdev_get_by_dev() anyway. Signed-off-by: Chris Webb <chris@arachsys.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md: Notify sysfs sync_completed in md_reap_sync_thread()Logan Gunthorpe1-0/+1
The mdadm test 07layouts randomly produces a kernel hung task deadlock. The deadlock is caused by the suspend_lo/suspend_hi files being set by the mdadm background process during reshape and not being cleared because the process hangs. (Leaving aside the issue of the fragility of freezing kernel tasks by buggy userspace processes...) When the background mdadm process hangs it, is waiting (without a timeout) on a change to the sync_completed file signalling that the reshape has completed. The process is woken up a couple times when the reshape finishes but it is woken up before MD_RECOVERY_RUNNING is cleared so sync_completed_show() reports 0 instead of "none". To fix this, notify the sysfs file in md_reap_sync_thread() after MD_RECOVERY_RUNNING has been cleared. This wakes up mdadm and causes it to continue and write to suspend_lo/suspend_hi to allow IO to continue. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md: Ensure resync is reported after it startsLogan Gunthorpe1-2/+12
The 07layouts test in mdadm fails on some systems. The failure presents itself as the backup file not being removed before the next layout is grown into: mdadm: /dev/md0: cannot create backup file /tmp/md-test-backup: File exists This is because the background mdadm process, which is responsible for cleaning up this backup file gets into an infinite loop waiting for the reshape to start. mdadm checks the mdstat file if a reshape is going and, if it is not, it waits for an event on the file or times out in 5 seconds. On faster machines, the reshape may complete before the 5 seconds times out, and thus the background mdadm process loops waiting for a reshape to start that has already occurred. mdadm reads the mdstat file to start, but mdstat does not report that the reshape has begun, even though it has indeed begun. So the mdstat_wait() call (in mdadm) which polls on the mdstat file won't ever return until timing out. The reason mdstat reports the reshape has started is due to an issue in status_resync(). recovery_active is subtracted from curr_resync which will result in a value of zero for the first chunk of reshaped data, and the resulting read will report no reshape in progress. To fix this, if "resync - recovery_active" is an overloaded value, force the value to be MD_RESYNC_ACTIVE so the code reports a resync in progress. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md: Use enum for overloaded magic numbers used by mddev->curr_resyncLogan Gunthorpe2-22/+33
Comments in the code document special values used for mddev->curr_resync. Make this clearer by using an enum to label these values. The only functional change is a couple places use the wrong comparison operator that implied 3 is another special value. They are all fixed to imply that 3 or greater is an active resync. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5-cache: Annotate pslot with __rcu notationLogan Gunthorpe1-2/+2
radix_tree_lookup_slot() and radix_tree_replace_slot() API expect the slot returned and looked up to be marked with __rcu. Otherwise sparse warnings are generated: drivers/md/raid5-cache.c:2939:23: warning: incorrect type in assignment (different address spaces) drivers/md/raid5-cache.c:2939:23: expected void **pslot drivers/md/raid5-cache.c:2939:23: got void [noderef] __rcu ** Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5-cache: Clear conf->log after finishing workLogan Gunthorpe1-2/+3
A NULL pointer dereferlence on conf->log is seen randomly with the mdadm test 21raid5cache. Kasan reporst: BUG: KASAN: null-ptr-deref in r5l_reclaimable_space+0xf5/0x140 Read of size 8 at addr 0000000000000860 by task md0_reclaim/3086 Call Trace: dump_stack_lvl+0x5a/0x74 kasan_report.cold+0x5f/0x1a9 __asan_load8+0x69/0x90 r5l_reclaimable_space+0xf5/0x140 r5l_do_reclaim+0xf4/0x5e0 r5l_reclaim_thread+0x69/0x3b0 md_thread+0x1a2/0x2c0 kthread+0x177/0x1b0 ret_from_fork+0x22/0x30 This is caused by conf->log being cleared in r5l_exit_log() before stopping the reclaim thread. To fix this, clear conf->log after the reclaim_thread is unregistered and after flushing disable_writeback_work. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5-cache: Drop RCU usage of conf->logLogan Gunthorpe1-11/+5
The only place that uses RCU to access conf->log is in r5l_log_disk_error(). This function is mostly used in the IO path and once with mddev_lock() held in raid5_change_consistency_policy(). It is known that the IO will be suspended before the log is freed and r5l_log_exit() is called with the mddev_lock() held. This should mean that conf->log can not be freed while the function is being called, so the RCU protection is not necessary. Drop the rcu_read_lock() as well as the synchronize_rcu() and rcu_assign_pointer() usage. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5-cache: Take mddev_lock in r5c_journal_mode_show()Logan Gunthorpe1-6/+9
The mddev->lock spinlock doesn't protect against the removal of conf->log in r5l_exit_log() so conf->log may be freed before it is used. To fix this, take the mddev_lock() insteaad of the mddev->lock spinlock. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: suspend the array for calls to log_exit()Logan Gunthorpe1-1/+4
The raid5-cache code relies on there being no IO in flight when log_exit() is called. There are two places where this is not guaranteed so add mddev_suspend() and mddev_resume() calls to these sites. The site in raid5_change_consistency_policy() is in the error path, and another similar call site already has suspend/resume calls just below it; so it should be equally safe to make that change here. There is one remaining site in raid5_remove_disk() that we call log_exit() without suspending the array. Unfortunately, as the comment stated, we cannot call mddev_suspend from raid5d. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5-ppl: Drop unused argument from ppl_handle_flush_request()Logan Gunthorpe2-3/+3
ppl_handle_flush_request() takes an struct r5log argument but doesn't use it. It has no buisiness taking this argument as it is only used by raid5-cache and has no way to derference it anyway. Remove the argument. No functional changes intended. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5-log: Drop extern decorators for function prototypesLogan Gunthorpe1-39/+36
extern is not necessary and recommended against when defining prototype functions in headers. checkpatch.pl complains about these. So remove them. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02Merge tag 'hardening-v5.20-rc1' of ↵Linus Torvalds4-1/+112
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: - Fix Sparse warnings with randomizd kstack (GONG, Ruiqi) - Replace uintptr_t with unsigned long in usercopy (Jason A. Donenfeld) - Fix Clang -Wforward warning in LKDTM (Justin Stitt) - Fix comment to correctly refer to STRICT_DEVMEM (Lukas Bulwahn) - Introduce dm-verity binding logic to LoadPin LSM (Matthias Kaehlcke) - Clean up warnings and overflow and KASAN tests (Kees Cook) * tag 'hardening-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: dm: verity-loadpin: Drop use of dm_table_get_num_targets() kasan: test: Silence GCC 12 warnings drivers: lkdtm: fix clang -Wformat warning x86: mm: refer to the intended config STRICT_DEVMEM in a comment dm: verity-loadpin: Use CONFIG_SECURITY_LOADPIN_VERITY for conditional compilation LoadPin: Enable loading from trusted dm-verity devices dm: Add verity helpers for LoadPin stack: Declare {randomize_,}kstack_offset to fix Sparse warnings lib: overflow: Do not define 64-bit tests on 32-bit MAINTAINERS: Add a general "kernel hardening" section usercopy: use unsigned long instead of uintptr_t
2022-08-02Merge tag 'for-6.0/dm-changes' of ↵Linus Torvalds19-394/+676
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Refactor DM core's mempool allocation so that it clearer by not being split acorss files. - Improve DM core's BLK_STS_DM_REQUEUE and BLK_STS_AGAIN handling. - Optimize DM core's more common bio splitting by eliminating the use of bio cloning with bio_split+bio_chain. Shift that cloning cost to the relatively unlikely dm_io requeue case that only occurs during error handling. Introduces dm_io_rewind() that will clone a bio that reflects the subset of the original bio that must be requeued. - Remove DM core's dm_table_get_num_targets() wrapper and audit all dm_table_get_target() callers. - Fix potential for OOM with DM writecache target by setting a default MAX_WRITEBACK_JOBS (set to 256MiB or 1/16 of total system memory, whichever is smaller). - Fix DM writecache target's stats that are reported through DM-specific table info. - Fix use-after-free crash in dm_sm_register_threshold_callback(). - Refine DM core's Persistent Reservation handling in preparation for broader work Mike Christie is doing to add compatibility with Microsoft Windows Failover Cluster. - Fix various KASAN reported bugs in the DM raid target. - Fix DM raid target crash due to md_handle_request() bio splitting that recurses to block core without properly initializing the bio's bi_dev. - Fix some code comment typos and fix some Documentation formatting. * tag 'for-6.0/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (29 commits) dm: fix dm-raid crash if md_handle_request() splits bio dm raid: fix address sanitizer warning in raid_resume dm raid: fix address sanitizer warning in raid_status dm: Start pr_preempt from the same starting path dm: Fix PR release handling for non All Registrants dm: Start pr_reserve from the same starting path dm: Allow dm_call_pr to be used for path searches dm: return early from dm_pr_call() if DM device is suspended dm thin: fix use-after-free crash in dm_sm_register_threshold_callback dm writecache: count number of blocks discarded, not number of discard bios dm writecache: count number of blocks written, not number of write bios dm writecache: count number of blocks read, not number of read bios dm writecache: return void from functions dm kcopyd: use __GFP_HIGHMEM when allocating pages dm writecache: set a default MAX_WRITEBACK_JOBS Documentation: dm writecache: Render status list as list Documentation: dm writecache: add blank line before optional parameters dm snapshot: fix typo in snapshot_map() comment dm raid: remove redundant "the" in parse_raid_params() comment dm cache: fix typo in 2 comment blocks ...
2022-08-02Merge tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-blockLinus Torvalds26-268/+254
Pull block updates from Jens Axboe: - Improve the type checking of request flags (Bart) - Ensure queue mapping for a single queues always picks the right queue (Bart) - Sanitize the io priority handling (Jan) - rq-qos race fix (Jinke) - Reserved tags handling improvements (John) - Separate memory alignment from file/disk offset aligment for O_DIRECT (Keith) - Add new ublk driver, userspace block driver using io_uring for communication with the userspace backend (Ming) - Use try_cmpxchg() to cleanup the code in various spots (Uros) - Finally remove bdevname() (Christoph) - Clean up the zoned device handling (Christoph) - Clean up independent access range support (Christoph) - Clean up and improve block sysfs handling (Christoph) - Clean up and improve teardown of block devices. This turns the usual two step process into something that is simpler to implement and handle in block drivers (Christoph) - Clean up chunk size handling (Christoph) - Misc cleanups and fixes (Bart, Bo, Dan, GuoYong, Jason, Keith, Liu, Ming, Sebastian, Yang, Ying) * tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block: (178 commits) ublk_drv: fix double shift bug ublk_drv: make sure that correct flags(features) returned to userspace ublk_drv: fix error handling of ublk_add_dev ublk_drv: fix lockdep warning block: remove __blk_get_queue block: call blk_mq_exit_queue from disk_release for never added disks blk-mq: fix error handling in __blk_mq_alloc_disk ublk: defer disk allocation ublk: rewrite ublk_ctrl_get_queue_affinity to not rely on hctx->cpumask ublk: fold __ublk_create_dev into ublk_ctrl_add_dev ublk: cleanup ublk_ctrl_uring_cmd ublk: simplify ublk_ch_open and ublk_ch_release ublk: remove the empty open and release block device operations ublk: remove UBLK_IO_F_PREFLUSH ublk: add a MAINTAINERS entry block: don't allow the same type rq_qos add more than once mmc: fix disk/queue leak in case of adding disk failure ublk_drv: fix an IS_ERR() vs NULL check ublk: remove UBLK_IO_F_INTEGRITY ublk_drv: remove unneeded semicolon ...
2022-07-28dm: verity-loadpin: Drop use of dm_table_get_num_targets()Matthias Kaehlcke1-1/+2
Commit 2aec377a2925 ("dm table: remove dm_table_get_num_targets() wrapper") in linux-dm/for-next removed the function dm_table_get_num_targets() which is used by verity-loadpin. Access table->num_targets directly instead of using the defunct wrapper. Fixes: b6c1c5745ccc ("dm: Add verity helpers for LoadPin") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220728085412.1.I242d21b378410eb6f9897a3160efb56e5608c59d@changeid
2022-07-28dm bufio: Add DM_BUFIO_CLIENT_NO_SLEEP flagNathan Huckleberry1-3/+19
Add an optional flag that ensures dm_bufio_client does not sleep (primary focus is to service dm_bufio_get without sleeping). This allows the dm-bufio cache to be queried from interrupt context. To ensure that dm-bufio does not sleep, dm-bufio must use a spinlock instead of a mutex. Additionally, to avoid deadlocks, special care must be taken so that dm-bufio does not sleep while holding the spinlock. But again: the scope of this no_sleep is initially confined to dm_bufio_get, so __alloc_buffer_wait_no_callback is _not_ changed to avoid sleeping because __bufio_new avoids allocation for NF_GET. Signed-off-by: Nathan Huckleberry <nhuck@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28dm bufio: Add flags argument to dm_bufio_client_createNathan Huckleberry7-8/+11
Add a flags argument to dm_bufio_client_create and update all the callers. This is in preparation to add the DM_BUFIO_NO_SLEEP flag. Signed-off-by: Nathan Huckleberry <nhuck@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28dm: fix dm-raid crash if md_handle_request() splits bioMike Snitzer2-7/+7
Commit ca522482e3eaf ("dm: pass NULL bdev to bio_alloc_clone") introduced the optimization to _not_ perform bio_associate_blkg()'s relatively costly work when DM core clones its bio. But in doing so it exposed the possibility for DM's cloned bio to alter DM target behavior (e.g. crash) if a target were to issue IO without first calling bio_set_dev(). The DM raid target can trigger an MD crash due to its need to split the DM bio that is passed to md_handle_request(). The split will recurse to submit_bio_noacct() using a bio with an uninitialized ->bi_blkg. This NULL bio->bi_blkg causes blk_throtl_bio() to dereference a NULL blkg_to_tg(bio->bi_blkg). Fix this in DM core by adding a new 'needs_bio_set_dev' target flag that will make alloc_tio() call bio_set_dev() on behalf of the target. dm-raid is the only target that requires this flag. bio_set_dev() initializes the DM cloned bio's ->bi_blkg, using bio_associate_blkg, before passing the bio to md_handle_request(). Long-term fix would be to audit and refactor MD code to rely on DM to split its bio, using dm_accept_partial_bio(), but there are MD raid personalities (e.g. raid1 and raid10) whose implementation are tightly coupled to handling the bio splitting inline. Fixes: ca522482e3eaf ("dm: pass NULL bdev to bio_alloc_clone") Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28dm raid: fix address sanitizer warning in raid_resumeMikulas Patocka1-1/+1
There is a KASAN warning in raid_resume when running the lvm test lvconvert-raid.sh. The reason for the warning is that mddev->raid_disks is greater than rs->raid_disks, so the loop touches one entry beyond the allocated length. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28dm raid: fix address sanitizer warning in raid_statusMikulas Patocka1-1/+1
There is this warning when using a kernel with the address sanitizer and running this testsuite: https://gitlab.com/cki-project/kernel-tests/-/tree/main/storage/swraid/scsi_raid ================================================================== BUG: KASAN: slab-out-of-bounds in raid_status+0x1747/0x2820 [dm_raid] Read of size 4 at addr ffff888079d2c7e8 by task lvcreate/13319 CPU: 0 PID: 13319 Comm: lvcreate Not tainted 5.18.0-0.rc3.<snip> #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <TASK> dump_stack_lvl+0x6a/0x9c print_address_description.constprop.0+0x1f/0x1e0 print_report.cold+0x55/0x244 kasan_report+0xc9/0x100 raid_status+0x1747/0x2820 [dm_raid] dm_ima_measure_on_table_load+0x4b8/0xca0 [dm_mod] table_load+0x35c/0x630 [dm_mod] ctl_ioctl+0x411/0x630 [dm_mod] dm_ctl_ioctl+0xa/0x10 [dm_mod] __x64_sys_ioctl+0x12a/0x1a0 do_syscall_64+0x5b/0x80 The warning is caused by reading conf->max_nr_stripes in raid_status. The code in raid_status reads mddev->private, casts it to struct r5conf and reads the entry max_nr_stripes. However, if we have different raid type than 4/5/6, mddev->private doesn't point to struct r5conf; it may point to struct r0conf, struct r1conf, struct r10conf or struct mpconf. If we cast a pointer to one of these structs to struct r5conf, we will be reading invalid memory and KASAN warns about it. Fix this bug by reading struct r5conf only if raid type is 4, 5 or 6. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28dm: Start pr_preempt from the same starting pathMike Christie1-14/+31
pr_preempt has a similar issue as reserve where for all the reservation types except the All Registrants ones the preempt can create a reservation. And a follow up reservation or release needs to go down the same path the preempt did. This has the pr_preempt work like reserve and release where we always start from the first path in the first group. This commit has been tested with windows failover clustering's validation test and libiscsi's PGR tests to check for regressions. They both don't have tests to verify this case, so I tested it manually. Signed-off-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28dm: Fix PR release handling for non All RegistrantsMike Christie1-14/+34
This commit fixes a bug where we are leaving the reservation in place even though pr_release has run and returned success. If we have a Write Exclusive, Exclusive Access, or Write/Exclusive Registrants only reservation, the release must be sent down the path that is the reservation holder. The problem is multipath_prepare_ioctl most likely selected path N for the reservation, then later when we do the release multipath_prepare_ioctl will select a completely different path. The device will then return success becuase the nvme and scsi specs say to return success if there is no reservation or if the release is sent down from a path that is not the holder. We then think we have released the reservation. This commit has us loop over each path and send a release so we can make sure the release is executed on the correct path. It has been tested with windows failover clustering's validation test which checks this case, and it has been tested manually (the libiscsi PGR tests don't have a test case for this yet, but I will be adding one). Signed-off-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28dm: Start pr_reserve from the same starting pathMike Christie1-14/+32
When an app does a pr_reserve it will go to whatever path we happen to be using at the time. This can result in errors when the app does a second pr_reserve call and expects success but gets a failure because the reserve is not done on the holder's path. This commit has us always start trying to do reserves from the first path in the first group. Windows failover clustering will produce the type of pattern above. With this commit, we will now pass its validation test for this case. Signed-off-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28dm: Allow dm_call_pr to be used for path searchesMike Christie1-12/+38
The specs state that if you send a reserve down a path that is already the holder success must be returned and if it goes down a path that is not the holder reservation conflict must be returned. Windows failover clustering will send a second reservation and expects that a device returns success. The problem for multipathing is that for an All Registrants reservation, we can send the reserve down any path but for all other reservation types there is one path that is the holder. To handle this we could add PR state to dm but that can get nasty. Look at target_core_pr.c for an example of the type of things we'd have to track. It will also get more complicated because other initiators can change the state so we will have to add in async event/sense handling. This commit, and the 3 commits that follow, tries to keep dm simple and keep just doing passthrough. This commit modifies dm_call_pr to be able to find the first usable path that can execute our pr_op then return. When dm_pr_reserve is converted to dm_call_pr in the next commit for the normal case we will use the same path for every reserve. Signed-off-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28dm: return early from dm_pr_call() if DM device is suspendedMike Snitzer1-0/+5
Otherwise PR ops may be issued while the broader DM device is being reconfigured, etc. Fixes: 9c72bad1f31a ("dm: call PR reserve/unreserve on each underlying device") Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-22Merge tag 'block-5.19-2022-07-21' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+3
Pull block fix from Jens Axboe: "Just a single fix for missing error propagation for an allocation failure in raid5" * tag 'block-5.19-2022-07-21' of git://git.kernel.dk/linux-block: md/raid5: missing error code in setup_conf()
2022-07-19md/raid5: missing error code in setup_conf()Dan Carpenter1-1/+3
Return -ENOMEM if the allocation fails. Don't return success. Fixes: 8fbcba6b999b ("md/raid5: Cleanup setup_conf() error returns") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>
2022-07-17dax: introduce holder for dax_deviceShiyang Ruan1-1/+1
Patch series "v14 fsdax-rmap + v11 fsdax-reflink", v2. The patchset fsdax-rmap is aimed to support shared pages tracking for fsdax. It moves owner tracking from dax_assocaite_entry() to pmem device driver, by introducing an interface ->memory_failure() for struct pagemap. This interface is called by memory_failure() in mm, and implemented by pmem device. Then call holder operations to find the filesystem which the corrupted data located in, and call filesystem handler to track files or metadata associated with this page. Finally we are able to try to fix the corrupted data in filesystem and do other necessary processing, such as killing processes who are using the files affected. The call trace is like this: memory_failure() |* fsdax case |------------ |pgmap->ops->memory_failure() => pmem_pgmap_memory_failure() | dax_holder_notify_failure() => | dax_device->holder_ops->notify_failure() => | - xfs_dax_notify_failure() | |* xfs_dax_notify_failure() | |-------------------------- | | xfs_rmap_query_range() | | xfs_dax_failure_fn() | | * corrupted on metadata | | try to recover data, call xfs_force_shutdown() | | * corrupted on file data | | try to recover data, call mf_dax_kill_procs() |* normal case |------------- |mf_generic_kill_procs() The patchset fsdax-reflink attempts to add CoW support for fsdax, and takes XFS, which has both reflink and fsdax features, as an example. One of the key mechanisms needed to be implemented in fsdax is CoW. Copy the data from srcmap before we actually write data to the destination iomap. And we just copy range in which data won't be changed. Another mechanism is range comparison. In page cache case, readpage() is used to load data on disk to page cache in order to be able to compare data. In fsdax case, readpage() does not work. So, we need another compare data with direct access support. With the two mechanisms implemented in fsdax, we are able to make reflink and fsdax work together in XFS. This patch (of 14): To easily track filesystem from a pmem device, we introduce a holder for dax_device structure, and also its operation. This holder is used to remember who is using this dax_device: - When it is the backend of a filesystem, the holder will be the instance of this filesystem. - When this pmem device is one of the targets in a mapped device, the holder will be this mapped device. In this case, the mapped device has its own dax_device and it will follow the first rule. So that we can finally track to the filesystem we needed. The holder and holder_ops will be set when filesystem is being mounted, or an target device is being activated. Link: https://lkml.kernel.org/r/20220603053738.1218681-1-ruansy.fnst@fujitsu.com Link: https://lkml.kernel.org/r/20220603053738.1218681-2-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.wiliams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-15dm thin: fix use-after-free crash in dm_sm_register_threshold_callbackLuo Meng2-3/+8
Fault inject on pool metadata device reports: BUG: KASAN: use-after-free in dm_pool_register_metadata_threshold+0x40/0x80 Read of size 8 at addr ffff8881b9d50068 by task dmsetup/950 CPU: 7 PID: 950 Comm: dmsetup Tainted: G W 5.19.0-rc6 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x34/0x44 print_address_description.constprop.0.cold+0xeb/0x3f4 kasan_report.cold+0xe6/0x147 dm_pool_register_metadata_threshold+0x40/0x80 pool_ctr+0xa0a/0x1150 dm_table_add_target+0x2c8/0x640 table_load+0x1fd/0x430 ctl_ioctl+0x2c4/0x5a0 dm_ctl_ioctl+0xa/0x10 __x64_sys_ioctl+0xb3/0xd0 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 This can be easily reproduced using: echo offline > /sys/block/sda/device/state dd if=/dev/zero of=/dev/mapper/thin bs=4k count=10 dmsetup load pool --table "0 20971520 thin-pool /dev/sda /dev/sdb 128 0 0" If a metadata commit fails, the transaction will be aborted and the metadata space maps will be destroyed. If a DM table reload then happens for this failed thin-pool, a use-after-free will occur in dm_sm_register_threshold_callback (called from dm_pool_register_metadata_threshold). Fix this by in dm_pool_register_metadata_threshold() by returning the -EINVAL error if the thin-pool is in fail mode. Also fail pool_ctr() with a new error message: "Error registering metadata threshold". Fixes: ac8c3f3df65e4 ("dm thin: generate event when metadata threshold passed") Cc: stable@vger.kernel.org Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-14dm writecache: count number of blocks discarded, not number of discard biosMikulas Patocka1-1/+1
Change dm-writecache, so that it counts the number of blocks discarded instead of the number of discard bios. Make it consistent with the read and write statistics counters that were changed to count the number of blocks instead of bios. Fixes: e3a35d03407c ("dm writecache: add event counters") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-14dm writecache: count number of blocks written, not number of write biosMikulas Patocka1-3/+9
Change dm-writecache, so that it counts the number of blocks written instead of the number of write bios. Bios can be split and requeued using the dm_accept_partial_bio function, so counting bios caused inaccurate results. Fixes: e3a35d03407c ("dm writecache: add event counters") Reported-by: Yu Kuai <yukuai1@huaweicloud.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-14dm writecache: count number of blocks read, not number of read biosMikulas Patocka1-0/+1
Change dm-writecache, so that it counts the number of blocks read instead of the number of read bios. Bios can be split and requeued using the dm_accept_partial_bio function, so counting bios caused inaccurate results. Fixes: e3a35d03407c ("dm writecache: add event counters") Reported-by: Yu Kuai <yukuai1@huaweicloud.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-14dm writecache: return void from functionsMikulas Patocka1-13/+13
The functions writecache_map_remap_origin and writecache_bio_copy_ssd only return a single value, thus they can be made to return void. This helps simplify the following IO accounting changes. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-14dm kcopyd: use __GFP_HIGHMEM when allocating pagesMikulas Patocka1-1/+1
dm-kcopyd doesn't access the allocated pages directly, it only passes them to dm-io which adds them to a bio list - thus, we can allocate the pages from high memory. This will reduce pressure on the low memory when there are a large number of kcopyd jobs in progress. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-14dm writecache: set a default MAX_WRITEBACK_JOBSMikulas Patocka1-1/+1
dm-writecache has the capability to limit the number of writeback jobs in progress. However, this feature was off by default. As such there were some out-of-memory crashes observed when lowering the low watermark while the cache is full. This commit enables writeback limit by default. It is set to 256MiB or 1/16 of total system memory, whichever is smaller. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-14fs/buffer: Combine two submit_bh() and ll_rw_block() argumentsBart Van Assche1-2/+2
Both submit_bh() and ll_rw_block() accept a request operation type and request flags as their first two arguments. Micro-optimize these two functions by combining these first two arguments into a single argument. This patch does not change the behavior of any of the modified code. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jan Kara <jack@suse.cz> Acked-by: Song Liu <song@kernel.org> (for the md changes) Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-48-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14md/raid5: Use the enum req_op and blk_opf_t typesBart Van Assche1-1/+2
Improve static type checking by using the enum req_op type for variables that represent a request operation and the new blk_opf_t type for variables that represent request flags. Acked-by: Song Liu <song@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-37-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14md/raid10: Use the new blk_opf_t typeBart Van Assche1-5/+5
Improve static type checking by using the new blk_opf_t type for variables that represent a request flags. Acked-by: Song Liu <song@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-36-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>