summaryrefslogtreecommitdiffstats
path: root/drivers/md/dm-raid.c
AgeCommit message (Collapse)AuthorFilesLines
2022-10-18dm raid: fix typo in analyse_superblocks code commentJiangshan Yi1-1/+1
Reported-by: k2ci <kernel-bot@kylinos.cn> Signed-off-by: Jiangshan Yi <yijiangshan@kylinos.cn> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-10-18dm raid: delete the redundant word 'that' in commentJilin Yuan1-1/+1
Signed-off-by: Jilin Yuan <yuanjilin@cdjrlc.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-08-02md: unlock mddev before reap sync_thread in action_storeGuoqing Jiang1-0/+1
Since the bug which commit 8b48ec23cc51a ("md: don't unregister sync_thread with reconfig_mutex held") fixed is related with action_store path, other callers which reap sync_thread didn't need to be changed. Let's pull md_unregister_thread from md_reap_sync_thread, then fix previous bug with belows. 1. unlock mddev before md_reap_sync_thread in action_store. 2. save reshape_position before unlock, then restore it to ensure position not changed accidentally by others. Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02Merge tag 'for-6.0/dm-changes' of ↵Linus Torvalds1-3/+4
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Refactor DM core's mempool allocation so that it clearer by not being split acorss files. - Improve DM core's BLK_STS_DM_REQUEUE and BLK_STS_AGAIN handling. - Optimize DM core's more common bio splitting by eliminating the use of bio cloning with bio_split+bio_chain. Shift that cloning cost to the relatively unlikely dm_io requeue case that only occurs during error handling. Introduces dm_io_rewind() that will clone a bio that reflects the subset of the original bio that must be requeued. - Remove DM core's dm_table_get_num_targets() wrapper and audit all dm_table_get_target() callers. - Fix potential for OOM with DM writecache target by setting a default MAX_WRITEBACK_JOBS (set to 256MiB or 1/16 of total system memory, whichever is smaller). - Fix DM writecache target's stats that are reported through DM-specific table info. - Fix use-after-free crash in dm_sm_register_threshold_callback(). - Refine DM core's Persistent Reservation handling in preparation for broader work Mike Christie is doing to add compatibility with Microsoft Windows Failover Cluster. - Fix various KASAN reported bugs in the DM raid target. - Fix DM raid target crash due to md_handle_request() bio splitting that recurses to block core without properly initializing the bio's bi_dev. - Fix some code comment typos and fix some Documentation formatting. * tag 'for-6.0/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (29 commits) dm: fix dm-raid crash if md_handle_request() splits bio dm raid: fix address sanitizer warning in raid_resume dm raid: fix address sanitizer warning in raid_status dm: Start pr_preempt from the same starting path dm: Fix PR release handling for non All Registrants dm: Start pr_reserve from the same starting path dm: Allow dm_call_pr to be used for path searches dm: return early from dm_pr_call() if DM device is suspended dm thin: fix use-after-free crash in dm_sm_register_threshold_callback dm writecache: count number of blocks discarded, not number of discard bios dm writecache: count number of blocks written, not number of write bios dm writecache: count number of blocks read, not number of read bios dm writecache: return void from functions dm kcopyd: use __GFP_HIGHMEM when allocating pages dm writecache: set a default MAX_WRITEBACK_JOBS Documentation: dm writecache: Render status list as list Documentation: dm writecache: add blank line before optional parameters dm snapshot: fix typo in snapshot_map() comment dm raid: remove redundant "the" in parse_raid_params() comment dm cache: fix typo in 2 comment blocks ...
2022-08-02Merge tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+1
Pull block updates from Jens Axboe: - Improve the type checking of request flags (Bart) - Ensure queue mapping for a single queues always picks the right queue (Bart) - Sanitize the io priority handling (Jan) - rq-qos race fix (Jinke) - Reserved tags handling improvements (John) - Separate memory alignment from file/disk offset aligment for O_DIRECT (Keith) - Add new ublk driver, userspace block driver using io_uring for communication with the userspace backend (Ming) - Use try_cmpxchg() to cleanup the code in various spots (Uros) - Finally remove bdevname() (Christoph) - Clean up the zoned device handling (Christoph) - Clean up independent access range support (Christoph) - Clean up and improve block sysfs handling (Christoph) - Clean up and improve teardown of block devices. This turns the usual two step process into something that is simpler to implement and handle in block drivers (Christoph) - Clean up chunk size handling (Christoph) - Misc cleanups and fixes (Bart, Bo, Dan, GuoYong, Jason, Keith, Liu, Ming, Sebastian, Yang, Ying) * tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block: (178 commits) ublk_drv: fix double shift bug ublk_drv: make sure that correct flags(features) returned to userspace ublk_drv: fix error handling of ublk_add_dev ublk_drv: fix lockdep warning block: remove __blk_get_queue block: call blk_mq_exit_queue from disk_release for never added disks blk-mq: fix error handling in __blk_mq_alloc_disk ublk: defer disk allocation ublk: rewrite ublk_ctrl_get_queue_affinity to not rely on hctx->cpumask ublk: fold __ublk_create_dev into ublk_ctrl_add_dev ublk: cleanup ublk_ctrl_uring_cmd ublk: simplify ublk_ch_open and ublk_ch_release ublk: remove the empty open and release block device operations ublk: remove UBLK_IO_F_PREFLUSH ublk: add a MAINTAINERS entry block: don't allow the same type rq_qos add more than once mmc: fix disk/queue leak in case of adding disk failure ublk_drv: fix an IS_ERR() vs NULL check ublk: remove UBLK_IO_F_INTEGRITY ublk_drv: remove unneeded semicolon ...
2022-07-28dm: fix dm-raid crash if md_handle_request() splits bioMike Snitzer1-0/+1
Commit ca522482e3eaf ("dm: pass NULL bdev to bio_alloc_clone") introduced the optimization to _not_ perform bio_associate_blkg()'s relatively costly work when DM core clones its bio. But in doing so it exposed the possibility for DM's cloned bio to alter DM target behavior (e.g. crash) if a target were to issue IO without first calling bio_set_dev(). The DM raid target can trigger an MD crash due to its need to split the DM bio that is passed to md_handle_request(). The split will recurse to submit_bio_noacct() using a bio with an uninitialized ->bi_blkg. This NULL bio->bi_blkg causes blk_throtl_bio() to dereference a NULL blkg_to_tg(bio->bi_blkg). Fix this in DM core by adding a new 'needs_bio_set_dev' target flag that will make alloc_tio() call bio_set_dev() on behalf of the target. dm-raid is the only target that requires this flag. bio_set_dev() initializes the DM cloned bio's ->bi_blkg, using bio_associate_blkg, before passing the bio to md_handle_request(). Long-term fix would be to audit and refactor MD code to rely on DM to split its bio, using dm_accept_partial_bio(), but there are MD raid personalities (e.g. raid1 and raid10) whose implementation are tightly coupled to handling the bio splitting inline. Fixes: ca522482e3eaf ("dm: pass NULL bdev to bio_alloc_clone") Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28dm raid: fix address sanitizer warning in raid_resumeMikulas Patocka1-1/+1
There is a KASAN warning in raid_resume when running the lvm test lvconvert-raid.sh. The reason for the warning is that mddev->raid_disks is greater than rs->raid_disks, so the loop touches one entry beyond the allocated length. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28dm raid: fix address sanitizer warning in raid_statusMikulas Patocka1-1/+1
There is this warning when using a kernel with the address sanitizer and running this testsuite: https://gitlab.com/cki-project/kernel-tests/-/tree/main/storage/swraid/scsi_raid ================================================================== BUG: KASAN: slab-out-of-bounds in raid_status+0x1747/0x2820 [dm_raid] Read of size 4 at addr ffff888079d2c7e8 by task lvcreate/13319 CPU: 0 PID: 13319 Comm: lvcreate Not tainted 5.18.0-0.rc3.<snip> #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <TASK> dump_stack_lvl+0x6a/0x9c print_address_description.constprop.0+0x1f/0x1e0 print_report.cold+0x55/0x244 kasan_report+0xc9/0x100 raid_status+0x1747/0x2820 [dm_raid] dm_ima_measure_on_table_load+0x4b8/0xca0 [dm_mod] table_load+0x35c/0x630 [dm_mod] ctl_ioctl+0x411/0x630 [dm_mod] dm_ctl_ioctl+0xa/0x10 [dm_mod] __x64_sys_ioctl+0x12a/0x1a0 do_syscall_64+0x5b/0x80 The warning is caused by reading conf->max_nr_stripes in raid_status. The code in raid_status reads mddev->private, casts it to struct r5conf and reads the entry max_nr_stripes. However, if we have different raid type than 4/5/6, mddev->private doesn't point to struct r5conf; it may point to struct r0conf, struct r1conf, struct r10conf or struct mpconf. If we cast a pointer to one of these structs to struct r5conf, we will be reading invalid memory and KASAN warns about it. Fix this bug by reading struct r5conf only if raid type is 4, 5 or 6. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-14md/core: Combine two sync_page_io() argumentsBart Van Assche1-1/+1
Improve uniformity in the kernel of handling of request operation and flags by passing these as a single argument. Cc: Song Liu <song@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-32-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-07dm raid: remove redundant "the" in parse_raid_params() commentJiang Jian1-1/+1
Signed-off-by: Jiang Jian <jiangjian@cdjrlc.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-06-27dm raid: fix accesses beyond end of raid member arrayHeinz Mauelshagen1-16/+18
On dm-raid table load (using raid_ctr), dm-raid allocates an array rs->devs[rs->raid_disks] for the raid device members. rs->raid_disks is defined by the number of raid metadata and image tupples passed into the target's constructor. In the case of RAID layout changes being requested, that number can be different from the current number of members for existing raid sets as defined in their superblocks. Example RAID layout changes include: - raid1 legs being added/removed - raid4/5/6/10 number of stripes changed (stripe reshaping) - takeover to higher raid level (e.g. raid5 -> raid6) When accessing array members, rs->raid_disks must be used in control loops instead of the potentially larger value in rs->md.raid_disks. Otherwise it will cause memory access beyond the end of the rs->devs array. Fix this by changing code that is prone to out-of-bounds access. Also fix validate_raid_redundancy() to validate all devices that are added. Also, use braces to help clean up raid_iterate_devices(). The out-of-bounds memory accesses was discovered using KASAN. This commit was verified to pass all LVM2 RAID tests (with KASAN enabled). Cc: stable@vger.kernel.org Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-06-15Revert "md: don't unregister sync_thread with reconfig_mutex held"Guoqing Jiang1-1/+1
The 07reshape5intr test is broke because of below path. md_reap_sync_thread -> mddev_unlock -> md_unregister_thread(&mddev->sync_thread) And md_check_recovery is triggered by, mddev_unlock -> md_wakeup_thread(mddev->thread) then mddev->reshape_position is set to MaxSector in raid5_finish_reshape since MD_RECOVERY_INTR is cleared in md_check_recovery, which means feature_map is not set with MD_FEATURE_RESHAPE_ACTIVE and superblock's reshape_position can't be updated accordingly. Fixes: 8b48ec23cc51a ("md: don't unregister sync_thread with reconfig_mutex held") Reported-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>
2022-05-22md: don't unregister sync_thread with reconfig_mutex heldGuoqing Jiang1-1/+1
Unregister sync_thread doesn't need to hold reconfig_mutex since it doesn't reconfigure array. And it could cause deadlock problem for raid5 as follows: 1. process A tried to reap sync thread with reconfig_mutex held after echo idle to sync_action. 2. raid5 sync thread was blocked if there were too many active stripes. 3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer) which causes the number of active stripes can't be decreased. 4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able to hold reconfig_mutex. More details in the link: https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t And add one parameter to md_reap_sync_thread since it could be called by dm-raid which doesn't hold reconfig_mutex. Reported-and-tested-by: Donald Buczek <buczek@molgen.mpg.de> Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <song@kernel.org>
2022-04-17block: remove QUEUE_FLAG_DISCARDChristoph Hellwig1-7/+2
Just use a non-zero max_discard_sectors as an indicator for discard support, similar to what is done for write zeroes. The only places where needs special attention is the RAID5 driver, which must clear discard support for security reasons by default, even if the default stacking rules would allow for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18dm: use bdev_nr_sectors and bdev_nr_bytes instead of open coding themChristoph Hellwig1-3/+3
Use the proper helpers to read the block device size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20211018101130.1838532-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-10dm: update target status functions to support IMA measurementTushar Sugandhi1-0/+39
For device mapper targets to take advantage of IMA's measurement capabilities, the status functions for the individual targets need to be updated to handle the status_type_t case for value STATUSTYPE_IMA. Update status functions for the following target types, to log their respective attributes to be measured using IMA. 01. cache 02. crypt 03. integrity 04. linear 05. mirror 06. multipath 07. raid 08. snapshot 09. striped 10. verity For rest of the targets, handle the STATUSTYPE_IMA case by setting the measurement buffer to NULL. For IMA to measure the data on a given system, the IMA policy on the system needs to be updated to have the following line, and the system needs to be restarted for the measurements to take effect. /etc/ima/ima-policy measure func=CRITICAL_DATA label=device-mapper template=ima-buf The measurements will be reflected in the IMA logs, which are located at: /sys/kernel/security/integrity/ima/ascii_runtime_measurements /sys/kernel/security/integrity/ima/binary_runtime_measurements These IMA logs can later be consumed by various attestation clients running on the system, and send them to external services for attesting the system. The DM target data measured by IMA subsystem can alternatively be queried from userspace by setting DM_IMA_MEASUREMENT_FLAG with DM_TABLE_STATUS_CMD. Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-30dm raid: remove unnecessary discard limits for raid0 and raid10Mike Snitzer1-9/+0
Commit 29efc390b946 ("md/md0: optimize raid0 discard handling") and commit d30588b2731f ("md/raid10: improve raid10 discard request") remove MD raid0's and raid10's inability to properly handle large discards. So eliminate associated constraints from dm-raid's support. Depends-on: 29efc390b946 ("md/md0: optimize raid0 discard handling") Depends-on: d30588b2731f ("md/raid10: improve raid10 discard request") Reported-by: Matthew Ruffell <matthew.ruffell@canonical.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-21dm raid: fix inconclusive reshape layout on fast raid4/5/6 table reload ↵Heinz Mauelshagen1-6/+28
sequences If fast table reloads occur during an ongoing reshape of raid4/5/6 devices the target may race reading a superblock vs the the MD resync thread; causing an inconclusive reshape state to be read in its constructor. lvm2 test lvconvert-raid-reshape-stripes-load-reload.sh can cause BUG_ON() to trigger in md_run(), e.g.: "kernel BUG at drivers/md/raid5.c:7567!". Scenario triggering the bug: 1. the MD sync thread calls end_reshape() from raid5_sync_request() when done reshaping. However end_reshape() _only_ updates the reshape position to MaxSector keeping the changed layout configuration though (i.e. any delta disks, chunk sector or RAID algorithm changes). That inconclusive configuration is stored in the superblock. 2. dm-raid constructs a mapping, loading named inconsistent superblock as of step 1 before step 3 is able to finish resetting the reshape state completely, and calls md_run() which leads to mentioned bug in raid5.c. 3. the MD RAID personality's finish_reshape() is called; which resets the reshape information on chunk sectors, delta disks, etc. This explains why the bug is rarely seen on multi-core machines, as MD's finish_reshape() superblock update races with the dm-raid constructor's superblock load in step 2. Fix identifies inconclusive superblock content in the dm-raid constructor and resets it before calling md_run(), factoring out identifying checks into rs_is_layout_change() to share in existing rs_reshape_requested() and new rs_reset_inclonclusive_reshape(). Also enhance a comment and remove an empty line. Cc: stable@vger.kernel.org Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-04-20dm raid: fix fall-through warning in rs_check_takeover() for ClangGustavo A. R. Silva1-0/+1
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-01-04dm raid: fix discard limits for raid1Mike Snitzer1-3/+3
Block core warned that discard_granularity was 0 for dm-raid with personality of raid1. Reason is that raid_io_hints() was incorrectly special-casing raid1 rather than raid0. Fix raid_io_hints() by removing discard limits settings for raid1. Check for raid0 instead. Fixes: 61697a6abd24a ("dm: eliminate 'split_discard_bios' flag from DM target interface") Cc: stable@vger.kernel.org Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Reported-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Stephan Bärwolf <stephan@matrixstorm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-16Merge tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-blockLinus Torvalds1-2/+1
Pull block updates from Jens Axboe: "Another series of killing more code than what is being added, again thanks to Christoph's relentless cleanups and tech debt tackling. This contains: - blk-iocost improvements (Baolin Wang) - part0 iostat fix (Jeffle Xu) - Disable iopoll for split bios (Jeffle Xu) - block tracepoint cleanups (Christoph Hellwig) - Merging of struct block_device and hd_struct (Christoph Hellwig) - Rework/cleanup of how block device sizes are updated (Christoph Hellwig) - Simplification of gendisk lookup and removal of block device aliasing (Christoph Hellwig) - Block device ioctl cleanups (Christoph Hellwig) - Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig) - Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig) - sbitmap improvements (Pavel Begunkov) - Hybrid polling fix (Pavel Begunkov) - bvec iteration improvements (Pavel Begunkov) - Zone revalidation fixes (Damien Le Moal) - blk-throttle limit fix (Yu Kuai) - Various little fixes" * tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits) blk-mq: fix msec comment from micro to milli seconds blk-mq: update arg in comment of blk_mq_map_queue blk-mq: add helper allocating tagset->tags Revert "block: Fix a lockdep complaint triggered by request queue flushing" nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class blk-mq: add new API of blk_mq_hctx_set_fq_lock_class block: disable iopoll for split bio block: Improve blk_revalidate_disk_zones() checks sbitmap: simplify wrap check sbitmap: replace CAS with atomic and sbitmap: remove swap_lock sbitmap: optimise sbitmap_deferred_clear() blk-mq: skip hybrid polling if iopoll doesn't spin blk-iocost: Factor out the base vrate change into a separate function blk-iocost: Factor out the active iocgs' state check into a separate function blk-iocost: Move the usage ratio calculation to the correct place blk-iocost: Remove unnecessary advance declaration blk-iocost: Fix some typos in comments blktrace: fix up a kerneldoc comment block: remove the request_queue to argument request based tracepoints ...
2020-12-14Revert "dm raid: fix discard limits for raid1 and raid10"Mike Snitzer1-7/+5
This reverts commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512. Reverting 6ffeb1c3f822 ("md: change mddev 'chunk_sectors' from int to unsigned") exposes dm-raid.c compiler warnings detailed that commit's header. Clearly this more conservative fix, of simply reverting e0910c8e4f8, would've been more prudent given how late we were in the v5.10 release. Lessons have been learned. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-09Revert "dm raid: remove unnecessary discard limits for raid10"Song Liu1-0/+11
This reverts commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28. Matthew Ruffell reported data corruption in raid10 due to the changes in discard handling [1]. Revert these changes before we find a proper fix. [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Xiao Ni <xni@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-11-16dm-raid: use set_capacity_and_notifyChristoph Hellwig1-2/+1
Use set_capacity_and_notify to set the size of both the disk and block device. This also gets the uevent notifications for the resize for free. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-29dm raid: remove unnecessary discard limits for raid10Mike Snitzer1-11/+0
Commit bcc90d280465e ("md/raid10: improve raid10 discard request") removes raid10's inability to properly handle large discards. So eliminate associated constraint from dm-raid's raid10 support. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-29dm raid: fix discard limits for raid1 and raid10Mike Snitzer1-5/+7
Block core warned that discard_granularity was 0 for dm-raid with personality of raid1. Reason is that raid_io_hints() was incorrectly special-casing raid1 rather than raid0. But since commit 29efc390b9462 ("md/md0: optimize raid0 discard handling") even raid0 properly handles large discards. Fix raid_io_hints() by removing discard limits settings for raid1. Also, fix limits for raid10 by properly stacking underlying limits as done in blk_stack_limits(). Depends-on: 29efc390b9462 ("md/md0: optimize raid0 discard handling") Fixes: 61697a6abd24a ("dm: eliminate 'split_discard_bios' flag from DM target interface") Cc: stable@vger.kernel.org Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-02block: add a new revalidate_disk_size helperChristoph Hellwig1-1/+1
revalidate_disk is a relative awkward helper for driver use, as it first calls an optional driver method and then updates the block device size, while most callers either don't need the method call at all, or want to keep state between the caller and the called method. Add a revalidate_disk_size helper that just performs the update of the block device size from the gendisk one, and switch all drivers that do not implement ->revalidate_disk to use the new helper instead of revalidate_disk() Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-07Merge tag 'for-5.9/dm-changes' of ↵Linus Torvalds1-2/+0
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - DM multipath locking fixes around m->flags tests and improvements to bio-based code so that it follows patterns established by request-based code. - Request-based DM core improvement to eliminate unnecessary call to blk_mq_queue_stopped(). - Add "panic_on_corruption" error handling mode to DM verity target. - DM bufio fix to to perform buffer cleanup from a workqueue rather than wait for IO in reclaim context from shrinker. - DM crypt improvement to optionally avoid async processing via workqueues for reads and/or writes -- via "no_read_workqueue" and "no_write_workqueue" features. This more direct IO processing improves latency and throughput with faster storage. Avoiding workqueue IO submission for writes (DM_CRYPT_NO_WRITE_WORKQUEUE) is a requirement for adding zoned block device support to DM crypt. - Add zoned block device support to DM crypt. Makes use of DM_CRYPT_NO_WRITE_WORKQUEUE and a new optional feature (DM_CRYPT_WRITE_INLINE) that allows write completion to wait for encryption to complete. This allows write ordering to be preserved, which is needed for zoned block devices. - Fix DM ebs target's check for REQ_OP_FLUSH. - Fix DM core's report zones support to not report more zones than were requested. - A few small compiler warning fixes. - DM dust improvements to return output directly to the user rather than require they scrape the system log for output. * tag 'for-5.9/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: don't call report zones for more than the user requested dm ebs: Fix incorrect checking for REQ_OP_FLUSH dm init: Set file local variable static dm ioctl: Fix compilation warning dm raid: Remove empty if statement dm verity: Fix compilation warning dm crypt: Enable zoned block device support dm crypt: add flags to optionally bypass kcryptd workqueues dm bufio: do buffer cleanup from a workqueue dm rq: don't call blk_mq_queue_stopped() in dm_stop_queue() dm dust: add interface to list all badblocks dm dust: report some message results directly back to user dm verity: add "panic_on_corruption" error handling mode dm mpath: use double checked locking in fast path dm mpath: rename current_pgpath to pgpath in multipath_prepare_ioctl dm mpath: rework __map_bio() dm mpath: factor out multipath_queue_bio dm mpath: push locking down to must_push_back_rq() dm mpath: take m->lock spinlock when testing QUEUE_IF_NO_PATH dm mpath: changes from initial m->flags locking audit
2020-08-04dm raid: Remove empty if statementDamien Le Moal1-2/+0
In super_init_validation(), remove a body-less if statement testing only variables to avoid a compilation warning when compiling with W=1. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-08writeback: remove bdi->congested_fnChristoph Hellwig1-12/+0
Except for pktdvd, the only places setting congested bits are file systems that allocate their own backing_dev_info structures. And pktdvd is a deprecated driver that isn't useful in stack setup either. So remove the dead congested_fn stacking infrastructure. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Acked-by: David Sterba <dsterba@suse.com> [axboe: fixup unused variables in bcache/request.c] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-20dm: replace zero-length array with flexible-arrayGustavo A. R. Silva1-1/+1
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-01-07dm raid: table line rebuild status fixesHeinz Mauelshagen1-21/+22
raid_status() wasn't emitting rebuild flags on the table line properly because the rdev number was not yet set properly; index raid component devices array directly to solve. Also fix wrong argument count on emitted table line caused by 1 too many rebuild/write_mostly argument and consider any journal_(dev|mode) pairs. Link: https://bugzilla.redhat.com/1782045 Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-07dm raid: Remove unnecessary negation of a shift in raid10_format_to_md_layoutNathan Chancellor1-1/+0
When building with Clang + -Wtautological-constant-compare: drivers/md/dm-raid.c:619:8: warning: converting the result of '<<' to a boolean always evaluates to true [-Wtautological-constant-compare] r = !RAID10_OFFSET; ^ drivers/md/dm-raid.c:517:28: note: expanded from macro 'RAID10_OFFSET' #define RAID10_OFFSET (1 << 16) /* stripes with data copies area adjacent on devices */ ^ 1 warning generated. Negating a non-zero number will always make it zero, which is the default value of r in this function so this statement is unnecessary; remove it so that clang no longer warns. Link: https://github.com/ClangBuiltLinux/linux/issues/753 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Acked-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05dm raid: streamline rs_get_progress() and its raid_status() caller sideHeinz Mauelshagen1-27/+20
Pass already deciphered state into rs_get_progress, simplify recovery offset definition and combine two st_resync, st_reshape conditionals into one as is already the case with st_check and st_repair. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05dm raid: simplify rs_setup_recovery call chainHeinz Mauelshagen1-21/+6
rs_setup_recovery() sets the starting recovery offset. Drop superfluous rs_setup_recovery() and replace with __rs_setup_recovery(). Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05dm raid: to ensure resynchronization, perform raid set grow in preresumeHeinz Mauelshagen1-21/+60
This fixes a flaw causing raid set extensions not to be synchronized in case the MD bitmap resize required additional pages to be allocated. Also share resize code in the raid constructor between new size changes and those occuring during recovery. Bump the target version to define the change and document it in Documentation/admin-guide/device-mapper/dm-raid.rst. Reported-by: Steve D <steved424@gmail.com> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-11-05dm raid: change rs_set_dev_and_array_sectors API and callersHeinz Mauelshagen1-9/+5
Add a size argument to rs_set_dev_and_array_sectors as prerequisite to fixing grown device resynchronization not occuring when new MD bitmap pages have to be allocated as a result of the extension in a follwup patch. Also avoid code duplication by using rs_set_rdev_sectors in the aforementioned function. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-09-11dm raid: fix updating of max_discard_sectors limitMing Lei1-5/+5
Unit of 'chunk_size' is byte, instead of sector, so fix it by setting the queue_limits' max_discard_sectors to rs->md.chunk_sectors. Also, rename chunk_size to chunk_size_bytes. Without this fix, too big max_discard_sectors is applied on the request queue of dm-raid, finally raid code has to split the bio again. This re-split done by raid causes the following nested clone_endio: 1) one big bio 'A' is submitted to dm queue, and served as the original bio 2) one new bio 'B' is cloned from the original bio 'A', and .map() is run on this bio of 'B', and B's original bio points to 'A' 3) raid code sees that 'B' is too big, and split 'B' and re-submit the remainded part of 'B' to dm-raid queue via generic_make_request(). 4) now dm will handle 'B' as new original bio, then allocate a new clone bio of 'C' and run .map() on 'C'. Meantime C's original bio points to 'B'. 5) suppose now 'C' is completed by raid directly, then the following clone_endio() is called recursively: clone_endio(C) ->clone_endio(B) #B is original bio of 'C' ->bio_endio(A) 'A' can be big enough to make hundreds of nested clone_endio(), then stack can be corrupted easily. Fixes: 61697a6abd24a ("dm: eliminate 'split_discard_bios' flag from DM target interface") Cc: stable@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-21dm raid: add missing cleanup in raid_ctr()Wenwen Wang1-1/+1
If rs_prepare_reshape() fails, no cleanup is executed, leading to leak of the raid_set structure allocated at the beginning of raid_ctr(). To fix this issue, go to the label 'bad' if the error occurs. Fixes: 11e4723206683 ("dm raid: stop keeping raid set frozen altogether") Cc: stable@vger.kernel.org Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-15docs: device-mapper: move it to the admin-guideMauro Carvalho Chehab1-1/+1
The DM support describes lots of aspects related to mapped disk partitions from the userspace PoV. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2019-06-14docs: convert docs to ReST and rename to *.rstMauro Carvalho Chehab1-1/+1
The conversion is actually: - add blank lines and indentation in order to identify paragraphs; - fix tables markups; - add some lists markups; - mark literal blocks; - adjust title markups. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Mark Brown <broonie@kernel.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2019-02-20dm: eliminate 'split_discard_bios' flag from DM target interfaceMike Snitzer1-5/+9
There is no need to have DM core split discards on behalf of a DM target now that blk_queue_split() handles splitting discards based on the queue_limits. A DM target just needs to set max_discard_sectors, discard_granularity, etc, in queue_limits. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-12-18dm raid: fix false -EBUSY when handling check/repair messageHeinz Mauelshagen1-2/+1
Sending a check/repair message infrequently leads to -EBUSY instead of properly identifying an active resync. This occurs because raid_message() is testing recovery bits in a racy way. Fix by calling decipher_sync_action() from raid_message() to properly identify the idle state of the RAID device. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-18dm raid: avoid bitmap with raid4/5/6 journal deviceHeinz Mauelshagen1-1/+1
With raid4/5/6, journal device and write intent bitmap are mutually exclusive. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-17dm raid: remove bogus const from decipher_sync_action() return typeGeert Uytterhoeven1-1/+1
With gcc-4.1.2: drivers/md/dm-raid.c:3357: warning: type qualifiers ignored on function return type Remove the "const" keyword to fix this. Fixes: 36a240a706d43383 ("dm raid: fix RAID leg rebuild errors") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-06dm raid: bump target version, update comments and documentationHeinz Mauelshagen1-4/+6
Bump target version to reflect the documented fixes are available. Also fix some code comments (typos and clarity). Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-06dm raid: fix RAID leg rebuild errorsHeinz Mauelshagen1-34/+46
On fast devices such as NVMe, a flaw in rs_get_progress() results in false target status output when userspace lvm2 requests leg rebuilds (symptom of the failure is device health chars 'aaaaaaaa' instead of expected 'aAaAAAAA' causing lvm2 to fail). The correct sync action state definitions already exist in decipher_sync_action() so fix rs_get_progress() to use it. Change decipher_sync_action() to return an enum rather than a string for the sync states and call it from rs_get_progress(). Introduce sync_str() to translate from enum to the string that is needed by raid_status(). Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-06dm raid: fix rebuild of specific devices by updating superblockHeinz Mauelshagen1-0/+5
Update superblock when particular devices are requested via rebuild (e.g. lvconvert --replace ...) to avoid spurious failure with the "New device injected into existing raid set without 'delta_disks' or 'rebuild' parameter specified" error message. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-06dm raid: fix stripe adding reshape deadlockHeinz Mauelshagen1-8/+3
When initiating a stripe adding reshape, a deadlock between md_stop_writes() waiting for the sync thread to stop and the running sync thread waiting for inactive stripes occurs (this frequently happens on single-core but rarely on multi-core systems). Fix this deadlock by setting MD_RECOVERY_WAIT to have the main MD resynchronization thread worker (md_do_sync()) bail out when initiating the reshape via constructor arguments. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-09-06dm raid: fix reshape race on small devicesHeinz Mauelshagen1-47/+1
Loading a new mapping table, the dm-raid target's constructor retrieves the volatile reshaping state from the raid superblocks. When the new table is activated in a following resume, the actual reshape position is retrieved. The reshape driven by the previous mapping can already have finished on small and/or fast devices thus updating raid superblocks about the new raid layout. This causes the actual array state (e.g. stripe size reshape finished) to be inconsistent with the one in the new mapping, causing hangs with left behind devices. This race does not occur with usual raid device sizes but with small ones (e.g. those created by the lvm2 test suite). Fix by no longer transferring stale/inconsistent raid_set state during preresume. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>