From 9ae6e8b1c9bbf6874163d1243e393137313762b7 Mon Sep 17 00:00:00 2001 From: Nikos Tsironis Date: Tue, 21 Jun 2022 15:24:03 +0300 Subject: dm era: commit metadata in postsuspend after worker stops During postsuspend dm-era does the following: 1. Archives the current era 2. Commits the metadata, as part of the RPC call for archiving the current era 3. Stops the worker Until the worker stops, it might write to the metadata again. Moreover, these writes are not flushed to disk immediately, but are cached by the dm-bufio client, which writes them back asynchronously. As a result, the committed metadata of a suspended dm-era device might not be consistent with the in-core metadata. In some cases, this can result in the corruption of the on-disk metadata. Suppose the following sequence of events: 1. Load a new table, e.g. a snapshot-origin table, to a device with a dm-era table 2. Suspend the device 3. dm-era commits its metadata, but the worker does a few more metadata writes until it stops, as part of digesting an archived writeset 4. These writes are cached by the dm-bufio client 5. Load the dm-era table to another device. 6. The new instance of the dm-era target loads the committed, on-disk metadata, which don't include the extra writes done by the worker after the metadata commit. 7. Resume the new device 8. The new dm-era target instance starts using the metadata 9. Resume the original device 10. The destructor of the old dm-era target instance is called and destroys the dm-bufio client, which results in flushing the cached writes to disk 11. These writes might overwrite the writes done by the new dm-era instance, hence corrupting its metadata. Fix this by committing the metadata after the worker stops running. stop_worker uses flush_workqueue to flush the current work. However, the work item may re-queue itself and flush_workqueue doesn't wait for re-queued works to finish. This could result in the worker changing the metadata after they have been committed, or writing to the metadata concurrently with the commit in the postsuspend thread. Use drain_workqueue instead, which waits until the work and all re-queued works finish. Fixes: eec40579d8487 ("dm: add era target") Cc: stable@vger.kernel.org # v3.15+ Signed-off-by: Nikos Tsironis Signed-off-by: Mike Snitzer --- drivers/md/dm-era-target.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c index 1f6bf152b3c7..e92c1afc3677 100644 --- a/drivers/md/dm-era-target.c +++ b/drivers/md/dm-era-target.c @@ -1400,7 +1400,7 @@ static void start_worker(struct era *era) static void stop_worker(struct era *era) { atomic_set(&era->suspended, 1); - flush_workqueue(era->wq); + drain_workqueue(era->wq); } /*---------------------------------------------------------------- @@ -1570,6 +1570,12 @@ static void era_postsuspend(struct dm_target *ti) } stop_worker(era); + + r = metadata_commit(era->md); + if (r) { + DMERR("%s: metadata_commit failed", __func__); + /* FIXME: fail mode */ + } } static int era_preresume(struct dm_target *ti) -- cgit v1.2.3 From 78ccef91234ba331c04d71f3ecb1377451d21056 Mon Sep 17 00:00:00 2001 From: Mike Snitzer Date: Tue, 21 Jun 2022 13:37:06 -0400 Subject: dm: do not return early from dm_io_complete if BLK_STS_AGAIN without polling Commit 5291984004edf ("dm: fix bio polling to handle possibile BLK_STS_AGAIN") inadvertently introduced an early return from dm_io_complete() without first queueing the bio to DM if BLK_STS_AGAIN occurs and bio-polling is _not_ being used. Fix this by only returning early from dm_io_complete() if the bio has first been properly queued to DM. Otherwise, the bio will never finish via bio_endio. Fixes: 5291984004edf ("dm: fix bio polling to handle possibile BLK_STS_AGAIN") Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer --- drivers/md/dm.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index b6b25d319ef7..9ede55278eec 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -939,9 +939,11 @@ static void dm_io_complete(struct dm_io *io) if (io_error == BLK_STS_AGAIN) { /* io_uring doesn't handle BLK_STS_AGAIN (yet) */ queue_io(md, bio); + return; } } - return; + if (io_error == BLK_STS_DM_REQUEUE) + return; } if (bio_is_flush_with_data(bio)) { -- cgit v1.2.3 From 61b6e2e5321da281ab3c0c04e1962b3d000f6248 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Thu, 23 Jun 2022 21:20:05 +0800 Subject: dm: fix BLK_STS_DM_REQUEUE handling when dm_io represents split bio Commit 7dd76d1feec7 ("dm: improve bio splitting and associated IO accounting") removed using cloned bio when dm io splitting is needed. Using bio_trim()+bio_inc_remaining() rather than bio_split()+bio_chain() causes multiple dm_io instances to share the same original bio, and it works fine if IOs are completed successfully. But a regression was caused for the case when BLK_STS_DM_REQUEUE is returned from any one of DM's cloned bios (whose dm_io share the same orig_bio). In this BLK_STS_DM_REQUEUE case only the mapped subset of the original bio for the current exact dm_io needs to be re-submitted. However, since the original bio is shared among all dm_io instances, the ->orig_bio actually only represents the last dm_io instance, so requeue can't work as expected. Also when more than one dm_io is requeued, the same original bio is requeued from all dm_io's completion handler, then race is caused. Fix this issue by still allocating one clone bio for completing io only, then io accounting can rely on ->orig_bio being unmodified. This is needed because the dm_io's sector_offset and sectors members are recorded relative to an unmodified ->orig_bio. In the future, we can go back to using bio_trim()+bio_inc_remaining() for dm's io splitting but then delay needing a bio clone only when handling BLK_STS_DM_REQUEUE, but that approach is a bit complicated (so it needs a development cycle): 1) bio clone needs to be done in task context 2) a block interface for unwinding bio is required Fixes: 7dd76d1feec7 ("dm: improve bio splitting and associated IO accounting") Reported-by: Benjamin Marzinski Signed-off-by: Ming Lei Signed-off-by: Mike Snitzer --- drivers/md/dm-core.h | 1 + drivers/md/dm.c | 11 +++++++---- 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h index 54c0473a51dd..c954ff91870e 100644 --- a/drivers/md/dm-core.h +++ b/drivers/md/dm-core.h @@ -272,6 +272,7 @@ struct dm_io { atomic_t io_count; struct mapped_device *md; + struct bio *split_bio; /* The three fields represent mapped part of original bio */ struct bio *orig_bio; unsigned int sector_offset; /* offset to end of orig_bio */ diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 9ede55278eec..2b75f1ef7386 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -594,6 +594,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio) atomic_set(&io->io_count, 2); this_cpu_inc(*md->pending_io); io->orig_bio = bio; + io->split_bio = NULL; io->md = md; spin_lock_init(&io->lock); io->start_time = jiffies; @@ -887,7 +888,7 @@ static void dm_io_complete(struct dm_io *io) { blk_status_t io_error; struct mapped_device *md = io->md; - struct bio *bio = io->orig_bio; + struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio; if (io->status == BLK_STS_DM_REQUEUE) { unsigned long flags; @@ -1693,9 +1694,11 @@ static void dm_split_and_process_bio(struct mapped_device *md, * Remainder must be passed to submit_bio_noacct() so it gets handled * *after* bios already submitted have been completely processed. */ - bio_trim(bio, io->sectors, ci.sector_count); - trace_block_split(bio, bio->bi_iter.bi_sector); - bio_inc_remaining(bio); + WARN_ON_ONCE(!dm_io_flagged(io, DM_IO_WAS_SPLIT)); + io->split_bio = bio_split(bio, io->sectors, GFP_NOIO, + &md->queue->bio_split); + bio_chain(io->split_bio, bio); + trace_block_split(io->split_bio, bio->bi_iter.bi_sector); submit_bio_noacct(bio); out: /* -- cgit v1.2.3 From 90736eb3232d208ee048493f371075e4272e0944 Mon Sep 17 00:00:00 2001 From: Mikulas Patocka Date: Thu, 23 Jun 2022 14:53:25 -0400 Subject: dm mirror log: clear log bits up to BITS_PER_LONG boundary Commit 85e123c27d5c ("dm mirror log: round up region bitmap size to BITS_PER_LONG") introduced a regression on 64-bit architectures in the lvm testsuite tests: lvcreate-mirror, mirror-names and vgsplit-operation. If the device is shrunk, we need to clear log bits beyond the end of the device. The code clears bits up to a 32-bit boundary and then calculates lc->sync_count by summing set bits up to a 64-bit boundary (the commit changed that; previously, this boundary was 32-bit too). So, it was using some non-zeroed bits in the calculation and this caused misbehavior. Fix this regression by clearing bits up to BITS_PER_LONG boundary. Fixes: 85e123c27d5c ("dm mirror log: round up region bitmap size to BITS_PER_LONG") Cc: stable@vger.kernel.org Reported-by: Benjamin Marzinski Signed-off-by: Mikulas Patocka Signed-off-by: Mike Snitzer --- drivers/md/dm-log.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/dm-log.c b/drivers/md/dm-log.c index 2dda05aada23..0c6620e7b7bf 100644 --- a/drivers/md/dm-log.c +++ b/drivers/md/dm-log.c @@ -615,7 +615,7 @@ static int disk_resume(struct dm_dirty_log *log) log_clear_bit(lc, lc->clean_bits, i); /* clear any old bits -- device has shrunk */ - for (i = lc->region_count; i % (sizeof(*lc->clean_bits) << BYTE_SHIFT); i++) + for (i = lc->region_count; i % BITS_PER_LONG; i++) log_clear_bit(lc, lc->clean_bits, i); /* copy clean across to sync */ -- cgit v1.2.3