From 9ae6e8b1c9bbf6874163d1243e393137313762b7 Mon Sep 17 00:00:00 2001
From: Nikos Tsironis <ntsironis@arrikto.com>
Date: Tue, 21 Jun 2022 15:24:03 +0300
Subject: dm era: commit metadata in postsuspend after worker stops

During postsuspend dm-era does the following:

1. Archives the current era
2. Commits the metadata, as part of the RPC call for archiving the
   current era
3. Stops the worker

Until the worker stops, it might write to the metadata again. Moreover,
these writes are not flushed to disk immediately, but are cached by the
dm-bufio client, which writes them back asynchronously.

As a result, the committed metadata of a suspended dm-era device might
not be consistent with the in-core metadata.

In some cases, this can result in the corruption of the on-disk
metadata. Suppose the following sequence of events:

1. Load a new table, e.g. a snapshot-origin table, to a device with a
   dm-era table
2. Suspend the device
3. dm-era commits its metadata, but the worker does a few more metadata
   writes until it stops, as part of digesting an archived writeset
4. These writes are cached by the dm-bufio client
5. Load the dm-era table to another device.
6. The new instance of the dm-era target loads the committed, on-disk
   metadata, which don't include the extra writes done by the worker
   after the metadata commit.
7. Resume the new device
8. The new dm-era target instance starts using the metadata
9. Resume the original device
10. The destructor of the old dm-era target instance is called and
    destroys the dm-bufio client, which results in flushing the cached
    writes to disk
11. These writes might overwrite the writes done by the new dm-era
    instance, hence corrupting its metadata.

Fix this by committing the metadata after the worker stops running.

stop_worker uses flush_workqueue to flush the current work. However, the
work item may re-queue itself and flush_workqueue doesn't wait for
re-queued works to finish.

This could result in the worker changing the metadata after they have
been committed, or writing to the metadata concurrently with the commit
in the postsuspend thread.

Use drain_workqueue instead, which waits until the work and all
re-queued works finish.

Fixes: eec40579d8487 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 drivers/md/dm-era-target.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
index 1f6bf152b3c7..e92c1afc3677 100644
--- a/drivers/md/dm-era-target.c
+++ b/drivers/md/dm-era-target.c
@@ -1400,7 +1400,7 @@ static void start_worker(struct era *era)
 static void stop_worker(struct era *era)
 {
 	atomic_set(&era->suspended, 1);
-	flush_workqueue(era->wq);
+	drain_workqueue(era->wq);
 }
 
 /*----------------------------------------------------------------
@@ -1570,6 +1570,12 @@ static void era_postsuspend(struct dm_target *ti)
 	}
 
 	stop_worker(era);
+
+	r = metadata_commit(era->md);
+	if (r) {
+		DMERR("%s: metadata_commit failed", __func__);
+		/* FIXME: fail mode */
+	}
 }
 
 static int era_preresume(struct dm_target *ti)
-- 
cgit v1.2.3


From 78ccef91234ba331c04d71f3ecb1377451d21056 Mon Sep 17 00:00:00 2001
From: Mike Snitzer <snitzer@kernel.org>
Date: Tue, 21 Jun 2022 13:37:06 -0400
Subject: dm: do not return early from dm_io_complete if BLK_STS_AGAIN without
 polling

Commit 5291984004edf ("dm: fix bio polling to handle possibile
BLK_STS_AGAIN") inadvertently introduced an early return from
dm_io_complete() without first queueing the bio to DM if BLK_STS_AGAIN
occurs and bio-polling is _not_ being used.

Fix this by only returning early from dm_io_complete() if the bio has
first been properly queued to DM. Otherwise, the bio will never finish
via bio_endio.

Fixes: 5291984004edf ("dm: fix bio polling to handle possibile BLK_STS_AGAIN")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 drivers/md/dm.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b6b25d319ef7..9ede55278eec 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -939,9 +939,11 @@ static void dm_io_complete(struct dm_io *io)
 			if (io_error == BLK_STS_AGAIN) {
 				/* io_uring doesn't handle BLK_STS_AGAIN (yet) */
 				queue_io(md, bio);
+				return;
 			}
 		}
-		return;
+		if (io_error == BLK_STS_DM_REQUEUE)
+			return;
 	}
 
 	if (bio_is_flush_with_data(bio)) {
-- 
cgit v1.2.3


From 61b6e2e5321da281ab3c0c04e1962b3d000f6248 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Thu, 23 Jun 2022 21:20:05 +0800
Subject: dm: fix BLK_STS_DM_REQUEUE handling when dm_io represents split bio

Commit 7dd76d1feec7 ("dm: improve bio splitting and associated IO
accounting") removed using cloned bio when dm io splitting is needed.
Using bio_trim()+bio_inc_remaining() rather than bio_split()+bio_chain()
causes multiple dm_io instances to share the same original bio, and it
works fine if IOs are completed successfully.

But a regression was caused for the case when BLK_STS_DM_REQUEUE is
returned from any one of DM's cloned bios (whose dm_io share the same
orig_bio). In this BLK_STS_DM_REQUEUE case only the mapped subset of
the original bio for the current exact dm_io needs to be re-submitted.
However, since the original bio is shared among all dm_io instances,
the ->orig_bio actually only represents the last dm_io instance, so
requeue can't work as expected. Also when more than one dm_io is
requeued, the same original bio is requeued from all dm_io's
completion handler, then race is caused.

Fix this issue by still allocating one clone bio for completing io
only, then io accounting can rely on ->orig_bio being unmodified. This
is needed because the dm_io's sector_offset and sectors members are
recorded relative to an unmodified ->orig_bio.

In the future, we can go back to using bio_trim()+bio_inc_remaining()
for dm's io splitting but then delay needing a bio clone only when
handling BLK_STS_DM_REQUEUE, but that approach is a bit complicated
(so it needs a development cycle):
1) bio clone needs to be done in task context
2) a block interface for unwinding bio is required

Fixes: 7dd76d1feec7 ("dm: improve bio splitting and associated IO accounting")
Reported-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 drivers/md/dm-core.h |  1 +
 drivers/md/dm.c      | 11 +++++++----
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 54c0473a51dd..c954ff91870e 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -272,6 +272,7 @@ struct dm_io {
 	atomic_t io_count;
 	struct mapped_device *md;
 
+	struct bio *split_bio;
 	/* The three fields represent mapped part of original bio */
 	struct bio *orig_bio;
 	unsigned int sector_offset; /* offset to end of orig_bio */
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 9ede55278eec..2b75f1ef7386 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -594,6 +594,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
 	atomic_set(&io->io_count, 2);
 	this_cpu_inc(*md->pending_io);
 	io->orig_bio = bio;
+	io->split_bio = NULL;
 	io->md = md;
 	spin_lock_init(&io->lock);
 	io->start_time = jiffies;
@@ -887,7 +888,7 @@ static void dm_io_complete(struct dm_io *io)
 {
 	blk_status_t io_error;
 	struct mapped_device *md = io->md;
-	struct bio *bio = io->orig_bio;
+	struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
 
 	if (io->status == BLK_STS_DM_REQUEUE) {
 		unsigned long flags;
@@ -1693,9 +1694,11 @@ static void dm_split_and_process_bio(struct mapped_device *md,
 	 * Remainder must be passed to submit_bio_noacct() so it gets handled
 	 * *after* bios already submitted have been completely processed.
 	 */
-	bio_trim(bio, io->sectors, ci.sector_count);
-	trace_block_split(bio, bio->bi_iter.bi_sector);
-	bio_inc_remaining(bio);
+	WARN_ON_ONCE(!dm_io_flagged(io, DM_IO_WAS_SPLIT));
+	io->split_bio = bio_split(bio, io->sectors, GFP_NOIO,
+				  &md->queue->bio_split);
+	bio_chain(io->split_bio, bio);
+	trace_block_split(io->split_bio, bio->bi_iter.bi_sector);
 	submit_bio_noacct(bio);
 out:
 	/*
-- 
cgit v1.2.3


From 90736eb3232d208ee048493f371075e4272e0944 Mon Sep 17 00:00:00 2001
From: Mikulas Patocka <mpatocka@redhat.com>
Date: Thu, 23 Jun 2022 14:53:25 -0400
Subject: dm mirror log: clear log bits up to BITS_PER_LONG boundary

Commit 85e123c27d5c ("dm mirror log: round up region bitmap size to
BITS_PER_LONG") introduced a regression on 64-bit architectures in the
lvm testsuite tests: lvcreate-mirror, mirror-names and vgsplit-operation.

If the device is shrunk, we need to clear log bits beyond the end of the
device. The code clears bits up to a 32-bit boundary and then calculates
lc->sync_count by summing set bits up to a 64-bit boundary (the commit
changed that; previously, this boundary was 32-bit too). So, it was using
some non-zeroed bits in the calculation and this caused misbehavior.

Fix this regression by clearing bits up to BITS_PER_LONG boundary.

Fixes: 85e123c27d5c ("dm mirror log: round up region bitmap size to BITS_PER_LONG")
Cc: stable@vger.kernel.org
Reported-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 drivers/md/dm-log.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/dm-log.c b/drivers/md/dm-log.c
index 2dda05aada23..0c6620e7b7bf 100644
--- a/drivers/md/dm-log.c
+++ b/drivers/md/dm-log.c
@@ -615,7 +615,7 @@ static int disk_resume(struct dm_dirty_log *log)
 			log_clear_bit(lc, lc->clean_bits, i);
 
 	/* clear any old bits -- device has shrunk */
-	for (i = lc->region_count; i % (sizeof(*lc->clean_bits) << BYTE_SHIFT); i++)
+	for (i = lc->region_count; i % BITS_PER_LONG; i++)
 		log_clear_bit(lc, lc->clean_bits, i);
 
 	/* copy clean across to sync */
-- 
cgit v1.2.3