From 751f188dd5ab95b3f2b5f2f467c38aae5a2877eb Mon Sep 17 00:00:00 2001 From: Mikulas Patocka Date: Fri, 20 Jul 2012 14:25:03 +0100 Subject: dm raid1: fix crash with mirror recovery and discard This patch fixes a crash when a discard request is sent during mirror recovery. Firstly, some background. Generally, the following sequence happens during mirror synchronization: - function do_recovery is called - do_recovery calls dm_rh_recovery_prepare - dm_rh_recovery_prepare uses a semaphore to limit the number simultaneously recovered regions (by default the semaphore value is 1, so only one region at a time is recovered) - dm_rh_recovery_prepare calls __rh_recovery_prepare, __rh_recovery_prepare asks the log driver for the next region to recover. Then, it sets the region state to DM_RH_RECOVERING. If there are no pending I/Os on this region, the region is added to quiesced_regions list. If there are pending I/Os, the region is not added to any list. It is added to the quiesced_regions list later (by dm_rh_dec function) when all I/Os finish. - when the region is on quiesced_regions list, there are no I/Os in flight on this region. The region is popped from the list in dm_rh_recovery_start function. Then, a kcopyd job is started in the recover function. - when the kcopyd job finishes, recovery_complete is called. It calls dm_rh_recovery_end. dm_rh_recovery_end adds the region to recovered_regions or failed_recovered_regions list (depending on whether the copy operation was successful or not). The above mechanism assumes that if the region is in DM_RH_RECOVERING state, no new I/Os are started on this region. When I/O is started, dm_rh_inc_pending is called, which increases reg->pending count. When I/O is finished, dm_rh_dec is called. It decreases reg->pending count. If the count is zero and the region was in DM_RH_RECOVERING state, dm_rh_dec adds it to the quiesced_regions list. Consequently, if we call dm_rh_inc_pending/dm_rh_dec while the region is in DM_RH_RECOVERING state, it could be added to quiesced_regions list multiple times or it could be added to this list when kcopyd is copying data (it is assumed that the region is not on any list while kcopyd does its jobs). This results in memory corruption and crash. There already exist bypasses for REQ_FLUSH requests: REQ_FLUSH requests do not belong to any region, so they are always added to the sync list in do_writes. dm_rh_inc_pending does not increase count for REQ_FLUSH requests. In mirror_end_io, dm_rh_dec is never called for REQ_FLUSH requests. These bypasses avoid the crash possibility described above. These bypasses were improperly implemented for REQ_DISCARD when the mirror target gained discard support in commit 5fc2ffeabb9ee0fc0e71ff16b49f34f0ed3d05b4 (dm raid1: support discard). In do_writes, REQ_DISCARD requests is always added to the sync queue and immediately dispatched (even if the region is in DM_RH_RECOVERING). However, dm_rh_inc and dm_rh_dec is called for REQ_DISCARD resusts. So it violates the rule that no I/Os are started on DM_RH_RECOVERING regions, and causes the list corruption described above. This patch changes it so that REQ_DISCARD requests follow the same path as REQ_FLUSH. This avoids the crash. Reference: https://bugzilla.redhat.com/837607 Signed-off-by: Mikulas Patocka Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon --- drivers/md/dm-raid1.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'drivers/md/dm-raid1.c') diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c index d039de8322f0..ea16984c0f08 100644 --- a/drivers/md/dm-raid1.c +++ b/drivers/md/dm-raid1.c @@ -1214,7 +1214,7 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio, * We need to dec pending if this was a write. */ if (rw == WRITE) { - if (!(bio->bi_rw & REQ_FLUSH)) + if (!(bio->bi_rw & (REQ_FLUSH | REQ_DISCARD))) dm_rh_dec(ms->rh, map_context->ll); return error; } -- cgit v1.2.3 From 7c8d3a42fe1c58a7e8fd3f6a013e7d7b474ff931 Mon Sep 17 00:00:00 2001 From: Mikulas Patocka Date: Fri, 20 Jul 2012 14:25:07 +0100 Subject: dm raid1: set discard_zeroes_data_unsupported We can't guarantee that REQ_DISCARD on dm-mirror zeroes the data even if the underlying disks support zero on discard. So this patch sets ti->discard_zeroes_data_unsupported. For example, if the mirror is in the process of resynchronizing, it may happen that kcopyd reads a piece of data, then discard is sent on the same area and then kcopyd writes the piece of data to another leg. Consequently, the data is not zeroed. The flag was made available by commit 983c7db347db8ce2d8453fd1d89b7a4bb6920d56 (dm crypt: always disable discard_zeroes_data). Signed-off-by: Mikulas Patocka Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon --- drivers/md/dm-raid1.c | 1 + 1 file changed, 1 insertion(+) (limited to 'drivers/md/dm-raid1.c') diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c index ea16984c0f08..b58b7a33914a 100644 --- a/drivers/md/dm-raid1.c +++ b/drivers/md/dm-raid1.c @@ -1084,6 +1084,7 @@ static int mirror_ctr(struct dm_target *ti, unsigned int argc, char **argv) ti->split_io = dm_rh_get_region_size(ms->rh); ti->num_flush_requests = 1; ti->num_discard_requests = 1; + ti->discard_zeroes_data_unsupported = 1; ms->kmirrord_wq = alloc_workqueue("kmirrord", WQ_NON_REENTRANT | WQ_MEM_RECLAIM, 0); -- cgit v1.2.3 From 542f90381422676544382d4071ba44a2de90a0c1 Mon Sep 17 00:00:00 2001 From: Mike Snitzer Date: Fri, 27 Jul 2012 15:08:00 +0100 Subject: dm: support non power of two target max_io_len Remove the restriction that limits a target's specified maximum incoming I/O size to be a power of 2. Rename this setting from 'split_io' to the less-ambiguous 'max_io_len'. Change it from sector_t to uint32_t, which is plenty big enough, and introduce a wrapper function dm_set_target_max_io_len() to set it. Use sector_div() to process it now that it is not necessarily a power of 2. Signed-off-by: Mike Snitzer Signed-off-by: Alasdair G Kergon --- drivers/md/dm-raid.c | 11 +++++------ drivers/md/dm-raid1.c | 6 +++++- drivers/md/dm-snap.c | 27 +++++++++++++++------------ drivers/md/dm-stripe.c | 5 ++++- drivers/md/dm-thin.c | 5 ++++- drivers/md/dm.c | 35 +++++++++++++++++++++++++++-------- include/linux/device-mapper.h | 9 +++++++-- include/linux/dm-ioctl.h | 4 ++-- 8 files changed, 69 insertions(+), 33 deletions(-) (limited to 'drivers/md/dm-raid1.c') diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c index 017c34d78d61..858a8b70811c 100644 --- a/drivers/md/dm-raid.c +++ b/drivers/md/dm-raid.c @@ -353,6 +353,7 @@ static int parse_raid_params(struct raid_set *rs, char **argv, { unsigned i, rebuild_cnt = 0; unsigned long value, region_size = 0; + sector_t max_io_len; char *key; /* @@ -522,14 +523,12 @@ static int parse_raid_params(struct raid_set *rs, char **argv, return -EINVAL; if (rs->md.chunk_sectors) - rs->ti->split_io = rs->md.chunk_sectors; + max_io_len = rs->md.chunk_sectors; else - rs->ti->split_io = region_size; + max_io_len = region_size; - if (rs->md.chunk_sectors) - rs->ti->split_io = rs->md.chunk_sectors; - else - rs->ti->split_io = region_size; + if (dm_set_target_max_io_len(rs->ti, max_io_len)) + return -EINVAL; /* Assume there are no metadata devices until the drives are parsed */ rs->md.persistent = 0; diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c index b58b7a33914a..819ccba65912 100644 --- a/drivers/md/dm-raid1.c +++ b/drivers/md/dm-raid1.c @@ -1081,7 +1081,11 @@ static int mirror_ctr(struct dm_target *ti, unsigned int argc, char **argv) } ti->private = ms; - ti->split_io = dm_rh_get_region_size(ms->rh); + + r = dm_set_target_max_io_len(ti, dm_rh_get_region_size(ms->rh)); + if (r) + goto err_free_context; + ti->num_flush_requests = 1; ti->num_discard_requests = 1; ti->discard_zeroes_data_unsupported = 1; diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c index a228c20e40b3..6c0f3e33923a 100644 --- a/drivers/md/dm-snap.c +++ b/drivers/md/dm-snap.c @@ -691,7 +691,7 @@ static int dm_add_exception(void *context, chunk_t old, chunk_t new) * Return a minimum chunk size of all snapshots that have the specified origin. * Return zero if the origin has no snapshots. */ -static sector_t __minimum_chunk_size(struct origin *o) +static uint32_t __minimum_chunk_size(struct origin *o) { struct dm_snapshot *snap; unsigned chunk_size = 0; @@ -701,7 +701,7 @@ static sector_t __minimum_chunk_size(struct origin *o) chunk_size = min_not_zero(chunk_size, snap->store->chunk_size); - return chunk_size; + return (uint32_t) chunk_size; } /* @@ -1172,7 +1172,10 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv) ti->error = "Chunk size not set"; goto bad_read_metadata; } - ti->split_io = s->store->chunk_size; + + r = dm_set_target_max_io_len(ti, s->store->chunk_size); + if (r) + goto bad_read_metadata; return 0; @@ -1239,7 +1242,7 @@ static void __handover_exceptions(struct dm_snapshot *snap_src, snap_dest->store->snap = snap_dest; snap_src->store->snap = snap_src; - snap_dest->ti->split_io = snap_dest->store->chunk_size; + snap_dest->ti->max_io_len = snap_dest->store->chunk_size; snap_dest->valid = snap_src->valid; /* @@ -1817,9 +1820,9 @@ static void snapshot_resume(struct dm_target *ti) up_write(&s->lock); } -static sector_t get_origin_minimum_chunksize(struct block_device *bdev) +static uint32_t get_origin_minimum_chunksize(struct block_device *bdev) { - sector_t min_chunksize; + uint32_t min_chunksize; down_read(&_origins_lock); min_chunksize = __minimum_chunk_size(__lookup_origin(bdev)); @@ -1838,9 +1841,9 @@ static void snapshot_merge_resume(struct dm_target *ti) snapshot_resume(ti); /* - * snapshot-merge acts as an origin, so set ti->split_io + * snapshot-merge acts as an origin, so set ti->max_io_len */ - ti->split_io = get_origin_minimum_chunksize(s->origin->bdev); + ti->max_io_len = get_origin_minimum_chunksize(s->origin->bdev); start_merge(s); } @@ -2073,12 +2076,12 @@ static int origin_write_extent(struct dm_snapshot *merging_snap, struct origin *o; /* - * The origin's __minimum_chunk_size() got stored in split_io + * The origin's __minimum_chunk_size() got stored in max_io_len * by snapshot_merge_resume(). */ down_read(&_origins_lock); o = __lookup_origin(merging_snap->origin->bdev); - for (n = 0; n < size; n += merging_snap->ti->split_io) + for (n = 0; n < size; n += merging_snap->ti->max_io_len) if (__origin_write(&o->snapshots, sector + n, NULL) == DM_MAPIO_SUBMITTED) must_wait = 1; @@ -2138,14 +2141,14 @@ static int origin_map(struct dm_target *ti, struct bio *bio, } /* - * Set the target "split_io" field to the minimum of all the snapshots' + * Set the target "max_io_len" field to the minimum of all the snapshots' * chunk sizes. */ static void origin_resume(struct dm_target *ti) { struct dm_dev *dev = ti->private; - ti->split_io = get_origin_minimum_chunksize(dev->bdev); + ti->max_io_len = get_origin_minimum_chunksize(dev->bdev); } static int origin_status(struct dm_target *ti, status_type_t type, char *result, diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c index 6931bd18b615..992c9d4c3bd9 100644 --- a/drivers/md/dm-stripe.c +++ b/drivers/md/dm-stripe.c @@ -165,7 +165,10 @@ static int stripe_ctr(struct dm_target *ti, unsigned int argc, char **argv) else sc->stripes_shift = __ffs(stripes); - ti->split_io = chunk_size; + r = dm_set_target_max_io_len(ti, chunk_size); + if (r) + return r; + ti->num_flush_requests = stripes; ti->num_discard_requests = stripes; diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c index e89f8e7d8a33..350bcf40485e 100644 --- a/drivers/md/dm-thin.c +++ b/drivers/md/dm-thin.c @@ -2628,7 +2628,10 @@ static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv) goto bad_thin_open; } - ti->split_io = tc->pool->sectors_per_block; + r = dm_set_target_max_io_len(ti, tc->pool->sectors_per_block); + if (r) + goto bad_thin_open; + ti->num_flush_requests = 1; /* In case the pool supports discards, pass them on. */ diff --git a/drivers/md/dm.c b/drivers/md/dm.c index e24143cc2040..415c2803c0c9 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -968,22 +968,41 @@ static sector_t max_io_len_target_boundary(sector_t sector, struct dm_target *ti static sector_t max_io_len(sector_t sector, struct dm_target *ti) { sector_t len = max_io_len_target_boundary(sector, ti); + sector_t offset, max_len; /* - * Does the target need to split even further ? + * Does the target need to split even further? */ - if (ti->split_io) { - sector_t boundary; - sector_t offset = dm_target_offset(ti, sector); - boundary = ((offset + ti->split_io) & ~(ti->split_io - 1)) - - offset; - if (len > boundary) - len = boundary; + if (ti->max_io_len) { + offset = dm_target_offset(ti, sector); + if (unlikely(ti->max_io_len & (ti->max_io_len - 1))) + max_len = sector_div(offset, ti->max_io_len); + else + max_len = offset & (ti->max_io_len - 1); + max_len = ti->max_io_len - max_len; + + if (len > max_len) + len = max_len; } return len; } +int dm_set_target_max_io_len(struct dm_target *ti, sector_t len) +{ + if (len > UINT_MAX) { + DMERR("Specified maximum size of target IO (%llu) exceeds limit (%u)", + (unsigned long long)len, UINT_MAX); + ti->error = "Maximum size of target IO is too large"; + return -EINVAL; + } + + ti->max_io_len = (uint32_t) len; + + return 0; +} +EXPORT_SYMBOL_GPL(dm_set_target_max_io_len); + static void __map_bio(struct dm_target *ti, struct bio *clone, struct dm_target_io *tio) { diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h index d70cbb2ada25..b19c1e189a68 100644 --- a/include/linux/device-mapper.h +++ b/include/linux/device-mapper.h @@ -186,8 +186,8 @@ struct dm_target { sector_t begin; sector_t len; - /* Always a power of 2 */ - sector_t split_io; + /* If non-zero, maximum size of I/O submitted to a target. */ + uint32_t max_io_len; /* * A number of zero-length barrier requests that will be submitted @@ -357,6 +357,11 @@ void dm_table_add_target_callbacks(struct dm_table *t, struct dm_target_callback */ int dm_table_complete(struct dm_table *t); +/* + * Target may require that it is never sent I/O larger than len. + */ +int __must_check dm_set_target_max_io_len(struct dm_target *ti, sector_t len); + /* * Table reference counting. */ diff --git a/include/linux/dm-ioctl.h b/include/linux/dm-ioctl.h index 75fd5573516e..3ece4eee84cb 100644 --- a/include/linux/dm-ioctl.h +++ b/include/linux/dm-ioctl.h @@ -268,8 +268,8 @@ enum { #define DM_VERSION_MAJOR 4 #define DM_VERSION_MINOR 22 -#define DM_VERSION_PATCHLEVEL 0 -#define DM_VERSION_EXTRA "-ioctl (2011-10-19)" +#define DM_VERSION_PATCHLEVEL 1 +#define DM_VERSION_EXTRA "-ioctl (2012-06-01)" /* Status bits */ #define DM_READONLY_FLAG (1 << 0) /* In/Out */ -- cgit v1.2.3