summaryrefslogtreecommitdiffstats
path: root/drivers/md/raid5-cache.c
AgeCommit message (Collapse)AuthorFilesLines
2016-11-29raid5-cache: don't set STRIPE_R5C_PARTIAL_STRIPE flag while load stripe into ↵Zhengyuan Liu1-3/+1
cache r5c_recovery_load_one_stripe should not set STRIPE_R5C_PARTIAL_STRIPE flag,as the data-only stripe may be STRIPE_R5C_FULL_STRIPE stripe. The state machine would release the stripe later and add it into neither r5c_cached_full_stripes list or r5c_cached_partial_stripes list and set correct flag. Reviewed-by: JackieLiu <liuyun01@kylinos.cn> Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-29raid5-cache: add another check conditon before replaying one stripeZhengyuan Liu1-2/+2
New stripe that was just allocated has no STRIPE_R5C_CACHING state too, add this check condition could avoid unnecessary replaying for empty stripe. r5l_recovery_replay_one_stripe would reset stripe for any case, delete it to make code more clean. Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-27md/r5cache: enable IRQs on error pathDan Carpenter1-1/+1
We need to re-enable the IRQs here before returning. Fixes: a39f7afde358 ("md/r5cache: write-out phase and reclaim support") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-27md/r5cache: handle alloc_page failureSong Liu1-1/+26
RMW of r5c write back cache uses an extra page to store old data for prexor. handle_stripe_dirtying() allocates this page by calling alloc_page(). However, alloc_page() may fail. To handle alloc_page() failures, this patch adds an extra page to disk_info. When alloc_page fails, handle_stripe() trys to use these pages. When these pages are used by other stripe (R5C_EXTRA_PAGE_IN_USE), the stripe is added to delayed_list. Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-23raid5-cache: suspend reclaim thread instead of shutdownShaohua Li1-13/+5
There is mechanism to suspend a kernel thread. Use it instead of playing create/destroy game. Signed-off-by: Shaohua Li <shli@fb.com> Reviewed-by: NeilBrown <neilb@suse.de> Cc: Song Liu <songliubraving@fb.com>
2016-11-22block: bio: pass bvec table to bio_init()Ming Lei1-1/+1
Some drivers often use external bvec table, so introduce this helper for this case. It is always safe to access the bio->bi_io_vec in this way for this case. After converting to this usage, it will becomes a bit easier to evaluate the remaining direct access to bio->bi_io_vec, so it can help to prepare for the following multipage bvec support. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Fixed up the new O_DIRECT cases. Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-18md/r5cache: handle FLUSH and FUASong Liu1-18/+144
With raid5 cache, we committing data from journal device. When there is flush request, we need to flush journal device's cache. This was not needed in raid5 journal, because we will flush the journal before committing data to raid disks. This is similar to FUA, except that we also need flush journal for FUA. Otherwise, corruptions in earlier meta data will stop recovery from reaching FUA data. slightly changed the code by Shaohua Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18md/r5cache: r5cache recovery: part 2Song Liu1-186/+24
1. In previous patch, we: - add new data to r5l_recovery_ctx - add new functions to recovery write-back cache The new functions are not used in this patch, so this patch does not change the behavior of recovery. 2. In this patchpatch, we: - modify main recovery procedure r5l_recovery_log() to call new functions - remove old functions Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18md/r5cache: r5cache recovery: part 1Song Liu1-0/+602
Recovery of write-back cache has different logic to write-through only cache. Specifically, for write-back cache, the recovery need to scan through all active journal entries before flushing data out. Therefore, large portion of the recovery logic is rewritten here. To make the diffs cleaner, we split the rewrite as follows: 1. In this patch, we: - add new data to r5l_recovery_ctx - add new functions to recovery write-back cache The new functions are not used in this patch, so this patch does not change the behavior of recovery. 2. In next patch, we: - modify main recovery procedure r5l_recovery_log() to call new functions - remove old functions With cache feature, there are 2 different scenarios of recovery: 1. Data-Parity stripe: a stripe with complete parity in journal. 2. Data-Only stripe: a stripe with only data in journal (or partial parity). The code differentiate Data-Parity stripe from Data-Only stripe with flag STRIPE_R5C_CACHING. For Data-Parity stripes, we use the same procedure as raid5 journal, where all the data and parity are replayed to the RAID devices. For Data-Only strips, we need to finish complete calculate parity and finish the full reconstruct write or RMW write. For simplicity, in the recovery, we load the stripe to stripe cache. Once the array is started, the stripe cache state machine will handle these stripes through normal write path. r5c_recovery_flush_log contains the main procedure of recovery. The recovery code first scans through the journal and loads data to stripe cache. The code keeps tracks of all these stripes in a list (use sh->lru and ctx->cached_list), stripes in the list are organized in the order of its first appearance on the journal. During the scan, the recovery code assesses each stripe as Data-Parity or Data-Only. During scan, the array may run out of stripe cache. In these cases, the recovery code will also call raid5_set_cache_size to increase stripe cache size. If the array still runs out of stripe cache because there isn't enough memory, the array will not assemble. At the end of scan, the recovery code replays all Data-Parity stripes, and sets proper states for Data-Only stripes. The recovery code also increases seq number by 10 and rewrites all Data-Only stripes to journal. This is to avoid confusion after repeated crashes. More details is explained in raid5-cache.c before r5c_recovery_rewrite_data_only_stripes(). Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18md/r5cache: refactoring journal recovery codeSong Liu1-9/+18
1. rename r5l_read_meta_block() as r5l_recovery_read_meta_block(); 2. pull the code that initialize r5l_meta_block from r5l_log_write_empty_meta_block() to a separate function r5l_recovery_create_empty_meta_block(), so that we can reuse this piece of code. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18md/r5cache: sysfs entry journal_modeSong Liu1-0/+65
With write cache, journal_mode is the knob to switch between write-back and write-through. Below is an example: root@virt-test:~/# cat /sys/block/md0/md/journal_mode [write-through] write-back root@virt-test:~/# echo write-back > /sys/block/md0/md/journal_mode root@virt-test:~/# cat /sys/block/md0/md/journal_mode write-through [write-back] Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18md/r5cache: write-out phase and reclaim supportSong Liu1-30/+381
There are two limited resources, stripe cache and journal disk space. For better performance, we priotize reclaim of full stripe writes. To free up more journal space, we free earliest data on the journal. In current implementation, reclaim happens when: 1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim if there is no reclaim in the past 5 seconds. 2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes, or cached stripes is enough for a full stripe (chunk size / 4k) (r5c_check_cached_full_stripe) 3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage) 4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data) r5c_do_reclaim() contains new logic of reclaim. For stripe cache: When stripe cache pressure is high (more than 3/4 stripes are cached, or there is empty inactive lists), flush all full stripe. If fewer than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes are flushed, flush some paritial stripes. When stripe cache pressure is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes. For log space: To avoid deadlock due to log space, we need to reserve enough space to flush cached data. The size of required log space depends on total number of cached stripes (stripe_in_journal_count). In current implementation, the writing-out phase automatically include pending data writes with parity writes (similar to write through case). Therefore, we need up to (conf->raid_disks + 1) pages for each cached stripe (1 page for meta data, raid_disks pages for all data and parity). r5c_log_required_to_flush_cache() calculates log space required to flush cache. In the following, we refer to the space calculated by r5c_log_required_to_flush_cache() as reclaim_required_space. Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL is set when free space on the log device is less than 2x of reclaim_required_space. r5c_cache keeps all data in cache (not fully committed to RAID) in a list (stripe_in_journal_list). These stripes are in the order of their first appearance on the journal. So the log tail (last_checkpoint) should point to the journal_start of the first item in the list. When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is set, the state machine only writes data that are already in the log device (in stripe_in_journal_list). This patch includes a fix to improve performance by Shaohua Li <shli@fb.com>. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18md/r5cache: caching phase of r5cacheSong Liu1-9/+233
As described in previous patch, write back cache operates in two phases: caching and writing-out. The caching phase works as: 1. write data to journal (r5c_handle_stripe_dirtying, r5c_cache_data) 2. call bio_endio (r5c_handle_data_cached, r5c_return_dev_pending_writes). Then the writing-out phase is as: 1. Mark the stripe as write-out (r5c_make_stripe_write_out) 2. Calcualte parity (reconstruct or RMW) 3. Write parity (and maybe some other data) to journal device 4. Write data and parity to RAID disks This patch implements caching phase. The cache is integrated with stripe cache of raid456. It leverages code of r5l_log to write data to journal device. Writing-out phase of the cache is implemented in the next patch. With r5cache, write operation does not wait for parity calculation and write out, so the write latency is lower (1 write to journal device vs. read and then write to raid disks). Also, r5cache will reduce RAID overhead (multipile IO due to read-modify-write of parity) and provide more opportunities of full stripe writes. This patch adds 2 flags to stripe_head.state: - STRIPE_R5C_PARTIAL_STRIPE, - STRIPE_R5C_FULL_STRIPE, Instead of inactive_list, stripes with cached data are tracked in r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list. STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for stripes in these lists. Note: stripes in r5c_full/partial_stripe_list are not considered as "active". For RMW, the code allocates an extra page for each data block being updated. This is stored in r5dev->orig_page and the old data is read into it. Then the prexor calculation subtracts ->orig_page from the parity block, and the reconstruct calculation adds the ->page data back into the parity block. r5cache naturally excludes SkipCopy. When the array has write back cache, async_copy_data() will not skip copy. There are some known limitations of the cache implementation: 1. Write cache only covers full page writes (R5_OVERWRITE). Writes of smaller granularity are write through. 2. Only one log io (sh->log_io) for each stripe at anytime. Later writes for the same stripe have to wait. This can be improved by moving log_io to r5dev. 3. With writeback cache, read path must enter state machine, which is a significant bottleneck for some workloads. 4. There is no per stripe checkpoint (with r5l_payload_flush) in the log, so recovery code has to replay more than necessary data (sometimes all the log from last_checkpoint). This reduces availability of the array. This patch includes a fix proposed by ZhengYuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18md/r5cache: State machine for raid5-cache write back modeSong Liu1-3/+140
This patch adds state machine for raid5-cache. With log device, the raid456 array could operate in two different modes (r5c_journal_mode): - write-back (R5C_MODE_WRITE_BACK) - write-through (R5C_MODE_WRITE_THROUGH) Existing code of raid5-cache only has write-through mode. For write-back cache, it is necessary to extend the state machine. With write-back cache, every stripe could operate in two different phases: - caching - writing-out In caching phase, the stripe handles writes as: - write to journal - return IO In writing-out phase, the stripe behaviors as a stripe in write through mode R5C_MODE_WRITE_THROUGH. STRIPE_R5C_CACHING is added to sh->state to differentiate caching and writing-out phase. Please note: this is a "no-op" patch for raid5-cache write-through mode. The following detailed explanation is copied from the raid5-cache.c: /* * raid5 cache state machine * * With rhe RAID cache, each stripe works in two phases: * - caching phase * - writing-out phase * * These two phases are controlled by bit STRIPE_R5C_CACHING: * if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase * if STRIPE_R5C_CACHING == 1, the stripe is in caching phase * * When there is no journal, or the journal is in write-through mode, * the stripe is always in writing-out phase. * * For write-back journal, the stripe is sent to caching phase on write * (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off * the write-out phase by clearing STRIPE_R5C_CACHING. * * Stripes in caching phase do not write the raid disks. Instead, all * writes are committed from the log device. Therefore, a stripe in * caching phase handles writes as: * - write to log device * - return IO * * Stripes in writing-out phase handle writes as: * - calculate parity * - write pending data and parity to journal * - write data and parity to raid disks * - return IO for pending writes */ Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-18md/r5cache: Check array size in r5l_init_logSong Liu1-10/+16
Currently, r5l_write_stripe checks meta size for each stripe write, which is not necessary. With this patch, r5l_init_log checks maximal meta size of the array, which is (r5l_meta_block + raid_disks x r5l_payload_data_parity). If this is too big to fit in one page, r5l_init_log aborts. With current meta data, r5l_log support raid_disks up to 203. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-17raid5-cache: fix lockdep warningShaohua Li1-2/+14
lockdep reports warning of the rcu_dereference usage. Using normal rdev access pattern to avoid the warning. Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07raid5-cache: restrict the use area of the log_offset variableJackieLiu1-11/+8
We can calculate this offset by using ctx->meta_total_blocks, without passing in from the function Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-01block,fs: use REQ_* flags directlyChristoph Hellwig1-2/+2
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags directly. Where applicable this also drops usage of the bio_set_op_attrs wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28raid5-cache: correct condition for empty metadata writeShaohua Li1-1/+1
As long as we recover one metadata block, we should write the empty metadata write. The original code could make recovery corrupted if only one meta is valid. Reported-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-24md/raid5: write an empty meta-block when creating log super-blockZhengyuan Liu1-0/+1
If superblock points to an invalid meta block, r5l_load_log will set create_super with true and create an new superblock, this runtime path would always happen if we do no writing I/O to this array since it was created. Writing an empty meta block could avoid this unnecessary action at the first time we created log superblock. Another reason is for the corretness of log recovery. Currently we have bellow code to guarantee log revocery to be correct. if (ctx.seq > log->last_cp_seq + 1) { int ret; ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10); if (ret) return ret; log->seq = ctx.seq + 11; log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS); r5l_write_super(log, ctx.pos); } else { log->log_start = ctx.pos; log->seq = ctx.seq; } If we just created a array with a journal device, log->log_start and log->last_checkpoint should all be 0, then we write three meta block which are valid except mid one and supposed crash happened. The ctx.seq would equal to log->last_cp_seq + 1 and log->log_start would be set to position of mid invalid meta block after we did a recovery, this will lead to problems which could be avoided with this patch. Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-24md/raid5: initialize next_checkpoint field before useZhengyuan Liu1-0/+3
No initial operation was done to this field when we load/recovery the log, it got assignment only when IO to raid disk was finished. So r5l_quiesce may use wrong next_checkpoint to reclaim log space, that would make reclaimable space calculation confused. Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
2016-08-31raid5-cache: fix a deadlock in superblock writeShaohua Li1-31/+15
There is a potential deadlock in superblock write. Discard could zero data, so before discard we must make sure superblock is updated to new log tail. Updating superblock (either directly call md_update_sb() or depend on md thread) must hold reconfig mutex. On the other hand, raid5_quiesce is called with reconfig_mutex hold. The first step of raid5_quiesce() is waitting for all IO finish, hence waitting for reclaim thread, while reclaim thread is calling this function and waitting for reconfig mutex. So there is a deadlock. We workaround this issue with a trylock. The downside of the solution is we could miss discard if we can't take reconfig mutex. But this should happen rarely (mainly in raid array stop), so miss discard shouldn't be a big problem. Cc: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-08-07block: rename bio bi_rw to bi_opfJens Axboe1-1/+1
Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSHMike Christie1-1/+1
To avoid confusion between REQ_OP_FLUSH, which is handled by request_fn drivers, and upper layers requesting the block layer perform a flush sequence along with possibly a WRITE, this patch renames REQ_FLUSH to REQ_PREFLUSH. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07md: use bio op accessorsMike Christie1-10/+16
Separate the op from the rq_flag_bits and have md set/get the bio using bio_set_op_attrs/bio_op. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block/fs/drivers: remove rw argument from submit_bioMike Christie1-3/+4
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw instead of passing it in. This makes that use the same as generic_make_request and how we set the other bio fields. Signed-off-by: Mike Christie <mchristi@redhat.com> Fixed up fs/ext4/crypto.c Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-19Merge tag 'md/4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds1-2/+2
Pull MD updates from Shaohua Li: "Several patches from Guoqing fixing md-cluster bugs and several patches from Heinz fixing dm-raid bugs" * tag 'md/4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md-cluster: check the return value of process_recvd_msg md-cluster: gather resync infos and enable recv_thread after bitmap is ready md: set MD_CHANGE_PENDING in a atomic region md: raid5: add prerequisite to run underneath dm-raid md: raid10: add prerequisite to run underneath dm-raid md: md.c: fix oops in mddev_suspend for raid0 md-cluster: fix ifnullfree.cocci warnings md-cluster/bitmap: unplug bitmap to sync dirty pages to disk md-cluster/bitmap: fix wrong page num in bitmap_file_clear_bit and bitmap_file_set_bit md-cluster/bitmap: fix wrong calcuation of offset md-cluster: sync bitmap when node received RESYNCING msg md-cluster: always setup in-memory bitmap md-cluster: wakeup thread if activated a spare disk md-cluster: change array_sectors and update size are not supported md-cluster: fix locking when node joins cluster during message broadcast md-cluster: unregister thread if err happened md-cluster: wake up thread to continue recovery md-cluser: make resync_finish only called after pers->sync_request md-cluster: change resync lock from asynchronous to synchronous
2016-05-09md: set MD_CHANGE_PENDING in a atomic regionGuoqing Jiang1-2/+2
Some code waits for a metadata update by: 1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN) 2. setting MD_CHANGE_PENDING and waking the management thread 3. waiting for MD_CHANGE_PENDING to be cleared If the first two are done without locking, the code in md_update_sb() which checks if it needs to repeat might test if an update is needed before step 1, then clear MD_CHANGE_PENDING after step 2, resulting in the wait returning early. So make sure all places that set MD_CHANGE_PENDING are atomicial, and bit_clear_unless (suggested by Neil) is introduced for the purpose. Cc: Martin Kepplinger <martink@posteo.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: <linux-kernel@vger.kernel.org> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-04-13block: kill off q->flush_flagsJens Axboe1-1/+2
Now that we converted everything to the newer block write cache interface, kill off the queue flush_flags and queueable flush entries. Signed-off-by: Jens Axboe <axboe@fb.com>
2016-01-14raid5-cache: handle journal hotadd in quiesceShaohua Li1-0/+7
Handle journal hotadd in quiesce to avoid creating duplicated threads. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-14md: set MD_HAS_JOURNAL in correct placesShaohua Li1-0/+1
Set MD_HAS_JOURNAL when a array is loaded or journal is initialized. This is to avoid the flags set too early in journal disk hotadd. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06raid5: allow r5l_io_unit allocations to failChristoph Hellwig1-10/+57
And propagate the error up the stack so we can add the stripe to no_stripes_list and retry our log operation later. This avoids blocking raid5d due to reclaim, an it allows to get rid of the deadlock-prone GFP_NOFAIL allocation. shli: add missing mempool_destroy() Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06raid5-cache: use a mempool for the metadata blockChristoph Hellwig1-2/+12
We only have a limited number in flight, so use a page based mempool. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06raid5-cache: use a bio_setChristoph Hellwig1-1/+15
This allows us to make guaranteed forward progress. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06raid5-cache: add journal hot add/remove supportShaohua Li1-4/+12
Add support for journal disk hot add/remove. Mostly trival checks in md part. The raid5 part is a little tricky. For hot-remove, we can't wait pending write as it's called from raid5d. The wait will cause deadlock. We simplily fail the hot-remove. A hot-remove retry can success eventually since if journal disk is faulty all pending write will be failed and finish. For hot-add, since an array supporting journal but without journal disk will be marked read-only, we are safe to hot add journal without stopping IO (should be read IO, while journal only handles write IO). Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06raid5-cache: free meta_page earlierChristoph Hellwig1-7/+2
Once the I/O completed we don't need the meta page anymore. As the iounits can live on for a long time this reduces memory pressure a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06raid5-cache: simplify r5l_move_io_unit_listChristoph Hellwig1-17/+15
It's only used for one kind of move, so make that explicit. Also clean up the code a bit by using list_for_each_safe. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01raid5-cache: start raid5 readonly if journal is missingShaohua Li1-1/+2
If raid array is expected to have journal (eg, journal is set in MD superblock feature map) and the array is started without journal disk, start the array readonly. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01raid5-cache: IO error handlingShaohua Li1-1/+14
There are 3 places the raid5-cache dispatches IO. The discard IO error doesn't matter, so we ignore it. The superblock write IO error can be handled in MD core. The remaining are log write and flush. When the IO error happens, we mark log disk faulty and fail all write IO. Read IO is still allowed to run. Userspace will get a notification too and corresponding daemon can choose setting raid array readonly for example. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01raid5-cache: add trim support for logShaohua Li1-1/+62
Since superblock is updated infrequently, we do a simple trim of log disk (a synchronous trim) Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01raid5-cache: use bio chainingChristoph Hellwig1-22/+16
Simplify the bio completion handler by using bio chaining and submitting bios as soon as they are full. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01raid5-cache: small log->seq cleanupChristoph Hellwig1-2/+1
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01raid5-cache: new helper: r5_reserve_log_entryChristoph Hellwig1-11/+19
Factor out code to reserve log space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01raid5-cache: inline r5l_alloc_io_unit into r5l_new_metaChristoph Hellwig1-18/+8
This is the only user, and keeping all code initializing the io_unit structure together improves readbility. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01raid5-cache: take rdev->data_offset into account early onChristoph Hellwig1-5/+2
Set up bi_sector properly when we allocate an bio instead of updating it at submission time. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01raid5-cache: refactor bio allocationChristoph Hellwig1-25/+20
Split out a helper to allocate a bio for log writes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01raid5-cache: clean up r5l_get_metaChristoph Hellwig1-8/+4
Remove the only partially used local 'io' variable to simplify the code flow. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01raid5-cache: simplify state machine when caches flushes are not neededChristoph Hellwig1-4/+29
For devices without a volatile write cache we don't need to send a FLUSH command to ensure writes are stable on disk, and thus can avoid the whole step of batching up bios for processing by the MD thread. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01raid5-cache: factor out a helper to run all stripes for an I/O unitChristoph Hellwig1-10/+13
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01raid5-cache: rename flushed_ios to finished_iosChristoph Hellwig1-7/+7
After this series we won't nessecarily have flushed the cache for these I/Os, so give the list a more neutral name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>