summaryrefslogtreecommitdiffstats
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2014-01-15Merge tag 'md/3.13-fixes' of git://neil.brown.name/mdLinus Torvalds5-14/+29
Pull late md fixes from Neil Brown: "Half a dozen md bug fixes. All of these fix real bugs the people have hit, and are tagged for -stable. Sorry they are late .... Christmas holidays and all that. Hopefully they can still squeak into 3.13" * tag 'md/3.13-fixes' of git://neil.brown.name/md: md: fix problem when adding device to read-only array with bitmap. md/raid10: fix bug when raid10 recovery fails to recover a block. md/raid5: fix a recently broken BUG_ON(). md/raid1: fix request counting bug in new 'barrier' code. md/raid10: fix two bugs in handling of known-bad-blocks. md/raid5: Fix possible confusion when multiple write errors occur.
2014-01-14dm sysfs: fix a module unload raceMikulas Patocka6-21/+74
This reverts commit be35f48610 ("dm: wait until embedded kobject is released before destroying a device") and provides an improved fix. The kobject release code that calls the completion must be placed in a non-module file, otherwise there is a module unload race (if the process calling dm_kobject_release is preempted and the DM module unloaded after the completion is triggered, but before dm_kobject_release returns). To fix this race, this patch moves the completion code to dm-builtin.c which is always compiled directly into the kernel if BLK_DEV_DM is selected. The patch introduces a new dm_kobject_holder structure, its purpose is to keep the completion and kobject in one place, so that it can be accessed from non-module code without the need to export the layout of struct mapped_device to that code. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-01-14dm snapshot: use dm-bufio prefetchMikulas Patocka3-3/+41
This patch modifies dm-snapshot so that it prefetches the buffers when loading the exceptions. The number of buffers read ahead is specified in the DM_PREFETCH_CHUNKS macro. The current value for DM_PREFETCH_CHUNKS (12) was found to provide the best performance on a single 15k SCSI spindle. In the future we may modify this default or make it configurable. Also, introduce the function dm_bufio_set_minimum_buffers to setup bufio's number of internal buffers before freeing happens. dm-bufio may hold more buffers if enough memory is available. There is no guarantee that the specified number of buffers will be available - if you need a guarantee, use the argument reserved_buffers for dm_bufio_client_create. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-01-14dm snapshot: use dm-bufioMikulas Patocka4-7/+62
Use dm-bufio for initial loading of the exceptions. Introduce a new function dm_bufio_forget that frees the given buffer. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-01-14dm snapshot: prepare for switch to using dm-bufioMikulas Patocka1-12/+14
Change the functions get_exception, read_exception and insert_exceptions so that ps->area is passed as an argument. This patch doesn't change any functionality, but it refactors the code to allow for a cleaner switch over to using dm-bufio. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-01-14dm snapshot: use GFP_KERNEL when initializing exceptionsMikulas Patocka1-5/+5
The list of initial exceptions is loaded in the target constructor. We are allowed to allocate memory with GFP_KERNEL at this point. So, change alloc_completed_exception to use GFP_KERNEL when being called from the constructor. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-01-14md: ensure metadata is writen after raid level change.NeilBrown1-0/+2
level_store() currently does not make sure the metadata is updates to reflect the new raid level. It simply sets MD_CHANGE_DEVS. Any level with a ->thread will quickly notice this and update the metadata. However RAID0 and Linear do not have a thread so no metadata update happens until the array is stopped. At that point the metadata is written. This is later that we would like. While the delay doesn't risk any data it can cause confusion. So if there is no md thread, immediately update the metadata after a level change. Reported-by: Richard Michael <rmichael@edgeofthenet.org> Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14md/raid10: avoid fullsync when not necessary.NeilBrown1-1/+2
This is the raid10 equivalent of commit 4f0a5e012cf41321d611e7cad63e1017d143d138 MD RAID1: Further conditionalize 'fullsync' If a device in a newly assembled array is not fully recovered we currently do a fully resync by don't need to. Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14md: allow a partially recovered device to be hot-added to an array.NeilBrown1-1/+2
When adding a new device into an array it is normally important to clear any stale data from ->recovery_offset else the new device may not be recovered properly. However when re-adding a device which is known to be nearly in-sync, this is not needed and can be detrimental. The (bitmap-based) resync will still happen, and further recovery is only needed from where-ever it was already up to. So if save_raid_disk is set, signifying a re-add, don't clear ->recovery_offset. Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14md: Change handling of save_raid_disk and metadata update during recovery.NeilBrown1-19/+23
Since commit d70ed2e4fafdbef0800e739 MD: Allow restarting an interrupted incremental recovery. we don't write out the metadata to devices while they are recovering. This had a good reason, but has unfortunate consequences. This patch changes things to make them work better. At issue is what happens if the array is shut down while a recovery is happening, particularly a bitmap-guided recovery. Ideally the recovery should pick up where it left off. However the metadata cannot represent the state "A recovery is in process which is guided by the bitmap". Before the above mentioned commit, we wrote metadata to the device which said "this is being recovered and it is up to <here>". So after a restart, a full recovery (not bitmap-guided) would happen from where-ever it was up to. After the commit the metadata wasn't updated so it still said "This device is fully in sync with <this> event count". That leads to a bitmap-based recovery following the whole bitmap, which should be a lot less work than a full recovery from some starting point. So this was an improvement. However updates some metadata but not all leads to other problems. In particular, the metadata written to the fully-up-to-date device record that the array has all devices present (even though some are recovering). So on restart, mdadm wants to find all devices and expects them to have current event counts. Obviously it doesn't (some have old event counts) so (when assembling with --incremental) it waits indefinitely for the rest of the expected devices. It really is wrong to not update all the metadata together. Do that is bound to cause confusion. Instead, we should make it possible to record the truth in the metadata. i.e. we need to be able to record that a device is being recovered based on the bitmap. We already have a Feature flag to say that recovery is happening. We now add another one to say that it is a bitmap-based recovery. With this we can remove the code that disables the write-out of metadata on some devices. So this patch: - moves the setting of 'saved_raid_disk' from add_new_disk to the validate_super methods. This makes sure it is always set properly, both when adding a new device to an array, and when assembling an array from a collection of devices. - Adds a metadata flag MD_FEATURE_RECOVERY_BITMAP which is only used if MD_FEATURE_RECOVERY_OFFSET is set, and record that a bitmap-based recovery is allowed. This is only present in v1.x metadata. v0.90 doesn't support devices which are in the middle of recovery at all. - Only skips writing metadata to Faulty devices. - Also allows rdev state to be set to "-insync" via sysfs. This can be used for external-metadata arrays. When the 'role' is set the device is assumed to be in-sync. If, after setting the role, we set the state to "-insync", the role is moved to saved_raid_disk which effectively says the device is partly in-sync with that slot and needs a bitmap recovery. Cc: Andrei Warkentin <andreiw@vmware.com> Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14md: fix problem when adding device to read-only array with bitmap.NeilBrown2-3/+18
If an array is started degraded, and then the missing device is found it can be re-added and a minimal bitmap-based recovery will bring it fully up-to-date. If the array is read-only a recovery would not be allowed. But also if the array is read-only and the missing device was present very recently, then there could be no need for any recovery at all, so we simply include the device in the read-only array without any recovery. However... if the missing device was removed a little longer ago it could be missing some updates, but if a bitmap is present it will be conditionally accepted pending a bitmap-based update. We don't currently detect this case properly and will include that old device into the read-only array with no recovery even though it really needs a recovery. This patch keeps track of whether a bitmap-based-recovery is really needed or not in the new Bitmap_sync rdev flag. If that is set, then the device will not be added to a read-only array. Cc: Andrei Warkentin <andreiw@vmware.com> Fixes: d70ed2e4fafdbef0800e73942482bb075c21578b Cc: stable@vger.kernel.org (3.2+) Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14md/raid10: fix bug when raid10 recovery fails to recover a block.NeilBrown1-4/+4
commit e875ecea266a543e643b19e44cf472f1412708f9 md/raid10 record bad blocks as needed during recovery. added code to the "cannot recover this block" path to record a bad block rather than fail the whole recovery. Unfortunately this new case was placed *after* r10bio was freed rather than *before*, yet it still uses r10bio. This is will crash with a null dereference. So move the freeing of r10bio down where it is safe. Cc: stable@vger.kernel.org (v3.1+) Fixes: e875ecea266a543e643b19e44cf472f1412708f9 Reported-by: Damian Nowak <spam@nowaker.net> URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181 Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14md/raid5: fix a recently broken BUG_ON().NeilBrown1-1/+2
commit 6d183de4077191d1201283a9035ce57a9b05254d md/raid5: fix newly-broken locking in get_active_stripe. simplified a BUG_ON, but removed too much so now it sometimes fires when it shouldn't. When the STRIPE_EXPANDING flag is set, the stripe_head might be on a special list while multiple stripe_heads are collected, or it might not be on any list, even a 'free' list when the refcount is zero. As long as STRIPE_EXPANDING is set, it will be found and added back to a list eventually. So both of the BUG_ONs which test for the ->lru being empty or not need to avoid the case where STRIPE_EXPANDING is set. The patch which broke this was marked for -stable, so this patch needs to be applied to any branch that received 6d183de4 Fixes: 6d183de4077191d1201283a9035ce57a9b05254d Cc: stable@vger.kernel.org (any release to which above was applied) Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14md/raid1: fix request counting bug in new 'barrier' code.NeilBrown1-2/+1
The new iobarrier implementation in raid1 (which keeps normal writes and resync activity separate) counts every request what is not before the current resync point in either next_window_requests or current_window_requests. It flags that the request is counted by setting ->start_next_window. allow_barrier follows this model exactly and decrements one of the *_window_requests if and only if ->start_next_window is set. However wait_barrier(), which increments *_window_requests uses a slightly different test for setting -.start_next_window (which is set from the return value of this function). So there is a possibility of the counts getting out of sync, and this leads to the resync hanging. So change wait_barrier() to return a non-zero value in exactly the same cases that it increments *_window_requests. But was introduced in 3.13-rc1. Reported-by: Bruno Wolff III <bruno@wolff.to> URL: https://bugzilla.kernel.org/show_bug.cgi?id=68061 Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761 Cc: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14md/raid10: fix two bugs in handling of known-bad-blocks.NeilBrown1-2/+2
If we discover a bad block when reading we split the request and potentially read some of it from a different device. The code path of this has two bugs in RAID10. 1/ we get a spin_lock with _irq, but unlock without _irq!! 2/ The calculation of 'sectors_handled' is wrong, as can be clearly seen by comparison with raid1.c This leads to at least 2 warnings and a probable crash is a RAID10 ever had known bad blocks. Cc: stable@vger.kernel.org (v3.1+) Fixes: 856e08e23762dfb92ffc68fd0a8d228f9e152160 Reported-by: Damian Nowak <spam@nowaker.net> URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181 Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14md/raid5: Fix possible confusion when multiple write errors occur.NeilBrown1-2/+2
commit 5d8c71f9e5fbdd95650be00294d238e27a363b5c md: raid5 crash during degradation Fixed a crash in an overly simplistic way which could leave R5_WriteError or R5_MadeGood set in the stripe cache for devices for which it is no longer relevant. When those devices are removed and spares added the flags are still set and can cause incorrect behaviour. commit 14a75d3e07c784c004b4b44b34af996b8e4ac453 md/raid5: preferentially read from replacement device if possible. Fixed the same bug if a more effective way, so we can now revert the original commit. Reported-and-tested-by: Alexander Lyakas <alex.bolshoy@gmail.com> Cc: stable@vger.kernel.org (3.2+ - 3.2 will need a different fix though) Fixes: 5d8c71f9e5fbdd95650be00294d238e27a363b5c Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-13cgroup: remove stray references to css_idHugh Dickins1-1/+0
Trivial: remove the few stray references to css_id, which itself was removed in v3.13's 2ff2a7d03bbe "cgroup: kill css_id". Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-01-10dm cache: add block sizes and total cache blocks to status outputMike Snitzer1-6/+10
Improve cache_status to emit: <metadata block size> <#used metadata blocks>/<#total metadata blocks> <cache block size> <#used cache blocks>/<#total cache blocks> ... Adding the block sizes allows for easier calculation of the overall size of both the metadata and cache devices. Adding <#total cache blocks> provides useful context for how much of the cache is used. Unfortunately these additions to the status will require updates to users' scripts that monitor the cache status. But these changes help provide more comprehensive information about the cache device and will simplify tools that are being developed to manage dm-cache devices -- because they won't need to issue 3 operations to cobble together the information that we can easily provide via a single status ioctl. While updating the status documentation in cache.txt spaces were tabify'd. Requested-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
2014-01-09dm btree: add dm_btree_find_lowest_keyJoe Thornber2-7/+34
dm_btree_find_lowest_key is the reciprocal of dm_btree_find_highest_key. Factor out common code for dm_btree_find_{highest,lowest}_key. dm_btree_find_lowest_key is needed for an upcoming DM target, as such it is best to get this interface in place. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-01-08bcache: Fix auxiliary search trees for key size > cacheline sizeKent Overstreet1-14/+14
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Don't return -EINTR when insert finishedKent Overstreet1-2/+4
We need to return -EINTR after a split because we invalidated iterators (and freed the btree node) - but if we were finished inserting, we don't want to redo the traversal. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Improve bucket_prio() calculationKent Overstreet2-3/+16
When deciding what order to reuse buckets we take into account both the bucket's priority (which indicates lru order) and also the amount of live data in that bucket. The way they were scaled together wasn't as correct as it could be... this patch improves and documents it. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Add bch_bkey_equal_header()Nicholas Swenson3-8/+11
Checks if two keys have equivalent header fields. (good enough for replacement or merging) Used in bch_bkey_try_merge, and replacing a key in the btree. Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: update bch_bkey_try_mergeNicholas Swenson3-16/+28
Added generic header checks to bch_bkey_try_merge, which then calls the bkey specific function Removed extraneous checks from bch_extent_merge Signed-off-by: Nicholas Swenson <nks@daterainc.com>
2014-01-08bcache: Move insert_fixup() to btree_keys_opsKent Overstreet4-229/+257
Now handling overlapping extents/keys is a method that's specific to what the btree node contains. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Convert sorting to btree_keysKent Overstreet3-36/+33
More work to disentangle various code from struct btree Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Convert debug code to btree_keysKent Overstreet9-217/+264
More work to disentangle various code from struct btree Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Convert btree_iter to struct btree_keysKent Overstreet6-38/+41
More work to disentangle bset.c from struct btree Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Refactor bset_tree sysfs statsKent Overstreet3-47/+54
We're in the process of turning bset.c into library code, so none of the code in that file should know about struct cache_set or struct btree - so, move the btree traversal part of the stats code to sysfs.c. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Add bch_btree_keys_u64s_remaining()Kent Overstreet2-13/+30
Helper function to explicitly check how much space is free in a btree node Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Add struct btree_keysKent Overstreet8-263/+322
Soon, bset.c won't need to depend on struct btree. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Abstract out stuff needed for sortingKent Overstreet9-289/+423
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Rename/shuffle various code aroundKent Overstreet8-276/+341
More work to disentangle bset.c from the rest of the code: Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Add struct bset_sort_stateKent Overstreet6-49/+87
More disentangling bset.c from the rest of the bcache code - soon, the sorting routines won't have any dependencies on any outside structs. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Split out sort_extent_cmp()Kent Overstreet4-32/+73
Only use extent comparison for comparing extents, so we're not using START_KEY() on other key types (i.e. btree pointers) Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Bkey indexing renamingKent Overstreet6-52/+62
More refactoring: node() -> bset_bkey_idx() end() -> bset_bkey_last() Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Make bch_keylist_realloc() take u64s, not nptrsKent Overstreet4-16/+26
Getting away from KEY_PTRS and moving toward KEY_U64s - and getting rid of magic 2s Also - split out the part that checks against journal entry size so as to avoid a dependancy on struct cache_set in bset.c Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Remove/fix some header dependenciesKent Overstreet3-24/+26
In the process of disentagling/libraryizing bset.c from the rest of the bcache code. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Use a mempool for mergesort temporary spaceKent Overstreet3-16/+8
It was a single element mempool before, it's slightly cleaner to just use a real mempool. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Btree verify code improvementsKent Overstreet6-40/+83
Used this fixed code to find and fix the bug fixed by a4d885097b0ac0cd1337f171f2d4b83e946094d4. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: kill index()Kent Overstreet4-8/+24
That was a terrible name for a macro, add some better helpers to replace it. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Trivial error handling fixKent Overstreet1-1/+2
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache/md: Use raid stripe sizeKent Overstreet2-0/+7
Now that we've got code for raid5/6 stripe awareness, bcache just needs to know about the stripes and when writing partial stripes is expensive - we probably don't want to enable this optimization for raid1 or 10, even though they have stripes. So add a flag to queue_limits. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Do bkey_put() in btree_split() error pathKent Overstreet1-1/+4
This error path shouldn't have been hit in practice.. and we've got reworked reserve code coming soon so that it shouldn't _ever_ be bit... but if we've got code for this error path it should be correct. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Rework allocator reservesKent Overstreet7-79/+101
We need a reserve for allocating buckets for new btree nodes - and now that we've got multiple btrees, it really needs to be per btree. This reworks the reserves so we've got separate freelists for each reserve instead of watermarks, which seems to make things a bit cleaner, and it adds some code so that btree_split() can make sure the reserve is available before it starts. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: kill closure locking codeKent Overstreet2-313/+123
Also flesh out the documentation a bit Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: kill closure locking usageKent Overstreet7-55/+98
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Zero less memoryKent Overstreet3-40/+41
Another minor performance optimization Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Don't touch bucket gen for dirty ptrsKent Overstreet2-2/+7
Unnecessary since a bucket that has dirty pointers pointing to it can never be invalidated - and skipping it is a measurable performance boost, since the bucket gen will usually be a cache miss. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
2014-01-08bcache: Minor btree cache fixKent Overstreet1-7/+3
Signed-off-by: Kent Overstreet <kmo@daterainc.com>