summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2019-12-13btrfs: return error pointer from alloc_test_extent_bufferDan Carpenter3-6/+8
Callers of alloc_test_extent_buffer have not correctly interpreted the return value as error pointer, as alloc_test_extent_buffer should behave as alloc_extent_buffer. The self-tests were unaffected but btrfs_find_create_tree_block could call both functions and that would cause problems up in the call chain. Fixes: faa2dbf004e8 ("Btrfs: add sanity tests for new qgroup accounting code") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13btrfs: fix devs_max constraints for raid1c3 and raid1c4David Sterba1-2/+2
The value 0 for devs_max means to spread the allocated chunks over all available devices, eg. stripe for RAID0 or RAID5. This got mistakenly copied to the RAID1C3/4 profiles. The intention is to have exactly 3 and 4 copies respectively. Fixes: 47e6f7423b91 ("btrfs: add support for 3-copy replication (raid1c3)") Fixes: 8d6fac0087e5 ("btrfs: add support for 4-copy replication (raid1c4)") Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13btrfs: tree-checker: Fix error format string for size_tAndreas Färber1-1/+1
Argument BTRFS_FILE_EXTENT_INLINE_DATA_START is defined as offsetof(), which returns type size_t, so we need %zu instead of %lu. This fixes a build warning on 32-bit ARM: ../fs/btrfs/tree-checker.c: In function 'check_extent_data_item': ../fs/btrfs/tree-checker.c:230:43: warning: format '%lu' expects argument of type 'long unsigned int', but argument 5 has type 'unsigned int' [-Wformat=] 230 | "invalid item size, have %u expect [%lu, %u)", | ~~^ | long unsigned int | %u Fixes: 153a6d299956 ("btrfs: tree-checker: Check item size before reading file extent type") Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andreas Färber <afaerber@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13btrfs: don't double lock the subvol_sem for rename exchangeJosef Bacik1-6/+4
If we're rename exchanging two subvols we'll try to lock this lock twice, which is bad. Just lock once if either of the ino's are subvols. Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13btrfs: handle error in btrfs_cache_block_groupJosef Bacik1-2/+18
We have a BUG_ON(ret < 0) in find_free_extent from btrfs_cache_block_group. If we fail to allocate our ctl we'll just panic, which is not good. Instead just go on to another block group. If we fail to find a block group we don't want to return ENOSPC, because really we got a ENOMEM and that's the root of the problem. Save our return from btrfs_cache_block_group(), and then if we still fail to make our allocation return that ret so we get the right error back. Tested with inject-error.py from bcc. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13btrfs: do not call synchronize_srcu() in inode_tree_delJosef Bacik1-2/+0
Testing with the new fsstress uncovered a pretty nasty deadlock with lookup and snapshot deletion. Process A unlink -> final iput -> inode_tree_del -> synchronize_srcu(subvol_srcu) Process B btrfs_lookup <- srcu_read_lock() acquired here -> btrfs_iget -> find inode that has I_FREEING set -> __wait_on_freeing_inode() We're holding the srcu_read_lock() while doing the iget in order to make sure our fs root doesn't go away, and then we are waiting for the inode to finish freeing. However because the free'ing process is doing a synchronize_srcu() we deadlock. Fix this by dropping the synchronize_srcu() in inode_tree_del(). We don't need people to stop accessing the fs root at this point, we're only adding our empty root to the dead roots list. A larger much more invasive fix is forthcoming to address how we deal with fs roots, but this fixes the immediate problem. Fixes: 76dda93c6ae2 ("Btrfs: add snapshot/subvolume destroy ioctl") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13Btrfs: fix cloning range with a hole when using the NO_HOLES featureFilipe Manana1-2/+2
When using the NO_HOLES feature if we clone a range that contains a hole and a temporary ENOSPC happens while dropping extents from the target inode's range, we can end up failing and aborting the transaction with -EEXIST or with a corrupt file extent item, that has a length greater than it should and overlaps with other extents. For example when cloning the following range from inode A to inode B: Inode A: extent A1 extent A2 [ ----------- ] [ hole, implicit, 4MB length ] [ ------------- ] 0 1MB 5MB 6MB Range to clone: [1MB, 6MB) Inode B: extent B1 extent B2 extent B3 extent B4 [ ---------- ] [ --------- ] [ ---------- ] [ ---------- ] 0 1MB 1MB 2MB 2MB 5MB 5MB 6MB Target range: [1MB, 6MB) (same as source, to make it easier to explain) The following can happen: 1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents(); 2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents() set 'drop_end' to 2MB, meaning it was able to drop only extent B2; 3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 2MB - 1MB = 1MB; 4) We then attempt to insert a file extent item at inode B with a file offset of 5MB, which is the value of clone_info->file_offset. This fails with error -EEXIST because there's already an extent at that offset (extent B4); 5) We abort the current transaction with -EEXIST and return that error to user space as well. Another example, for extent corruption: Inode A: extent A1 extent A2 [ ----------- ] [ hole, implicit, 10MB length ] [ ------------- ] 0 1MB 11MB 12MB Inode B: extent B1 extent B2 [ ----------- ] [ --------- ] [ ----------------------------- ] 0 1MB 1MB 5MB 5MB 12MB Target range: [1MB, 12MB) (same as source, to make it easier to explain) 1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents(); 2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents() set 'drop_end' to 5MB, meaning it was able to drop only extent B2; 3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 5MB - 1MB = 4MB; 4) We then insert a file extent item at inode B with a file offset of 11MB which is the value of clone_info->file_offset, and a length of 4MB (the value of 'clone_len'). So we get 2 extents items with ranges that overlap and an extent length of 4MB, larger then the extent A2 from inode A (1MB length); 5) After that we end the transaction, balance the btree dirty pages and then start another or join the previous transaction. It might happen that the transaction which inserted the incorrect extent was committed by another task so we end up with extent corruption if a power failure happens. So fix this by making sure we attempt to insert the extent to clone at the destination inode only if we are past dropping the sub-range that corresponds to a hole. Fixes: 690a5dbfc51315 ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13btrfs: Fix error messages in qgroup_rescan_initNikolay Borisov1-2/+2
The branch of qgroup_rescan_init which is executed from the mount path prints wrong errors messages. The textual print out in case BTRFS_QGROUP_STATUS_FLAG_RESCAN/BTRFS_QGROUP_STATUS_FLAG_ON are not set are transposed. Fix it by exchanging their place. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: drop bdev argument from submit_extent_pageDavid Sterba1-8/+3
After previous patches removing bdev being passed around to set it to bio, it has become unused in submit_extent_page. So it now has "only" 13 parameters. Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: remove extent_map::bdevDavid Sterba8-31/+2
We can now remove the bdev from extent_map. Previous patches made sure that bio_set_dev is correctly in all places and that we don't need to grab it from latest_bdev or pass it around inside the extent map. Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: drop bio_set_dev where not neededDavid Sterba2-11/+0
bio_set_dev sets a bdev to a bio and is not only setting a pointer bug also changing some state bits if there was a different bdev set before. This is one thing that's not needed. Another thing is that setting a bdev at bio allocation time is too early and actually does not work with plain redundancy profiles, where each time we submit a bio to a device, the bdev is set correctly. In many places the bio bdev is set to latest_bdev that seems to serve as a stub pointer "just to put something to bio". But we don't have to do that. Where do we know which bdev to set: * for regular IO: submit_stripe_bio that's called by btrfs_map_bio * repair IO: repair_io_failure, read or write from specific device * super block write (using buffer_heads but uses raw bdev) and barriers * scrub: this does not use all regular IO paths as it needs to reach all copies, verify and fixup eventually, and for that all bdev management is independent * raid56: rbio_add_io_page, for the RMW write * integrity-checker: does it's own low-level block tracking Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: get bdev directly from fs_devices in submit_extent_pageDavid Sterba1-1/+4
This is preparatory patch to remove @bdev parameter from submit_extent_page. It can't be removed completely, because the cgroups need it for wbc when initializing the bio wbc_init_bio bio_associate_blkg_from_css dereference bdev->bi_disk->queue The bdev pointer is the same as latest_bdev, thus no functional change. We can retrieve it from fs_devices that's reachable through several dereferences. The local variable shadows the parameter, but that's only temporary. Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: record all roots for rename exchange on a subvolJosef Bacik1-0/+3
Testing with the new fsstress support for subvolumes uncovered a pretty bad problem with rename exchange on subvolumes. We're modifying two different subvolumes, but we only start the transaction on one of them, so the other one is not added to the dirty root list. This is caught by btrfs_cow_block() with a warning because the root has not been updated, however if we do not modify this root again we'll end up pointing at an invalid root because the root item is never updated. Fix this by making sure we add the destination root to the trans list, the same as we do with normal renames. This fixes the corruption. Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18Btrfs: fix block group remaining RO forever after error during device replaceFilipe Manana3-45/+2
When doing a device replace, while at scrub.c:scrub_enumerate_chunks(), we set the block group to RO mode and then wait for any ongoing writes into extents of the block group to complete. While doing that wait we overwrite the value of the variable 'ret' and can break out of the loop if an error happens without turning the block group back into RW mode. So what happens is the following: 1) btrfs_inc_block_group_ro() returns 0, meaning it set the block group to RO mode (its ->ro field set to 1 or incremented to some value > 1); 2) Then btrfs_wait_ordered_roots() returns a value > 0; 3) Then if either joining or committing the transaction fails, we break out of the loop wihtout calling btrfs_dec_block_group_ro(), leaving the block group in RO mode forever. To fix this, just remove the code that waits for ongoing writes to extents of the block group, since it's not needed because in the initial setup phase of a device replace operation, before starting to find all chunks and their extents, we set the target device for replace while holding fs_info->dev_replace->rwsem, which ensures that after releasing that semaphore, any writes into the source device are made to the target device as well (__btrfs_map_block() guarantees that). So while at scrub_enumerate_chunks() we only need to worry about finding and copying extents (from the source device to the target device) that were written before we started the device replace operation. Fixes: f0e9b7d6401959 ("Btrfs: fix race setting block group readonly during device replace") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: scrub: Don't check free space before marking a block group ROQu Wenruo4-20/+54
[BUG] When running btrfs/072 with only one online CPU, it has a pretty high chance to fail: btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg) - output mismatch (see xfstests-dev/results//btrfs/072.out.bad) --- tests/btrfs/072.out 2019-10-22 15:18:14.008965340 +0800 +++ /xfstests-dev/results//btrfs/072.out.bad 2019-11-14 15:56:45.877152240 +0800 @@ -1,2 +1,3 @@ QA output created by 072 Silence is golden +Scrub find errors in "-m dup -d single" test ... And with the following call trace: BTRFS info (device dm-5): scrub: started on devid 1 ------------[ cut here ]------------ BTRFS: Transaction aborted (error -27) WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs] CPU: 0 PID: 55087 Comm: btrfs Tainted: G W O 5.4.0-rc1-custom+ #13 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs] Call Trace: __btrfs_end_transaction+0xdb/0x310 [btrfs] btrfs_end_transaction+0x10/0x20 [btrfs] btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs] scrub_enumerate_chunks+0x264/0x940 [btrfs] btrfs_scrub_dev+0x45c/0x8f0 [btrfs] btrfs_ioctl+0x31a1/0x3fb0 [btrfs] do_vfs_ioctl+0x636/0xaa0 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x43/0x50 do_syscall_64+0x79/0xe0 entry_SYSCALL_64_after_hwframe+0x49/0xbe ---[ end trace 166c865cec7688e7 ]--- [CAUSE] The error number -27 is -EFBIG, returned from the following call chain: btrfs_end_transaction() |- __btrfs_end_transaction() |- btrfs_create_pending_block_groups() |- btrfs_finish_chunk_alloc() |- btrfs_add_system_chunk() This happens because we have used up all space of btrfs_super_block::sys_chunk_array. The root cause is, we have the following bad loop of creating tons of system chunks: 1. The only SYSTEM chunk is being scrubbed It's very common to have only one SYSTEM chunk. 2. New SYSTEM bg will be allocated As btrfs_inc_block_group_ro() will check if we have enough space after marking current bg RO. If not, then allocate a new chunk. 3. New SYSTEM bg is still empty, will be reclaimed During the reclaim, we will mark it RO again. 4. That newly allocated empty SYSTEM bg get scrubbed We go back to step 2, as the bg is already mark RO but still not cleaned up yet. If the cleaner kthread doesn't get executed fast enough (e.g. only one CPU), then we will get more and more empty SYSTEM chunks, using up all the space of btrfs_super_block::sys_chunk_array. [FIX] Since scrub/dev-replace doesn't always need to allocate new extent, especially chunk tree extent, so we don't really need to do chunk pre-allocation. To break above spiral, here we introduce a new parameter to btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we need extra chunk pre-allocation. For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass @do_chunk_alloc=false. This should keep unnecessary empty chunks from popping up for scrub. Also, since there are two parameters for btrfs_inc_block_group_ro(), add more comment for it. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: change btrfs_fs_devices::rotating to boolJohannes Thumshirn2-4/+4
struct btrfs_fs_devices::rotating currently is declared as an integer variable but only used as a boolean. Change the variable definition to bool and update to code touching it to set 'true' and 'false'. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: change btrfs_fs_devices::seeding to boolJohannes Thumshirn2-5/+5
struct btrfs_fs_devices::seeding currently is declared as an integer variable but only used as a boolean. Change the variable definition to bool and update to code touching it to set 'true' and 'false'. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: rename btrfs_block_group_cacheDavid Sterba25-281/+266
The type name is misleading, a single entry is named 'cache' while this normally means a collection of objects. Rename that everywhere. Also the identifier was quite long, making function prototypes harder to format. Suggested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: block-group: Reuse the item key from caller of read_one_block_group()Qu Wenruo1-9/+8
For read_one_block_group(), its only caller has already got the item key to search next block group item. So we can use that key directly without doing our own convertion on stack. Also, since that key used in btrfs_read_block_groups() is vital for block group item search, add 'const' keyword for that parameter to prevent read_one_block_group() to modify it. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: block-group: Refactor btrfs_read_block_groups()Qu Wenruo1-111/+108
Refactor the work inside the loop of btrfs_read_block_groups() into one separate function, read_one_block_group(). This allows read_one_block_group to be reused for later BG_TREE feature. The refactor does the following extra fix: - Use btrfs_fs_incompat() to replace open-coded feature check Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: document extent buffer lockingDavid Sterba1-14/+158
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: access eb::blocking_writers according to ACCESS_ONCE policiesDavid Sterba1-11/+21
A nice writeup of the LKMM (Linux Kernel Memory Model) rules for access once policies can be found here https://lwn.net/Articles/799218/#Access-Marking%20Policies . The locked and unlocked access to eb::blocking_writers should be annotated accordingly, following this: Writes: - locked write must use ONCE, may use plain read - unlocked write must use ONCE Reads: - unlocked read must use ONCE - locked read may use plain read iff not mixed with unlocked read - unlocked read then locked must use ONCE There's one difference on the assembly level, where btrfs_tree_read_lock_atomic and btrfs_try_tree_read_lock used the cached value and did not reevaluate it after taking the lock. This could have missed some opportunities to take the lock in case blocking writers changed between the calls, but the window is just a few instructions long. As this is in try-lock, the callers handle that. Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: set blocking_writers directly, no increment or decrementDavid Sterba1-2/+2
The increment and decrement was inherited from previous version that used atomics, switched in commit 06297d8cefca ("btrfs: switch extent_buffer blocking_writers from atomic to int"). The only possible values are 0 and 1 so we can set them directly. The generated assembly (gcc 9.x) did the direct value assignment in btrfs_set_lock_blocking_write (asm diff after change in 06297d8cefca): 5d: test %eax,%eax 5f: je 62 <btrfs_set_lock_blocking_write+0x22> 61: retq - 62: lock incl 0x44(%rdi) - 66: add $0x50,%rdi - 6a: jmpq 6f <btrfs_set_lock_blocking_write+0x2f> + 62: movl $0x1,0x44(%rdi) + 69: add $0x50,%rdi + 6d: jmpq 72 <btrfs_set_lock_blocking_write+0x32> The part in btrfs_tree_unlock did a decrement because BUG_ON(blockers > 1) is probably not a strong hint for the compiler, but otherwise the output looks safe: - lock decl 0x44(%rdi) + sub $0x1,%eax + mov %eax,0x44(%rdi) Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: merge blocking_writers branches in btrfs_tree_read_lockDavid Sterba1-13/+14
There are two ifs that use eb::blocking_writers. As this is a variable modified inside and outside of locks, we could minimize number of accesses to avoid problems with getting different results at different times. The access here is locked so this can only race with btrfs_tree_unlock that sets blocking_writers to 0 without lock and unsets the lock owner. The first branch is taken only if the same thread already holds the lock, the second if checks for blocking writers. Here we'd either unlock and wait, or proceed. Both are valid states of the locking protocol. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: drop incompat bit for raid1c34 after last block group is goneDavid Sterba1-9/+18
When there are no raid1c3 or raid1c4 block groups left after balance (either convert or with other filters applied), remove the incompat bit. This is already done for RAID56, do the same for RAID1C34. Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: add incompat for raid1 with 3, 4 copiesDavid Sterba3-1/+13
The new raid1c3 and raid1c4 profiles are backward incompatible and the name shall be 'raid1c34', the status can be found in the global supported features in /sys/fs/btrfs/features or in the per-filesystem directory. Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: add support for 4-copy replication (raid1c4)David Sterba4-2/+18
Add new block group profile to store 4 copies in a simliar way that current RAID1 does. The profile attributes and constraints are defined in the raid table and used by the same code that already handles the 2- and 3-copy RAID1. The minimum number of devices is 4, the maximum number of devices/chunks that can be lost/damaged is 3. There is no comparable traditional RAID level, the profile is added for future needs to accompany triple-parity and beyond. Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: add support for 3-copy replication (raid1c3)David Sterba4-4/+23
Add new block group profile to store 3 copies in a simliar way that current RAID1 does. The profile attributes and constraints are defined in the raid table and used by the same code that already handles the 2-copy RAID1. The minimum number of devices is 3, the maximum number of devices/chunks that can be lost/damaged is 2. Like RAID6 but with 33% space utilization. Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: sink write flags to cow_file_range_asyncDavid Sterba1-5/+3
In commit "Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios", cow_file_range_async gained wbc as a parameter and this makes passing write flags redundant. Set it inside the function and remove the parameter. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: sink write_flags to __extent_writepage_ioDavid Sterba1-5/+3
__extent_writepage reads write flags from wbc and passes both to __extent_writepage_io. This makes write_flags redundant and we can remove it. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18Btrfs: send, skip backreference walking for extents with many referencesFilipe Manana1-1/+24
Backreference walking, which is used by send to figure if it can issue clone operations instead of write operations, can be very slow and use too much memory when extents have many references. This change simply skips backreference walking when an extent has more than 64 references, in which case we fallback to a write operation instead of a clone operation. This limit is conservative and in practice I observed no signicant slowdown with up to 100 references and still low memory usage up to that limit. This is a temporary workaround until there are speedups in the backref walking code, and as such it does not attempt to add extra interfaces or knobs to tweak the threshold. Reported-by: Atemu <atemu.main@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18Btrfs: send, allow clone operations within the same fileFilipe Manana1-5/+13
For send we currently skip clone operations when the source and destination files are the same. This is so because clone didn't support this case in its early days, but support for it was added back in May 2013 by commit a96fbc72884fcb ("Btrfs: allow file data clone within a file"). This change adds support for it. Example: $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt/sdd $ xfs_io -f -c "pwrite -S 0xab -b 64K 0 64K" /mnt/sdd/foobar $ xfs_io -c "reflink /mnt/sdd/foobar 0 64K 64K" /mnt/sdd/foobar $ btrfs subvolume snapshot -r /mnt/sdd /mnt/sdd/snap $ mkfs.btrfs -f /dev/sde $ mount /dev/sde /mnt/sde $ btrfs send /mnt/sdd/snap | btrfs receive /mnt/sde Without this change file foobar at the destination has a single 128Kb extent: $ filefrag -v /mnt/sde/snap/foobar Filesystem type is: 9123683e File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 0.. 31: 32: last,unknown_loc,delalloc,eof /mnt/sde/snap/foobar: 1 extent found With this we get a single 64Kb extent that is shared at file offsets 0 and 64K, just like in the source filesystem: $ filefrag -v /mnt/sde/snap/foobar Filesystem type is: 9123683e File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 15: 3328.. 3343: 16: shared 1: 16.. 31: 3328.. 3343: 16: 3344: last,shared,eof /mnt/sde/snap/foobar: 2 extents found Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: Ensure we trim ranges across block group boundaryQu Wenruo2-12/+35
[BUG] When deleting large files (which cross block group boundary) with discard mount option, we find some btrfs_discard_extent() calls only trimmed part of its space, not the whole range: btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50% type: bbio->map_type, in above case, it's SINGLE DATA. start: Logical address of this trim len: Logical length of this trim trimmed: Physically trimmed bytes ratio: trimmed / len Thus leaving some unused space not discarded. [CAUSE] When discard mount option is specified, after a transaction is fully committed (super block written to disk), we begin to cleanup pinned extents in the following call chain: btrfs_commit_transaction() |- btrfs_finish_extent_commit() |- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY); |- btrfs_discard_extent() However, pinned extents are recorded in an extent_io_tree, which can merge adjacent extent states. When a large file gets deleted and it has adjacent file extents across block group boundary, we will get a large merged range like this: |<--- BG1 --->|<--- BG2 --->| |//////|<-- Range to discard --->|/////| To discard that range, we have the following calls: btrfs_discard_extent() |- btrfs_map_block() | Returned bbio will end at BG1's end. As btrfs_map_block() | never returns result across block group boundary. |- btrfs_issuse_discard() Issue discard for each stripe. So we will only discard the range in BG1, not the remaining part in BG2. Furthermore, this bug is not that reliably observed, for above case, if there is no other extent in BG2, BG2 will be empty and btrfs will trim all space of BG2, covering up the bug. [FIX] - Allow __btrfs_map_block_for_discard() to modify @length parameter btrfs_map_block() uses its @length paramter to notify the caller how many bytes are mapped in current call. With __btrfs_map_block_for_discard() also modifing the @length, btrfs_discard_extent() now understands when to do extra trim. - Call btrfs_map_block() in a loop until we hit the range end Since we now know how many bytes are mapped each time, we can iterate through each block group boundary and issue correct trim for each range. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: volumes: Use more straightforward way to calculate map lengthQu Wenruo1-1/+1
The old code goes: offset = logical - em->start; length = min_t(u64, em->len - offset, length); Where @length calculation is dependent on offset, it can take reader several more seconds to find it's just the same code as: offset = logical - em->start; length = min_t(u64, em->start + em->len - logical, length); Use above code to make the length calculate independent from other variable, thus slightly increase the readability. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: tree-checker: Check item size before reading file extent typeQu Wenruo1-0/+11
In check_extent_data_item(), we read file extent type without verifying if the item size is valid. Add such check to ensure the file extent type we read is correct. The check is not as accurate as we need to cover both inline and regular extents, so it only checks if the item size is larger or equal to inline header. So the existing size checks on inline/regular extents are still needed. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: clean up locking name in scrub_enumerate_chunks()Dan Carpenter1-3/+3
The "&fs_info->dev_replace.rwsem" and "&dev_replace->rwsem" refer to the same lock but Smatch is not clever enough to figure that out so it leads to static checker warnings. It's better to use it consistently anyway. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: Streamline btrfs_fs_info::backup_root_index semanticsNikolay Borisov1-39/+12
The backup_root_index member stores the index at which the backup root should be saved upon next transaction commit. However, there is a small deviation from this behavior in the form of a check in backup_super_roots which checks if current root generation equals to the generation of the previous root. This can trigger in the following scenario: slot0: gen-2 slot1: gen-1 slot2: gen slot3: unused Now suppose slot3 (which is also the root specified in the super block) is corrupted hence init_tree_roots chooses to use the backup root at slot2, meaning read_backup_root will read slot2 and assign the superblock generation to gen-1. Despite this backup_root_index will point at slot3 because its init happens in init_backup_root_slot, long before any parsing of the backup roots occur. Then on next transaction start, gen-1 will be incremented by 1 making the root's generation equal gen. Subsequently, on transaction commit the following check triggers: if (btrfs_backup_tree_root_gen(root_backup) == btrfs_header_generation(info->tree_root->node)) This causes the 'next_backup', which is the index at which the backup is going to be written to, to set to last_backup, which will be slot2. All of this is a very confusing way of expressing the following invariant: Always write a backup root at the index following the last used backup root. This commit streamlines this logic by setting backup_root_index to the next index after the one used for mount. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: Rename find_oldest_super_backup to init_backup_root_slotNikolay Borisov1-6/+4
The old name name was an awful misnomer because it didn't really find the oldest super backup per-se but rather its slot. For example if we have: slot0: gen - 2 slot1: gen - 1 slot2: gen slot3: empty init_backup_root_slot will return slot3 and not slot0. The new name is more appropriate since the function doesn't care whether there is a valid backup in the returned slot or not. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: Remove unused next_root_backup functionNikolay Borisov1-50/+0
This function has been superseded by previous commits and is no longer used so just remove it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: Don't use objectid_mutex during mountNikolay Borisov1-3/+4
Since the filesystem is not well formed and no trees are loaded it's pointless holding the objectid_mutex. Just remove its usage. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: Factor out tree roots initialization during mountNikolay Borisov1-57/+85
The code responsible for reading and initializing tree roots is scattered in open_ctree among 2 labels, emulating a loop. This is rather confusing to reason about. Instead, factor the code to a new function, init_tree_roots which implements the same logical flow. There are a couple of notable differences, namely: * Instead of using next_backup_root it's using the newly introduced read_backup_root. * If read_backup_root returns an error init_tree_roots propagates the error and there is no special handling of that case e.g. the code jumps straight to 'fail_tree_roots' label. The old code, however, was (erroneously) jumping to 'fail_block_groups' label if next_backup_root did fail, this was unnecessary since the tree roots init logic doesn't modify the state of block groups. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: Add read_backup_rootNikolay Borisov1-0/+44
This function will replace next_root_backup with a much saner/cleaner interface. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: Remove newest_gen argument from find_oldest_super_backupNikolay Borisov1-4/+2
It's no longer needed following cleanups around find_newest_backup_root Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: Cleanup and simplify find_newest_super_backupNikolay Borisov1-20/+11
Backup roots are always written in a circular manner. By definition we can only ever have 1 backup root whose generation equals to that of the superblock. Hence, the 'if' in the for loop will trigger at most once. This is sufficient to return the newest backup root. Furthermore the newest_gen parameter is always set to the generation of the superblock. This value can be obtained from the fs_info. This patch removes the unnecessary code dealing with the wraparound case and makes 'newest_gen' a local variable. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18Btrfs: remove unnecessary delalloc mutex for inodesFilipe Manana3-20/+5
The inode delalloc mutex was added a long time ago by commit f248679e86fea ("Btrfs: add a delalloc mutex to inodes for delalloc reservations"), and the reason for its introduction is not very clear from the change log. It claims it solves bogus warnings from lockdep, however it lacks an example report/warning from lockdep, or any explanation. Since we have enough concurrentcy protection from the locks of the space info and block reserve objects, and such lockdep warnings don't seem to exist anymore (at least on a 5.3 kernel I couldn't get them with fstests, ltp, fs_mark, etc), remove it, simplifying things a bit and decreasing the size of the btrfs_inode structure. With some quick fio tests doing direct IO and mmap writes I couldn't observe any significant performance increase either (direct IO writes that don't increase the file's size don't hold the inode's lock for their entire duration and mmap writes don't hold the inode's lock at all), which are the only type of writes that could see any performance gain due to less serialization. Review feedback from Josef: The problem was taking the i_mutex in mmap, which is how I was protecting delalloc reservations originally. The delalloc mutex didn't come with all of the other dependencies. That's what the lockdep messages were about, removing the lock isn't going to make them appear again. We _had_ to lock around this because we used to do tricks to keep from over-reserving, and if we didn't serialize delalloc reservations we'd end up with ugly accounting problems when we tried to clean things up. However with my recentish changes this isn't the case anymore. Every operation is responsible for reserving its space, and then adding it to the inode. Then cleaning up is straightforward and can't be mucked up by other users. So we no longer need the delalloc mutex to safe us from ourselves. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18Btrfs: remove wait queue from space_info structureFilipe Manana2-2/+0
It is not used anymore since commit 957780eb2788d8 ("Btrfs: introduce ticketed enospc infrastructure"), so just remove it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: remove cached space_info in btrfs_statfs()Johannes Thumshirn1-2/+1
In btrfs_statfs() we cache fs_info::space_info in a local variable only to use it once in a list_for_each_rcu() statement. Not only is the local variable unnecessary it even makes the code harder to follow as it's not clear which list it is iterating. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: add dedicated members for start and length of a block groupDavid Sterba14-216/+202
The on-disk format of block group item makes use of the key that stores the offset and length. This is further used in the code, although this makes thing harder to understand. The key is also packed so the offset/length is not properly aligned as u64. Add start (key.objectid) and length (key.offset) members to block group and remove the embedded key. When the item is searched or written, a local variable for key is used. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: rename extent buffer block group item accessorsDavid Sterba2-6/+6
Accessors defined by BTRFS_SETGET_FUNCS take a raw extent buffer and manipulate the items there, there's no special prefix required. The block group accessors had _disk_ because previously the names were occupied by the on-stack accessors. As this has been addressed in the previous patch, we can now unify the naming. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: rename block_group_item on-stack accessors to follow namingDavid Sterba3-17/+17
All accessors defined by BTRFS_SETGET_STACK_FUNCS contain _stack_ in the name, the block group ones were not following that scheme, so let's switch them. Signed-off-by: David Sterba <dsterba@suse.com>