summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2014-03-10btrfs: Add threshold workqueue based on kernel workqueueQu Wenruo2-9/+101
The original btrfs_workers has thresholding functions to dynamically create or destroy kthreads. Though there is no such function in kernel workqueue because the worker is not created manually, we can still use the workqueue_set_max_active to simulated the behavior, mainly to achieve a better HDD performance by setting a high threshold on submit_workers. (Sadly, no resource can be saved) So in this patch, extra workqueue pending counters are introduced to dynamically change the max active of each btrfs_workqueue_struct, hoping to restore the behavior of the original thresholding function. Also, workqueue_set_max_active use a mutex to protect workqueue_struct, which is not meant to be called too frequently, so a new interval mechanism is applied, that will only call workqueue_set_max_active after a count of work is queued. Hoping to balance both the random and sequence performance on HDD. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Add high priority workqueue support for btrfs_workqueue_structQu Wenruo2-13/+83
Add high priority function to btrfs_workqueue. This is implemented by embedding a btrfs_workqueue into a btrfs_workqueue and use some helper functions to differ the normal priority wq and high priority wq. So the high priority wq is completely independent from the normal workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Added btrfs_workqueue_struct implemented ordered execution based on ↵Qu Wenruo2-0/+164
kernel workqueue Use kernel workqueue to implement a new btrfs_workqueue_struct, which has the ordering execution feature like the btrfs_worker. The func is executed in a concurrency way, and the ordred_func/ordered_free is executed in the sequence them are queued after the corresponding func is done. The new btrfs_workqueue works much like the original one, one workqueue for normal work and a list for ordered work. When a work is queued, ordered work will be added to the list and helper function will be queued into the workqueue. The helper function will execute a normal work and then check and execute as many ordered work as possible in the sequence they were queued. At this patch, high priority work queue or thresholding is not added yet. The high priority feature and thresholding will be added in the following patches. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Cleanup the unused struct async_sched.Qu Wenruo1-7/+0
The struct async_sched is not used by any codes and can be removed. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fusionio.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: skip search tree for REG filesLiu Bo1-4/+15
It is really unnecessary to search tree again for @gen, @mode and @rdev in the case of REG inodes' creation, as we've got btrfs_inode_item in sctx, and @gen, @mode and @rdev can easily be fetched. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix preallocate vs double nocow writeMiao Xie2-8/+17
We can not release the reserved metadata space for the first write if we find the write position is pre-allocated. Because the kernel might write the data on the disk before we do the second write but after the can-nocow check, if we release the space for the first write, we might fail to update the metadata because of no space. Fix this problem by end nocow write if there is dirty data in the range whose space is pre-allocated. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix wrong lock range and write size in check_can_nocow()Miao Xie1-2/+3
The write range may not be sector-aligned, for example: |--------|--------| <- write range, sector-unaligned, size: 2blocks |--------|--------|--------| <- correct lock range, size: 3blocks But according to the old code, we used the size of write range to calculate the lock range directly, not considered the offset, we would get a wrong lock range: |--------|--------| <- write range, sector-unaligned, size: 2blocks |--------|--------| <- wrong lock range, size: 2blocks And besides that, the old code also had the same problem when calculating the real write size. Correct them. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: send: simplify allocation code in fs_path_ensure_bufDavid Sterba1-18/+12
Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: send: fix old buffer length in fs_path_ensure_bufDavid Sterba1-3/+3
In "btrfs: send: lower memory requirements in common case" the code to save the old_buf_len was incorrectly moved to a wrong place and broke the original logic. Reported-by: Filipe David Manana <fdmanana@gmail.com> Signed-off-by: David Sterba <dsterba@suse.cz> Reviewed-by: Filipe David Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: more efficient btrfs_drop_extent_cacheFilipe Manana3-14/+45
While droping extent map structures from the extent cache that cover our target range, we would remove each extent map structure from the red black tree and then add either 1 or 2 new extent map structures if the former extent map covered sections outside our target range. This change simply attempts to replace the existing extent map structure with a new one that covers the subsection we're not interested in, instead of doing a red black remove operation followed by an insertion operation. The number of elements in an inode's extent map tree can get very high for large files under random writes. For example, while running the following test: sysbench --test=fileio --file-num=1 --file-total-size=10G \ --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \ --max-requests=500000 --file-rw-ratio=2 [prepare|run] I captured the following histogram capturing the number of extent_map items in the red black tree while that test was running: Count: 122462 Range: 1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981 Percentiles: 90th: 160120.000; 95th: 166335.000; 99th: 171070.000 1.000 - 5.231: 452 | 5.231 - 187.392: 87 | 187.392 - 585.911: 206 | 585.911 - 1827.438: 623 | 1827.438 - 5695.245: 1962 # 5695.245 - 17744.861: 6204 #### 17744.861 - 55283.764: 21115 ############ 55283.764 - 172231.000: 91813 ##################################################### Benchmark: sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \ --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \ --file-io-mode=sync --file-fsync-freq=0 [prepare|run] Before this change: 122.1Mb/sec After this change: 125.07Mb/sec (averages of 5 test runs) Test machine: quad core intel i5-3570K, 32Gb of ram, SSD Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: more efficient split extent state insertionFilipe Manana1-5/+8
When we split an extent state there's no need to start the rbtree search from the root node - we can start it from the original extent state node, since we would end up in its subtree if we do the search starting at the root node anyway. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: remove unneeded field / smaller extent_map structureFilipe Manana3-13/+12
We don't need to have an unsigned int field in the extent_map struct to tell us whether the extent map is in the inode's extent_map tree or not. We can use the rb_node struct field and the RB_CLEAR_NODE and RB_EMPTY_NODE macros to achieve the same task. This reduces sizeof(struct extent_map) from 152 bytes to 144 bytes (on a 64 bits system). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: skip locking when searching commit rootWang Shilong1-1/+3
We won't change commit root, skip locking dance with commit root when walking backrefs, this can speed up btrfs send operations. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: wake up @scrub_pause_wait as much as we canWang Shilong1-0/+10
check if @scrubs_running=@scrubs_paused condition inside wait_event() is not an atomic operation which means we may inc/dec @scrub_running/ paused at any time. Let's wake up @scrub_pause_wait as much as we can to let commit transaction blocked less. An example below: Thread1 Thread2 |->scrub_blocked_if_needed() |->scrub_pending_trans_workers_inc |->increase @scrub_paused |->increase @scrub_running |->wake up scrub_pause_wait list |->scrub blocked |->increase @scrub_paused Thread3 is commiting transaction which is blocked at btrfs_scrub_pause(). So after Thread2 increase @scrub_paused, we meet the condition @scrub_paused=@scrub_running, but transaction will be still blocked until another calling to wake up @scrub_pause_wait. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: cancel scrub on transaction abortionWang Shilong1-0/+1
If we fail to commit transaction, we'd better cancel scrub operations. Suggested-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: device_replace: fix deadlock for nocow caseWang Shilong1-2/+15
commit cb7ab02156e4 cause a following deadlock found by xfstests,btrfs/011: Thread1 is commiting transaction which is blocked at btrfs_scrub_pause(). Thread2 is calling btrfs_file_aio_write() which has held inode's @i_mutex and commit transaction(blocked because Thread1 is committing transaction). Thread3 is copy_nocow_page worker which will also try to hold inode @i_mutex, so thread3 will wait Thread1 finished. Thread4 is waiting pending workers finished which will wait Thread3 finished. So the problem is like this: Thread1--->Thread4--->Thread3--->Thread2---->Thread1 Deadlock happens! we fix it by letting Thread1 go firstly, which means we won't block transaction commit while we are waiting pending workers finished. Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix a possible deadlock between scrub and transaction committingWang Shilong1-11/+11
btrfs_scrub_continue() will be called when cleaning up transaction.However, this can only be called if btrfs_scrub_pause() is called before. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: Use PTR_ERR_OR_ZEROSachin Kamat1-1/+2
PTR_RET is deprecated. Use PTR_ERR_OR_ZERO instead. While at it also include missing err.h header. Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix send issuing outdated paths for utimes, chown and chmodFilipe Manana1-19/+12
When doing an incremental send, if we had a directory pending a move/rename operation and none of its parents, except for the immediate parent, were pending a move/rename, after processing the directory's references, we would be issuing utimes, chown and chmod intructions against am outdated path - a path which matched the one in the parent root. This change also simplifies a bit the code that deals with building a path for a directory which has a move/rename operation delayed. Steps to reproduce: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b/c/d/e $ mkdir /mnt/btrfs/a/b/c/f $ chmod 0777 /mnt/btrfs/a/b/c/d/e $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send $ mv /mnt/btrfs/a/b/c/f /mnt/btrfs/a/b/f2 $ mv /mnt/btrfs/a/b/c/d/e /mnt/btrfs/a/b/f2/e2 $ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2 $ mv /mnt/btrfs/a/b/c2/d /mnt/btrfs/a/b/c2/d2 $ chmod 0700 /mnt/btrfs/a/b/f2/e2 $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send $ umount /mnt/btrfs $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ btrfs receive /mnt/btrfs -f /tmp/base.send $ btrfs receive /mnt/btrfs -f /tmp/incremental.send The second btrfs receive command failed with: ERROR: chmod a/b/c/d/e failed. No such file or directory A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: correctly determine if blocks are shared in btrfs_compare_treesFilipe Manana1-1/+10
Just comparing the pointers (logical disk addresses) of the btree nodes is not completely bullet proof, we have to check if their generation numbers match too. It is guaranteed that a COW operation will result in a block with a different logical disk address than the original block's address, but over time we can reuse that former logical disk address. For example, creating a 2Gb filesystem on a loop device, and having a script running in a loop always updating the access timestamp of a file, resulted in the same logical disk address being reused for the same fs btree block in about only 4 minutes. This could make us skip entire subtrees when doing an incremental send (which is currently the only user of btrfs_compare_trees). However the odds of getting 2 blocks at the same tree level, with the same logical disk address, equal first slot keys and different generations, should hopefully be very low. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix send attempting to rmdir non-empty directoriesFilipe Manana1-26/+221
The incremental send algorithm assumed that it was possible to issue a directory remove (rmdir) if the the inode number it was currently processing was greater than (or equal) to any inode that referenced the directory's inode. This wasn't a valid assumption because any such inode might be a child directory that is pending a move/rename operation, because it was moved into a directory that has a higher inode number and was moved/renamed too - in other words, the case the following commit addressed: 9f03740a956d7ac6a1b8f8c455da6fa5cae11c22 (Btrfs: fix infinite path build loops in incremental send) This made an incremental send issue an rmdir operation before the target directory was actually empty, which made btrfs receive fail. Therefore it needs to wait for all pending child directory inodes to be moved/renamed before sending an rmdir operation. Simple steps to reproduce this issue: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b/c/x $ mkdir /mnt/btrfs/a/b/y $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send $ mv /mnt/btrfs/a/b/y /mnt/btrfs/a/b/YY $ mv /mnt/btrfs/a/b/c/x /mnt/btrfs/a/b/YY $ rmdir /mnt/btrfs/a/b/c $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send $ umount /mnt/btrfs $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ btrfs receive /mnt/btrfs -f /tmp/base.send $ btrfs receive /mnt/btrfs -f /tmp/incremental.send The second btrfs receive command failed with: ERROR: rmdir o259-6-0 failed. Directory not empty A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: send, don't send rmdir for same target multiple timesFilipe Manana1-1/+4
When doing an incremental send, if we delete a directory that has N > 1 hardlinks for the same file and that file has the highest inode number inside the directory contents, an incremental send would send N times an rmdir operation against the directory. This made the btrfs receive command fail on the second rmdir instruction, as the target directory didn't exist anymore. Steps to reproduce the issue: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b/c $ echo 'ola mundo' > /mnt/btrfs/a/b/c/foo.txt $ ln /mnt/btrfs/a/b/c/foo.txt /mnt/btrfs/a/b/c/bar.txt $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send $ rm -f /mnt/btrfs/a/b/c/foo.txt $ rm -f /mnt/btrfs/a/b/c/bar.txt $ rmdir /mnt/btrfs/a/b/c $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send $ umount /mnt/btrfs $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ btrfs receive /mnt/btrfs -f /tmp/base.send $ btrfs receive /mnt/btrfs -f /tmp/incremental.send The second btrfs receive command failed with: ERROR: rmdir o259-6-0 failed. No such file or directory A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: incremental send, fix invalid path after dir renameFilipe Manana1-7/+34
This fixes yet one more case not caught by the commit titled: Btrfs: fix infinite path build loops in incremental send In this case, even before the initial full send, we have a directory which is a child of a directory with a higher inode number. Then we perform the initial send, and after we rename both the child and the parent, without moving them around. After doing these 2 renames, an incremental send sent a rename instruction for the child directory which contained an invalid "from" path (referenced the parent's old name, not the new one), which made the btrfs receive command fail. Steps to reproduce: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b $ mkdir /mnt/btrfs/d $ mkdir /mnt/btrfs/a/b/c $ mv /mnt/btrfs/d /mnt/btrfs/a/b/c $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send $ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/x $ mv /mnt/btrfs/a/b/x/d /mnt/btrfs/a/b/x/y $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send $ umout /mnt/btrfs $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ btrfs receive /mnt/btrfs -f /tmp/base.send $ btrfs receive /mnt/btrfs -f /tmp/incremental.send The second btrfs receive command failed with: "ERROR: rename a/b/c/d -> a/b/x/y failed. No such file or directory" A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: don't insert useless holes when punching beyond the inode's sizeFilipe Manana1-11/+17
If we punch beyond the size of an inode, we'll correctly remove any prealloc extents, but we'll also insert file extent items representing holes (disk bytenr == 0) that start with a key offset that lies beyond the inode's size and are not contiguous with the last file extent item. Example: $XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo $XFS_IO_PROG -c "fpunch 582007 864596" $SCRATCH_MNT/foo $XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo btrfs-debug-tree output: item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160 inode generation 6 transid 6 size 132254 block group 0 mode 100600 links 1 item 5 key (257 INODE_REF 256) itemoff 15872 itemsize 13 inode ref index 2 namelen 3 name: foo item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53 extent data disk byte 0 nr 0 gen 6 extent data offset 0 nr 90112 ram 122880 extent compression 0 item 7 key (257 EXTENT_DATA 90112) itemoff 15766 itemsize 53 extent data disk byte 12845056 nr 4096 gen 6 extent data offset 0 nr 45056 ram 45056 extent compression 2 item 8 key (257 EXTENT_DATA 585728) itemoff 15713 itemsize 53 extent data disk byte 0 nr 0 gen 6 extent data offset 0 nr 860160 ram 860160 extent compression 0 The last extent item, which represents a hole, is useless as it lies beyond the inode's size. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: cleanup delayed-ref.c:find_ref_head()Filipe Manana1-18/+6
The argument last wasn't used, all callers supplied a NULL value for it. Also removed unnecessary intermediate storage of the result of key comparisons. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: remove unnecessary ref heads rb tree searchFilipe Manana1-4/+3
When we didn't find the exact ref head we were looking for, if return_bigger != 0 we set a new search key to match either the next node after the last one we found or the first one in the ref heads rb tree, and then did another full tree search. For both cases this ended up being pointless as we would end up returning an entry we already had before repeating the search. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: wake up transaction thread upon remountJustin Maggard1-0/+1
Now that we can adjust the commit interval with a remount, we need to wake up the transaction thread or else he will continue to sleep until the previous transaction interval has elapsed before waking up. So, if we go from a large commit interval to something smaller, the transaction thread will not wake up until the large interval has expired. This also causes the cleaner thread to stay sleeping, since it gets woken up by the transaction thread. Fix it by simply waking up the transaction thread during a remount. Signed-off-by: Justin Maggard <jmaggard10@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: stop joining the log transaction if sync log failsMiao Xie1-2/+14
If the log sync fails, there is something wrong in the log tree, we should not continue to join the log transaction and log the metadata. What we should do is to do a full commit. This patch fixes this problem by setting ->last_trans_log_full_commit to the current transaction id, it will tell the tasks not to join the log transaction, and do a full commit. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: just wait or commit our own log sub-transactionMiao Xie4-23/+47
We might commit the log sub-transaction which didn't contain the metadata we logged. It was because we didn't record the log transid and just select the current log sub-transaction to commit, but the right one might be committed by the other task already. Actually, we needn't do anything and it is safe that we go back directly in this case. This patch improves the log sync by the above idea. We record the transid of the log sub-transaction in which we log the metadata, and the transid of the log sub-transaction we have committed. If the committed transid is >= the transid we record when logging the metadata, we just go back. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix skipped error handle when log sync failedMiao Xie5-31/+111
It is possible that many tasks sync the log tree at the same time, but only one task can do the sync work, the others will wait for it. But those wait tasks didn't get the result of the log sync, and returned 0 when they ended the wait. It caused those tasks skipped the error handle, and the serious problem was they told the users the file sync succeeded but in fact they failed. This patch fixes this problem by introducing a log context structure, we insert it into the a global list. When the sync fails, we will set the error number of every log context in the list, then the waiting tasks get the error number of the log context and handle the error if need. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: use signed integer instead of unsigned long integer for log transidMiao Xie3-11/+11
The log trans id is initialized to be 0 every time we create a log tree, and the log tree need be re-created after a new transaction is started, it means the log trans id is unlikely to be a huge number, so we can use signed integer instead of unsigned long integer to save a bit space. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: remove unnecessary memory barrier in btrfs_sync_log()Miao Xie1-3/+0
Mutex unlock implies certain memory barriers to make sure all the memory operation completes before the unlock, and the next mutex lock implies memory barriers to make sure the all the memory happens after the lock. So it is a full memory barrier(smp_mb), we needn't add memory barriers. Remove them. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: don't start the log transaction if the log tree init failsMiao Xie1-12/+14
The old code would start the log transaction even the log tree init failed, it was unnecessary. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix the skipped transaction commit during the file syncMiao Xie1-10/+16
We may abort the wait earlier if ->last_trans_log_full_commit was set to the current transaction id, at this case, we need commit the current transaction instead of the log sub-transaction. But the current code didn't tell the caller to do it (return 0, not -EAGAIN). Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: use ACCESS_ONCE to prevent the optimize accesses to ↵Miao Xie1-7/+10
->last_trans_log_full_commit ->last_trans_log_full_commit may be changed by the other tasks without lock, so we need prevent the compiler from the optimize access just like tmp = fs_info->last_trans_log_full_commit if (tmp == ...) ... <do something> if (tmp == ...) ... In fact, we need get the new value of ->last_trans_log_full_commit during the second access. Fix it by ACCESS_ONCE(). Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: avoid warning bomb of btrfs_invalidate_inodesLiu Bo1-1/+2
So after transaction is aborted, we need to cleanup inode resources by calling btrfs_invalidate_inodes(), and btrfs_invalidate_inodes() hopes roots' refs to be zero in old times and sets a WARN_ON(), however, this is not always true within cleaning up transaction, so we get to detect transaction abortion and not warn at all. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: fix possible deadlock in btrfs_cleanup_transactionLiu Bo1-1/+3
[13654.480669] ====================================================== [13654.480905] [ INFO: possible circular locking dependency detected ] [13654.481003] 3.12.0+ #4 Tainted: G W O [13654.481060] ------------------------------------------------------- [13654.481060] btrfs-transacti/9347 is trying to acquire lock: [13654.481060] (&(&root->ordered_extent_lock)->rlock){+.+...}, at: [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs] [13654.481060] but task is already holding lock: [13654.481060] (&(&fs_info->ordered_root_lock)->rlock){+.+...}, at: [<ffffffffa02d3015>] btrfs_cleanup_transaction+0x1e5/0x570 [btrfs] [13654.481060] which lock already depends on the new lock. [13654.481060] the existing dependency chain (in reverse order) is: [13654.481060] -> #1 (&(&fs_info->ordered_root_lock)->rlock){+.+...}: [13654.481060] [<ffffffff810c4103>] lock_acquire+0x93/0x130 [13654.481060] [<ffffffff81689991>] _raw_spin_lock+0x41/0x50 [13654.481060] [<ffffffffa02f011b>] __btrfs_add_ordered_extent+0x39b/0x450 [btrfs] [13654.481060] [<ffffffffa02f0202>] btrfs_add_ordered_extent+0x32/0x40 [btrfs] [13654.481060] [<ffffffffa02df6aa>] run_delalloc_nocow+0x78a/0x9d0 [btrfs] [13654.481060] [<ffffffffa02dfc0d>] run_delalloc_range+0x31d/0x390 [btrfs] [13654.481060] [<ffffffffa02f7c00>] __extent_writepage+0x310/0x780 [btrfs] [13654.481060] [<ffffffffa02f830a>] extent_write_cache_pages.isra.29.constprop.48+0x29a/0x410 [btrfs] [13654.481060] [<ffffffffa02f879d>] extent_writepages+0x4d/0x70 [btrfs] [13654.481060] [<ffffffffa02d9f68>] btrfs_writepages+0x28/0x30 [btrfs] [13654.481060] [<ffffffff8114be91>] do_writepages+0x21/0x50 [13654.481060] [<ffffffff81140d49>] __filemap_fdatawrite_range+0x59/0x60 [13654.481060] [<ffffffff81140e13>] filemap_fdatawrite_range+0x13/0x20 [13654.481060] [<ffffffffa02f1db9>] btrfs_wait_ordered_range+0x49/0x140 [btrfs] [13654.481060] [<ffffffffa0318fe2>] __btrfs_write_out_cache+0x682/0x8b0 [btrfs] [13654.481060] [<ffffffffa031952d>] btrfs_write_out_cache+0x8d/0xe0 [btrfs] [13654.481060] [<ffffffffa02c7083>] btrfs_write_dirty_block_groups+0x593/0x680 [btrfs] [13654.481060] [<ffffffffa0345307>] commit_cowonly_roots+0x14b/0x20d [btrfs] [13654.481060] [<ffffffffa02d7c1a>] btrfs_commit_transaction+0x43a/0x9d0 [btrfs] [13654.481060] [<ffffffffa030061a>] btrfs_create_uuid_tree+0x5a/0x100 [btrfs] [13654.481060] [<ffffffffa02d5a8a>] open_ctree+0x21da/0x2210 [btrfs] [13654.481060] [<ffffffffa02ab6fe>] btrfs_mount+0x68e/0x870 [btrfs] [13654.481060] [<ffffffff811b2409>] mount_fs+0x39/0x1b0 [13654.481060] [<ffffffff811cd653>] vfs_kern_mount+0x63/0xf0 [13654.481060] [<ffffffff811cfcce>] do_mount+0x23e/0xa90 [13654.481060] [<ffffffff811d05a3>] SyS_mount+0x83/0xc0 [13654.481060] [<ffffffff81692b52>] system_call_fastpath+0x16/0x1b [13654.481060] -> #0 (&(&root->ordered_extent_lock)->rlock){+.+...}: [13654.481060] [<ffffffff810c340a>] __lock_acquire+0x150a/0x1a70 [13654.481060] [<ffffffff810c4103>] lock_acquire+0x93/0x130 [13654.481060] [<ffffffff81689991>] _raw_spin_lock+0x41/0x50 [13654.481060] [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs] [13654.481060] [<ffffffffa02d35ce>] transaction_kthread+0x22e/0x270 [btrfs] [13654.481060] [<ffffffff81079efa>] kthread+0xea/0xf0 [13654.481060] [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0 [13654.481060] other info that might help us debug this: [13654.481060] Possible unsafe locking scenario: [13654.481060] CPU0 CPU1 [13654.481060] ---- ---- [13654.481060] lock(&(&fs_info->ordered_root_lock)->rlock); [13654.481060] lock(&(&root->ordered_extent_lock)->rlock); [13654.481060] lock(&(&fs_info->ordered_root_lock)->rlock); [13654.481060] lock(&(&root->ordered_extent_lock)->rlock); [13654.481060] *** DEADLOCK *** [...] ====================================================== btrfs_destroy_all_ordered_extents() gets &fs_info->ordered_root_lock __BEFORE__ acquiring &root->ordered_extent_lock, while btrfs_[add,remove]_ordered_extent() acquires &fs_info->ordered_root_lock __AFTER__ getting &root->ordered_extent_lock. This patch fixes the above problem. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: faster/more efficient insertion of file extent itemsFilipe David Borba Manana1-22/+30
This is an extension to my previous commit titled: "Btrfs: faster file extent item replace operations" (hash 1acae57b161ef1282f565ef907f72aeed0eb71d9) Instead of inserting the new file extent item if we deleted existing file extent items covering our target file range, also allow to insert the new file extent item if we didn't find any existing items to delete and replace_extent != 0, since in this case our caller would do another tree search to insert the new file extent item anyway, therefore just combine the two tree searches into a single one, saving cpu time, reducing lock contention and reducing btree node/leaf COW operations. This covers the case where applications keep doing tail append writes to files, which for example is the case of Apache CouchDB (its database and view index files are always open with O_APPEND). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: always choose work from prio_head firstStanislaw Gruszka1-4/+5
In case we do not refill, we can overwrite cur pointer from prio_head by one from not prioritized head, what looks as something that was not intended. This change make we always take works from prio_head first until it's not empty. Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Revert "Btrfs: remove transaction from btrfs send"Wang Shilong1-0/+33
This reverts commit 41ce9970a8a6a362ae8df145f7a03d789e9ef9d2. Previously i was thinking we can use readonly root's commit root safely while it is not true, readonly root may be cowed with the following cases. 1.snapshot send root will cow source root. 2.balance,device operations will also cow readonly send root to relocate. So i have two ideas to make us safe to use commit root. -->approach 1: make it protected by transaction and end transaction properly and we research next item from root node(see btrfs_search_slot_for_read()). -->approach 2: add another counter to local root structure to sync snapshot with send. and add a global counter to sync send with exclusive device operations. So with approach 2, send can use commit root safely, because we make sure send root can not be cowed during send. Unfortunately, it make codes *ugly* and more complex to maintain. To make snapshot and send exclusively, device operations and send operation exclusively with each other is a little confusing for common users. So why not drop into previous way. Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: skip readonly root for snapshot-aware defragmentWang Shilong1-0/+5
Btrfs send is assuming readonly root won't change, let's skip readonly root. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: switch to btrfs_previous_extent_item()Wang Shilong1-31/+6
Since we have introduced btrfs_previous_extent_item() to search previous extent item, just switch into it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: skip submitting barrier for missing deviceHidetoshi Seto1-0/+4
I got an error on v3.13: BTRFS error (device sdf1) in write_all_supers:3378: errno=-5 IO failure (errors while submitting device barriers.) how to reproduce: > mkfs.btrfs -f -d raid1 /dev/sdf1 /dev/sdf2 > wipefs -a /dev/sdf2 > mount -o degraded /dev/sdf1 /mnt > btrfs balance start -f -sconvert=single -mconvert=single -dconvert=single /mnt The reason of the error is that barrier_all_devices() failed to submit barrier to the missing device. However it is clear that we cannot do anything on missing device, and also it is not necessary to care chunks on the missing device. This patch stops sending/waiting barrier if device is missing. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: unlock extent and pages on error in cow_file_rangeJosef Bacik1-1/+2
When I converted the BUG_ON() for the free_space_cache_inode in cow_file_range I made it so we just return an error instead of unlocking all of our various stuff. This is a mistake and causes us to hang when we run into this. This patch fixes this problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: balance delayed inode updatesJosef Bacik1-0/+4
While trying to reproduce a delayed ref problem I noticed the box kept falling over using all 80gb of my ram with btrfs_inode's and btrfs_delayed_node's. Turns out this is because we only throttle delayed inode updates in btrfs_dirty_inode, which doesn't actually get called that often, especially when all you are doing is creating a bunch of files. So balance delayed inode updates everytime we create a new inode. With this patch we no longer use up all of our ram with delayed inode updates. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: add simple debugfs interfaceDavid Sterba2-6/+32
Help during debugging to export various interesting infromation and tunables without the need of extra mount options or ioctls. Usage: * declare your variable in sysfs.h, and include where you need it * define the variable in sysfs.c and make it visible via debugfs_create_TYPE Depends on CONFIG_DEBUG_FS. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: send: lower memory requirements in common caseDavid Sterba1-69/+37
The fs_path structure uses an inline buffer and falls back to a chain of allocations, but vmalloc is not necessary because PATH_MAX fits into PAGE_SIZE. The size of fs_path has been reduced to 256 bytes from PAGE_SIZE, usually 4k. Experimental measurements show that most paths on a single filesystem do not exceed 200 bytes, and these get stored into the inline buffer directly, which is now 230 bytes. Longer paths are kmalloced when needed. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: make some tree searches in send.c more efficientFilipe David Borba Manana1-41/+64
We have this pattern where we do search for a contiguous group of items in a tree and everytime we find an item, we process it, then we release our path, increment the offset of the search key, do another full tree search and repeat these steps until a tree search can't find more items we're interested in. Instead of doing these full tree searches after processing each item, just process the next item/slot in our leaf and don't release the path. Since all these trees are read only and we always use the commit root for a search and skip node/leaf locks, we're not affecting concurrency on the trees. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10Btrfs: use right extent item position in send when finding extent clonesFilipe David Borba Manana1-2/+0
This was a leftover from the commit: 74dd17fbe3d65829e75d84f00a9525b2ace93998 (Btrfs: fix btrfs send for inline items and compression) Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10btrfs: send: remove BUG_ON from name_cache_deleteDavid Sterba1-2/+9
If cleaning the name cache fails, we could try to proceed at the cost of some memory leak. This is not expected to happen often. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>