summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2012-10-04Btrfs: fix race when getting the eb out of page->privateJosef Bacik1-4/+19
We can race when checking wether PagePrivate is set on a page and we actually have an eb saved in the pages private pointer. We could have easily written out this page and released it in the time that we did the pagevec lookup and actually got around to looking at this page. So use mapping->private_lock to ensure we get a consistent view of the page->private pointer. This is inline with the alloc and releasepage paths which use private_lock when manipulating page->private. Thanks, Reported-by: David Sterba <dave@jikos.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-04Btrfs: do not hold the write_lock on the extent tree while loggingJosef Bacik3-5/+20
Dave Sterba pointed out a sleeping while atomic bug while doing fsync. This is because I'm an idiot and didn't realize that rwlock's were spin locks, so we've been holding this thing while doing allocations and such which is not good. This patch fixes this by dropping the write lock before we do anything heavy and re-acquire it when it is done. We also need to take a ref on the em's in case their corresponding pages are evicted and mark them as being logged so that releasepage does not remove them and doesn't remove them from our local list. Thanks, Reported-by: Dave Sterba <dave@jikos.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-04Btrfs: fix race with freeze and free space inodesJosef Bacik1-2/+11
So we start our freeze, somebody comes in and does an fsync() on a file where we have to commit a transaction for whatever reason, and we will deadlock because the freeze is waiting on FS_FREEZE people to stop writing to the file system, but the transaction is waiting for its free space inodes to be written out, which are in turn waiting on sb_start_intwrite while trying to write the file extents. To fix this we'll just skip the sb_start_intwrite() if we TRANS_JOIN_NOLOCK since we're being waited on by a transaction commit so we're safe wrt to freeze and this will keep us from deadlocking. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-04Btrfs: kill obsolete arguments in btrfs_wait_ordered_extentsLiu Bo6-18/+7
nocow_only is now an obsolete argument. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-04Btrfs: cleanup fs_info->hashersLiu Bo2-2/+0
fs_info->hashers is now an obsolete one. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-04Btrfs: cleanup for duplicated code in find_free_extentLiu Bo1-4/+0
There is already an 'add free space' phrase in front of this one, we needn't to redo it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-04Btrfs: fix race in sync and freeze againJosef Bacik3-10/+18
I screwed this up, there is a race between checking if there is a running transaction and actually starting a transaction in sync where we could race with a freezer and get ourselves into trouble. To fix this we need to make a new join type to only do the try lock on the freeze stuff. If it fails we'll return EPERM and just return from sync. This fixes a hang Liu Bo reported when running xfstest 68 in a loop. Thanks, Reported-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-04btrfs: return EPERM upon rmdir on a subvolumeDavid Sterba1-2/+3
A subvolume cannot be deleted via rmdir, but the error code ENOTEMPTY is confusing. Return EPERM instead, as this is not permitted. Signed-off-by: David Sterba <dsterba@suse.cz>
2012-10-04Btrfs: using for_each_set_bit_from to simplify the codeWei Yongjun1-6/+2
Using for_each_set_bit_from() to simplify the code. spatch with a semantic match is used to found this. (http://coccinelle.lip6.fr/) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
2012-10-04Btrfs: write_buf is now callable outside send.cAnand Jain2-5/+7
Developing service cmds needs it. Signed-off-by: Anand Jain <anand.jain@oracle.com>
2012-10-04Btrfs: remove unnecessary code in btree_get_extent()Tsutomu Itoh1-7/+1
Unnecessary lookup_extent_mapping() is removed because an error is returned to the caller. This patch was made based on the advice from Stefan Behrens, thanks. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-10-04Btrfs: cleanup of error processing in btree_get_extent()Tsutomu Itoh1-9/+5
This patch simplifies a little complex error processing in btree_get_extent(). Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-10-01Revert "Btrfs: do not do filemap_write_and_wait_range in fsync"Miao Xie1-3/+11
This reverts commit 0885ef5b5601e9b007c383e77c172769b1f214fd After applying the above patch, the performance slowed down because the dirty page flush can only be done by one task, so revert it. The following is the test result of sysbench: Before After 24MB/s 39MB/s Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01Btrfs: remove bytes argument from do_chunk_allocJosef Bacik1-15/+10
Everybody is just making stuff up, and it's just used to see if we really do need to alloc a chunk, and since we do this when we already know we really do it's just a waste of space. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01Btrfs: delay block group item insertionJosef Bacik4-67/+79
So we have lots of places where we try to preallocate chunks in order to make sure we have enough space as we make our allocations. This has historically meant that we're constantly tweaking when we should allocate a new chunk, and historically we have gotten this horribly wrong so we way over allocate either metadata or data. To try and keep this from happening we are going to make it so that the block group item insertion is done out of band at the end of a transaction. This will allow us to create chunks even if we are trying to make an allocation for the extent tree. With this patch my enospc tests run faster (didn't expect this) and more efficiently use the disk space (this is what I wanted). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01btrfs: Kill some bi_idx referencesKent Overstreet2-17/+2
For immutable bio vecs, I've been auditing and removing bi_idx references. These were harmless, but removing them will make auditing easier. scrub_bio_end_io_worker() was open coding a bio_reset() - but this doesn't appear to have been needed for anything as right after it does a bio_put(), and perusing the code it doesn't appear anything else was holding a reference to the bio. The other use end_bio_extent_readpage() was just for a pr_debug() - changed it to something that might be a bit more useful. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Chris Mason <chris.mason@oracle.com> CC: Stefan Behrens <sbehrens@giantdisaster.de>
2012-10-01Btrfs: fix unnecessary warning when the fragments make the space alloc failMiao Xie1-1/+1
When we wrote some data by compress mode into a btrfs filesystem which was full of the fragments, the kernel will report: BTRFS warning (device xxx): Aborting unused transaction. The reason is: We can not find a long enough free space to store the compressed data because of the fragmentary free space, and the compressed data can not be splited, so the kernel outputed the above message. In fact, btrfs can deal with this problem very well: it fall back to uncompressed IO, split the uncompressed data into small ones, and then store them into to the fragmentary free space. So we shouldn't output the above warning message. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01Btrfs: create a pinned em when writing to a prealloc range in DIOJosef Bacik1-0/+55
Wade Cline reported a problem where he was getting garbage and warnings when writing to a preallocated range via O_DIRECT. This is because we weren't creating our normal pinned extent_map for the range we were writing to, which was causing all sorts of issues. This patch fixes the problem and makes his testcase much happier. Thanks, Reported-by: Wade Cline <clinew@linux.vnet.ibm.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01Btrfs: move the sb_end_intwrite until after the throttle logicJosef Bacik1-2/+2
Sage reported the following lockdep backtrace ===================================== [ BUG: bad unlock balance detected! ] 3.6.0-rc2-ceph-00171-gc7ed62d #1 Not tainted ------------------------------------- btrfs-cleaner/7607 is trying to release lock (sb_internal) at: [<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs] but there are no more locks to release! other info that might help us debug this: 1 lock held by btrfs-cleaner/7607: #0: (&fs_info->cleaner_mutex){+.+...}, at: [<ffffffffa003b405>] cleaner_kthread+0x95/0x120 [btrfs] stack backtrace: Pid: 7607, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00171-gc7ed62d #1 Call Trace: [<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs] [<ffffffff810afa9e>] print_unlock_inbalance_bug+0xfe/0x110 [<ffffffff810b289e>] lock_release_non_nested+0x1ee/0x310 [<ffffffff81172f9b>] ? kmem_cache_free+0x7b/0x160 [<ffffffffa004106c>] ? put_transaction+0x8c/0x130 [btrfs] [<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs] [<ffffffff810b2a95>] lock_release+0xd5/0x220 [<ffffffff81173071>] ? kmem_cache_free+0x151/0x160 [<ffffffff8117d9ed>] __sb_end_write+0x7d/0x90 [<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs] [<ffffffff81079850>] ? __init_waitqueue_head+0x60/0x60 [<ffffffff81634c6b>] ? _raw_spin_unlock+0x2b/0x40 [<ffffffffa0042758>] __btrfs_end_transaction+0x368/0x3c0 [btrfs] [<ffffffffa0042808>] btrfs_end_transaction_throttle+0x18/0x20 [btrfs] [<ffffffffa00318f0>] btrfs_drop_snapshot+0x410/0x600 [btrfs] [<ffffffff8132babd>] ? do_raw_spin_unlock+0x5d/0xb0 [<ffffffffa00430ef>] btrfs_clean_old_snapshots+0xaf/0x150 [btrfs] [<ffffffffa003b405>] ? cleaner_kthread+0x95/0x120 [btrfs] [<ffffffffa003b419>] cleaner_kthread+0xa9/0x120 [btrfs] [<ffffffffa003b370>] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs] [<ffffffff810791ee>] kthread+0xae/0xc0 [<ffffffff810b379d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10 [<ffffffff81635430>] ? retint_restore_args+0x13/0x13 [<ffffffff81079140>] ? flush_kthread_work+0x1a0/0x1a0 [<ffffffff8163e740>] ? gs_change+0x13/0x13 This is because the throttle stuff can commit the transaction, which expects to be the one stopping the intwrite stuff, but we've already done it in the __btrfs_end_transaction. Moving the sb_end_intewrite after this logic makes the lockdep go away. Thanks, Tested-by: Sage Weil <sage@inktank.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01Btrfs: use larger limit for translation of logical to inodeLiu Bo2-4/+5
This is the change of the kernel side. Translation of logical to inode used to have an upper limit 4k on inode container's size, but the limit is not large enough for a data with a great many of refs, so when resolving logical address, we can end up with "ioctl ret=0, bytes_left=0, bytes_missing=19944, cnt=510, missed=2493" This changes to regard 64k as the upper limit and use vmalloc instead of kmalloc to get memory more easily. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01Btrfs: use helper for logical resolveLiu Bo1-16/+3
We already have a helper, iterate_inodes_from_logical(), for logical resolve, so just use it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01Btrfs: fix a bug in parsing return value in logical resolveLiu Bo5-20/+34
In logical resolve, we parse extent_from_logical()'s 'ret' as a kind of flag. It is possible to lose our errors because (-EXXXX & BTRFS_EXTENT_FLAG_TREE_BLOCK) is true. I'm not sure if it is on purpose, it just looks too hacky if it is. I'd rather use a real flag and a 'ret' to catch errors. Acked-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Liu Bo <liub.liubo@gmail.com>
2012-10-01Btrfs: update delayed ref's tracepoints to show sequenceLiu Bo1-4/+10
We've added a new field 'sequence' to delayed ref node, so update related tracepoints. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01Btrfs: cleanup for unused ref cache stuffliubo2-8/+0
As ref cache has been removed from btrfs, there is no user on its lock and its check. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01Btrfs: fix corrupted metadata in the snapshotMiao Xie3-18/+32
When we delete a inode, we will remove all the delayed items including delayed inode update, and then truncate all the relative metadata. If there is lots of metadata, we will end the current transaction, and start a new transaction to truncate the left metadata. In this way, we will leave a inode item that its link counter is > 0, and also may leave some directory index items in fs/file tree after the current transaction ends. In other words, the metadata in this fs/file tree is inconsistent. If we create a snapshot for this tree now, we will find a inode with corrupted metadata in the new snapshot, and we won't continue to drop the left metadata, because its link counter is not 0. We fix this problem by updating the inode item before the current transaction ends. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01btrfs: polish names of kmem cachesDavid Sterba4-9/+9
Usecase: watch 'grep btrfs < /proc/slabinfo' easy to watch all caches in one go. Signed-off-by: David Sterba <dsterba@suse.cz>
2012-10-01Btrfs: fix our overcommit mathJosef Bacik1-29/+42
I noticed I was seeing large lags when running my torrent test in a vm on my laptop. While trying to make it lag less I noticed that our overcommit math was taking into account the number of bytes we wanted to reclaim, not the number of bytes we actually wanted to allocate, which means we wouldn't overcommit as often. This patch fixes the overcommit math and makes shrink_delalloc() use that logic so that it will stop looping faster. We still have pretty high spikes of latency, but the test now takes 3 minutes less time (about 5% faster). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01Btrfs: wait on async pages when shrinking delallocJosef Bacik1-0/+7
Mitch reported a problem where you could get an ENOSPC error when untarring a kernel git tree onto a 16gb file system with compress-force=zlib. This is because compression is a huge pain, it will return from ->writepages() without having actually created any ordered extents. To get around this we check to see if the async submit counter is up, and if it is wait until it drops to 0 before doing our normal ordered wait dance. With this patch I can now untar a kernel git tree onto a 16gb file system without getting ENOSPC errors. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01Btrfs: use flag EXTENT_DEFRAG for snapshot-aware defragLiu Bo5-14/+28
We're going to use this flag EXTENT_DEFRAG to indicate which range belongs to defragment so that we can implement snapshow-aware defrag: We set the EXTENT_DEFRAG flag when dirtying the extents that need defragmented, so later on writeback thread can differentiate between normal writeback and writeback started by defragmentation. Original-Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01Btrfs: check return value of ulist_alloc() properlyTsutomu Itoh1-0/+8
ulist_alloc() has the possibility of returning NULL. So, it is necessary to check the return value. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-10-01Btrfs: fix error handling in delete_block_group_cache()Tsutomu Itoh1-2/+2
btrfs_iget() never return NULL. So, NULL check is unnecessary. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-10-01Btrfs: fix wrong size for the reservation when doing, file pre-allocation.Miao Xie1-2/+2
When we ran fsstress(a program in xfstests), the filesystem hung up when it is full. It was because the space reserved in btrfs_fallocate() was wrong, btrfs_fallocate() just used the size of the pre-allocation to reserve the space, didn't took the block size aligning into account, so the size of the reserved space was less than the allocated space, it caused the over reserve problem and made the filesystem hung up when invoking cow_file_range(). Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01Btrfs: output more information when aborting a unused transaction handleMiao Xie1-1/+7
Though we dump the stack information when aborting a unused transaction handle, we don't know the correct place where we decide to abort the transaction handle if one function has several place where the transaction abort function is invoked and jumps to the same place after this call. And beside that we also don't know the reason why we jump to abort the current handle. So I modify the transaction abort function and make it output the function name, line and error information. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01Btrfs: fix unprotected ->log_batchMiao Xie4-11/+9
We forget to protect ->log_batch when syncing a file, this patch fix this problem by atomic operation. And ->log_batch is used to check if there are parallel sync operations or not, so it is unnecessary to reset it to 0 after the sync operation of the current log tree complete. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01Btrfs: fix wrong size for the reservation of the, snapshot creationMiao Xie2-4/+4
We should insert/update 6 items(root ref, root backref, dir item, dir index, root item and parent inode) when creating a snapshot, not 5 items, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01Btrfs: fix the snapshot that should not existMiao Xie1-15/+53
The snapshot should be the image of the fs tree before it was created, so the metadata of the snapshot should not exist in the its tree. But now, we found the directory item and directory name index is in both the snapshot tree and the fs tree. It introduces some problems and makes the users feel strange: # mkfs.btrfs /dev/sda1 # mount /dev/sda1 /mnt # mkdir /mnt/1 # cd /mnt/1 # btrfs subvolume snapshot /mnt snap0 # ls -a /mnt/1/snap0/1 . .. [no other file/dir] # ll /mnt/1/snap0/ total 0 drwxr-xr-x 1 root root 10 Ju1 24 12:11 1 ^^^ There is no file/dir in it, but it's size is 10 # cd /mnt/1/snap0/1/snap0 [Enter a unexisted directory successfully...] There is nothing in the directory 1 in snap0, but btrfs told the length of this directory is 10. Beside that, we can enter an unexisted directory, it is very strange to the users. # btrfs subvolume snapshot /mnt/1/snap0 /mnt/snap1 # ll /mnt/1/snap0/1/ total 0 [None] # ll /mnt/snap1/1/ total 0 drwxr-xr-x 1 root root 0 Ju1 24 12:14 snap0 And the source of snap1 did have any directory in Directory 1, but snap1 have a snap0, it is different between the source and the snapshot. So I think we should insert directory item and directory name index and update the parent inode as the last step of snapshot creation, and do not leave the useless metadata in the file tree. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01Btrfs: add a new "type" field into the block reservation structureMiao Xie8-22/+39
Sometimes we need choose the method of the reservation according to the type of the block reservation, such as the reservation for the delayed inode update. Now we identify the type just by comparing the address of the reservation variants, it is very ugly if it is a temporary one because we need compare it with all the common reservation variants. So we add a new "type" field to keep the type the reservation variants. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01Btrfs: use a slab for ordered extents allocationMiao Xie3-3/+31
The ordered extent allocation is in the fast path of the IO, so use a slab to improve the speed of the allocation. "Size of the struct is 280, so this will fall into the size-512 bucket, giving 8 objects per page, while own slab will pack 14 objects into a page. Another benefit I see is to check for leaked objects when the module is removed (and the cache destroy takes place)." -- David Sterba Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01Btrfs: fix file extent discount problem in the, snapshotMiao Xie2-44/+25
If a snapshot is created while we are writing some data into the file, the i_size of the corresponding file in the snapshot will be wrong, it will be beyond the end of the last file extent. And btrfsck will report: root 256 inode 257 errors 100 Steps to reproduce: # mkfs.btrfs <partition> # mount <partition> <mnt> # cd <mnt> # dd if=/dev/zero of=tmpfile bs=4M count=1024 & # for ((i=0; i<4; i++)) > do > btrfs sub snap . $i > done This because the algorithm of disk_i_size update is wrong. Though there are some ordered extents behind the current one which we use to update disk_i_size, it doesn't mean those extents will be dealt with in the same transaction. So We shouldn't use the offset of those extents to update disk_i_size. Or we will get the wrong i_size in the snapshot. We fix this problem by recording the max real i_size. If we find there is a ordered extent which is in front of the current one and doesn't complete, we will record the end of the current one into that ordered extent. Surely, if the current extent holds the end of other extent(it must be greater than the current one because it is behind the current one), we will record the number that the current extent holds. In this way, we can exclude the ordered extents that may not be dealth with in the same transaction, and be easy to know the real disk_i_size. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01Btrfs: fix full backref problem when inserting shared block referenceMiao Xie1-0/+4
If we create several snapshots at the same time, the following BUG_ON() will be triggered. kernel BUG at fs/btrfs/extent-tree.c:6047! Steps to reproduce: # mkfs.btrfs <partition> # mount <partition> <mnt> # cd <mnt> # for ((i=0;i<2400;i++)); do touch long_name_to_make_tree_more_deep$i; done # for ((i=0; i<4; i++)) > do > mkdir $i > for ((j=0; j<200; j++)) > do > btrfs sub snap . $i/$j > done & > done The reason is: Before transaction commit, some operations changed the fs tree and new tree blocks were allocated because of COW. We used the implicit non-shared back reference for those newly allocated tree blocks because they were not shared by two or more trees. And then we created the first snapshot for the fs tree, according to the back reference rules, we also used implicit back refs for the child tree blocks of the root node of the fs tree, now those child nodes/leaves were shared by two trees. Then We didn't deal with the delayed references, and continued to change the fs tree(created the second snapshot and inserted the dir item of the new snapshot into the fs tree). According to the rules of the back reference, we added full back refs for those tree blocks whose parents have be shared by two trees. Now some newly allocated tree blocks had two types of the references. As we know, the delayed reference system handles these delayed references from back to front, and the full delayed reference is inserted after the implicit ones. So when we dealt with the back references of those newly allocated tree blocks, the full references was dealt with at first. And if the first reference is a shared back reference and the tree block that the reference points to is newly allocated, It would be considered as a tree block which is shared by two or more trees when it is allocated and should be a full back reference not a implicit one, the flag of its reference also should be set to FULL_BACKREF. But in fact, it was a non-shared tree block with a implicit reference at beginning, so it was not compulsory to set the flags to FULL_BACKREF. So BUG_ON was triggered. We have several methods to fix this bug: 1. deal with delayed references after the snapshot is created and before we change the source tree of the snapshot. This is the easiest and safest way. 2. modify the sort method of the delayed reference tree, make the full delayed references be inserted before the implicit ones. It is also very easy, but I don't know if it will introduce some problems or not. 3. modify select_delayed_ref() and make it select the implicit delayed reference at first. This way is not so good because it may wastes CPU time if we have lots of delayed references. 4. set the flags to FULL_BACKREF, this method is a little complex comparing with the 1st way. I chose the 1st way to fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01Btrfs: fix error path in create_pending_snapshot()Miao Xie1-23/+17
This patch fixes the following problem: - If we failed to deal with the delayed dir items, we should abort transaction, just as its comment said. Fix it. - If root reference or root back reference insertion failed, we should abort transaction. Fix it. - Fix the double free problem of pending->inherit. - Do not restore the trans->rsv if we doesn't change it. - make the error path more clearly. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01Btrfs: fix possible memory leak in scrub_setup_recheck_block()Wei Yongjun1-0/+1
bbio has been malloced in btrfs_map_block() and should be freed before leaving from the error handling cases. spatch with a semantic match is used to found this problem. (http://coccinelle.lip6.fr/) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
2012-10-01Btrfs: btrfs_drop_extent_cache should never failJosef Bacik2-6/+11
I noticed this when I was doing the fsync stuff, we allocate split extents if we drop an extent range that is in the middle of an existing extent. This BUG()'s if we fail to allocate memory, but the fact is this is just a cache, we will just regenerate the cache if we need it, the important part is that we free the range we are given. This can be done without allocations, so if we fail to allocate splits just skip the splitting stage and free our em and look for more extents to drop. This also makes btrfs_drop_extent_cache a void since nobody was checking the return value anyway. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01Btrfs: do not take cleanup_work_sem in btrfs_run_delayed_iputs()Sage Weil1-2/+0
Josef has suggested that this is not necessary. Removing it also avoids this lockdep splat (after the new sb_internal locking stuff was added): [ 604.090449] ====================================================== [ 604.114819] [ INFO: possible circular locking dependency detected ] [ 604.139262] 3.6.0-rc2-ceph-00144-g463b030 #1 Not tainted [ 604.162193] ------------------------------------------------------- [ 604.186139] btrfs-cleaner/6669 is trying to acquire lock: [ 604.209555] (sb_internal#2){.+.+..}, at: [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs] [ 604.257100] [ 604.257100] but task is already holding lock: [ 604.300366] (&fs_info->cleanup_work_sem){.+.+..}, at: [<ffffffffa0048002>] btrfs_run_delayed_iputs+0x72/0x130 [btrfs] [ 604.352989] [ 604.352989] which lock already depends on the new lock. [ 604.352989] [ 604.427104] [ 604.427104] the existing dependency chain (in reverse order) is: [ 604.478493] [ 604.478493] -> #1 (&fs_info->cleanup_work_sem){.+.+..}: [ 604.529313] [<ffffffff810b2c82>] lock_acquire+0xa2/0x140 [ 604.559621] [<ffffffff81632b69>] down_read+0x39/0x4e [ 604.589382] [<ffffffffa004db98>] btrfs_lookup_dentry+0x218/0x550 [btrfs] [ 604.596161] btrfs: unlinked 1 orphans [ 604.675002] [<ffffffffa006aadd>] create_subvol+0x62d/0x690 [btrfs] [ 604.708859] [<ffffffffa006d666>] btrfs_mksubvol.isra.52+0x346/0x3a0 [btrfs] [ 604.772466] [<ffffffffa006d7f2>] btrfs_ioctl_snap_create_transid+0x132/0x190 [btrfs] [ 604.842245] [<ffffffffa006d8ae>] btrfs_ioctl_snap_create+0x5e/0x80 [btrfs] [ 604.912852] [<ffffffffa00708ae>] btrfs_ioctl+0x138e/0x1990 [btrfs] [ 604.951888] [<ffffffff8118e9b8>] do_vfs_ioctl+0x98/0x560 [ 604.989961] [<ffffffff8118ef11>] sys_ioctl+0x91/0xa0 [ 605.026628] [<ffffffff8163d569>] system_call_fastpath+0x16/0x1b [ 605.064404] [ 605.064404] -> #0 (sb_internal#2){.+.+..}: [ 605.126832] [<ffffffff810b25e8>] __lock_acquire+0x1ac8/0x1b90 [ 605.163671] [<ffffffff810b2c82>] lock_acquire+0xa2/0x140 [ 605.200228] [<ffffffff8117dac6>] __sb_start_write+0xc6/0x1b0 [ 605.236818] [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs] [ 605.274029] [<ffffffffa00431a3>] btrfs_start_transaction+0x13/0x20 [btrfs] [ 605.340520] [<ffffffffa004ccfa>] btrfs_evict_inode+0x19a/0x330 [btrfs] [ 605.378720] [<ffffffff811972c8>] evict+0xb8/0x1c0 [ 605.416057] [<ffffffff811974d5>] iput+0x105/0x210 [ 605.452373] [<ffffffffa0048082>] btrfs_run_delayed_iputs+0xf2/0x130 [btrfs] [ 605.521627] [<ffffffffa003b5e1>] cleaner_kthread+0xa1/0x120 [btrfs] [ 605.560520] [<ffffffff810791ee>] kthread+0xae/0xc0 [ 605.598094] [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10 [ 605.636499] [ 605.636499] other info that might help us debug this: [ 605.636499] [ 605.736504] Possible unsafe locking scenario: [ 605.736504] [ 605.801931] CPU0 CPU1 [ 605.835126] ---- ---- [ 605.867093] lock(&fs_info->cleanup_work_sem); [ 605.898594] lock(sb_internal#2); [ 605.931954] lock(&fs_info->cleanup_work_sem); [ 605.965359] lock(sb_internal#2); [ 605.994758] [ 605.994758] *** DEADLOCK *** [ 605.994758] [ 606.075281] 2 locks held by btrfs-cleaner/6669: [ 606.104528] #0: (&fs_info->cleaner_mutex){+.+...}, at: [<ffffffffa003b5d5>] cleaner_kthread+0x95/0x120 [btrfs] [ 606.165626] #1: (&fs_info->cleanup_work_sem){.+.+..}, at: [<ffffffffa0048002>] btrfs_run_delayed_iputs+0x72/0x130 [btrfs] [ 606.231297] [ 606.231297] stack backtrace: [ 606.287723] Pid: 6669, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00144-g463b030 #1 [ 606.347823] Call Trace: [ 606.376184] [<ffffffff8162a77c>] print_circular_bug+0x1fb/0x20c [ 606.409243] [<ffffffff810b25e8>] __lock_acquire+0x1ac8/0x1b90 [ 606.441343] [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs] [ 606.474583] [<ffffffff810b2c82>] lock_acquire+0xa2/0x140 [ 606.505934] [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs] [ 606.539429] [<ffffffff8132babd>] ? do_raw_spin_unlock+0x5d/0xb0 [ 606.571719] [<ffffffff8117dac6>] __sb_start_write+0xc6/0x1b0 [ 606.603498] [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs] [ 606.637405] [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs] [ 606.670165] [<ffffffff81172e75>] ? kmem_cache_alloc+0xb5/0x160 [ 606.702144] [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs] [ 606.735562] [<ffffffffa00256a6>] ? block_rsv_add_bytes+0x56/0x80 [btrfs] [ 606.769861] [<ffffffffa00431a3>] btrfs_start_transaction+0x13/0x20 [btrfs] [ 606.804575] [<ffffffffa004ccfa>] btrfs_evict_inode+0x19a/0x330 [btrfs] [ 606.838756] [<ffffffff81634c6b>] ? _raw_spin_unlock+0x2b/0x40 [ 606.872010] [<ffffffff811972c8>] evict+0xb8/0x1c0 [ 606.903800] [<ffffffff811974d5>] iput+0x105/0x210 [ 606.935416] [<ffffffffa0048082>] btrfs_run_delayed_iputs+0xf2/0x130 [btrfs] [ 606.970510] [<ffffffffa003b5d5>] ? cleaner_kthread+0x95/0x120 [btrfs] [ 607.005648] [<ffffffffa003b5e1>] cleaner_kthread+0xa1/0x120 [btrfs] [ 607.040724] [<ffffffffa003b540>] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs] [ 607.104740] [<ffffffff810791ee>] kthread+0xae/0xc0 [ 607.137119] [<ffffffff810b379d>] ? trace_hardirqs_on+0xd/0x10 [ 607.169797] [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10 [ 607.202472] [<ffffffff81635430>] ? retint_restore_args+0x13/0x13 [ 607.235884] [<ffffffff81079140>] ? flush_kthread_work+0x1a0/0x1a0 [ 607.268731] [<ffffffff8163e740>] ? gs_change+0x13/0x13 Signed-off-by: Sage Weil <sage@inktank.com>
2012-10-01Btrfs: set journal_info in async trans commit workerSage Weil1-0/+2
We expect current->journal_info to point to the trans handle we are committing. Signed-off-by: Sage Weil <sage@inktank.com>
2012-10-01Btrfs: pass lockdep rwsem metadata to async commit transactionSage Weil1-0/+16
The freeze rwsem is taken by sb_start_intwrite() and dropped during the commit_ or end_transaction(). In the async case, that happens in a worker thread. Tell lockdep the calling thread is releasing ownership of the rwsem and the async thread is picking it up. XFS plays the same trick in fs/xfs/xfs_aops.c. Signed-off-by: Sage Weil <sage@inktank.com>
2012-10-01Btrfs: add hole punchingJosef Bacik5-13/+355
This patch adds hole punching via fallocate. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01Btrfs: remove unused hint byte argument for btrfs_drop_extentsJosef Bacik5-30/+14
I audited all users of btrfs_drop_extents and found that nobody actually uses the hint_byte argument. I'm sure it was used for something at some point but it's not used now, and the way the pinning works the disk bytenr would never be immediately useful anyway so lets just remove it. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01Btrfs: check if an inode has no checksum when logging itLiu Bo1-11/+12
This is based on Josef's "Btrfs: turbo charge fsync". If an inode is a BTRFS_INODE_NODATASUM one, we don't need to look for csum items any more. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01Btrfs: fix a bug in checking whether a inode is already in logLiu Bo4-8/+10
This is based on Josef's "Btrfs: turbo charge fsync". The current btrfs checks if an inode is in log by comparing root's last_log_commit to inode's last_sub_trans[2]. But the problem is that this root->last_log_commit is shared among inodes. Say we have N inodes to be logged, after the first inode, root's last_log_commit is updated and the N-1 remained files will be skipped. This fixes the bug by keeping a local copy of root's last_log_commit inside each inode and this local copy will be maintained itself. [1]: we regard each log transaction as a subset of btrfs's transaction, i.e. sub_trans Signed-off-by: Liu Bo <bo.li.liu@oracle.com>