summaryrefslogtreecommitdiffstats
path: root/fs/btrfs
AgeCommit message (Collapse)AuthorFilesLines
2016-08-04block: add missing group association in bio-cloning functionsPaolo Valente1-6/+0
When a bio is cloned, the newly created bio must be associated with the same blkcg as the original bio (if BLK_CGROUP is enabled). If this operation is not performed, then the new bio is not associated with any group, and the group of the current task is returned when the group of the bio is requested. Depending on the cloning frequency, this may cause a large percentage of the bios belonging to a given group to be treated as if belonging to other groups (in most cases as if belonging to the root group). The expected group isolation may thereby be broken. This commit adds the missing association in bio-cloning functions. Fixes: da2f0f74cf7d ("Btrfs: add support for blkio controllers") Cc: stable@vger.kernel.org # v4.3+ Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Reviewed-by: Nikolay Borisov <kernel@kyup.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-31Merge branch 'for-linus-4.8' of ↵Linus Torvalds6-338/+544
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This pull is dedicated to Josef's enospc rework, which we've been testing for a few releases now. It fixes some early enospc problems and is dramatically faster. This also includes an updated fix for the delalloc accounting that happens after a fault in copy_from_user. My patch in v4.7 was almost but not quite enough" * 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix delalloc accounting after copy_from_user faults Btrfs: avoid deadlocks during reservations in btrfs_truncate_block Btrfs: use FLUSH_LIMIT for relocation in reserve_metadata_bytes Btrfs: fill relocation block rsv after allocation Btrfs: always use trans->block_rsv for orphans Btrfs: change how we calculate the global block rsv Btrfs: use root when checking need_async_flush Btrfs: don't bother kicking async if there's nothing to reclaim Btrfs: fix release reserved extents trace points Btrfs: add fsid to some tracepoints Btrfs: add tracepoints for flush events Btrfs: fix delalloc reservation amount tracepoint Btrfs: trace pinned extents Btrfs: introduce ticketed enospc infrastructure Btrfs: add tracepoint for adding block groups Btrfs: warn_on for unaccounted spaces Btrfs: change delayed reservation fallback behavior Btrfs: always reserve metadata for delalloc extents Btrfs: fix callers of btrfs_block_rsv_migrate Btrfs: add bytes_readonly to the spaceinfo at once
2016-07-26Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-1/+2
Merge updates from Andrew Morton: - a few misc bits - ocfs2 - most(?) of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits) thp: fix comments of __pmd_trans_huge_lock() cgroup: remove unnecessary 0 check from css_from_id() cgroup: fix idr leak for the first cgroup root mm: memcontrol: fix documentation for compound parameter mm: memcontrol: remove BUG_ON in uncharge_list mm: fix build warnings in <linux/compaction.h> mm, thp: convert from optimistic swapin collapsing to conservative mm, thp: fix comment inconsistency for swapin readahead functions thp: update Documentation/{vm/transhuge,filesystems/proc}.txt shmem: split huge pages beyond i_size under memory pressure thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE khugepaged: add support of collapse for tmpfs/shmem pages shmem: make shmem_inode_info::lock irq-safe khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page() thp: extract khugepaged from mm/huge_memory.c shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings shmem: add huge pages support shmem: get_unmapped_area align huge page shmem: prepare huge= mount option and sysfs knob mm, rmap: account shmem thp pages ...
2016-07-26mm, memcg: use consistent gfp flags during readaheadMichal Hocko1-1/+2
Vladimir has noticed that we might declare memcg oom even during readahead because read_pages only uses GFP_KERNEL (with mapping_gfp restriction) while __do_page_cache_readahead uses page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from OOMs. This gfp mask discrepancy is really unfortunate and easily fixable. Drop page_cache_alloc_readahead() which only has one user and outsource the gfp_mask logic into readahead_gfp_mask and propagate this mask from __do_page_cache_readahead down to read_pages. This alone would have only very limited impact as most filesystems are implementing ->readpages and the common implementation mpage_readpages does GFP_KERNEL (with mapping_gfp restriction) again. We can tell it to use readahead_gfp_mask instead as this function is called only during readahead as well. The same applies to read_cache_pages. ext4 has its own ext4_mpage_readpages but the path which has pages != NULL can use the same gfp mask. Btrfs, cifs, f2fs and orangefs are doing a very similar pattern to mpage_readpages so the same can be applied to them as well. [akpm@linux-foundation.org: coding-style fixes] [mhocko@suse.com: restrict gfp mask in mpage_alloc] Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Chris Mason <clm@fb.com> Cc: Steve French <sfrench@samba.org> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Cc: Mike Marshall <hubcap@omnibond.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Changman Lee <cm224.lee@samsung.com> Cc: Chao Yu <yuchao0@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-blockLinus Torvalds14-209/+222
Pull core block updates from Jens Axboe: - the big change is the cleanup from Mike Christie, cleaning up our uses of command types and modified flags. This is what will throw some merge conflicts - regression fix for the above for btrfs, from Vincent - following up to the above, better packing of struct request from Christoph - a 2038 fix for blktrace from Arnd - a few trivial/spelling fixes from Bart Van Assche - a front merge check fix from Damien, which could cause issues on SMR drives - Atari partition fix from Gabriel - convert cfq to highres timers, since jiffies isn't granular enough for some devices these days. From Jan and Jeff - CFQ priority boost fix idle classes, from me - cleanup series from Ming, improving our bio/bvec iteration - a direct issue fix for blk-mq from Omar - fix for plug merging not involving the IO scheduler, like we do for other types of merges. From Tahsin - expose DAX type internally and through sysfs. From Toshi and Yigal * 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits) block: Fix front merge check block: do not merge requests without consulting with io scheduler block: Fix spelling in a source code comment block: expose QUEUE_FLAG_DAX in sysfs block: add QUEUE_FLAG_DAX for devices to advertise their DAX support Btrfs: fix comparison in __btrfs_map_block() block: atari: Return early for unsupported sector size Doc: block: Fix a typo in queue-sysfs.txt cfq-iosched: Charge at least 1 jiffie instead of 1 ns cfq-iosched: Fix regression in bonnie++ rewrite performance cfq-iosched: Convert slice_resid from u64 to s64 block: Convert fifo_time from ulong to u64 blktrace: avoid using timespec block/blk-cgroup.c: Declare local symbols static block/bio-integrity.c: Add #include "blk.h" block/partition-generic.c: Remove a set-but-not-used variable block: bio: kill BIO_MAX_SIZE cfq-iosched: temporarily boost queue priority for idle classes block: drbd: avoid to use BIO_MAX_SIZE block: bio: remove BIO_MAX_SECTORS ...
2016-07-21Btrfs: fix delalloc accounting after copy_from_user faultsChris Mason1-7/+5
Commit 56244ef151c3cd11 was almost but not quite enough to fix the reservation math after btrfs_copy_from_user returned partial copies. Some users are still seeing warnings in btrfs_destroy_inode, and with a long enough test run I'm able to trigger them as well. This patch fixes the accounting math again, bringing it much closer to the way it was before the sectorsize conversion Chandan did. The problem is accounting for the offset into the page/sector when we do a partial copy. This one just uses the dirty_sectors variable which should already be updated properly. Signed-off-by: Chris Mason <clm@fb.com> cc: stable@vger.kernel.org # v4.6+
2016-07-20Btrfs: avoid deadlocks during reservations in btrfs_truncate_blockJosef Bacik1-0/+5
The new enospc code makes it possible to deadlock if we don't use FLUSH_LIMIT during reservations inside a transaction. This enforces the correct flush type to avoid both deadlocks and assertions Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2016-07-18Btrfs: fix comparison in __btrfs_map_block()Vincent Stehlé1-1/+1
Add missing comparison to op in expression, which was forgotten when doing the REQ_OP transition. Fixes: b3d3fa519905 ("btrfs: update __btrfs_map_block for REQ_OP transition") Signed-off-by: Vincent Stehlé <vincent.stehle@intel.com> Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-07Btrfs: use FLUSH_LIMIT for relocation in reserve_metadata_bytesJosef Bacik2-17/+22
We used to allow you to set FLUSH_ALL and then just wouldn't do things like commit transactions or wait on ordered extents if we noticed you were in a transaction. However now that all the flushing for FLUSH_ALL is asynchronous we've lost the ability to tell, and we could end up deadlocking. So instead use FLUSH_LIMIT in reserve_metadata_bytes in relocation and then return -EAGAIN if we error out to preserve the previous behavior. I've also added an ASSERT() to catch anybody else who tries to do this. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07Btrfs: fill relocation block rsv after allocationJosef Bacik1-0/+6
Since we set the reloc control before we've reserved our space for relocation we could race with a root being dirtied and not actually have space to do our init reloc root. So once we've allocated it and set it up go ahead and make our reservation before setting the relocate control, that way anybody who tries to do the reloc root init has space to use. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07Btrfs: always use trans->block_rsv for orphansJosef Bacik1-1/+7
This is the case all the time anyway except for relocation which could be doing a reloc root for a non ref counted root, in which case we'd end up with some random block rsv rather than the one we have our reservation in. If there isn't enough space in the block rsv we are trying to steal from we'll BUG() because we expect there to be space for the orphan to make its reservation. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07Btrfs: change how we calculate the global block rsvJosef Bacik1-36/+9
Traditionally we've calculated the global block rsv by guessing how much of the metadata used amount was the extent tree, and then taking the data size and figuring out how large the csum tree would have to be to hold that much data. This is imprecise and falls down on MIXED file systems as we can't trust the data used amount. This resulted in failures for xfstests generic/333 because it creates lots of clones, which explodes out the extent tree. Our global reserve calculations were woefully inaccurate in this case which meant we got into a situation where we did not have enough reserved to do our work. We know we only use the global block rsv for the extent, csum, and root trees, so just get the bytes used for these trees and use that as the basis of our global reserve. Since these are not reference counted trees the bytes_used value will be accurate. This fixed the transaction aborts seen with generic/333. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07Btrfs: use root when checking need_async_flushJosef Bacik1-5/+6
Instead of doing fs_info->fs_root in need_async_flush, which may not be set during recovery when mounting, just pass the root itself in, which makes more sense as thats what btrfs_calc_reclaim_metadata_size takes. Signed-off-by: Josef Bacik <jbacik@fb.com> Reported-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07Btrfs: don't bother kicking async if there's nothing to reclaimJosef Bacik1-0/+3
We do this check when we start the async reclaimer thread, might as well check before we kick it off to save us some cycles. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07Btrfs: fix release reserved extents trace pointsJosef Bacik1-5/+1
We were doing trace_btrfs_release_reserved_extent() in pin_down_extent which isn't quite right because we will go through and free that extent later when we unpin, so it messes up apps that are accounting for the reservation space. We were also unconditionally doing it in __btrfs_free_reserved_extent(), when we only actually free the reservation instead of pinning the extent. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07Btrfs: add tracepoints for flush eventsJosef Bacik2-10/+21
We want to track when we're triggering flushing from our reservation code and what flushing is being done when we start flushing. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07Btrfs: fix delalloc reservation amount tracepointJosef Bacik1-1/+3
We can sometimes drop the reservation we had for our inode, so we need to remove that amount from to_reserve so that our tracepoint reports a valid amount of space. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07Btrfs: trace pinned extentsJosef Bacik1-0/+8
Pinned extents are an important metric to keep track of for enospc. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07Btrfs: introduce ticketed enospc infrastructureJosef Bacik2-151/+380
Our enospc flushing sucks. It is born from a time where we were early enospc'ing constantly because multiple threads would race in for the same reservation and randomly starve other ones out. So I came up with this solution to block any other reservations from happening while one guy tried to flush stuff to satisfy his reservation. This gives us pretty good correctness, but completely crap latency. The solution I've come up with is ticketed reservations. Basically we try to make our reservation, and if we can't we put a ticket on a list in order and kick off an async flusher thread. This async flusher thread does the same old flushing we always did, just asynchronously. As space is freed and added back to the space_info it checks and sees if we have any tickets that need satisfying, and adds space to the tickets and wakes up anything we've satisfied. Once the flusher thread stops making progress it wakes up all the current tickets and tells them to take a hike. There is a priority list for things that can't flush, since the async flusher could do anything we need to avoid deadlocks. These guys get priority for having their reservation made, and will still do manual flushing themselves in case the async flusher isn't running. This patch gives us significantly better latencies. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07Btrfs: add tracepoint for adding block groupsJosef Bacik1-0/+2
I'm writing a tool to visualize the enospc system inside btrfs, I need this tracepoint in order to keep track of the block groups in the system. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07Btrfs: warn_on for unaccounted spacesJosef Bacik1-6/+8
These were hidden behind enospc_debug, which isn't helpful as they indicate actual bugs, unlike the rest of the enospc_debug stuff which is really debug information. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07Btrfs: change delayed reservation fallback behaviorJosef Bacik1-41/+23
We reserve space for the inode update when we first reserve space for writing to a file. However there are lots of ways that we can use this reservation and not have it for subsequent ordered extents. Previously we'd fall through and try to reserve metadata bytes for this, then we'd just steal the full reservation from the delalloc_block_rsv, and if that didn't have enough space we'd steal the full reservation from the global reserve. The problem with this is we can easily just return ENOSPC and fallback to updating the inode item directly. In the worst case (assuming 4k nodesize) we'd steal 64kib from the global reserve if we fall all the way through, however if we just fallback and update the inode directly we'd only steal 4k * BTRFS_PATH_MAX in the worst case which is 32kib. We would have also just added the extent item for the inode so we likely will have already cow'ed down most of the way to the leaf containing the inode item, so we are more often than not only need one or two nodesize's worth of reservations. Given the reservation for the extent itself is also a worst case we will likely already have space to cover the inode update. This change will make us behave better in the theoretical worst case, and much better in the case that we don't have our reservation and cannot reserve more metadata. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07Btrfs: always reserve metadata for delalloc extentsJosef Bacik1-22/+13
There are a few races in the metadata reservation stuff. First we add the bytes to the block_rsv well after we've set the bit on the inode saying that we have space for it and after we've reserved the bytes. So use the normal btrfs_block_rsv_add helper for this case. Secondly we can flush delalloc extents when we try to reserve space for our write, which means that we could have used up the space for the inode and we wouldn't know because we only check before the reservation. So instead make sure we are always reserving space for the inode update, and then if we don't need it release those bytes afterward. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07Btrfs: fix callers of btrfs_block_rsv_migrateJosef Bacik6-25/+18
So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes. Not only this but it unconditionally changes the size of the block_rsv. This isn't a bug strictly speaking, but it makes truncate block rsv's look funny because every time we migrate bytes over its size grows, even though we only want it to be a specific size. So collapse this into one function that takes an update_size argument and make truncate and evict not update the size for consistency sake. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-07Btrfs: add bytes_readonly to the spaceinfo at onceJosef Bacik1-18/+11
For some reason we're adding bytes_readonly to the space info after we update the space info with the block group info. This creates a tiny race where we could over-reserve space because we haven't yet taken out the bytes_readonly bit. Since we already know this information at the time we call update_space_info, just pass it along so it can be updated all at once. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-25Merge branch 'for-linus-4.7-part2' of ↵Linus Torvalds3-13/+34
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes part 2 from Chris Mason: "This has one patch from Omar to bring iterate_shared back to btrfs. We have a tree of work we queue up for directory items and it doesn't lend itself well to shared access. While we're cleaning it up, Omar has changed things to use an exclusive lock when there are delayed items" * 'for-linus-4.7-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes
2016-06-25Merge branch 'for-linus-4.7' of ↵Linus Torvalds11-36/+57
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "I have a two part pull this time because one of the patches Dave Sterba collected needed to be against v4.7-rc2 or higher (we used rc4). I try to make my for-linus-xx branch testable on top of the last major so we can hand fixes to people on the list more easily, so I've split this pull in two. This first part has some fixes and two performance improvements that we've been testing for some time. Josef's two performance fixes are most notable. The transid tracking patch makes a big improvement on pretty much every workload" * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: Force stripesize to the value of sectorsize btrfs: fix disk_i_size update bug when fallocate() fails Btrfs: fix error handling in map_private_extent_buffer Btrfs: fix error return code in btrfs_init_test_fs() Btrfs: don't do nocow check unless we have to btrfs: fix deadlock in delayed_ref_async_start Btrfs: track transid for delayed ref flushing
2016-06-25Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodesOmar Sandoval3-13/+34
Commit fe742fd4f90f ("Revert "btrfs: switch to ->iterate_shared()"") backed out the conversion to ->iterate_shared() for Btrfs because the delayed inode handling in btrfs_real_readdir() is racy. However, we can still do readdir in parallel if there are no delayed nodes. This is a temporary fix which upgrades the shared inode lock to an exclusive lock only when we have delayed items until we come up with a more complete solution. While we're here, rename the btrfs_{get,put}_delayed_items functions to make it very clear that they're just for readdir. Tested with xfstests and by doing a parallel kernel build: while make tinyconfig && make -j4 && git clean dqfx; do : done along with a bunch of parallel finds in another shell: while true; do for ((i=0; i<4; i++)); do find . >/dev/null & done wait done Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-06-23Btrfs: Force stripesize to the value of sectorsizeChandan Rajendra2-6/+4
Btrfs code currently assumes stripesize to be same as sectorsize. However Btrfs-progs (until commit df05c7ed455f519e6e15e46196392e4757257305) has been setting btrfs_super_block->stripesize to a value of 4096. This commit makes sure that the value of btrfs_super_block->stripesize is a power of 2. Later, it unconditionally sets btrfs_root->stripesize to sectorsize. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-06-23btrfs: fix disk_i_size update bug when fallocate() failsWang Xiaoguang1-1/+2
When doing truncate operation, btrfs_setsize() will first call truncate_setsize() to set new inode->i_size, but if later btrfs_truncate() fails, btrfs_setsize() will call "i_size_write(inode, BTRFS_I(inode)->disk_i_size)" to reset the inmemory inode size, now bug occurs. It's because for truncate case btrfs_ordered_update_i_size() directly uses inode->i_size to update BTRFS_I(inode)->disk_i_size, indeed we should use the "offset" argument to update disk_i_size. Here is the call graph: ==>btrfs_truncate() ====>btrfs_truncate_inode_items() ======>btrfs_ordered_update_i_size(inode, last_size, NULL); Here btrfs_ordered_update_i_size()'s offset argument is last_size. And below test case can reveal this bug: dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=100 dev=$(losetup --show -f fs.img) mkdir -p /mnt/mntpoint mkfs.btrfs -f $dev mount $dev /mnt/mntpoint cd /mnt/mntpoint echo "workdir is: /mnt/mntpoint" blocksize=$((128 * 1024)) dd if=/dev/zero of=testfile bs=$blocksize count=1 sync count=$((17*1024*1024*1024/blocksize)) echo "file size is:" $((count*blocksize)) for ((i = 1; i <= $count; i++)); do i=$((i + 1)) dst_offset=$((blocksize * i)) xfs_io -f -c "reflink testfile 0 $dst_offset $blocksize"\ testfile > /dev/null done sync truncate --size 0 testfile ls -l testfile du -sh testfile exit In this case, truncate operation will fail for enospc reason and "du -sh testfile" returns value greater than 0, but testfile's size is 0, we need to reflect correct inode->i_size. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-06-23Btrfs: fix error handling in map_private_extent_bufferLiu Bo2-2/+11
map_private_extent_buffer() can return -EINVAL in two different cases, 1. when the requested contents span two pages if nodesize is larger than pagesize, 2. when it detects something insane. The 2nd one used to be only a WARN_ON(1), and we decided to return a error to callers, but we didn't fix up all its callers, which will be addressed by this patch. Without this, btrfs may end up with 'general protection', ie. reading invalid memory. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-06-23Btrfs: fix error return code in btrfs_init_test_fs()Wei Yongjun1-1/+1
Fix to return a negative error code from the kern_mount() error handling case instead of 0(ret is set to 0 by register_filesystem), as done elsewhere in this function. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-06-22Btrfs: don't do nocow check unless we have toJosef Bacik1-22/+22
Before we write into prealloc/nocow space we have to make sure that there are no references to the extents we are writing into, which means checking the extent tree and csum tree in the case of nocow. So we don't want to do the nocow dance unless we can't reserve data space, since it's a serious drag on performance. With the following sequence fallocate -l10737418240 /mnt/btrfs-test/file cp --reflink /mnt/btrfs-test/file /mnt/btrfs-test/link fio --name=randwrite --rw=randwrite --bs=4k --filename=/mnt/btrfs-test/file \ --end_fsync=1 we get the worst case scenario where we have to fall back on to doing the check anyway. Without this patch lat (usec): min=5, max=111598, avg=27.65, stdev=124.51 write: io=10240MB, bw=126876KB/s, iops=31718, runt= 82646msec With this patch lat (usec): min=3, max=91210, avg=14.09, stdev=110.62 write: io=10240MB, bw=212753KB/s, iops=53188, runt= 49286msec We get twice the throughput, half of the runtime, and half of the average latency. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> [ PAGE_CACHE_ removal related fixups ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-06-22btrfs: fix deadlock in delayed_ref_async_startChris Mason1-9/+11
"Btrfs: track transid for delayed ref flushing" was deadlocking on btrfs_attach_transaction because its not safe to call from the async delayed ref start code. This commit brings back btrfs_join_transaction instead and checks for a blocked commit. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-06-22Btrfs: track transid for delayed ref flushingJosef Bacik4-5/+16
Using the offwakecputime bpf script I noticed most of our time was spent waiting on the delayed ref throttling. This is what is supposed to happen, but sometimes the transaction can commit and then we're waiting for throttling that doesn't matter anymore. So change this stuff to be a little smarter by tracking the transid we were in when we initiated the throttling. If the transaction we get is different then we can just bail out. This resulted in a 50% speedup in my fs_mark test, and reduced the amount of time spent throttling by 60 seconds over the entire run (which is about 30 minutes). Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-06-18Merge branch 'for-linus-4.7' of ↵Linus Torvalds11-43/+61
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "The most user visible change here is a fix for our recent superblock validation checks that were causing problems on non-4k pagesized systems" * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: btrfs_check_super_valid: Allow 4096 as stripesize btrfs: remove build fixup for qgroup_account_snapshot btrfs: use new error message helper in qgroup_account_snapshot btrfs: avoid blocking open_ctree from cleaner_kthread Btrfs: don't BUG_ON() in btrfs_orphan_add btrfs: account for non-CoW'd blocks in btrfs_abort_transaction Btrfs: check if extent buffer is aligned to sectorsize btrfs: Use correct format specifier
2016-06-17Btrfs: btrfs_check_super_valid: Allow 4096 as stripesizeChandan Rajendra1-1/+2
Older btrfs-progs/mkfs.btrfs sets 4096 as the stripesize. Hence restricting stripesize to be equal to sectorsize would cause super block validation to return an error on architectures where PAGE_SIZE is not equal to 4096. Hence as a workaround, this commit allows stripesize to be set to 4096 bytes. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17btrfs: remove build fixup for qgroup_account_snapshotDavid Sterba1-5/+0
Introduced in 2c1984f244838477aab ("btrfs: build fixup for qgroup_account_snapshot") as temporary bisectability build fixup. Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17btrfs: use new error message helper in qgroup_account_snapshotDavid Sterba1-1/+1
We've renamed btrfs_std_error, this one is left from last merge. Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17btrfs: avoid blocking open_ctree from cleaner_kthreadZygo Blaxell2-15/+12
This fixes a problem introduced in commit 2f3165ecf103599f82bf0ea254039db335fb5005 "btrfs: don't force mounts to wait for cleaner_kthread to delete one or more subvolumes". open_ctree eventually calls btrfs_replay_log which in turn calls btrfs_commit_super which tries to lock the cleaner_mutex, causing a recursive mutex deadlock during mount. Instead of playing whack-a-mole trying to keep up with all the functions that may want to lock cleaner_mutex, put all the cleaner_mutex lockers back where they were, and attack the problem more directly: keep cleaner_kthread asleep until the filesystem is mounted. When filesystems are mounted read-only and later remounted read-write, open_ctree did not set fs_info->open and neither does anything else. Set this flag in btrfs_remount so that neither btrfs_delete_unused_bgs nor cleaner_kthread get confused by the common case of "/" filesystem read-only mount followed by read-write remount. Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17Btrfs: don't BUG_ON() in btrfs_orphan_addJosef Bacik1-1/+10
This is just a screwup for developers, so change it to an ASSERT() so developers notice when things go wrong and deal with the error appropriately if ASSERT() isn't enabled. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17btrfs: account for non-CoW'd blocks in btrfs_abort_transactionJeff Mahoney4-4/+7
The test for !trans->blocks_used in btrfs_abort_transaction is insufficient to determine whether it's safe to drop the transaction handle on the floor. btrfs_cow_block, informed by should_cow_block, can return blocks that have already been CoW'd in the current transaction. trans->blocks_used is only incremented for new block allocations. If an operation overlaps the blocks in the current transaction entirely and must abort the transaction, we'll happily let it clean up the trans handle even though it may have modified the blocks and will commit an incomplete operation. In the long-term, I'd like to do closer tracking of when the fs is actually modified so we can still recover as gracefully as possible, but that approach will need some discussion. In the short term, since this is the only code using trans->blocks_used, let's just switch it to a bool indicating whether any blocks were used and set it when should_cow_block returns false. Cc: stable@vger.kernel.org # 3.4+ Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17Btrfs: check if extent buffer is aligned to sectorsizeLiu Bo6-15/+28
Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer via alloc_extent_buffer(). An unaligned eb can have more pages than it should have, which ends up extent buffer's leak or some corrupted content in extent buffer. This adds a warning to let us quickly know what was happening. Now that alloc_extent_buffer() no more returns NULL, this changes its caller and callers of its caller to match with the new error handling. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17btrfs: Use correct format specifierHeinrich Schuchardt1-1/+1
Component mirror_num of struct btrfsic_block is defined as unsigned int. Use %u as format specifier. Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-10Merge branch 'for-linus-4.7' of ↵Linus Torvalds18-357/+568
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Has some fixes and some new self tests for btrfs. The self tests are usually disabled in the .config file (unless you're doing btrfs dev work), and this bunch is meant to find problems with the 64K page size patches. Jeff has a patch to help people see if they are using the hardware assist crc32c module, which really helps us nail down problems when people ask why crcs are using so much CPU. Otherwise, it's small fixes" * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: self-tests: Fix extent buffer bitmap test fail on BE system Btrfs: self-tests: Fix test_bitmaps fail on 64k sectorsize Btrfs: self-tests: Use macros instead of constants and add missing newline Btrfs: self-tests: Support testing all possible sectorsizes and nodesizes Btrfs: self-tests: Execute page straddling test only when nodesize < PAGE_SIZE btrfs: advertise which crc32c implementation is being used at module load Btrfs: add validadtion checks for chunk loading Btrfs: add more validation checks for superblock Btrfs: clear uptodate flags of pages in sys_array eb Btrfs: self-tests: Support non-4k page size Btrfs: Fix integer overflow when calculating bytes_per_bitmap Btrfs: test_check_exists: Fix infinite loop when searching for free space entries Btrfs: end transaction if we abort when creating uuid root btrfs: Use __u64 in exported linux/btrfs.h.
2016-06-08Merge branch 'misc-fixes-4.7' of ↵Chris Mason5-17/+114
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.7
2016-06-08Merge branch 'for-chris' of ↵Chris Mason15-340/+454
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.7
2016-06-07block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSHMike Christie1-4/+4
To avoid confusion between REQ_OP_FLUSH, which is handled by request_fn drivers, and upper layers requesting the block layer perform a flush sequence along with possibly a WRITE, this patch renames REQ_FLUSH to REQ_PREFLUSH. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block, drivers, fs: shrink bi_rw from long to intMike Christie2-3/+3
We don't need bi_rw to be so large on 64 bit archs, so reduce it to unsigned int. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07btrfs: use bio fields for op and flagsMike Christie9-67/+54
The bio REQ_OP and bi_rw rq_flag_bits are now always setup, so there is no need to pass around the rq_flag_bits bits too. btrfs users should should access the bio insead. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>