summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/disk-io.c
AgeCommit message (Collapse)AuthorFilesLines
2018-03-26btrfs: Remove custom crc32c init codeNikolay Borisov1-2/+2
The custom crc32 init code was introduced in 14a958e678cd ("Btrfs: fix btrfs boot when compiled as built-in") to enable using btrfs as a built-in. However, later as pointed out by 60efa5eb2e88 ("Btrfs: use late_initcall instead of module_init") this wasn't enough and finally btrfs was switched to late_initcall which comes after the generic crc32c implementation is initiliased. The latter commit superseeded the former. Now that we don't have to maintain our own code let's just remove it and switch to using the generic implementation. Despite touching a lot of files the patch is really simple. Here is the gist of the changes: 1. Select LIBCRC32C rather than the low-level modules. 2. s/btrfs_crc32c/crc32c/g 3. replace hash.h with linux/crc32c.h 4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly. I've tested this with btrfs being both a module and a built-in and xfstest doesn't complain. Does seem to fix the longstanding problem of not automatically selectiong the crc32c module when btrfs is used. Possibly there is a workaround in dracut. The modinfo confirms that now all the module dependencies are there: before: depends: zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate after: depends: libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add more info to changelog from mails ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26btrfs: tree-checker: Replace root parameter with fs_infoQu Wenruo1-3/+3
When inspecting the error message with real corruption, the "root=%llu" always shows "1" (root tree), instead of the correct owner. The problem is that we are getting @root from page->mapping->host, which points the same btree inode, so we will always get the same root. This makes the root owner output meaningless, and harder to port tree-checker to btrfs-progs. So get rid of the false and meaningless @root parameter and replace it with @fs_info. To get the owner, we can only rely on btrfs_header_owner() now. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26btrfs: Use schedule_timeout_interruptibleNikolay Borisov1-3/+1
Instead of manually fiddling with the state of the task (RUNNING->INTERRUPTIBLE->RUNNING) again just use schedule_timeout_interruptible which adjusts the task state as needed. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26btrfs: remove unused function btrfs_async_submit_limit()Anand Jain1-8/+0
Commit [1] removed the need to use btrfs_async_submit_limit(), so delete it. [1] commit 736cd52e0c720103f52ab9da47b6cc3af6b083f6 Btrfs: remove nr_async_submits and async_submit_draining Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26btrfs: Document consistency of transaction->io_bgs listNikolay Borisov1-0/+4
The reason why io_bgs can be modified without holding any lock is non-obvious. Document it and reference that documentation from the respective call sites. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26btrfs: Remove invalid null checks from btrfs_cleanup_dirty_bgsNikolay Borisov1-9/+0
list_first_entry is essentially a wrapper over cotnainer_of. The latter can never return null even if it's working on inconsistent list since it will either crash or return some offset in the wrong struct. Additionally, for the dirty_bgs list the iteration is done under dirty_bgs_lock which ensures consistency of the list. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26btrfs: manage thread_pool mount option as %uAnand Jain1-2/+2
The mount option thread_pool is always unsigned. Manage it that way all around. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26btrfs: print error if primary super block write failsHoward McLauchlan1-1/+14
Presently, failing a primary super block write but succeeding in at least one super block write in general will appear to users as if nothing important went wrong. However, upon unmounting and re-mounting, the file system will be in a rolled back state. This was discovered with a BCC program that uses bpf_override_return() to fail super block writes. This patch outputs an error clarifying that the primary super block write has failed, so users can expect potentially erroneous behaviour. It also forces wait_dev_supers() to return an error to its caller if the primary super block write fails. Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds1-0/+2
Pull networking updates from David Miller: 1) Significantly shrink the core networking routing structures. Result of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf 2) Add netdevsim driver for testing various offloads, from Jakub Kicinski. 3) Support cross-chip FDB operations in DSA, from Vivien Didelot. 4) Add a 2nd listener hash table for TCP, similar to what was done for UDP. From Martin KaFai Lau. 5) Add eBPF based queue selection to tun, from Jason Wang. 6) Lockless qdisc support, from John Fastabend. 7) SCTP stream interleave support, from Xin Long. 8) Smoother TCP receive autotuning, from Eric Dumazet. 9) Lots of erspan tunneling enhancements, from William Tu. 10) Add true function call support to BPF, from Alexei Starovoitov. 11) Add explicit support for GRO HW offloading, from Michael Chan. 12) Support extack generation in more netlink subsystems. From Alexander Aring, Quentin Monnet, and Jakub Kicinski. 13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From Russell King. 14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso. 15) Many improvements and simplifications to the NFP driver bpf JIT, from Jakub Kicinski. 16) Support for ipv6 non-equal cost multipath routing, from Ido Schimmel. 17) Add resource abstration to devlink, from Arkadi Sharshevsky. 18) Packet scheduler classifier shared filter block support, from Jiri Pirko. 19) Avoid locking in act_csum, from Davide Caratti. 20) devinet_ioctl() simplifications from Al viro. 21) More TCP bpf improvements from Lawrence Brakmo. 22) Add support for onlink ipv6 route flag, similar to ipv4, from David Ahern. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits) tls: Add support for encryption using async offload accelerator ip6mr: fix stale iterator net/sched: kconfig: Remove blank help texts openvswitch: meter: Use 64-bit arithmetic instead of 32-bit tcp_nv: fix potential integer overflow in tcpnv_acked r8169: fix RTL8168EP take too long to complete driver initialization. qmi_wwan: Add support for Quectel EP06 rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK ipmr: Fix ptrdiff_t print formatting ibmvnic: Wait for device response when changing MAC qlcnic: fix deadlock bug tcp: release sk_frag.page in tcp_disconnect ipv4: Get the address of interface correctly. net_sched: gen_estimator: fix lockdep splat net: macb: Handle HRESP error net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring ipv6: addrconf: break critical section in addrconf_verify_rtnl() ipv6: change route cache aging logic i40e/i40evf: Update DESC_NEEDED value to reflect larger value bnxt_en: cleanup DIM work on device shutdown ...
2018-01-22btrfs: fail mount when sb flag is not in BTRFS_SUPER_FLAG_SUPPAnand Jain1-2/+4
It appears from the original commit [1] that there isn't any design specific reason not to fail the mount instead of just warning. This patch will change it to fail. [1] commit 319e4d0661e5323c9f9945f0f8fb5905e5fe74c3 btrfs: Enhance super validation check Fixes: 319e4d0661e5323 ("btrfs: Enhance super validation check") Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22btrfs: define SUPER_FLAG_METADUMP_V2Anand Jain1-1/+2
btrfs-progs uses super flag bit BTRFS_SUPER_FLAG_METADUMP_V2 (1ULL << 34). So just define that in kernel so that we know its been used. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22btrfs: factor btrfs_check_rw_degradable() to check given deviceAnand Jain1-2/+2
Update btrfs_check_rw_degradable() to check against the given device if its lost. We can use this function to know if the volume is going to be in degraded mode OR failed state, when the given device fails. Which is needed when we are handling the device failed state. A preparatory patch does not affect the flow as such. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ enhance comment ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22btrfs: sink unlock_extent parameter gfp_flagsDavid Sterba1-1/+1
All callers pass either GFP_NOFS or GFP_KERNEL now, so we can sink the parameter to the function, though we lose some of the slightly better semantics of GFP_KERNEL in some places, it's worth cleaning up the callchains. Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22btrfs: cleanup device states define BTRFS_DEV_STATE_FLUSH_SENTAnand Jain1-3/+3
Currently device state is being managed by each individual int variable such as struct btrfs_device::is_tgtdev_for_dev_replace. Instead of that declare btrfs_device::dev_state BTRFS_DEV_STATE_FLUSH_SENT and use the bit operations. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22btrfs: cleanup device states define BTRFS_DEV_STATE_MISSINGAnand Jain1-2/+2
Currently device state is being managed by each individual int variable such as struct btrfs_device::missing. Instead of that declare btrfs_device::dev_state BTRFS_DEV_STATE_MISSING and use the bit operations. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by : Nikolay Borisov <nborisov@suse.com> [ whitespace adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22btrfs: cleanup device states define BTRFS_DEV_STATE_IN_FS_METADATAAnand Jain1-4/+4
Currently device state is being managed by each individual int variable such as struct btrfs_device::in_fs_metadata. Instead of that declare device state BTRFS_DEV_STATE_IN_FS_METADATA and use the bit operations. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ whitespace adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22btrfs: cleanup device states define BTRFS_DEV_STATE_WRITEABLEAnand Jain1-4/+8
Currently device state is being managed by each individual int variable such as struct btrfs_device::writeable. Instead of that declare device state BTRFS_DEV_STATE_WRITEABLE and use the bit operations. Signed-off-by: Anand Jain <anand.jain@oracle.com> [ whitespace adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22btrfs: switch to on-stack csum buffer in csum_tree_blockDavid Sterba1-13/+3
The maximum size of a checksum buffer is known, BTRFS_CSUM_SIZE, and we don't have to allocate it dynamically. This code path is not used at all as we have only the crc32c and use an on-stack buffer already. Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22btrfs: sink get_extent parameter to read_extent_buffer_pagesDavid Sterba1-4/+4
All callers pass btree_get_extent, which needs to be exported. Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22btrfs: use non-RCU list traversal in write_all_supers calleesDavid Sterba1-4/+5
We take the fs_devices::device_list_mutex mutex in write_all_supers which will prevent any add/del changes to the device list. Therefore we don't need to use the RCU variant list_for_each_entry_rcu in any of the called functions. Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22btrfs: Don't generate UUID for non-fs treeQu Wenruo1-2/+3
btrfs_create_tree() will unconditionally generate UUID for any root. So for quota tree and data reloc tree created by kernel, they will have unique UUIDs. However UUID in root item is only referred by UUID tree, which only records UUID for fs trees. This makes unique UUIDs for quota/data reloc tree meaningless. Leave the UUID as zero for non-fs tree, making btrfs-debug-tree output less confusing. Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-12error-injection: Add injectable error typesMasami Hiramatsu1-1/+1
Add injectable error types for each error-injectable function. One motivation of error injection test is to find software flaws, mistakes or mis-handlings of expectable errors. If we find such flaws by the test, that is a program bug, so we need to fix it. But if the tester miss input the error (e.g. just return success code without processing anything), it causes unexpected behavior even if the caller is correctly programmed to handle any errors. That is not what we want to test by error injection. To clarify what type of errors the caller must expect for each injectable function, this introduces injectable error types: - EI_ETYPE_NULL : means the function will return NULL if it fails. No ERR_PTR, just a NULL. - EI_ETYPE_ERRNO : means the function will return -ERRNO if it fails. - EI_ETYPE_ERRNO_NULL : means the function will return -ERRNO (ERR_PTR) or NULL. ALLOW_ERROR_INJECTION() macro is expanded to get one of NULL, ERRNO, ERRNO_NULL to record the error type for each function. e.g. ALLOW_ERROR_INJECTION(open_ctree, ERRNO) This error types are shown in debugfs as below. ==== / # cat /sys/kernel/debug/error_injection/list open_ctree [btrfs] ERRNO io_ctl_init [btrfs] ERRNO ==== Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-12error-injection: Separate error-injection from kprobeMasami Hiramatsu1-2/+2
Since error-injection framework is not limited to be used by kprobes, nor bpf. Other kernel subsystems can use it freely for checking safeness of error-injection, e.g. livepatch, ftrace etc. So this separate error-injection framework from kprobes. Some differences has been made: - "kprobe" word is removed from any APIs/structures. - BPF_ALLOW_ERROR_INJECTION() is renamed to ALLOW_ERROR_INJECTION() since it is not limited for BPF too. - CONFIG_FUNCTION_ERROR_INJECTION is the config item of this feature. It is automatically enabled if the arch supports error injection feature for kprobe or ftrace etc. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2017-12-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-0/+2
Daniel Borkmann says: ==================== pull-request: bpf-next 2017-12-18 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Allow arbitrary function calls from one BPF function to another BPF function. As of today when writing BPF programs, __always_inline had to be used in the BPF C programs for all functions, unnecessarily causing LLVM to inflate code size. Handle this more naturally with support for BPF to BPF calls such that this __always_inline restriction can be overcome. As a result, it allows for better optimized code and finally enables to introduce core BPF libraries in the future that can be reused out of different projects. x86 and arm64 JIT support was added as well, from Alexei. 2) Add infrastructure for tagging functions as error injectable and allow for BPF to return arbitrary error values when BPF is attached via kprobes on those. This way of injecting errors generically eases testing and debugging without having to recompile or restart the kernel. Tags for opting-in for this facility are added with BPF_ALLOW_ERROR_INJECTION(), from Josef. 3) For BPF offload via nfp JIT, add support for bpf_xdp_adjust_head() helper call for XDP programs. First part of this work adds handling of BPF capabilities included in the firmware, and the later patches add support to the nfp verifier part and JIT as well as some small optimizations, from Jakub. 4) The bpftool now also gets support for basic cgroup BPF operations such as attaching, detaching and listing current BPF programs. As a requirement for the attach part, bpftool can now also load object files through 'bpftool prog load'. This reuses libbpf which we have in the kernel tree as well. bpftool-cgroup man page is added along with it, from Roman. 5) Back then commit e87c6bc3852b ("bpf: permit multiple bpf attachments for a single perf event") added support for attaching multiple BPF programs to a single perf event. Given they are configured through perf's ioctl() interface, the interface has been extended with a PERF_EVENT_IOC_QUERY_BPF command in this work in order to return an array of one or multiple BPF prog ids that are currently attached, from Yonghong. 6) Various minor fixes and cleanups to the bpftool's Makefile as well as a new 'uninstall' and 'doc-uninstall' target for removing bpftool itself or prior installed documentation related to it, from Quentin. 7) Add CONFIG_CGROUP_BPF=y to the BPF kernel selftest config file which is required for the test_dev_cgroup test case to run, from Naresh. 8) Fix reporting of XDP prog_flags for nfp driver, from Jakub. 9) Fix libbpf's exit code from the Makefile when libelf was not found in the system, also from Jakub. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-12btrfs: make open_ctree error injectableJosef Bacik1-0/+2
This allows us to do error injection with BPF for open_ctree. Signed-off-by: Josef Bacik <jbacik@fb.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2017-12-07Btrfs: disable FUA if mounted with nobarrierOmar Sandoval1-7/+5
I was seeing disk flushes still happening when I mounted a Btrfs filesystem with nobarrier for testing. This is because we use FUA to write out the first super block, and on devices without FUA support, the block layer translates FUA to a flush. Even on devices supporting true FUA, using FUA when we asked for no barriers is surprising. Fixes: 387125fc722a8ed ("Btrfs: fix barrier flushes") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-28btrfs: tree-checker: Fix false panic for sanity testQu Wenruo1-2/+8
[BUG] If we run btrfs with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y, it will instantly cause kernel panic like: ------ ... assertion failed: 0, file: fs/btrfs/disk-io.c, line: 3853 ... Call Trace: btrfs_mark_buffer_dirty+0x187/0x1f0 [btrfs] setup_items_for_insert+0x385/0x650 [btrfs] __btrfs_drop_extents+0x129a/0x1870 [btrfs] ... ----- [Cause] Btrfs will call btrfs_check_leaf() in btrfs_mark_buffer_dirty() to check if the leaf is valid with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y. However quite some btrfs_mark_buffer_dirty() callers(*) don't really initialize its item data but only initialize its item pointers, leaving item data uninitialized. This makes tree-checker catch uninitialized data as error, causing such panic. *: These callers include but not limited to setup_items_for_insert() btrfs_split_item() btrfs_expand_item() [Fix] Add a new parameter @check_item_data to btrfs_check_leaf(). With @check_item_data set to false, item data check will be skipped and fallback to old btrfs_check_leaf() behavior. So we can still get early warning if we screw up item pointers, and avoid false panic. Cc: Filipe Manana <fdmanana@gmail.com> Reported-by: Lakshmipathi.G <lakshmipathi.g@gmail.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01btrfs: track refs in a rb_tree instead of a listJosef Bacik1-4/+6
If we get a significant amount of delayed refs for a single block (think modifying multiple snapshots) we can end up spending an ungodly amount of time looping through all of the entries trying to see if they can be merged. This is because we only add them to a list, so we have O(2n) for every ref head. This doesn't make any sense as we likely have refs for different roots, and so they cannot be merged. Tracking in a tree will allow us to break as soon as we hit an entry that doesn't match, making our worst case O(n). With this we can also merge entries more easily. Before we had to hope that matching refs were on the ends of our list, but with the tree we can search down to exact matches and merge them at insert time. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01btrfs: make the delalloc block rsv per inodeJosef Bacik1-10/+8
The way we handle delalloc metadata reservations has gotten progressively more complicated over the years. There is so much cruft and weirdness around keeping the reserved count and outstanding counters consistent and handling the error cases that it's impossible to understand. Fix this by making the delalloc block rsv per-inode. This way we can calculate the actual size of the outstanding metadata reservations every time we make a change, and then reserve the delta based on that amount. This greatly simplifies the code everywhere, and makes the error handling in btrfs_delalloc_reserve_metadata far less terrifying. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30btrfs: Replace opencoded sizes with their symbolic constantsNikolay Borisov1-1/+1
Currently btrfs' code uses a mix of opencoded sizes and defines from sizes.h. Let's unifiy the code base to always use the symbolic constants. No functional changes Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30btrfs: remove delayed_ref_node from ref_headJosef Bacik1-6/+6
This is just excessive information in the ref_head, and makes the code complicated. It is a relic from when we had the heads and the refs in the same tree, which is no longer the case. With this removal I've cleaned up a bunch of the cruft around this old assumption as well. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30Btrfs: add a extent ref verify toolJosef Bacik1-0/+6
We were having corruption issues that were tied back to problems with the extent tree. In order to track them down I built this tool to try and find the culprit, which was pretty successful. If you compile with this tool on it will live verify every ref update that the fs makes and make sure it is consistent and valid. I've run this through with xfstests and haven't gotten any false positives. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update error messages, add fixup from Dan Carpenter to handle errors of read_tree_block ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30Btrfs: remove nr_async_submits and async_submit_drainingLiu Bo1-24/+0
Now that we have the combo of flushing twice, which can make sure IO have started since the second flush will wait for page lock which won't be unlocked unless setting page writeback and queuing ordered extents, we don't need %async_submit_draining, %async_delalloc_pages and %nr_async_submits to tell whether the IO has actually started. Moreover, all the flushers in use are followed by functions that wait for ordered extents to complete, so %nr_async_submits, which tracks whether bio's async submit has made progress, doesn't really make sense. However, %async_delalloc_pages is still required by shrink_delalloc() as that function doesn't flush twice in the normal case (just issues a writeback with WB_REASON_FS_FREE_SPACE). Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30Btrfs: remove nr_async_biosLiu Bo1-1/+0
This was intended to congest higher layers to not send bios, but as 1) the congested bit has been taken by writeback Async bios come from buffered writes and DIO writes. For DIO writes, we want to submit them ASAP, while for buffered writes, writeback uses balance_dirty_pages() to throttle how much dirty pages we can have. 2) and no one is waiting for %nr_async_bios down to zero, Historically, it was introduced along with changes which let checksumming workload spread accross different cpus. And at that time, pdflush was used instead of per-bdi flushing, perhaps pdflush did not have the necessary information for writeback to do throttling. We can safely remove them now. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> [ additional explanation from mails, removed unused variable 'limit' ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30btrfs: Move leaf and node validation checker to tree-checker.cQu Wenruo1-281/+4
It's no doubt the comprehensive tree block checker will become larger, so moving them into their own files is quite reasonable. Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com> [ wording adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30btrfs: Add checker for EXTENT_CSUMQu Wenruo1-0/+24
EXTENT_CSUM checker is a relatively easy one, only needs to check: 1) Objectid Fixed to BTRFS_EXTENT_CSUM_OBJECTID 2) Key offset alignment Must be aligned to sectorsize 3) Item size alignedment Must be aligned to csum size Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30btrfs: Add sanity check for EXTENT_DATA when reading out leafQu Wenruo1-0/+103
Add extra checks for item with EXTENT_DATA type. This checks the following thing: 0) Key offset All key offsets must be aligned to sectorsize. Inline extent must have 0 for key offset. 1) Item size Uncompressed inline file extent size must match item size. (Compressed inline file extent has no information about its on-disk size.) Regular/preallocated file extent size must be a fixed value. 2) Every member of regular file extent item Including alignment for bytenr and offset, possible value for compression/encryption/type. 3) Type/compression/encode must be one of the valid values. This should be the most comprehensive and strict check in the context of btrfs_item for EXTENT_DATA. Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ switch to BTRFS_FILE_EXTENT_TYPES, similar to what BTRFS_COMPRESS_TYPES does ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30btrfs: Check if item pointer overlaps with the item itselfQu Wenruo1-0/+7
Function check_leaf() checks if any item pointer points outside of the leaf, but it doesn't check if the pointer overlaps with the item itself. Normally only the last item may be the victim, but adding such check is never a bad idea anyway. Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30btrfs: Refactor check_leaf function for later expansionQu Wenruo1-23/+27
Current check_leaf() function does a good job checking key order and item offset/size. However it only checks from slot 0 to the last but one slot, this is good but makes later expansion hard. So this refactoring iterates from slot 0 to the last slot. For key comparison, it uses a key with all 0 as initial key, so all valid keys should be larger than that. And for item size/offset checks, it compares current item end with previous item offset. For slot 0, use leaf end as a special case. This makes later item/key offset checks and item size checks easier to be implemented. Also, makes check_leaf() to return -EUCLEAN other than -EIO to indicate error. Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30Btrfs: remove bio_flags which indicates a meta block of log-treeLiu Bo1-4/+2
Since both committing transaction and writing log-tree are doing plugging on metadata IO, we can unify to use %sync_writers to benefit both cases, instead of checking bio_flags while writing meta blocks of log-tree. We can remove this bio_flags because in order to write dirty blocks, log tree also uses btrfs_write_marked_extents(), inside which we have enabled %sync_writers, therefore, every write goes in a synchronous way, so does checksuming. Please also note that, bio_flags is applied per-context while %sync_writers is applied per-inode, so this might incur some overhead, ie. 1) while log tree is flushing its dirty blocks via btrfs_write_marked_extents(), in which %sync_writers is increased by one. 2) in the meantime, some writeback operations may happen upon btrfs's metadata inode, so these writes go synchronously, too. However, AFAICS, the overhead is not a big one while the win is that we unify the two places that needs synchronous way and remove a special hack/flag. This removes the bio_flags related stuff for writing log-tree. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30Btrfs: make plug in writing meta blocks really workLiu Bo1-2/+4
We have started plug in btrfs_write_and_wait_marked_extents() but the generated IOs actually go to device's schedule IO list where the work is doing in another task, thus the started plug doesn't make any sense. And since we wait for IOs immediately after writing meta blocks, it's the same case as writing log tree, doing sync submit can merge more IOs. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30btrfs: copy fsid to super_block s_uuidAnand Jain1-0/+1
We didn't copy fsid to struct super_block.s_uuid so Overlay disables index feature with btrfs as the lower FS. kernel: overlayfs: fs on '/lower' does not support file handles, falling back to index=off. Fix this by publishing the fsid through struct super_block.s_uuid. [ dsterba: I think that setting s_uuid is the last missing bit. Overlay needs the file handle encoding support from the lower filesystem, which is supported. Filling the whole filesystem id is correct, the subvolume id is encoded in the file handle buffer from inside btrfs_encode_fh. ] Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-29Merge branch 'for-4.14-rc3' of ↵Linus Torvalds1-1/+8
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "We've collected a bunch of isolated fixes, for crashes, user-visible behaviour or missing bits from other subsystem cleanups from the past. The overall number is not small but I was not able to make it significantly smaller. Most of the patches are supposed to go to stable" * 'for-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: log csums for all modified extents Btrfs: fix unexpected result when dio reading corrupted blocks btrfs: Report error on removing qgroup if del_qgroup_item fails Btrfs: skip checksum when reading compressed data if some IO have failed Btrfs: fix kernel oops while reading compressed data Btrfs: use btrfs_op instead of bio_op in __btrfs_map_block Btrfs: do not backup tree roots when fsync btrfs: remove BTRFS_FS_QUOTA_DISABLING flag btrfs: propagate error to btrfs_cmp_data_prepare caller btrfs: prevent to set invalid default subvolid Btrfs: send: fix error number for unknown inode types btrfs: fix NULL pointer dereference from free_reloc_roots() btrfs: finish ordered extent cleaning if no progress is found btrfs: clear ordered flag on cleaning up ordered extents Btrfs: fix incorrect {node,sector}size endianness from BTRFS_IOC_FS_INFO Btrfs: do not reset bio->bi_ops while writing bio Btrfs: use the new helper wbc_to_write_flags
2017-09-26Btrfs: do not backup tree roots when fsyncLiu Bo1-1/+8
It doesn't make sense to backup tree roots when doing fsync, since during fsync those tree roots have not been consistent on disk. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-14Merge branch 'work.mount' of ↵Linus Torvalds1-6/+6
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull mount flag updates from Al Viro: "Another chunk of fmount preparations from dhowells; only trivial conflicts for that part. It separates MS_... bits (very grotty mount(2) ABI) from the struct super_block ->s_flags (kernel-internal, only a small subset of MS_... stuff). This does *not* convert the filesystems to new constants; only the infrastructure is done here. The next step in that series is where the conflicts would be; that's the conversion of filesystems. It's purely mechanical and it's better done after the merge, so if you could run something like list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$') sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \ -e 's/\<MS_NOSUID\>/SB_NOSUID/g' \ -e 's/\<MS_NODEV\>/SB_NODEV/g' \ -e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \ -e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \ -e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \ -e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \ -e 's/\<MS_NOATIME\>/SB_NOATIME/g' \ -e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \ -e 's/\<MS_SILENT\>/SB_SILENT/g' \ -e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \ -e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \ -e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \ -e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \ $list and commit it with something along the lines of 'convert filesystems away from use of MS_... constants' as commit message, it would save a quite a bit of headache next cycle" * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: VFS: Differentiate mount flags (MS_*) from internal superblock flags VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb) vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
2017-09-14Merge branch 'zstd-minimal' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull zstd support from Chris Mason: "Nick Terrell's patch series to add zstd support to the kernel has been floating around for a while. After talking with Dave Sterba, Herbert and Phillip, we decided to send the whole thing in as one pull request. zstd is a big win in speed over zlib and in compression ratio over lzo, and the compression team here at FB has gotten great results using it in production. Nick will continue to update the kernel side with new improvements from the open source zstd userland code. Nick has a number of benchmarks for the main zstd code in his lib/zstd commit: I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, 16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is 211,988,480 B large. Run the following commands for the benchmark: sudo modprobe zstd_compress_test sudo mknod zstd_compress_test c 245 0 sudo cp silesia.tar zstd_compress_test The time is reported by the time of the userland `cp`. The MB/s is computed with 1,536,217,008 B / time(buffer size, hash) which includes the time to copy from userland. The Adjusted MB/s is computed with 1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)). The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s | Adj MB/s | Mem (MB) | |----------|----------|----------|-------|---------|----------|----------| | none | 11988480 | 0.100 | 1 | 2119.88 | - | - | | zstd -1 | 73645762 | 1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 | 1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 | 2.563 | 3.261 | 82.71 | 86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 | 16.13 | 13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 | 4.45 | 4.46 | 21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 | 2.06 | 2.06 | 60.15 | | zlib -1 | 77260026 | 2.895 | 2.744 | 73.23 | 75.85 | 0.27 | | zlib -3 | 72972206 | 4.116 | 2.905 | 51.50 | 52.79 | 0.27 | | zlib -6 | 68190360 | 9.633 | 3.109 | 22.01 | 22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 | 9.40 | 9.44 | 0.27 | I benchmarked zstd decompression using the same method on the same machine. The benchmark file is located in the upstream zstd repo under `contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is the amount of memory required to decompress data compressed with the given compression level. If you know the maximum size of your input, you can reduce the memory usage of decompression irrespective of the compression level. | Method | Time (s) | MB/s | Adjusted MB/s | Memory (MB) | |----------|----------|---------|---------------|-------------| | none | 0.025 | 8479.54 | - | - | | zstd -1 | 0.358 | 592.15 | 636.60 | 0.84 | | zstd -3 | 0.396 | 535.32 | 571.40 | 1.46 | | zstd -5 | 0.396 | 535.32 | 571.40 | 1.46 | | zstd -10 | 0.374 | 566.81 | 607.42 | 2.51 | | zstd -15 | 0.379 | 559.34 | 598.84 | 4.61 | | zstd -19 | 0.412 | 514.54 | 547.77 | 8.80 | | zlib -1 | 0.940 | 225.52 | 231.68 | 0.04 | | zlib -3 | 0.883 | 240.08 | 247.07 | 0.04 | | zlib -6 | 0.844 | 251.17 | 258.84 | 0.04 | | zlib -9 | 0.837 | 253.27 | 287.64 | 0.04 | I ran a long series of tests and benchmarks on the btrfs side and the gains are very similar to the core benchmarks Nick ran" * 'zstd-minimal' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: squashfs: Add zstd support btrfs: Add zstd support lib: Add zstd modules lib: Add xxhash module
2017-09-09Merge branch 'for-4.14' of ↵Linus Torvalds1-148/+94
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "The changes range through all types: cleanups, core chagnes, sanity checks, fixes, other user visible changes, detailed list below: - deprecated: user transaction ioctl - mount option ssd does not change allocation alignments - degraded read-write mount is allowed if all the raid profile constraints are met, now based on more accurate check - defrag: do not reset compression afterwards; the NOCOMPRESS flag can be now overriden by defrag - prep work for better extent reference tracking (related to the qgroup slowness with balance) - prep work for compression heuristics - memory allocation reductions (may help latencies on a loaded system) - better accounting for io waiting states - error handling improvements (removed BUGs) - added more sanity checks for shared refs - fix readdir vs pagefault deadlock under some circumstances - fix for 'no-hole' mode, certain combination of compressed and inline extents - send: fix emission of invalid clone operations - fixup file mode if setting acls fail - more fixes from fuzzing - oher cleanups" * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (104 commits) btrfs: submit superblock io with REQ_META and REQ_PRIO btrfs: remove unnecessary memory barrier in btrfs_direct_IO btrfs: remove superfluous chunk_tree argument from btrfs_alloc_dev_extent btrfs: Remove chunk_objectid parameter of btrfs_alloc_dev_extent btrfs: pass fs_info to btrfs_del_root instead of tree_root Btrfs: add one more sanity check for shared ref type Btrfs: remove BUG_ON in __add_tree_block Btrfs: remove BUG() in add_data_reference Btrfs: remove BUG() in print_extent_item Btrfs: remove BUG() in btrfs_extent_inline_ref_size Btrfs: convert to use btrfs_get_extent_inline_ref_type Btrfs: add a helper to retrive extent inline ref type btrfs: scrub: simplify scrub worker initialization btrfs: scrub: clean up division in scrub_find_csum btrfs: scrub: clean up division in __scrub_mark_bitmap btrfs: scrub: use bool for flush_all_writes btrfs: preserve i_mode if __btrfs_set_acl() fails btrfs: Remove extraneous chunk_objectid variable btrfs: Remove chunk_objectid argument from btrfs_make_block_group btrfs: Remove extra parentheses from condition in copy_items() ...
2017-09-07Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+1
Pull block layer updates from Jens Axboe: "This is the first pull request for 4.14, containing most of the code changes. It's a quiet series this round, which I think we needed after the churn of the last few series. This contains: - Fix for a registration race in loop, from Anton Volkov. - Overflow complaint fix from Arnd for DAC960. - Series of drbd changes from the usual suspects. - Conversion of the stec/skd driver to blk-mq. From Bart. - A few BFQ improvements/fixes from Paolo. - CFQ improvement from Ritesh, allowing idling for group idle. - A few fixes found by Dan's smatch, courtesy of Dan. - A warning fixup for a race between changing the IO scheduler and device remova. From David Jeffery. - A few nbd fixes from Josef. - Support for cgroup info in blktrace, from Shaohua. - Also from Shaohua, new features in the null_blk driver to allow it to actually hold data, among other things. - Various corner cases and error handling fixes from Weiping Zhang. - Improvements to the IO stats tracking for blk-mq from me. Can drastically improve performance for fast devices and/or big machines. - Series from Christoph removing bi_bdev as being needed for IO submission, in preparation for nvme multipathing code. - Series from Bart, including various cleanups and fixes for switch fall through case complaints" * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits) kernfs: checking for IS_ERR() instead of NULL drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set drbd: Fix allyesconfig build, fix recent commit drbd: switch from kmalloc() to kmalloc_array() drbd: abort drbd_start_resync if there is no connection drbd: move global variables to drbd namespace and make some static drbd: rename "usermode_helper" to "drbd_usermode_helper" drbd: fix race between handshake and admin disconnect/down drbd: fix potential deadlock when trying to detach during handshake drbd: A single dot should be put into a sequence. drbd: fix rmmod cleanup, remove _all_ debugfs entries drbd: Use setup_timer() instead of init_timer() to simplify the code. drbd: fix potential get_ldev/put_ldev refcount imbalance during attach drbd: new disk-option disable-write-same drbd: Fix resource role for newly created resources in events2 drbd: mark symbols static where possible drbd: Send P_NEG_ACK upon write error in protocol != C drbd: add explicit plugging when submitting batches drbd: change list_for_each_safe to while(list_first_entry_or_null) drbd: introduce drbd_recv_header_maybe_unplug ...
2017-08-24Btrfs: fix blk_status_t/errno confusionOmar Sandoval1-2/+2
This fixes several instances of blk_status_t and bare errno ints being mixed up, some of which are real bugs. In the normal case, 0 matches BLK_STS_OK, so we don't observe any effects of the missing conversion, but in case of errors or passes through the repair/retry paths, the errors get mixed up. The changes were identified using 'sparse', we don't have reports of the buggy behaviour. Fixes: 4e4cbee93d56 ("block: switch bios to blk_status_t") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-23block: replace bi_bdev with a gendisk pointer and partitions indexChristoph Hellwig1-1/+1
This way we don't need a block_device structure to submit I/O. The block_device has different life time rules from the gendisk and request_queue and is usually only available when the block device node is open. Other callers need to explicitly create one (e.g. the lightnvm passthrough code, or the new nvme multipathing code). For the actual I/O path all that we need is the gendisk, which exists once per block device. But given that the block layer also does partition remapping we additionally need a partition index, which is used for said remapping in generic_make_request. Note that all the block drivers generally want request_queue or sometimes the gendisk, so this removes a layer of indirection all over the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>