summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2015-11-04Merge tag 'driver-core-4.4-rc1' of ↵Linus Torvalds3-75/+112
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here's the "big" driver core updates for 4.4-rc1. Primarily a bunch of debugfs updates, with a smattering of minor driver core fixes and updates as well. All have been in linux-next for a long time" * tag 'driver-core-4.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: debugfs: Add debugfs_create_ulong() of: to support binding numa node to specified device in devicetree debugfs: Add read-only/write-only bool file ops debugfs: Add read-only/write-only size_t file ops debugfs: Add read-only/write-only x64 file ops debugfs: Consolidate file mode checks in debugfs_create_*() Revert "mm: Check if section present during memory block (un)registering" driver-core: platform: Provide helpers for multi-driver modules mm: Check if section present during memory block (un)registering devres: fix a for loop bounds check CMA: fix CONFIG_CMA_SIZE_MBYTES overflow in 64bit base/platform: assert that dev_pm_domain callbacks are called unconditionally sysfs: correctly handle short reads on PREALLOC attrs. base: soc: siplify ida usage kobject: move EXPORT_SYMBOL() macros next to corresponding definitions kobject: explain what kobject's sd field is debugfs: document that debugfs_remove*() accepts NULL and error values debugfs: Pass bool pointer to debugfs_create_bool() ACPI / EC: Fix broken 64bit big-endian users of 'global_lock'
2015-11-04Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+1
Pull block integrity updates from Jens Axboe: ""This is the joint work of Dan and Martin, cleaning up and improving the support for block data integrity" * 'for-4.4/integrity' of git://git.kernel.dk/linux-block: block, libnvdimm, nvme: provide a built-in blk_integrity nop profile block: blk_flush_integrity() for bio-based drivers block: move blk_integrity to request_queue block: generic request_queue reference counting nvme: suspend i/o during runtime blk_integrity_unregister md: suspend i/o during runtime blk_integrity_unregister md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown block: Inline blk_integrity in struct gendisk block: Export integrity data interval size in sysfs block: Reduce the size of struct blk_integrity block: Consolidate static integrity profile properties block: Move integrity kobject to struct gendisk
2015-11-04Merge branch 'for-4.4/core' of git://git.kernel.dk/linux-blockLinus Torvalds2-11/+21
Pull core block updates from Jens Axboe: "This is the core block pull request for 4.4. I've got a few more topic branches this time around, some of them will layer on top of the core+drivers changes and will come in a separate round. So not a huge chunk of changes in this round. This pull request contains: - Enable blk-mq page allocation tracking with kmemleak, from Catalin. - Unused prototype removal in blk-mq from Christoph. - Cleanup of the q->blk_trace exchange, using cmpxchg instead of two xchg()'s, from Davidlohr. - A plug flush fix from Jeff. - Also from Jeff, a fix that means we don't have to update shared tag sets at init time unless we do a state change. This cuts down boot times on thousands of devices a lot with scsi/blk-mq. - blk-mq waitqueue barrier fix from Kosuke. - Various fixes from Ming: - Fixes for segment merging and splitting, and checks, for the old core and blk-mq. - Potential blk-mq speedup by marking ctx pending at the end of a plug insertion batch in blk-mq. - direct-io no page dirty on kernel direct reads. - A WRITE_SYNC fix for mpage from Roman" * 'for-4.4/core' of git://git.kernel.dk/linux-block: blk-mq: avoid excessive boot delays with large lun counts blktrace: re-write setting q->blk_trace blk-mq: mark ctx as pending at batch in flush plug path blk-mq: fix for trace_block_plug() block: check bio_mergeable() early before merging blk-mq: check bio_mergeable() early before merging block: avoid to merge splitted bio block: setup bi_phys_segments after splitting block: fix plug list flushing for nomerge queues blk-mq: remove unused blk_mq_clone_flush_request prototype blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write block: kmemleak: Track the page allocations for struct request
2015-11-04Merge branch 'for-linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Martin Schwidefsky: "There is only one new feature in this pull for the 4.4 merge window, most of it is small enhancements, cleanup and bug fixes: - Add the s390 backend for the software dirty bit tracking. This adds two new pgtable functions pte_clear_soft_dirty and pmd_clear_soft_dirty which is why there is a hit to arch/x86/include/asm/pgtable.h in this pull request. - A series of cleanup patches for the AP bus, this includes the removal of the support for two outdated crypto cards (PCICC and PCICA). - The irq handling / signaling on buffer full in the runtime instrumentation code is dropped. - Some micro optimizations: remove unnecessary memory barriers for a couple of functions: [smb_]rmb, [smb_]wmb, atomics, bitops, and for spin_unlock. Use the builtin bswap if available and make test_and_set_bit_lock more cache friendly. - Statistics and a tracepoint for the diagnose calls to the hypervisor. - The CPU measurement facility support to sample KVM guests is improved. - The vector instructions are now always enabled for user space processes if the hardware has the vector facility. This simplifies the FPU handling code. The fpu-internal.h header is split into fpu internals, api and types just like x86. - Cleanup and improvements for the common I/O layer. - Rework udelay to solve a problem with kprobe. udelay has busy loop semantics but still uses an idle processor state for the wait" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (66 commits) s390: remove runtime instrumentation interrupts s390/cio: de-duplicate subchannel validation s390/css: unneeded initialization in for_each_subchannel s390/Kconfig: use builtin bswap s390/dasd: fix disconnected device with valid path mask s390/dasd: fix invalid PAV assignment after suspend/resume s390/dasd: fix double free in dasd_eckd_read_conf s390/kernel: fix ptrace peek/poke for floating point registers s390/cio: move ccw_device_stlck functions s390/cio: move ccw_device_call_handler s390/topology: reduce per_cpu() invocations s390/nmi: reduce size of percpu variable s390/nmi: fix terminology s390/nmi: remove casts s390/nmi: remove pointless error strings s390: don't store registers on disabled wait anymore s390: get rid of __set_psw_mask() s390/fpu: split fpu-internal.h into fpu internals, api, and type headers s390/dasd: fix list_del corruption after lcu changes s390/spinlock: remove unneeded serializations at unlock ...
2015-11-03Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU changes from Ingo Molnar: "The main changes in this cycle were: - Improvements to expedited grace periods (Paul E McKenney) - Performance improvements to and locktorture tests for percpu-rwsem (Oleg Nesterov, Paul E McKenney) - Torture-test changes (Paul E McKenney, Davidlohr Bueso) - Documentation updates (Paul E McKenney) - Miscellaneous fixes (Paul E McKenney, Boqun Feng, Oleg Nesterov, Patrick Marlier)" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in bdi_split_work_to_wbs() rcu: Better hotplug handling for synchronize_sched_expedited() rcu: Enable stall warnings for synchronize_rcu_expedited() rcu: Add tasks to expedited stall-warning messages rcu: Add online/offline info to expedited stall warning message rcu: Consolidate expedited CPU selection rcu: Prepare for consolidating expedited CPU selection cpu: Remove try_get_online_cpus() rcu: Stop excluding CPU hotplug in synchronize_sched_expedited() rcu: Stop silencing lockdep false positive for expedited grace periods rcu: Switch synchronize_sched_expedited() to IPI locktorture: Fix module unwind when bad torture_type specified torture: Forgive non-plural arguments rcutorture: Fix unused-function warning for torturing_tasks() rcutorture: Fix module unwind when bad torture_type specified rcu_sync: Cleanup the CONFIG_PROVE_RCU checks locking/percpu-rwsem: Clean up the lockdep annotations in percpu_down_read() locking/percpu-rwsem: Fix the comments outdated by rcu_sync locking/percpu-rwsem: Make use of the rcu_sync infrastructure locking/percpu-rwsem: Make percpu_free_rwsem() after kzalloc() safe ...
2015-11-03Merge branch 'core-debug-for-linus' of ↵Linus Torvalds2-8/+17
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull wchan kernel address hiding from Ingo Molnar: "This fixes a wchan related information leak in /proc/PID/stat. There's a bit of an ABI twist to it: instead of setting the wchan field to 0 (which is our usual technique) we set it conditionally to a 0/1 flag to keep ABI compatibility with older procps versions that only fetches /proc/PID/wchan (symbolic names) if the absolute wchan address is nonzero" * 'core-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: fs/proc, core/debug: Don't expose absolute kernel addresses via wchan
2015-11-01mm: get rid of 'vmalloc_info' from /proc/meminfoLinus Torvalds1-5/+2
It turns out that at least some versions of glibc end up reading /proc/meminfo at every single startup, because glibc wants to know the amount of memory the machine has. And while that's arguably insane, it's just how things are. And it turns out that it's not all that expensive most of the time, but the vmalloc information statistics (amount of virtual memory used in the vmalloc space, and the biggest remaining chunk) can be rather expensive to compute. The 'get_vmalloc_info()' function actually showed up on my profiles as 4% of the CPU usage of "make test" in the git source repository, because the git tests are lots of very short-lived shell-scripts etc. It turns out that apparently this same silly vmalloc info gathering shows up on the facebook servers too, according to Dave Jones. So it's not just "make test" for git. We had two patches to just cache the information (one by me, one by Ingo) to mitigate this issue, but the whole vmalloc information of of rather dubious value to begin with, and people who *actually* want to know what the situation is wrt the vmalloc area should just look at the much more complete /proc/vmallocinfo instead. In fact, according to my testing - and perhaps more importantly, according to that big search engine in the sky: Google - there is nothing out there that actually cares about those two expensive fields: VmallocUsed and VmallocChunk. So let's try to just remove them entirely. Actually, this just removes the computation and reports the numbers as zero for now, just to try to be minimally intrusive. If this breaks anything, we'll obviously have to re-introduce the code to compute this all and add the caching patches on top. But if given the option, I'd really prefer to just remove this bad idea entirely rather than add even more code to work around our historical mistake that likely nobody really cares about. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-01Merge branch 'fs-file-descriptor-optimization'Linus Torvalds1-5/+37
Merge file descriptor allocation speedup. Eric Dumazet has a test-case for a fairly common network deamon load pattern: openign and closing a lot of sockets that each have very little work done on them. It turns out that in that case, the cost of just finding the correct file descriptor number can be a dominating factor. We've long had a trivial optimization for allocating file descriptors sequentially, but that optimization ends up being not very effective when other file descriptors are being closed concurrently, and the fd patterns are not some simple FIFO pattern. In such cases we ended up spending a lot of time just scanning the bitmap of open file descriptors in order to find the next file descriptor number to open. This trivial patch-series mitigates that by simply introducing a second-level bitmap of which words in the first bitmap are already fully allocated. That cuts down the cost of scanning by an order of magnitude in some pathological (but realistic) cases. The second patch is an even more trivial patch to avoid unnecessarily dirtying the cacheline for the close-on-exec bit array that normally ends up being all empty. * fs-file-descriptor-optimization: vfs: conditionally clear close-on-exec flag vfs: Fix pathological performance case for __alloc_fd()
2015-10-31vfs: conditionally clear close-on-exec flagLinus Torvalds1-1/+2
We clear the close-on-exec flag when opening and closing files, and the bit was almost always already clear before. Avoid dirtying the cacheline if the clearning isn't necessary. That avoids unnecessary cacheline dirtying and bouncing in multi-socket environments. Eric Dumazet has a file descriptor benchmark that goes 4% faster from this on his two-socket machine. It's probably partly superlinear improvement due to getting slightly less spinlock contention on the file_lock spinlock due to less work in the critical section. Tested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-31vfs: Fix pathological performance case for __alloc_fd()Linus Torvalds1-4/+35
Al Viro points out that: > > * [Linux-specific aside] our __alloc_fd() can degrade quite badly > > with some use patterns. The cacheline pingpong in the bitmap is probably > > inevitable, unless we accept considerably heavier memory footprint, > > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard > > to trigger - close(3);open(...); will have the next open() after that > > scanning the entire in-use bitmap. And Eric Dumazet has a somewhat realistic multithreaded microbenchmark that opens and closes a lot of sockets with minimal work per socket. This patch largely fixes it. We keep a 2nd-level bitmap of the open file bitmaps, showing which words are already full. So then we can traverse that second-level bitmap to efficiently skip already allocated file descriptors. On his benchmark, this improves performance by up to an order of magnitude, by avoiding the excessive open file bitmap scanning. Tested-and-acked-by: Eric Dumazet <edumazet@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-31Merge branch 'overlayfs-linus' of ↵Linus Torvalds3-3/+8
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs bug fixes from Miklos Szeredi: "This contains fixes for bugs that appeared in earlier kernels (all are marked for -stable)" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: free lower_mnt array in ovl_put_super ovl: free stack of paths in ovl_fill_super ovl: fix open in stacked overlay ovl: fix dentry reference leak ovl: use O_LARGEFILE in ovl_copy_up()
2015-10-28fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in ↵Tejun Heo1-2/+2
bdi_split_work_to_wbs() bdi_split_work_to_wbs() uses list_for_each_entry_rcu_continue() to walk @bdi->wb_list. To set up the initial iteration condition, it uses list_entry_rcu() to calculate the entry pointer corresponding to the list head; however, this isn't an actual RCU dereference and using list_entry_rcu() for it ended up breaking a proposed list_entry_rcu() change because it was feeding an non-lvalue pointer into the macro. Don't use the RCU variant for simple pointer offsetting. Use list_entry() instead. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Darren Hart <dvhart@linux.intel.com> Cc: David Howells <dhowells@redhat.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Patrick Marlier <patrick.marlier@gmail.com> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: pranith kumar <bobby.prani@gmail.com> Link: http://lkml.kernel.org/r/20151027051939.GA19355@mtj.duckdns.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-24Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds1-11/+24
Pull block layer fixes from Jens Axboe: "A final set of fixes for 4.3. It is (again) bigger than I would have liked, but it's all been through the testing mill and has been carefully reviewed by multiple parties. Each fix is either a regression fix for this cycle, or is marked stable. You can scold me at KS. The pull request contains: - Three simple fixes for NVMe, fixing regressions since 4.3. From Arnd, Christoph, and Keith. - A single xen-blkfront fix from Cathy, fixing a NULL dereference if an error is returned through the staste change callback. - Fixup for some bad/sloppy code in nbd that got introduced earlier in this cycle. From Markus Pargmann. - A blk-mq tagset use-after-free fix from Junichi. - A backing device lifetime fix from Tejun, fixing a crash. - And finally, a set of regression/stable fixes for cgroup writeback from Tejun" * 'for-linus' of git://git.kernel.dk/linux-block: writeback: remove broken rbtree_postorder_for_each_entry_safe() usage in cgwb_bdi_destroy() NVMe: Fix memory leak on retried commands block: don't release bdi while request_queue has live references nvme: use an integer value to Linux errno values blk-mq: fix use-after-free in blk_mq_free_tag_set() nvme: fix 32-bit build warning writeback: fix incorrect calculation of available memory for memcg domains writeback: memcg dirty_throttle_control should be initialized with wb->memcg_completions writeback: bdi_writeback iteration must not skip dying ones writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback() writeback: laptop_mode_timer_fn() needs rcu_read_lock() around bdi_writeback iteration nbd: Add locking for tasks xen-blkfront: check for null drvdata in blkback_changed (XenbusStateClosing)
2015-10-24Merge branch 'for-linus-4.3' of ↵Linus Torvalds2-2/+5
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "I have two more small fixes this week: Qu's fix avoids unneeded COW during fallocate, and Christian found a memory leak in the error handling of an earlier fix" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: fix possible leak in btrfs_ioctl_balance() btrfs: Avoid truncate tailing page if fallocate range doesn't exceed inode size
2015-10-23ocfs2/dlm: unlock lockres spinlock before dlm_lockres_putJoseph Qi2-2/+3
dlm_lockres_put will call dlm_lockres_release if it is the last reference, and then it may call dlm_print_one_lock_resource and take lockres spinlock. So unlock lockres spinlock before dlm_lockres_put to avoid deadlock. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-21btrfs: fix possible leak in btrfs_ioctl_balance()Christian Engelmayer1-1/+4
Commit 8eb934591f8b ("btrfs: check unsupported filters in balance arguments") adds a jump to exit label out_bargs in case the argument check fails. At this point in addition to the bargs memory, the memory for struct btrfs_balance_control has already been allocated. Ownership of bctl is passed to btrfs_balance() in the good case, thus the memory is not freed due to the introduced jump. Make sure that the memory gets freed in any case as necessary. Detected by Coverity CID 1328378. Signed-off-by: Christian Engelmayer <cengelma@gmx.at> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21block: Inline blk_integrity in struct gendiskMartin K. Petersen1-1/+1
Up until now the_integrity profile has been dynamically allocated and attached to struct gendisk after the disk has been made active. This causes problems because NVMe devices need to register the profile prior to the partition table being read due to a mandatory metadata buffer requirement. In addition, DM goes through hoops to deal with preallocating, but not initializing integrity profiles. Since the integrity profile is small (4 bytes + a pointer), Christoph suggested moving it to struct gendisk proper. This requires several changes: - Moving the blk_integrity definition to genhd.h. - Inlining blk_integrity in struct gendisk. - Removing the dynamic allocation code. - Adding helper functions which allow gendisk to set up and tear down the integrity sysfs dir when a disk is added/deleted. - Adding a blk_integrity_revalidate() callback for updating the stable pages bdi setting. - The calls that depend on whether a device has an integrity profile or not now key off of the bi->profile pointer. - Simplifying the integrity support routines in DM (Mike Snitzer). Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-20btrfs: Avoid truncate tailing page if fallocate range doesn't exceed inode sizeQu Wenruo1-1/+1
Current code will always truncate tailing page if its alloc_start is smaller than inode size. For example, the file extent layout is like: 0 4K 8K 16K 32K |<-----Extent A---------------->| |<--Inode size: 18K---------->| But if calling fallocate even for range [0,4K), it will cause btrfs to re-truncate the range [16,32K), causing COW and a new extent. 0 4K 8K 16K 32K |///////| <- Fallocate call range |<-----Extent A-------->|<--B-->| The cause is quite easy, just a careless btrfs_truncate_inode() in a else branch without extra judgment. Fix it by add judgment on whether the fallocate range is beyond isize. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-18debugfs: Add debugfs_create_ulong()Viresh Kumar1-0/+48
Add debugfs_create_ulong() for the users of type 'unsigned long'. These will be 32 bits long on a 32 bit machine and 64 bits long on a 64 bit machine. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-17debugfs: Add read-only/write-only bool file opsStephen Boyd1-1/+14
There aren't any read-only or write-only bool file ops, but there is a caller of debugfs_create_bool() that calls it with mode equal to 0400. This leads to the possibility of userspace modifying the file, so let's use the newly created debugfs_create_mode() helper here to fix this. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-17debugfs: Add read-only/write-only size_t file opsStephen Boyd1-1/+4
There aren't any read-only or write-only size_t file ops, but there is a caller of debugfs_create_size_t() that calls it with mode equal to 0400. This leads to the possibility of userspace modifying the file, so let's use the newly created debugfs_create_mode() helper here to fix this. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-17debugfs: Add read-only/write-only x64 file opsStephen Boyd1-1/+4
There aren't any read-only or write-only x64 file ops, but there is a caller of debugfs_create_x64() that calls it with mode equal to S_IRUGO. This leads to the possibility of userspace modifying the file, so let's use the newly created debugfs_create_mode() helper here to fix this. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-17debugfs: Consolidate file mode checks in debugfs_create_*()Stephen Boyd1-66/+32
The code that creates debugfs file with different file ops based on the file mode is duplicated in each debugfs_create_*() API. Consolidate that code into debugfs_create_mode(), that takes three file ops structures so that we don't have to keep copy/pasting that logic. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-16Merge branch 'for-linus-4.3' of ↵Linus Torvalds3-5/+16
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "I have two more bug fixes for btrfs. My commit fixes a bug we hit last week at FB, a combination of lots of hard links and an admin command to resolve inode numbers. Dave is adding checks to make sure balance on current kernels ignores filters it doesn't understand. The penalty for being wrong is just doing more work (not crashing etc), but it's a good fix" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: fix use after free iterating extrefs btrfs: check unsupported filters in balance arguments
2015-10-16Merge branch 'akpm' (patches from Andrew)Linus Torvalds5-54/+46
Merge misc fixes from Andrew Morton: "6 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: sh: add copy_user_page() alias for __copy_user() lib/Kconfig: ZLIB_DEFLATE must select BITREVERSE mm, dax: fix DAX deadlocks memcg: convert threshold to bytes builddeb: remove debian/files before build mm, fs: obey gfp_mapping for add_to_page_cache()
2015-10-16mm, dax: fix DAX deadlocksRoss Zwisler1-41/+29
The following two locking commits in the DAX code: commit 843172978bb9 ("dax: fix race between simultaneous faults") commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX") introduced a number of deadlocks and other issues which need to be fixed for the v4.3 kernel. The list of issues in DAX after these commits (some newly introduced by the commits, some preexisting) can be found here: https://lkml.org/lkml/2015/9/25/602 (Subject: "Re: [PATCH] dax: fix deadlock in __dax_fault"). This undoes most of the changes introduced by those two commits, essentially returning us to the DAX locking scheme that was used in v4.2. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Tested-by: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-16mm, fs: obey gfp_mapping for add_to_page_cache()Michal Hocko4-13/+17
Commit 6afdb859b710 ("mm: do not ignore mapping_gfp_mask in page cache allocation paths") has caught some users of hardcoded GFP_KERNEL used in the page cache allocation paths. This, however, wasn't complete and there were others which went unnoticed. Dave Chinner has reported the following deadlock for xfs on loop device: : With the recent merge of the loop device changes, I'm now seeing : XFS deadlock on my single CPU, 1GB RAM VM running xfs/073. : : The deadlocked is as follows: : : kloopd1: loop_queue_read_work : xfs_file_iter_read : lock XFS inode XFS_IOLOCK_SHARED (on image file) : page cache read (GFP_KERNEL) : radix tree alloc : memory reclaim : reclaim XFS inodes : log force to unpin inodes : <wait for log IO completion> : : xfs-cil/loop1: <does log force IO work> : xlog_cil_push : xlog_write : <loop issuing log writes> : xlog_state_get_iclog_space() : <blocks due to all log buffers under write io> : <waits for IO completion> : : kloopd1: loop_queue_write_work : xfs_file_write_iter : lock XFS inode XFS_IOLOCK_EXCL (on image file) : <wait for inode to be unlocked> : : i.e. the kloopd, with it's split read and write work queues, has : introduced a dependency through memory reclaim. i.e. that writes : need to be able to progress for reads make progress. : : The problem, fundamentally, is that mpage_readpages() does a : GFP_KERNEL allocation, rather than paying attention to the inode's : mapping gfp mask, which is set to GFP_NOFS. : : The didn't used to happen, because the loop device used to issue : reads through the splice path and that does: : : error = add_to_page_cache_lru(page, mapping, index, : GFP_KERNEL & mapping_gfp_mask(mapping)); This has changed by commit aa4d86163e4 ("block: loop: switch to VFS ITER_BVEC"). This patch changes mpage_readpage{s} to follow gfp mask set for the mapping. There are, however, other places which are doing basically the same. lustre:ll_dir_filler is doing GFP_KERNEL from the function which apparently uses GFP_NOFS for other allocations so let's make this consistent. cifs:readpages_get_pages is called from cifs_readpages and __cifs_readpages_from_fscache called from the same path obeys mapping gfp. ramfs_nommu_expand_for_mapping is hardcoding GFP_KERNEL as well regardless it uses mapping_gfp_mask for the page allocation. ext4_mpage_readpages is the called from the page cache allocation path same as read_pages and read_cache_pages As I've noticed in my previous post I cannot say I would be happy about sprinkling mapping_gfp_mask all over the place and it sounds like we should drop gfp_mask argument altogether and use it internally in __add_to_page_cache_locked that would require all the filesystems to use mapping gfp consistently which I am not sure is the case here. From a quick glance it seems that some file system use it all the time while others are selective. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Dave Chinner <david@fromorbit.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Ming Lei <ming.lei@canonical.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-15Merge branch 'for_linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext4 Kconfig description fixup from Jan Kara: "A small fixup in description of EXT4_USE_FOR_EXT2 config option" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: ext4: Update EXT4_USE_FOR_EXT2 description
2015-10-14mm: add architecture primitives for software dirty bit clearingMartin Schwidefsky1-2/+2
There are primitives to create and query the software dirty bits in a pte or pmd. But the clearing of the software dirty bits is done in common code with x86 specific page table functions. Add the missing architecture primitives to clear the software dirty bits to allow the feature to be used on non-x86 systems, e.g. the s390 architecture. Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-10-13btrfs: fix use after free iterating extrefsChris Mason1-5/+3
The code for btrfs inode-resolve has never worked properly for files with enough hard links to trigger extrefs. It was trying to get the leaf out of a path after freeing the path: btrfs_release_path(path); leaf = path->nodes[0]; item_size = btrfs_item_size_nr(leaf, slot); The fix here is to use the extent buffer we cloned just a little higher up to avoid deadlocks caused by using the leaf in the path. Signed-off-by: Chris Mason <clm@fb.com> cc: stable@vger.kernel.org # v3.7+ cc: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de>
2015-10-13btrfs: check unsupported filters in balance argumentsDavid Sterba2-0/+13
We don't verify that all the balance filter arguments supplemented by the flags are actually known to the kernel. Thus we let it silently pass and do nothing. At the moment this means only the 'limit' filter, but we're going to add a few more soon so it's better to have that fixed. Also in older stable kernels so that it works with newer userspace tools. Cc: stable@vger.kernel.org # 3.16+ Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-13Merge tag 'nfsd-4.3-2' of git://linux-nfs.org/~bfields/linuxLinus Torvalds1-8/+0
Pull nfsd fixes from Bruce Fields: "Two nfsd fixes, one for an RDMA crash, one for a pnfs/block protocol bug" * tag 'nfsd-4.3-2' of git://linux-nfs.org/~bfields/linux: svcrdma: Fix NFS server crash triggered by 1MB NFS WRITE nfsd/blocklayout: accept any minlength
2015-10-12writeback: bdi_writeback iteration must not skip dying onesTejun Heo1-9/+22
bdi_for_each_wb() is used in several places to wake up or issue writeback work items to all wb's (bdi_writeback's) on a given bdi. The iteration is performed by walking bdi->cgwb_tree; however, the tree only indexes wb's which are currently active. For example, when a memcg gets associated with a different blkcg, the old wb is removed from the tree so that the new one can be indexed. The old wb starts dying from then on but will linger till all its inodes are drained. As these dying wb's may still host dirty inodes, writeback operations which affect all wb's must include them. bdi_for_each_wb() skipping dying wb's led to sync(2) missing and failing to sync the inodes belonging to those wb's. This patch adds a RCU protected @bdi->wb_list which lists all wb's beloinging to that bdi. wb's are added on creation and removed on release rather than on the start of destruction. bdi_for_each_wb() usages are replaced with list_for_each[_continue]_rcu() iterations over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed. v2: Updated as per Jan. last_wb ref leak in bdi_split_work_to_wbs() fixed and unnecessary list head severing in cgwb_bdi_destroy() removed. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Artem Bityutskiy <dedekind1@gmail.com> Fixes: ebe41ab0c79d ("writeback: implement bdi_for_each_wb()") Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-12writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback()Tejun Heo1-2/+2
wakeup_dirtytime_writeback() walks and wakes up all wb's of all bdi's; unfortunately, it was always waking up bdi->wb instead of the wb being walked. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 001fe6f617b1 ("writeback: make wakeup_dirtytime_writeback() handle multiple bdi_writeback's") Reviewed-by: Jan Kara <jack@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-12ovl: free lower_mnt array in ovl_put_superKonstantin Khlebnikov1-0/+1
This fixes memory leak after umount. Kmemleak report: unreferenced object 0xffff8800ba791010 (size 8): comm "mount", pid 2394, jiffies 4294996294 (age 53.920s) hex dump (first 8 bytes): 20 1c 13 02 00 88 ff ff ....... backtrace: [<ffffffff811f8cd4>] create_object+0x124/0x2c0 [<ffffffff817a059b>] kmemleak_alloc+0x7b/0xc0 [<ffffffff811dffe6>] __kmalloc+0x106/0x340 [<ffffffffa0152bfc>] ovl_fill_super+0x55c/0x9b0 [overlay] [<ffffffff81200ac4>] mount_nodev+0x54/0xa0 [<ffffffffa0152118>] ovl_mount+0x18/0x20 [overlay] [<ffffffff81201ab3>] mount_fs+0x43/0x170 [<ffffffff81220d34>] vfs_kern_mount+0x74/0x170 [<ffffffff812233ad>] do_mount+0x22d/0xdf0 [<ffffffff812242cb>] SyS_mount+0x7b/0xc0 [<ffffffff817b6bee>] entry_SYSCALL_64_fastpath+0x12/0x76 [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Fixes: dd662667e6d3 ("ovl: add mutli-layer infrastructure") Cc: <stable@vger.kernel.org> # v4.0+
2015-10-12ovl: free stack of paths in ovl_fill_superKonstantin Khlebnikov1-0/+1
This fixes small memory leak after mount. Kmemleak report: unreferenced object 0xffff88003683fe00 (size 16): comm "mount", pid 2029, jiffies 4294909563 (age 33.380s) hex dump (first 16 bytes): 20 27 1f bb 00 88 ff ff 40 4b 0f 36 02 88 ff ff '......@K.6.... backtrace: [<ffffffff811f8cd4>] create_object+0x124/0x2c0 [<ffffffff817a059b>] kmemleak_alloc+0x7b/0xc0 [<ffffffff811dffe6>] __kmalloc+0x106/0x340 [<ffffffffa01b7a29>] ovl_fill_super+0x389/0x9a0 [overlay] [<ffffffff81200ac4>] mount_nodev+0x54/0xa0 [<ffffffffa01b7118>] ovl_mount+0x18/0x20 [overlay] [<ffffffff81201ab3>] mount_fs+0x43/0x170 [<ffffffff81220d34>] vfs_kern_mount+0x74/0x170 [<ffffffff812233ad>] do_mount+0x22d/0xdf0 [<ffffffff812242cb>] SyS_mount+0x7b/0xc0 [<ffffffff817b6bee>] entry_SYSCALL_64_fastpath+0x12/0x76 [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Fixes: a78d9f0d5d5c ("ovl: support multiple lower layers") Cc: <stable@vger.kernel.org> # v4.0+
2015-10-12ovl: fix open in stacked overlayMiklos Szeredi1-0/+3
If two overlayfs filesystems are stacked on top of each other, then we need recursion in ovl_d_select_inode(). I guess d_backing_inode() is supposed to do that. But currently it doesn't and that functionality is open coded in vfs_open(). This is now copied into ovl_d_select_inode() to fix this regression. Reported-by: Alban Crequy <alban.crequy@gmail.com> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Fixes: 4bacc9c9234c ("overlayfs: Make f_path always point to the overlay...") Cc: David Howells <dhowells@redhat.com> Cc: <stable@vger.kernel.org> # v4.2+
2015-10-12ovl: fix dentry reference leakDavid Howells1-1/+1
In ovl_copy_up_locked(), newdentry is leaked if the function exits through out_cleanup as this just to out after calling ovl_cleanup() - which doesn't actually release the ref on newdentry. The out_cleanup segment should instead exit through out2 as certainly newdentry leaks - and possibly upper does also, though this isn't caught given the catch of newdentry. Without this fix, something like the following is seen: BUG: Dentry ffff880023e9eb20{i=f861,n=#ffff880023e82d90} still in use (1) [unmount of tmpfs tmpfs] BUG: Dentry ffff880023ece640{i=0,n=bigfile} still in use (1) [unmount of tmpfs tmpfs] when unmounting the upper layer after an error occurred in copyup. An error can be induced by creating a big file in a lower layer with something like: dd if=/dev/zero of=/lower/a/bigfile bs=65536 count=1 seek=$((0xf000)) to create a large file (4.1G). Overlay an upper layer that is too small (on tmpfs might do) and then induce a copy up by opening it writably. Reported-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Cc: <stable@vger.kernel.org> # v3.18+
2015-10-12ovl: use O_LARGEFILE in ovl_copy_up()David Howells1-2/+2
Open the lower file with O_LARGEFILE in ovl_copy_up(). Pass O_LARGEFILE unconditionally in ovl_copy_up_data() as it's purely for catching 32-bit userspace dealing with a file large enough that it'll be mishandled if the application isn't aware that there might be an integer overflow. Inside the kernel, there shouldn't be any problems. Reported-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Cc: <stable@vger.kernel.org> # v3.18+
2015-10-10namei: results of d_is_negative() should be checked after dentry revalidationTrond Myklebust1-2/+6
Leandro Awa writes: "After switching to version 4.1.6, our parallelized and distributed workflows now fail consistently with errors of the form: T34: ./regex.c:39:22: error: config.h: No such file or directory From our 'git bisect' testing, the following commit appears to be the possible cause of the behavior we've been seeing: commit 766c4cbfacd8" Al Viro says: "What happens is that 766c4cbfacd8 got the things subtly wrong. We used to treat d_is_negative() after lookup_fast() as "fall with ENOENT". That was wrong - checking ->d_flags outside of ->d_seq protection is unreliable and failing with hard error on what should've fallen back to non-RCU pathname resolution is a bug. Unfortunately, we'd pulled the test too far up and ran afoul of another kind of staleness. The dentry might have been absolutely stable from the RCU point of view (and we might be on UP, etc), but stale from the remote fs point of view. If ->d_revalidate() returns "it's actually stale", dentry gets thrown away and the original code wouldn't even have looked at its ->d_flags. What we need is to check ->d_flags where 766c4cbfacd8 does (prior to ->d_seq validation) but only use the result in cases where we do not discard this dentry outright" Reported-by: Leandro Awa <lawa@nvidia.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=104911 Fixes: 766c4cbfacd8 ("namei: d_is_negative() should be checked...") Tested-by: Leandro Awa <lawa@nvidia.com> Cc: stable@vger.kernel.org # v4.1+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-09Merge branch 'for-linus-4.3' of ↵Linus Torvalds7-17/+35
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "These are small and assorted. Neil's is the oldest, I dropped the ball thinking he was going to send it in" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: support NFSv2 export Btrfs: open_ctree: Fix possible memory leak Btrfs: fix deadlock when finalizing block group creation Btrfs: update fix for read corruption of compressed and shared extents Btrfs: send, fix corner case for reference overwrite detection
2015-10-09nfsd/blocklayout: accept any minlengthChristoph Hellwig1-8/+0
Recent Linux clients have started to send GETLAYOUT requests with minlength less than blocksize. Servers aren't really allowed to impose this kind of restriction on layouts; see RFC 5661 section 18.43.3 for details. This has been observed to cause indefinite hangs on fsx runs on some clients. Cc: stable@vger.kernel.org Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-10-09Merge tag 'v4.3-rc4' into for-4.4/coreJens Axboe28-157/+420
Linux 4.3-rc4 Pulling in v4.3-rc4 to avoid conflicts with NVMe fixes that have gone in since for-4.4/core was based.
2015-10-07Merge tag 'nfs-for-4.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds4-11/+23
Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Bugfixes: - Fix a use-after-free bug in the RPC/RDMA client - Fix a write performance regression - Fix up page writeback accounting - Don't try to reclaim unused state owners - Fix a NFSv4 nograce recovery hang - reset states to use open_stateid when returning delegation voluntarily - Fix a tracepoint NULL-pointer dereference" * tag 'nfs-for-4.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: Fix a tracepoint NULL-pointer dereference nfs4: reset states to use open_stateid when returning delegation voluntarily NFSv4: Fix a nograce recovery hang NFSv4.1: nfs4_opendata_check_deleg needs to handle NFS4_OPEN_CLAIM_DELEG_CUR_FH NFSv4: Don't try to reclaim unused state owners NFS: Fix a write performance regression NFS: Fix up page writeback accounting xprtrdma: disconnect and flush cqs before freeing buffers
2015-10-06NFS: Fix a tracepoint NULL-pointer dereferenceAnna Schumaker1-1/+1
Running xfstest generic/013 with the tracepoint nfs:nfs4_open_file enabled produces a NULL-pointer dereference when calculating fileid and filehandle of the opened file. Fix this by checking if state is NULL before trying to use the inode pointer. Reported-by: Olga Kornievskaia <aglo@umich.edu> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-10-06BTRFS: support NFSv2 exportNeilBrown1-5/+5
The "fh_len" passed to ->fh_to_* is not guaranteed to be that same as that returned by encode_fh - it may be larger. With NFSv2, the filehandle is fixed length, so it may appear longer than expected and be zero-padded. So we must test that fh_len is at least some value, not exactly equal to it. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: David Sterba <dsterba@suse.cz>
2015-10-06Btrfs: open_ctree: Fix possible memory leakchandan1-0/+4
After reading one of chunk or tree root tree's root node from disk, if the root node does not have EXTENT_BUFFER_UPTODATE flag set, we fail to release the memory used by the root node. Fix this. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
2015-10-06Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds3-36/+2
Pull CIFS fixes from Steve French: "Two fixes for problems pointed out by automated tools. Thanks PaX/grsecurity team and Dan Carpenter (and the Smatch tool)" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: [CIFS] Update cifs version number [SMB3] Do not fall back to SMBWriteX in set_file_size error cases [SMB3] Missing null tcon check
2015-10-05Btrfs: fix deadlock when finalizing block group creationFilipe Manana3-1/+10
Josef ran into a deadlock while a transaction handle was finalizing the creation of its block groups, which produced the following trace: [260445.593112] fio D ffff88022a9df468 0 8924 4518 0x00000084 [260445.593119] ffff88022a9df468 ffffffff81c134c0 ffff880429693c00 ffff88022a9df488 [260445.593126] ffff88022a9e0000 ffff8803490d7b00 ffff8803490d7b18 ffff88022a9df4b0 [260445.593132] ffff8803490d7af8 ffff88022a9df488 ffffffff8175a437 ffff8803490d7b00 [260445.593137] Call Trace: [260445.593145] [<ffffffff8175a437>] schedule+0x37/0x80 [260445.593189] [<ffffffffa0850f37>] btrfs_tree_lock+0xa7/0x1f0 [btrfs] [260445.593197] [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0 [260445.593225] [<ffffffffa07eac44>] btrfs_lock_root_node+0x34/0x50 [btrfs] [260445.593253] [<ffffffffa07eff6b>] btrfs_search_slot+0x88b/0xa00 [btrfs] [260445.593295] [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs] [260445.593324] [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs] [260445.593351] [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs] [260445.593394] [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs] [260445.593427] [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs] [260445.593459] [<ffffffffa0800964>] do_chunk_alloc+0x2a4/0x2e0 [btrfs] [260445.593491] [<ffffffffa0803815>] find_free_extent+0xa55/0xd90 [btrfs] [260445.593524] [<ffffffffa0803c22>] btrfs_reserve_extent+0xd2/0x220 [btrfs] [260445.593532] [<ffffffff8119fe5d>] ? account_page_dirtied+0xdd/0x170 [260445.593564] [<ffffffffa0803e78>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs] [260445.593597] [<ffffffffa080c9de>] ? btree_set_page_dirty+0xe/0x10 [btrfs] [260445.593626] [<ffffffffa07eb5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs] [260445.593654] [<ffffffffa07ebbff>] btrfs_cow_block+0x11f/0x1c0 [btrfs] [260445.593682] [<ffffffffa07ef8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs] [260445.593724] [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs] [260445.593752] [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs] [260445.593830] [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs] [260445.593905] [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs] [260445.593946] [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs] [260445.593990] [<ffffffffa0815798>] btrfs_commit_transaction+0xa8/0xb40 [btrfs] [260445.594042] [<ffffffffa085abcd>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs] [260445.594089] [<ffffffffa082bc84>] btrfs_sync_file+0x294/0x350 [btrfs] [260445.594115] [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0 [260445.594133] [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180 [260445.594149] [<ffffffff8123e35d>] do_fsync+0x3d/0x70 [260445.594169] [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110 [260445.594187] [<ffffffff8123e600>] SyS_fsync+0x10/0x20 [260445.594204] [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71 This happened because the same transaction handle created a large number of block groups and while finalizing their creation (inserting new items and updating existing items in the chunk and device trees) a new metadata extent had to be allocated and no free space was found in the current metadata block groups, which made find_free_extent() attempt to allocate a new block group via do_chunk_alloc(). However at do_chunk_alloc() we ended up allocating a new system chunk too and exceeded the threshold of 2Mb of reserved chunk bytes, which makes do_chunk_alloc() enter the final part of block group creation again (at btrfs_create_pending_block_groups()) and attempt to lock again the root of the chunk tree when it's already write locked by the same task. Similarly we can deadlock on extent tree nodes/leafs if while we are running delayed references we end up creating a new metadata block group in order to allocate a new node/leaf for the extent tree (as part of a CoW operation or growing the tree), as btrfs_create_pending_block_groups inserts items into the extent tree as well. In this case we get the following trace: [14242.773581] fio D ffff880428ca3418 0 3615 3100 0x00000084 [14242.773588] ffff880428ca3418 ffff88042d66b000 ffff88042a03c800 ffff880428ca3438 [14242.773594] ffff880428ca4000 ffff8803e4b20190 ffff8803e4b201a8 ffff880428ca3460 [14242.773600] ffff8803e4b20188 ffff880428ca3438 ffffffff8175a437 ffff8803e4b20190 [14242.773606] Call Trace: [14242.773613] [<ffffffff8175a437>] schedule+0x37/0x80 [14242.773656] [<ffffffffa057ff07>] btrfs_tree_lock+0xa7/0x1f0 [btrfs] [14242.773664] [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0 [14242.773692] [<ffffffffa0519c44>] btrfs_lock_root_node+0x34/0x50 [btrfs] [14242.773720] [<ffffffffa051ef6b>] btrfs_search_slot+0x88b/0xa00 [btrfs] [14242.773750] [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs] [14242.773758] [<ffffffff811ef4a2>] ? kmem_cache_alloc+0x1d2/0x200 [14242.773786] [<ffffffffa0520ad1>] btrfs_insert_item+0x71/0xf0 [btrfs] [14242.773818] [<ffffffffa052f292>] btrfs_create_pending_block_groups+0x102/0x200 [btrfs] [14242.773850] [<ffffffffa052f96e>] do_chunk_alloc+0x2ae/0x2f0 [btrfs] [14242.773934] [<ffffffffa0532825>] find_free_extent+0xa55/0xd90 [btrfs] [14242.773998] [<ffffffffa0532c22>] btrfs_reserve_extent+0xc2/0x1d0 [btrfs] [14242.774041] [<ffffffffa0532e38>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs] [14242.774078] [<ffffffffa051a5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs] [14242.774118] [<ffffffffa051abff>] btrfs_cow_block+0x11f/0x1c0 [btrfs] [14242.774155] [<ffffffffa051e8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs] [14242.774194] [<ffffffffa0528021>] ? __btrfs_free_extent.isra.70+0x2e1/0xcb0 [btrfs] [14242.774235] [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs] [14242.774274] [<ffffffffa051994a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs] [14242.774318] [<ffffffffa052c433>] __btrfs_run_delayed_refs+0xbb3/0x1020 [btrfs] [14242.774358] [<ffffffffa052f404>] btrfs_run_delayed_refs.part.78+0x74/0x280 [btrfs] [14242.774391] [<ffffffffa052f627>] btrfs_run_delayed_refs+0x17/0x20 [btrfs] [14242.774432] [<ffffffffa05be236>] commit_cowonly_roots+0x8d/0x2bd [btrfs] [14242.774474] [<ffffffffa059d07f>] ? __btrfs_run_delayed_items+0x1cf/0x210 [btrfs] [14242.774516] [<ffffffffa05adac3>] ? btrfs_qgroup_account_extents+0x83/0x130 [btrfs] [14242.774558] [<ffffffffa0544c40>] btrfs_commit_transaction+0x590/0xb40 [btrfs] [14242.774599] [<ffffffffa0589b9d>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs] [14242.774642] [<ffffffffa055ac54>] btrfs_sync_file+0x294/0x350 [btrfs] [14242.774650] [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0 [14242.774657] [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180 [14242.774663] [<ffffffff8123e35d>] do_fsync+0x3d/0x70 [14242.774669] [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110 [14242.774675] [<ffffffff8123e600>] SyS_fsync+0x10/0x20 [14242.774681] [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71 Fix this by never recursing into the finalization phase of block group creation and making sure we never trigger the finalization of block group creation while running delayed references. Reported-by: Josef Bacik <jbacik@fb.com> Fixes: 00d80e342c0f ("Btrfs: fix quick exhaustion of the system array in the superblock") Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-05Btrfs: update fix for read corruption of compressed and shared extentsFilipe Manana1-8/+11
My previous fix in commit 005efedf2c7d ("Btrfs: fix read corruption of compressed and shared extents") was effective only if the compressed extents cover a file range with a length that is not a multiple of 16 pages. That's because the detection of when we reached a different range of the file that shares the same compressed extent as the previously processed range was done at extent_io.c:__do_contiguous_readpages(), which covers subranges with a length up to 16 pages, because extent_readpages() groups the pages in clusters no larger than 16 pages. So fix this by tracking the start of the previously processed file range's extent map at extent_readpages(). The following test case for fstests reproduces the issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner rm -f $seqres.full test_clone_and_read_compressed_extent() { local mount_opts=$1 _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount $mount_opts # Create our test file with a single extent of 64Kb that is going to # be compressed no matter which compression algo is used (zlib/lzo). $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 64K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Now clone the compressed extent into an adjacent file offset. $CLONER_PROG -s 0 -d $((64 * 1024)) -l $((64 * 1024)) \ $SCRATCH_MNT/foo $SCRATCH_MNT/foo echo "File digest before unmount:" md5sum $SCRATCH_MNT/foo | _filter_scratch # Remount the fs or clear the page cache to trigger the bug in # btrfs. Because the extent has an uncompressed length that is a # multiple of 16 pages, all the pages belonging to the second range # of the file (64K to 128K), which points to the same extent as the # first range (0K to 64K), had their contents full of zeroes instead # of the byte 0xaa. This was a bug exclusively in the read path of # compressed extents, the correct data was stored on disk, btrfs # just failed to fill in the pages correctly. _scratch_remount echo "File digest after remount:" # Must match the digest we got before. md5sum $SCRATCH_MNT/foo | _filter_scratch } echo -e "\nTesting with zlib compression..." test_clone_and_read_compressed_extent "-o compress=zlib" _scratch_unmount echo -e "\nTesting with lzo compression..." test_clone_and_read_compressed_extent "-o compress=lzo" status=0 exit Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: Timofey Titovets <nefelim4ag@gmail.com>