summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2020-05-09hfs: stop using ioctl_by_bdevChristoph Hellwig1-13/+19
Instead just call the CDROM layer functionality directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09bdi: remove the name field in struct backing_dev_infoChristoph Hellwig2-3/+1
The name is only printed for a not registered bdi in writeback. Use the device name there as is more useful anyway for the unlike case that the warning triggers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09bdi: simplify bdi_allocChristoph Hellwig1-1/+1
Merge the _node vs normal version and drop the superflous gfp_t argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09Merge branch 'block-5.7' into for-5.8/blockJens Axboe4-4/+13
Pull in block-5.7 fixes for 5.8. Mostly to resolve a conflict with the blk-iocost changes, but we also need the base of the bdi use-after-free as well as we build on top of it. * block-5.7: nvme: fix possible hang when ns scanning fails during error recovery nvme-pci: fix "slimmer CQ head update" bdi: add a ->dev_name field to struct backing_dev_info bdi: use bdi_dev_name() to get device name bdi: move bdi_dev_name out of line vboxsf: don't use the source name in the bdi name iocost: protect iocg->abs_vdebt with iocg->waitq.lock block: remove the bd_openers checks in blk_drop_partitions nvme: prevent double free in nvme_alloc_ns() error handling null_blk: Cleanup zoned device initialization null_blk: Fix zoned command handling block: remove unused header blk-iocost: Fix error on iocost_ioc_vrate_adj bdev: Reduce time holding bd_mutex in sync in blkdev_close() buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09bdi: use bdi_dev_name() to get device nameYufen Yu1-1/+1
Use the common interface bdi_dev_name() to get device name. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Add missing <linux/backing-dev.h> include BFQ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-07vboxsf: don't use the source name in the bdi nameChristoph Hellwig1-1/+1
Simplify the bdi name to mirror what we are doing elsewhere, and drop them name in favor of just using a number. This avoids a potentially very long bdi name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-04udf: stop using ioctl_by_bdevChristoph Hellwig1-16/+13
Instead just call the CDROM layer functionality directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-04isofs: stop using ioctl_by_bdevChristoph Hellwig1-28/+26
Instead just call the CDROM layer functionality directly, and turn the hot mess in isofs_get_last_session into remotely readable code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-04hfsplus: stop using ioctl_by_bdevChristoph Hellwig1-15/+18
Instead just call the CDROM layer functionality directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-24block: unexport bdev_read_page and bdev_write_pageChristoph Hellwig1-2/+0
Each one just has two callers, both in always built-in code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-21block: remove unused headerMa, Jianpeng1-1/+0
Dax related code already removed from this file. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-20block: fold bdev_unhash_inode into invalidate_partitionChristoph Hellwig1-15/+0
invalidate_partition and bdev_unhash_inode are always paired, and invalidate_partition already does an icache lookup for the block device inode. Piggy back on that to remove the inode from the hash. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-20block: remove the disk argument from blk_drop_partitionsChristoph Hellwig1-1/+1
The gendisk can be trivially deducted from the block_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-20bdev: Reduce time holding bd_mutex in sync in blkdev_close()Douglas Anderson1-0/+10
While trying to "dd" to the block device for a USB stick, I encountered a hung task warning (blocked for > 120 seconds). I managed to come up with an easy way to reproduce this on my system (where /dev/sdb is the block device for my USB stick) with: while true; do dd if=/dev/zero of=/dev/sdb bs=4M; done With my reproduction here are the relevant bits from the hung task detector: INFO: task udevd:294 blocked for more than 122 seconds. ... udevd D 0 294 1 0x00400008 Call trace: ... mutex_lock_nested+0x40/0x50 __blkdev_get+0x7c/0x3d4 blkdev_get+0x118/0x138 blkdev_open+0x94/0xa8 do_dentry_open+0x268/0x3a0 vfs_open+0x34/0x40 path_openat+0x39c/0xdf4 do_filp_open+0x90/0x10c do_sys_open+0x150/0x3c8 ... ... Showing all locks held in the system: ... 1 lock held by dd/2798: #0: ffffff814ac1a3b8 (&bdev->bd_mutex){+.+.}, at: __blkdev_put+0x50/0x204 ... dd D 0 2798 2764 0x00400208 Call trace: ... schedule+0x8c/0xbc io_schedule+0x1c/0x40 wait_on_page_bit_common+0x238/0x338 __lock_page+0x5c/0x68 write_cache_pages+0x194/0x500 generic_writepages+0x64/0xa4 blkdev_writepages+0x24/0x30 do_writepages+0x48/0xa8 __filemap_fdatawrite_range+0xac/0xd8 filemap_write_and_wait+0x30/0x84 __blkdev_put+0x88/0x204 blkdev_put+0xc4/0xe4 blkdev_close+0x28/0x38 __fput+0xe0/0x238 ____fput+0x1c/0x28 task_work_run+0xb0/0xe4 do_notify_resume+0xfc0/0x14bc work_pending+0x8/0x14 The problem appears related to the fact that my USB disk is terribly slow and that I have a lot of RAM in my system to cache things. Specifically my writes seem to be happening at ~15 MB/s and I've got ~4 GB of RAM in my system that can be used for buffering. To write 4 GB of buffer to disk thus takes ~4000 MB / ~15 MB/s = ~267 seconds. The 267 second number is a problem because in __blkdev_put() we call sync_blockdev() while holding the bd_mutex. Any other callers who want the bd_mutex will be blocked for the whole time. The problem is made worse because I believe blkdev_put() specifically tells other tasks (namely udev) to go try to access the device at right around the same time we're going to hold the mutex for a long time. Putting some traces around this (after disabling the hung task detector), I could confirm: dd: 437.608600: __blkdev_put() right before sync_blockdev() for sdb udevd: 437.623901: blkdev_open() right before blkdev_get() for sdb dd: 661.468451: __blkdev_put() right after sync_blockdev() for sdb udevd: 663.820426: blkdev_open() right after blkdev_get() for sdb A simple fix for this is to realize that sync_blockdev() works fine if you're not holding the mutex. Also, it's not the end of the world if you sync a little early (though it can have performance impacts). Thus we can make a guess that we're going to need to do the sync and then do it without holding the mutex. We still do one last sync with the mutex but it should be much, much faster. With this, my hung task warnings for my test case are gone. Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Guenter Roeck <groeck@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-19Merge tag 'timers-urgent-2020-04-19' of ↵Linus Torvalds1-1/+13
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull time namespace fix from Thomas Gleixner: "An update for the proc interface of time namespaces: Use symbolic names instead of clockid numbers. The usability nuisance of numbers was noticed by Michael when polishing the man page" * tag 'timers-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: proc, time/namespace: Show clock symbolic names in /proc/pid/timens_offsets
2020-04-19Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds8-18/+26
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Miscellaneous bug fixes and cleanups for ext4, including a fix for generic/388 in data=journal mode, removing some BUG_ON's, and cleaning up some compiler warnings" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: convert BUG_ON's to WARN_ON's in mballoc.c ext4: increase wait time needed before reuse of deleted inode numbers ext4: remove set but not used variable 'es' in ext4_jbd2.c ext4: remove set but not used variable 'es' ext4: do not zeroout extents beyond i_disksize ext4: fix return-value types in several function comments ext4: use non-movable memory for superblock readahead ext4: use matching invalidatepage in ext4_writepage
2020-04-19Merge tag '5.7-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds4-3/+22
Pull cifs fixes from Steve French: "Three small smb3 fixes: two debug related (helping network tracing for SMB2 mounts, and the other removing an unintended debug line on signing failures), and one fixing a performance problem with 64K pages" * tag '5.7-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb3: remove overly noisy debug line in signing errors cifs: improve read performance for page size 64KB & cache=strict & vers=2.1+ cifs: dump the session id and keys also for SMB2 sessions
2020-04-18Merge tag 'xfs-5.7-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds5-20/+42
Pull xfs fixes from Darrick Wong: "The three commits here fix some livelocks and other clashes with fsfreeze, a potential corruption problem, and a minor race between processes freeing and allocating space when the filesystem is near ENOSPC. Summary: - Fix a partially uninitialized variable. - Teach the background gc threads to apply for fsfreeze protection. - Fix some scaling problems when multiple threads try to flush the filesystem when we're about to hit ENOSPC" * tag 'xfs-5.7-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: move inode flush to the sync workqueue xfs: fix partially uninitialized structure in xfs_reflink_remap_extent xfs: acquire superblock freeze protection on eofblocks scans
2020-04-17buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason.Zhiqiang Liu1-1/+1
free_more_memory func has been completely removed in commit bc48f001de12 ("buffer: eliminate the need to call free_more_memory() in __getblk_slow()") So comment and `WB_REASON_FREE_MORE_MEM` reason about free_more_memory are no longer needed. Fixes: bc48f001de12 ("buffer: eliminate the need to call free_more_memory() in __getblk_slow()") Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-17Merge branch 'for-linus' of ↵Linus Torvalds1-0/+7
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull proc fix from Eric Biederman: "While running syzbot happened to spot one more oversight in my rework of proc_flush_task. The fields proc_self and proc_thread_self were not being reinitialized when proc was unmounted, which could cause problems if the mount of proc fails" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc: Handle umounts cleanly
2020-04-17Merge tag 'io_uring-5.7-2020-04-17' of git://git.kernel.dk/linux-blockLinus Torvalds1-139/+162
Pull io_uring fixes from Jens Axboe: - wrap up the init/setup cleanup (Pavel) - fix some issues around deferral sequences (Pavel) - fix splice punt check using the wrong struct file member - apply poll re-arm logic for pollable retry too - pollable retry should honor cancelation - fix setup time error handling syzbot reported crash - restore work state when poll is canceled * tag 'io_uring-5.7-2020-04-17' of git://git.kernel.dk/linux-block: io_uring: don't count rqs failed after current one io_uring: kill already cached timeout.seq_offset io_uring: fix cached_sq_head in io_timeout() io_uring: only post events in io_poll_remove_all() if we completed some io_uring: io_async_task_func() should check and honor cancelation io_uring: check for need to re-wait in polled async handling io_uring: correct O_NONBLOCK check for splice punt io_uring: restore req->work when canceling poll request io_uring: move all request init code in one place io_uring: keep all sqe->flags in req->flags io_uring: early submission req fail code io_uring: track mm through current->mm io_uring: remove obsolete @mm_fault
2020-04-17Merge tag 'for-5.7-rc1-tag' of ↵Linus Torvalds1-2/+17
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "A regression fix for a warning caused by running balance and snapshot creation in parallel" * tag 'for-5.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix setting last_trans for reloc roots
2020-04-17btrfs: fix setting last_trans for reloc rootsJosef Bacik1-2/+17
I made a mistake with my previous fix, I assumed that we didn't need to mess with the reloc roots once we were out of the part of relocation where we are actually moving the extents. The subtle thing that I missed is that btrfs_init_reloc_root() also updates the last_trans for the reloc root when we do btrfs_record_root_in_trans() for the corresponding fs_root. I've added a comment to make sure future me doesn't make this mistake again. This showed up as a WARN_ON() in btrfs_copy_root() because our last_trans didn't == the current transid. This could happen if we snapshotted a fs root with a reloc root after we set rc->create_reloc_tree = 0, but before we actually merge the reloc root. Worth mentioning that the regression produced the following warning when running snapshot creation and balance in parallel: BTRFS info (device sdc): relocating block group 30408704 flags metadata|dup ------------[ cut here ]------------ WARNING: CPU: 0 PID: 12823 at fs/btrfs/ctree.c:191 btrfs_copy_root+0x26f/0x430 [btrfs] CPU: 0 PID: 12823 Comm: btrfs Tainted: G W 5.6.0-rc7-btrfs-next-58 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_copy_root+0x26f/0x430 [btrfs] RSP: 0018:ffffb96e044279b8 EFLAGS: 00010202 RAX: 0000000000000009 RBX: ffff9da70bf61000 RCX: ffffb96e04427a48 RDX: ffff9da733a770c8 RSI: ffff9da70bf61000 RDI: ffff9da694163818 RBP: ffff9da733a770c8 R08: fffffffffffffff8 R09: 0000000000000002 R10: ffffb96e044279a0 R11: 0000000000000000 R12: ffff9da694163818 R13: fffffffffffffff8 R14: ffff9da6d2512000 R15: ffff9da714cdac00 FS: 00007fdeacf328c0(0000) GS:ffff9da735e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055a2a5b8a118 CR3: 00000001eed78002 CR4: 00000000003606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? create_reloc_root+0x49/0x2b0 [btrfs] ? kmem_cache_alloc_trace+0xe5/0x200 create_reloc_root+0x8b/0x2b0 [btrfs] btrfs_reloc_post_snapshot+0x96/0x5b0 [btrfs] create_pending_snapshot+0x610/0x1010 [btrfs] create_pending_snapshots+0xa8/0xd0 [btrfs] btrfs_commit_transaction+0x4c7/0xc50 [btrfs] ? btrfs_mksubvol+0x3cd/0x560 [btrfs] btrfs_mksubvol+0x455/0x560 [btrfs] __btrfs_ioctl_snap_create+0x15f/0x190 [btrfs] btrfs_ioctl_snap_create_v2+0xa4/0xf0 [btrfs] ? mem_cgroup_commit_charge+0x6e/0x540 btrfs_ioctl+0x12d8/0x3760 [btrfs] ? do_raw_spin_unlock+0x49/0xc0 ? _raw_spin_unlock+0x29/0x40 ? __handle_mm_fault+0x11b3/0x14b0 ? ksys_ioctl+0x92/0xb0 ksys_ioctl+0x92/0xb0 ? trace_hardirqs_off_thunk+0x1a/0x1c __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x5c/0x280 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fdeabd3bdd7 Fixes: 2abc726ab4b8 ("btrfs: do not init a reloc root if we aren't relocating") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-04-16Merge tag 'nfs-for-5.7-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-1/+2
Pull NFS client bugfix from Trond Myklebust: "Fix an ABBA spinlock issue in pnfs_update_layout()" * tag 'nfs-for-5.7-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: Fix an ABBA spinlock issue in pnfs_update_layout()
2020-04-16Merge tag 'ceph-for-5.7-rc2' of git://github.com/ceph/ceph-clientLinus Torvalds3-5/+5
Pull ceph fixes from Ilya Dryomov: - a set of patches for a deadlock on "rbd map" error path - a fix for invalid pointer dereference and uninitialized variable use on asynchronous create and unlink error paths. * tag 'ceph-for-5.7-rc2' of git://github.com/ceph/ceph-client: ceph: fix potential bad pointer deref in async dirops cb's rbd: don't mess with a page vector in rbd_notify_op_lock() rbd: don't test rbd_dev->opts in rbd_dev_image_release() rbd: call rbd_dev_unprobe() after unwatching and flushing notifies rbd: avoid a deadlock on header_rwsem when flushing notifies
2020-04-16smb3: remove overly noisy debug line in signing errorsSteve French1-2/+2
A dump_stack call for signature related errors can be too noisy and not of much value in debugging such problems. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Shyam Prasad N <nspmangalore@gmail.com>
2020-04-16xfs: move inode flush to the sync workqueueDarrick J. Wong2-19/+27
Move the inode dirty data flushing to a workqueue so that multiple threads can take advantage of a single thread's flushing work. The ratelimiting technique used in bdd4ee4 was not successful, because threads that skipped the inode flush scan due to ratelimiting would ENOSPC early, which caused occasional (but noticeable) changes in behavior and sporadic fstest regressions. Therefore, make all the writer threads wait on a single inode flush, which eliminates both the stampeding hordes of flushers and the small window in which a write could fail with ENOSPC because it lost the ratelimit race after even another thread freed space. Fixes: c6425702f21e ("xfs: ratelimit inode flush on buffered write ENOSPC") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-04-16proc, time/namespace: Show clock symbolic names in /proc/pid/timens_offsetsAndrei Vagin1-1/+13
Michael Kerrisk suggested to replace numeric clock IDs with symbolic names. Now the content of these files looks like this: $ cat /proc/774/timens_offsets monotonic 864000 0 boottime 1728000 0 For setting offsets, both representations of clocks (numeric and symbolic) can be used. As for compatibility, it is acceptable to change things as long as userspace doesn't care. The format of timens_offsets files is very new and there are no userspace tools yet which rely on this format. But three projects crun, util-linux and criu rely on the interface of setting time offsets and this is why it's required to continue supporting the numeric clock IDs on write. Fixes: 04a8682a71be ("fs/proc: Introduce /proc/pid/timens_offsets") Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200411154031.642557-1-avagin@gmail.com
2020-04-15proc: Handle umounts cleanlyEric W. Biederman1-0/+7
syzbot writes: > KASAN: use-after-free Read in dput (2) > > proc_fill_super: allocate dentry failed > ================================================================== > BUG: KASAN: use-after-free in fast_dput fs/dcache.c:727 [inline] > BUG: KASAN: use-after-free in dput+0x53e/0xdf0 fs/dcache.c:846 > Read of size 4 at addr ffff88808a618cf0 by task syz-executor.0/8426 > > CPU: 0 PID: 8426 Comm: syz-executor.0 Not tainted 5.6.0-next-20200412-syzkaller #0 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:77 [inline] > dump_stack+0x188/0x20d lib/dump_stack.c:118 > print_address_description.constprop.0.cold+0xd3/0x315 mm/kasan/report.c:382 > __kasan_report.cold+0x35/0x4d mm/kasan/report.c:511 > kasan_report+0x33/0x50 mm/kasan/common.c:625 > fast_dput fs/dcache.c:727 [inline] > dput+0x53e/0xdf0 fs/dcache.c:846 > proc_kill_sb+0x73/0xf0 fs/proc/root.c:195 > deactivate_locked_super+0x8c/0xf0 fs/super.c:335 > vfs_get_super+0x258/0x2d0 fs/super.c:1212 > vfs_get_tree+0x89/0x2f0 fs/super.c:1547 > do_new_mount fs/namespace.c:2813 [inline] > do_mount+0x1306/0x1b30 fs/namespace.c:3138 > __do_sys_mount fs/namespace.c:3347 [inline] > __se_sys_mount fs/namespace.c:3324 [inline] > __x64_sys_mount+0x18f/0x230 fs/namespace.c:3324 > do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295 > entry_SYSCALL_64_after_hwframe+0x49/0xb3 > RIP: 0033:0x45c889 > Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 > RSP: 002b:00007ffc1930ec48 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 > RAX: ffffffffffffffda RBX: 0000000001324914 RCX: 000000000045c889 > RDX: 0000000020000140 RSI: 0000000020000040 RDI: 0000000000000000 > RBP: 000000000076bf00 R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003 > R13: 0000000000000749 R14: 00000000004ca15a R15: 0000000000000013 Looking at the code now that it the internal mount of proc is no longer used it is possible to unmount proc. If proc is unmounted the fields of the pid namespace that were used for filesystem specific state are not reinitialized. Which means that proc_self and proc_thread_self can be pointers to already freed dentries. The reported user after free appears to be from mounting and unmounting proc followed by mounting proc again and using error injection to cause the new root dentry allocation to fail. This in turn results in proc_kill_sb running with proc_self and proc_thread_self still retaining their values from the previous mount of proc. Then calling dput on either proc_self of proc_thread_self will result in double put. Which KASAN sees as a use after free. Solve this by always reinitializing the filesystem state stored in the struct pid_namespace, when proc is unmounted. Reported-by: syzbot+72868dd424eb66c6b95f@syzkaller.appspotmail.com Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Fixes: 69879c01a0c3 ("proc: Remove the now unnecessary internal mount of proc") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-15ext4: convert BUG_ON's to WARN_ON's in mballoc.cTheodore Ts'o1-2/+4
If the in-core buddy bitmap gets corrupted (or out of sync with the block bitmap), issue a WARN_ON and try to recover. In most cases this involves skipping trying to allocate out of a particular block group. We can end up declaring the file system corrupted, which is fair, since the file system probably should be checked before we proceed any further. Link: https://lore.kernel.org/r/20200414035649.293164-1-tytso@mit.edu Google-Bug-Id: 34811296 Google-Bug-Id: 34639169 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-04-15ext4: increase wait time needed before reuse of deleted inode numbersTheodore Ts'o1-1/+1
Current wait times have proven to be too short to protect against inode reuses that lead to metadata inconsistencies. Now that we will retry the inode allocation if we can't find any recently deleted inodes, it's a lot safer to increase the recently deleted time from 5 seconds to a minute. Link: https://lore.kernel.org/r/20200414023925.273867-1-tytso@mit.edu Google-Bug-Id: 36602237 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-04-15ext4: remove set but not used variable 'es' in ext4_jbd2.cJason Yan1-3/+0
Fix the following gcc warning: fs/ext4/ext4_jbd2.c:341:30: warning: variable 'es' set but not used [-Wunused-but-set-variable] struct ext4_super_block *es; ^~ Fixes: 2ea2fc775321 ("ext4: save all error info in save_error_info() and drop ext4_set_errno()") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> Link: https://lore.kernel.org/r/20200402034759.29957-1-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-04-15ext4: remove set but not used variable 'es'Jason Yan1-2/+0
Fix the following gcc warning: fs/ext4/super.c:599:27: warning: variable 'es' set but not used [-Wunused-but-set-variable] struct ext4_super_block *es; ^~ Fixes: 2ea2fc775321 ("ext4: save all error info in save_error_info() and drop ext4_set_errno()") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> Link: https://lore.kernel.org/r/20200402033939.25303-1-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-04-15ext4: do not zeroout extents beyond i_disksizeJan Kara1-4/+4
We do not want to create initialized extents beyond end of file because for e2fsck it is impossible to distinguish them from a case of corrupted file size / extent tree and so it complains like: Inode 12, i_size is 147456, should be 163840. Fix? no Code in ext4_ext_convert_to_initialized() and ext4_split_convert_extents() try to make sure it does not create initialized extents beyond inode size however they check against inode->i_size which is wrong. They should instead check against EXT4_I(inode)->i_disksize which is the current inode size on disk. That's what e2fsck is going to see in case of crash before all dirty data is written. This bug manifests as generic/456 test failure (with recent enough fstests where fsx got fixed to properly pass FALLOC_KEEP_SIZE_FL flags to the kernel) when run with dioread_lock mount option. CC: stable@vger.kernel.org Fixes: 21ca087a3891 ("ext4: Do not zero out uninitialized extents beyond i_size") Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20200331105016.8674-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-04-15ext4: fix return-value types in several function commentsJosh Triplett2-3/+3
The documentation comments for ext4_read_block_bitmap_nowait and ext4_read_inode_bitmap describe them as returning NULL on error, but they return an ERR_PTR on error; update the documentation to match. The documentation comment for ext4_wait_block_bitmap describes it as returning 1 on error, but it returns -errno on error; update the documentation to match. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Ritesh Harani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/60a3f4996f4932c45515aaa6b75ca42f2a78ec9b.1585512514.git.josh@joshtriplett.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-04-15ext4: use non-movable memory for superblock readaheadRoman Gushchin3-2/+13
Since commit a8ac900b8163 ("ext4: use non-movable memory for the superblock") buffers for ext4 superblock were allocated using the sb_bread_unmovable() helper which allocated buffer heads out of non-movable memory blocks. It was necessarily to not block page migrations and do not cause cma allocation failures. However commit 85c8f176a611 ("ext4: preload block group descriptors") broke this by introducing pre-reading of the ext4 superblock. The problem is that __breadahead() is using __getblk() underneath, which allocates buffer heads out of movable memory. It resulted in page migration failures I've seen on a machine with an ext4 partition and a preallocated cma area. Fix this by introducing sb_breadahead_unmovable() and __breadahead_gfp() helpers which use non-movable memory for buffer head allocations and use them for the ext4 superblock readahead. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Fixes: 85c8f176a611 ("ext4: preload block group descriptors") Signed-off-by: Roman Gushchin <guro@fb.com> Link: https://lore.kernel.org/r/20200229001411.128010-1-guro@fb.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-04-15ext4: use matching invalidatepage in ext4_writepageyangerkun1-1/+1
Run generic/388 with journal data mode sometimes may trigger the warning in ext4_invalidatepage. Actually, we should use the matching invalidatepage in ext4_writepage. Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200226041002.13914-1-yangerkun@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-04-15cifs: improve read performance for page size 64KB & cache=strict & vers=2.1+Jones Syue2-1/+5
Found a read performance issue when linux kernel page size is 64KB. If linux kernel page size is 64KB and mount options cache=strict & vers=2.1+, it does not support cifs_readpages(). Instead, it is using cifs_readpage() and cifs_read() with maximum read IO size 16KB, which is much slower than read IO size 1MB when negotiated SMB 2.1+. Since modern SMB server supported SMB 2.1+ and Max Read Size can reach more than 64KB (for example 1MB ~ 8MB), this patch check max_read instead of maxBuf to determine whether server support readpages() and improve read performance for page size 64KB & cache=strict & vers=2.1+, and for SMB1 it is more cleaner to initialize server->max_read to server->maxBuf. The client is a linux box with linux kernel 4.2.8, page size 64KB (CONFIG_ARM64_64K_PAGES=y), cpu arm 1.7GHz, and use mount.cifs as smb client. The server is another linux box with linux kernel 4.2.8, share a file '10G.img' with size 10GB, and use samba-4.7.12 as smb server. The client mount a share from the server with different cache options: cache=strict and cache=none, mount -tcifs //<server_ip>/Public /cache_strict -overs=3.0,cache=strict,username=<xxx>,password=<yyy> mount -tcifs //<server_ip>/Public /cache_none -overs=3.0,cache=none,username=<xxx>,password=<yyy> The client download a 10GbE file from the server across 1GbE network, dd if=/cache_strict/10G.img of=/dev/null bs=1M count=10240 dd if=/cache_none/10G.img of=/dev/null bs=1M count=10240 Found that cache=strict (without patch) is slower read throughput and smaller read IO size than cache=none. cache=strict (without patch): read throughput 40MB/s, read IO size is 16KB cache=strict (with patch): read throughput 113MB/s, read IO size is 1MB cache=none: read throughput 109MB/s, read IO size is 1MB Looks like if page size is 64KB, cifs_set_ops() would use cifs_addr_ops_smallbuf instead of cifs_addr_ops, /* check if server can support readpages */ if (cifs_sb_master_tcon(cifs_sb)->ses->server->maxBuf < PAGE_SIZE + MAX_CIFS_HDR_SIZE) inode->i_data.a_ops = &cifs_addr_ops_smallbuf; else inode->i_data.a_ops = &cifs_addr_ops; maxBuf is came from 2 places, SMB2_negotiate() and CIFSSMBNegotiate(), (SMB2_MAX_BUFFER_SIZE is 64KB) SMB2_negotiate(): /* set it to the maximum buffer size value we can send with 1 credit */ server->maxBuf = min_t(unsigned int, le32_to_cpu(rsp->MaxTransactSize),       SMB2_MAX_BUFFER_SIZE); CIFSSMBNegotiate(): server->maxBuf = le32_to_cpu(pSMBr->MaxBufferSize); Page size 64KB and cache=strict lead to read_pages() use cifs_readpage() instead of cifs_readpages(), and then cifs_read() using maximum read IO size 16KB, which is much slower than maximum read IO size 1MB. (CIFSMaxBufSize is 16KB by default) /* FIXME: set up handlers for larger reads and/or convert to async */ rsize = min_t(unsigned int, cifs_sb->rsize, CIFSMaxBufSize); Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Jones Syue <jonessyue@qnap.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-04-15cifs: dump the session id and keys also for SMB2 sessionsRonnie Sahlberg1-0/+15
We already dump these keys for SMB3, lets also dump it for SMB2 sessions so that we can use the session key in wireshark to check and validate that the signatures are correct. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2020-04-14io_uring: don't count rqs failed after current onePavel Begunkov1-3/+3
When checking for draining with __req_need_defer(), it tries to match how many requests were sent before a current one with number of already completed. Dropped SQEs are included in req->sequence, and they won't ever appear in CQ. To compensate for that, __req_need_defer() substracts ctx->cached_sq_dropped. However, what it should really use is number of SQEs dropped __before__ the current one. In other words, any submitted request shouldn't shouldn't affect dequeueing from the drain queue of previously submitted ones. Instead of saving proper ctx->cached_sq_dropped in each request, substract from req->sequence it at initialisation, so it includes number of properly submitted requests. note: it also changes behaviour of timeouts, but 1. it's already diverge from the description because of using SQ 2. the description is ambiguous regarding dropped SQEs Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-14io_uring: kill already cached timeout.seq_offsetPavel Begunkov1-6/+3
req->timeout.count and req->io->timeout.seq_offset store the same value, which is sqe->off. Kill the second one Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-14io_uring: fix cached_sq_head in io_timeout()Pavel Begunkov1-7/+8
io_timeout() can be executed asynchronously by a worker and without holding ctx->uring_lock 1. using ctx->cached_sq_head there is racy there 2. it should count events from a moment of timeout's submission, but not execution Use req->sequence. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-14Merge tag 'for-5.7-rc1-tag' of ↵Linus Torvalds6-87/+47
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "We have a few regressions and one fix for stable: - revert fsync optimization - fix lost i_size update - fix a space accounting leak - build fix, add back definition of a deprecated ioctl flag - fix search condition for old roots in relocation" * tag 'for-5.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: re-instantiate the removed BTRFS_SUBVOL_CREATE_ASYNC definition btrfs: fix reclaim counter leak of space_info objects btrfs: make full fsyncs always operate on the entire file again btrfs: fix lost i_size update after cloning inline extent btrfs: check commit root generation in should_ignore_root
2020-04-13io_uring: only post events in io_poll_remove_all() if we completed someJens Axboe1-3/+4
syzbot reports this crash: BUG: unable to handle page fault for address: ffffffffffffffe8 PGD f96e17067 P4D f96e17067 PUD f96e19067 PMD 0 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI CPU: 55 PID: 211750 Comm: trinity-c127 Tainted: G B L 5.7.0-rc1-next-20200413 #4 Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 04/12/2017 RIP: 0010:__wake_up_common+0x98/0x290 el/sched/wait.c:87 Code: 40 4d 8d 78 e8 49 8d 7f 18 49 39 fd 0f 84 80 00 00 00 e8 6b bd 2b 00 49 8b 5f 18 45 31 e4 48 83 eb 18 4c 89 ff e8 08 bc 2b 00 <45> 8b 37 41 f6 c6 04 75 71 49 8d 7f 10 e8 46 bd 2b 00 49 8b 47 10 RSP: 0018:ffffc9000adbfaf0 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: ffffffffaa9636b8 RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffffffffffffffe8 RBP: ffffc9000adbfb40 R08: fffffbfff582c5fd R09: fffffbfff582c5fd R10: ffffffffac162fe3 R11: fffffbfff582c5fc R12: 0000000000000000 R13: ffff888ef82b0960 R14: ffffc9000adbfb80 R15: ffffffffffffffe8 FS: 00007fdcba4c4740(0000) GS:ffff889033780000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffe8 CR3: 0000000f776a0004 CR4: 00000000001606e0 Call Trace: __wake_up_common_lock+0xea/0x150 ommon_lock at kernel/sched/wait.c:124 ? __wake_up_common+0x290/0x290 ? lockdep_hardirqs_on+0x16/0x2c0 __wake_up+0x13/0x20 io_cqring_ev_posted+0x75/0xe0 v_posted at fs/io_uring.c:1160 io_ring_ctx_wait_and_kill+0x1c0/0x2f0 l at fs/io_uring.c:7305 io_uring_create+0xa8d/0x13b0 ? io_req_defer_prep+0x990/0x990 ? __kasan_check_write+0x14/0x20 io_uring_setup+0xb8/0x130 ? io_uring_create+0x13b0/0x13b0 ? check_flags.part.28+0x220/0x220 ? lockdep_hardirqs_on+0x16/0x2c0 __x64_sys_io_uring_setup+0x31/0x40 do_syscall_64+0xcc/0xaf0 ? syscall_return_slowpath+0x580/0x580 ? lockdep_hardirqs_off+0x1f/0x140 ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3 ? trace_hardirqs_off_caller+0x3a/0x150 ? trace_hardirqs_off_thunk+0x1a/0x1c entry_SYSCALL_64_after_hwframe+0x49/0xb3 RIP: 0033:0x7fdcb9dd76ed Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6b 57 2c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe7fd4e4f8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9 RAX: ffffffffffffffda RBX: 00000000000001a9 RCX: 00007fdcb9dd76ed RDX: fffffffffffffffc RSI: 0000000000000000 RDI: 0000000000005d54 RBP: 00000000000001a9 R08: 0000000e31d3caa7 R09: 0082400004004000 R10: ffffffffffffffff R11: 0000000000000246 R12: 0000000000000002 R13: 00007fdcb842e058 R14: 00007fdcba4c46c0 R15: 00007fdcb842e000 Modules linked in: bridge stp llc nfnetlink cn brd vfat fat ext4 crc16 mbcache jbd2 loop kvm_intel kvm irqbypass intel_cstate intel_uncore dax_pmem intel_rapl_perf dax_pmem_core ip_tables x_tables xfs sd_mod tg3 firmware_class libphy hpsa scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: binfmt_misc] CR2: ffffffffffffffe8 ---[ end trace f9502383d57e0e22 ]--- RIP: 0010:__wake_up_common+0x98/0x290 Code: 40 4d 8d 78 e8 49 8d 7f 18 49 39 fd 0f 84 80 00 00 00 e8 6b bd 2b 00 49 8b 5f 18 45 31 e4 48 83 eb 18 4c 89 ff e8 08 bc 2b 00 <45> 8b 37 41 f6 c6 04 75 71 49 8d 7f 10 e8 46 bd 2b 00 49 8b 47 10 RSP: 0018:ffffc9000adbfaf0 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: ffffffffaa9636b8 RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffffffffffffffe8 RBP: ffffc9000adbfb40 R08: fffffbfff582c5fd R09: fffffbfff582c5fd R10: ffffffffac162fe3 R11: fffffbfff582c5fc R12: 0000000000000000 R13: ffff888ef82b0960 R14: ffffc9000adbfb80 R15: ffffffffffffffe8 FS: 00007fdcba4c4740(0000) GS:ffff889033780000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffe8 CR3: 0000000f776a0004 CR4: 00000000001606e0 Kernel panic - not syncing: Fatal exception Kernel Offset: 0x29800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) ---[ end Kernel panic - not syncing: Fatal exception ]— which is due to error injection (or allocation failure) preventing the rings from being setup. On shutdown, we attempt to remove any pending requests, and for poll request, we call io_cqring_ev_posted() when we've killed poll requests. However, since the rings aren't setup, we won't find any poll requests. Make the calling of io_cqring_ev_posted() dependent on actually having completed requests. This fixes this setup corner case, and removes spurious calls if we remove poll requests and don't find any. Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-13NFS: Fix an ABBA spinlock issue in pnfs_update_layout()Trond Myklebust1-1/+2
We need to drop the inode spinlock while calling nfs4_select_rw_stateid(), since nfs4_copy_delegation_stateid() could take the delegation lock. Note that it is safe to do this, since all other calls to pnfs_update_layout() for that inode will find themselves blocked by the lock we hold on NFS_LAYOUT_FIRST_LAYOUTGET. Fixes: fc51b1cf391d ("NFS: Beware when dereferencing the delegation cred") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-13ceph: fix potential bad pointer deref in async dirops cb'sJeff Layton3-5/+5
The new async dirops callback routines can pass ERR_PTR values to ceph_mdsc_free_path, which could cause an oops. Make ceph_mdsc_free_path ignore ERR_PTR values. Also, ensure that the pr_warn messages look sane even if ceph_mdsc_build_path fails. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-04-13io_uring: io_async_task_func() should check and honor cancelationJens Axboe1-0/+15
If the request has been marked as canceled, don't try and issue it. Instead just fill a canceled event and finish the request. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-13io_uring: check for need to re-wait in polled async handlingJens Axboe1-14/+29
We added this for just the regular poll requests in commit a6ba632d2c24 ("io_uring: retry poll if we got woken with non-matching mask"), we should do the same for the poll handler used pollable async requests. Move the re-wait check and arm into a helper, and call it from io_async_task_func() as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-13xfs: fix partially uninitialized structure in xfs_reflink_remap_extentDarrick J. Wong1-0/+1
In the reflink extent remap function, it turns out that uirec (the block mapping corresponding only to the part of the passed-in mapping that got unmapped) was not fully initialized. Specifically, br_state was not being copied from the passed-in struct to the uirec. This could lead to unpredictable results such as the reflinked mapping being marked unwritten in the destination file. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-04-13xfs: acquire superblock freeze protection on eofblocks scansBrian Foster2-1/+14
The filesystem freeze sequence in XFS waits on any background eofblocks or cowblocks scans to complete before the filesystem is quiesced. At this point, the freezer has already stopped the transaction subsystem, however, which means a truncate or cowblock cancellation in progress is likely blocked in transaction allocation. This results in a deadlock between freeze and the associated scanner. Fix this problem by holding superblock write protection across calls into the block reapers. Since protection for background scans is acquired from the workqueue task context, trylock to avoid a similar deadlock between freeze and blocking on the write lock. Fixes: d6b636ebb1c9f ("xfs: halt auto-reclamation activities while rebuilding rmap") Reported-by: Paul Furtado <paulfurtado91@gmail.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>