summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2018-12-03ext4: fix possible use after free in ext4_quota_enablePan Bian1-1/+1
The function frees qf_inode via iput but then pass qf_inode to lockdep_set_quota_inode on the failure path. This may result in a use-after-free bug. The patch frees df_inode only when it is never used. Fixes: daf647d2dd5 ("ext4: add lockdep annotations for i_data_sem") Cc: stable@kernel.org # 4.6 Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Pan Bian <bianpan2016@163.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-12-03jbd2: avoid long hold times of j_state_lock while committing a transactionJan Kara2-5/+41
We can hold j_state_lock for writing at the beginning of jbd2_journal_commit_transaction() for a rather long time (reportedly for 30 ms) due cleaning revoke bits of all revoked buffers under it. The handling of revoke tables as well as cleaning of t_reserved_list, and checkpoint lists does not need j_state_lock for anything. It is only needed to prevent new handles from joining the transaction. Generally T_LOCKED transaction state prevents new handles from joining the transaction - except for reserved handles which have to allowed to join while we wait for other handles to complete. To prevent reserved handles from joining the transaction while cleaning up lists, add new transaction state T_SWITCH and watch for it when starting reserved handles. With this we can just drop the lock for operations that don't need it. Reported-and-tested-by: Adrian Hunter <adrian.hunter@intel.com> Suggested-by: "Theodore Y. Ts'o" <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-11-25ext4: add ext4_sb_bread() to disambiguate ENOMEM casesTheodore Ts'o5-94/+115
Today, when sb_bread() returns NULL, this can either be because of an I/O error or because the system failed to allocate the buffer. Since it's an old interface, changing would require changing many call sites. So instead we create our own ext4_sb_bread(), which also allows us to set the REQ_META flag. Also fixed a problem in the xattr code where a NULL return in a function could also mean that the xattr was not found, which could lead to the wrong error getting returned to userspace. Fixes: ac27a0ec112a ("ext4: initial copy of files from ext3") Cc: stable@kernel.org # 2.6.19 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-11-25Merge tag 'nfs-for-4.20-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds7-37/+66
Pull NFS client bugfixes from Trond Myklebust: - Fix a NFSv4 state manager deadlock when returning a delegation - NFSv4.2 copy do not allocate memory under the lock - flexfiles: Use the correct stateid for IO in the tightly coupled case * tag 'nfs-for-4.20-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: flexfiles: use per-mirror specified stateid for IO NFSv4.2 copy do not allocate memory under the lock NFSv4: Fix a NFSv4 state manager deadlock
2018-11-24Merge tag 'xarray-4.20-rc4' of git://git.infradead.org/users/willy/linux-daxLinus Torvalds2-28/+36
Pull XArray updates from Matthew Wilcox: "We found some bugs in the DAX conversion to XArray (and one bug which predated the XArray conversion). There were a couple of bugs in some of the higher-level functions, which aren't actually being called in today's kernel, but surfaced as a result of converting existing radix tree & IDR users over to the XArray. Some of the other changes to how the higher-level APIs work were also motivated by converting various users; again, they're not in use in today's kernel, so changing them has a low probability of introducing a bug. Dan can still trigger a bug in the DAX code with hot-offline/online, and we're working on tracking that down" * tag 'xarray-4.20-rc4' of git://git.infradead.org/users/willy/linux-dax: XArray tests: Add missing locking dax: Avoid losing wakeup in dax_lock_mapping_entry dax: Fix huge page faults dax: Fix dax_unlock_mapping_entry for PMD pages dax: Reinstate RCU protection of inode dax: Make sure the unlocking entry isn't locked dax: Remove optimisation from dax_lock_mapping_entry XArray tests: Correct some 64-bit assumptions XArray: Correct xa_store_range XArray: Fix Documentation XArray: Handle NULL pointers differently for allocation XArray: Unify xa_store and __xa_store XArray: Add xa_store_bh() and xa_store_irq() XArray: Turn xa_erase into an exported function XArray: Unify xa_cmpxchg and __xa_cmpxchg XArray: Regularise xa_reserve nilfs2: Use xa_erase_irq XArray: Export __xa_foo to non-GPL modules XArray: Fix xa_for_each with a single element at 0
2018-11-24Merge tag 'xfs-4.20-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds10-46/+104
Pull xfs fixes from Darrick Wong: "Dave and I have continued our work fixing corruption problems that can be found when running long-term burn-in exercisers on xfs. Here are some patches fixing most of the problems, but there will likely be more. :/ - Numerous corruption fixes for copy on write - Numerous corruption fixes for blocksize < pagesize writes - Don't miscalculate AG reservations for small final AGs - Fix page cache truncation to work properly for reflink and extent shifting - Fix use-after-free when retrying failed inode/dquot buffer logging - Fix corruptions seen when using copy_file_range in directio mode" * tag 'xfs-4.20-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: readpages doesn't zero page tail beyond EOF vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP iomap: dio data corruption and spurious errors when pipes fill iomap: sub-block dio needs to zeroout beyond EOF iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extents xfs: delalloc -> unwritten COW fork allocation can go wrong xfs: flush removing page cache in xfs_reflink_remap_prep xfs: extent shifting doesn't fully invalidate page cache xfs: finobt AG reserves don't consider last AG can be a runt xfs: fix transient reference count error in xfs_buf_resubmit_failed_buffers xfs: uncached buffer tracing needs to print bno xfs: make xfs_file_remap_range() static xfs: fix shared extent data corruption due to missing cow reservation
2018-11-23Merge tag 'pm-4.20-rc4' of ↵Linus Torvalds1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix two issues in the Operating Performance Points (OPP) framework, one cpufreq driver issue, one problem related to the tasks freezer and a few build-related issues in the cpupower utility. Specifics: - Fix tasks freezer deadlock in de_thread() that occurs if one of its sub-threads has been frozen already (Chanho Min). - Avoid registering a platform device by the ti-cpufreq driver on platforms that cannot use it (Dave Gerlach). - Fix a mistake in the ti-opp-supply operating performance points (OPP) driver that caused an incorrect reference voltage to be used and make it adjust the minimum voltage dynamically to avoid hangs or crashes in some cases (Keerthy). - Fix issues related to compiler flags in the cpupower utility and correct a linking problem in it by renaming a file with a duplicate name (Jiri Olsa, Konstantin Khlebnikov)" * tag 'pm-4.20-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: exec: make de_thread() freezable cpufreq: ti-cpufreq: Only register platform_device when supported opp: ti-opp-supply: Correct the supply in _get_optimal_vdd_voltage call opp: ti-opp-supply: Dynamically update u_volt_min tools cpupower: Override CFLAGS assignments tools cpupower debug: Allow to use outside build flags tools/power/cpupower: fix compilation with STATIC=true
2018-11-22flexfiles: use per-mirror specified stateid for IOTigran Mkrtchyan3-12/+32
rfc8435 says: For tight coupling, ffds_stateid provides the stateid to be used by the client to access the file. However current implementation replaces per-mirror provided stateid with by open or lock stateid. Ensure that per-mirror stateid is used by ff_layout_write_prepare_v4 and nfs4_ff_layout_prepare_ds. Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de> Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-11-22NFSv4.2 copy do not allocate memory under the lockOlga Kornievskaia2-20/+21
Bruce pointed out that we shouldn't allocate memory while holding a lock in the nfs4_callback_offload() and handle_async_copy() that deal with a racing CB_OFFLOAD and reply to COPY case. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-11-21iomap: readpages doesn't zero page tail beyond EOFDave Chinner1-3/+8
When we read the EOF page of the file via readpages, we need to zero the region beyond EOF that we either do not read or should not contain data so that mmap does not expose stale data to user applications. However, iomap_adjust_read_range() fails to detect EOF correctly, and so fsx on 1k block size filesystems fails very quickly with mapreads exposing data beyond EOF. There are two problems here. Firstly, when calculating the end block of the EOF byte, we have to round the size by one to avoid a block aligned EOF from reporting a block too large. i.e. a size of 1024 bytes is 1 block, which in index terms is block 0. Therefore we have to calculate the end block from (isize - 1), not isize. The second bug is determining if the current page spans EOF, and so whether we need split it into two half, one for the IO, and the other for zeroing. Unfortunately, the code that checks whether we should split the block doesn't actually check if we span EOF, it just checks if the read spans the /offset in the page/ that EOF sits on. So it splits every read into two if EOF is not page aligned, regardless of whether we are reading the EOF block or not. Hence we need to restrict the "does the read span EOF" check to just the page that spans EOF, not every page we read. This patch results in correct EOF detection through readpages: xfs_vm_readpages: dev 259:0 ino 0x43 nr_pages 24 xfs_iomap_found: dev 259:0 ino 0x43 size 0x66c00 offset 0x4f000 count 98304 type hole startoff 0x13c startblock 1368 blockcount 0x4 iomap_readpage_actor: orig pos 323584 pos 323584, length 4096, poff 0 plen 4096, isize 420864 xfs_iomap_found: dev 259:0 ino 0x43 size 0x66c00 offset 0x50000 count 94208 type hole startoff 0x140 startblock 1497 blockcount 0x5c iomap_readpage_actor: orig pos 327680 pos 327680, length 94208, poff 0 plen 4096, isize 420864 iomap_readpage_actor: orig pos 331776 pos 331776, length 90112, poff 0 plen 4096, isize 420864 iomap_readpage_actor: orig pos 335872 pos 335872, length 86016, poff 0 plen 4096, isize 420864 iomap_readpage_actor: orig pos 339968 pos 339968, length 81920, poff 0 plen 4096, isize 420864 iomap_readpage_actor: orig pos 344064 pos 344064, length 77824, poff 0 plen 4096, isize 420864 iomap_readpage_actor: orig pos 348160 pos 348160, length 73728, poff 0 plen 4096, isize 420864 iomap_readpage_actor: orig pos 352256 pos 352256, length 69632, poff 0 plen 4096, isize 420864 iomap_readpage_actor: orig pos 356352 pos 356352, length 65536, poff 0 plen 4096, isize 420864 iomap_readpage_actor: orig pos 360448 pos 360448, length 61440, poff 0 plen 4096, isize 420864 iomap_readpage_actor: orig pos 364544 pos 364544, length 57344, poff 0 plen 4096, isize 420864 iomap_readpage_actor: orig pos 368640 pos 368640, length 53248, poff 0 plen 4096, isize 420864 iomap_readpage_actor: orig pos 372736 pos 372736, length 49152, poff 0 plen 4096, isize 420864 iomap_readpage_actor: orig pos 376832 pos 376832, length 45056, poff 0 plen 4096, isize 420864 iomap_readpage_actor: orig pos 380928 pos 380928, length 40960, poff 0 plen 4096, isize 420864 iomap_readpage_actor: orig pos 385024 pos 385024, length 36864, poff 0 plen 4096, isize 420864 iomap_readpage_actor: orig pos 389120 pos 389120, length 32768, poff 0 plen 4096, isize 420864 iomap_readpage_actor: orig pos 393216 pos 393216, length 28672, poff 0 plen 4096, isize 420864 iomap_readpage_actor: orig pos 397312 pos 397312, length 24576, poff 0 plen 4096, isize 420864 iomap_readpage_actor: orig pos 401408 pos 401408, length 20480, poff 0 plen 4096, isize 420864 iomap_readpage_actor: orig pos 405504 pos 405504, length 16384, poff 0 plen 4096, isize 420864 iomap_readpage_actor: orig pos 409600 pos 409600, length 12288, poff 0 plen 4096, isize 420864 iomap_readpage_actor: orig pos 413696 pos 413696, length 8192, poff 0 plen 4096, isize 420864 iomap_readpage_actor: orig pos 417792 pos 417792, length 4096, poff 0 plen 3072, isize 420864 iomap_readpage_actor: orig pos 420864 pos 420864, length 1024, poff 3072 plen 1024, isize 420864 As you can see, it now does full page reads until the last one which is split correctly at the block aligned EOF, reading 3072 bytes and zeroing the last 1024 bytes. The original version of the patch got this right, but it got another case wrong. The EOF detection crossing really needs to the the original length as plen, while it starts at the end of the block, will be shortened as up-to-date blocks are found on the page. This means "orig_pos + plen" no longer points to the end of the page, and so will not correctly detect EOF crossing. Hence we have to use the length passed in to detect this partial page case: xfs_filemap_fault: dev 259:1 ino 0x43 write_fault 0 xfs_vm_readpage: dev 259:1 ino 0x43 nr_pages 1 xfs_iomap_found: dev 259:1 ino 0x43 size 0x2cc00 offset 0x2c000 count 4096 type hole startoff 0xb0 startblock 282 blockcount 0x4 iomap_readpage_actor: orig pos 180224 pos 181248, length 4096, poff 1024 plen 2048, isize 183296 xfs_iomap_found: dev 259:1 ino 0x43 size 0x2cc00 offset 0x2cc00 count 1024 type hole startoff 0xb3 startblock 285 blockcount 0x1 iomap_readpage_actor: orig pos 183296 pos 183296, length 1024, poff 3072 plen 1024, isize 183296 Heere we see a trace where the first block on the EOF page is up to date, hence poff = 1024 bytes. The offset into the page of EOF is 3072, so the range we want to read is 1024 - 3071, and the range we want to zero is 3072 - 4095. You can see this is split correctly now. This fixes the stale data beyond EOF problem that fsx quickly uncovers on 1k block size filesystems. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-21vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPPDave Chinner1-8/+7
It returns EINVAL when the operation is not supported by the filesystem. Fix it to return EOPNOTSUPP to be consistent with the man page and clone_file_range(). Clean up the inconsistent error return handling while I'm there. (I know, lipstick on a pig, but every little bit helps...) Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-21iomap: dio data corruption and spurious errors when pipes fillDave Chinner1-3/+19
When doing direct IO to a pipe for do_splice_direct(), then pipe is trivial to fill up and overflow as it can only hold 16 pages. At this point bio_iov_iter_get_pages() then returns -EFAULT, and we abort the IO submission process. Unfortunately, iomap_dio_rw() propagates the error back up the stack. The error is converted from the EFAULT to EAGAIN in generic_file_splice_read() to tell the splice layers that the pipe is full. do_splice_direct() completely fails to handle EAGAIN errors (it aborts on error) and returns EAGAIN to the caller. copy_file_write() then completely fails to handle EAGAIN as well, and so returns EAGAIN to userspace, having failed to copy the data it was asked to. Avoid this whole steaming pile of fail by having iomap_dio_rw() silently swallow EFAULT errors and so do short reads. To make matters worse, iomap_dio_actor() has a stale data exposure bug bio_iov_iter_get_pages() fails - it does not zero the tail block that it may have been left uncovered by partial IO. Fix the error handling case to drop to the sub-block zeroing rather than immmediately returning the -EFAULT error. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-21iomap: sub-block dio needs to zeroout beyond EOFDave Chinner1-1/+8
If we are doing sub-block dio that extends EOF, we need to zero the unused tail of the block to initialise the data in it it. If we do not zero the tail of the block, then an immediate mmap read of the EOF block will expose stale data beyond EOF to userspace. Found with fsx running sub-block DIO sizes vs MAPREAD/MAPWRITE operations. Fix this by detecting if the end of the DIO write is beyond EOF and zeroing the tail if necessary. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-21iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extentsDave Chinner1-5/+6
When we write into an unwritten extent via direct IO, we dirty metadata on IO completion to convert the unwritten extent to written. However, when we do the FUA optimisation checks, the inode may be clean and so we issue a FUA write into the unwritten extent. This means we then bypass the generic_write_sync() call after unwritten extent conversion has ben done and we don't force the modified metadata to stable storage. This violates O_DSYNC semantics. The window of exposure is a single IO, as the next DIO write will see the inode has dirty metadata and hence will not use the FUA optimisation. Calling generic_write_sync() after completion of the second IO will also sync the first write and it's metadata. Fix this by avoiding the FUA optimisation when writing to unwritten extents. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-21xfs: delalloc -> unwritten COW fork allocation can go wrongDave Chinner1-1/+4
Long saga. There have been days spent following this through dead end after dead end in multi-GB event traces. This morning, after writing a trace-cmd wrapper that enabled me to be more selective about XFS trace points, I discovered that I could get just enough essential tracepoints enabled that there was a 50:50 chance the fsx config would fail at ~115k ops. If it didn't fail at op 115547, I stopped fsx at op 115548 anyway. That gave me two traces - one where the problem manifested, and one where it didn't. After refining the traces to have the necessary information, I found that in the failing case there was a real extent in the COW fork compared to an unwritten extent in the working case. Walking back through the two traces to the point where the CWO fork extents actually diverged, I found that the bad case had an extra unwritten extent in it. This is likely because the bug it led me to had triggered multiple times in those 115k ops, leaving stray COW extents around. What I saw was a COW delalloc conversion to an unwritten extent (as they should always be through xfs_iomap_write_allocate()) resulted in a /written extent/: xfs_writepage: dev 259:0 ino 0x83 pgoff 0x17000 size 0x79a00 offset 0 length 0 xfs_iext_remove: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/2 offset 32 block 152 count 20 flag 1 caller xfs_bmap_add_extent_delay_real xfs_bmap_pre_update: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 4503599627239429 count 31 flag 0 caller xfs_bmap_add_extent_delay_real xfs_bmap_post_update: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 121 count 51 flag 0 caller xfs_bmap_add_ex Basically, Cow fork before: 0 1 32 52 +H+DDDDDDDDDDDD+UUUUUUUUUUU+ PREV RIGHT COW delalloc conversion allocates: 1 32 +uuuuuuuuuuuu+ NEW And the result according to the xfs_bmap_post_update trace was: 0 1 32 52 +H+wwwwwwwwwwwwwwwwwwwwwwww+ PREV Which is clearly wrong - it should be a merged unwritten extent, not an unwritten extent. That lead me to look at the LEFT_FILLING|RIGHT_FILLING|RIGHT_CONTIG case in xfs_bmap_add_extent_delay_real(), and sure enough, there's the bug. It takes the old delalloc extent (PREV) and adds the length of the RIGHT extent to it, takes the start block from NEW, removes the RIGHT extent and then updates PREV with the new extent. What it fails to do is update PREV.br_state. For delalloc, this is always XFS_EXT_NORM, while in this case we are converting the delayed allocation to unwritten, so it needs to be updated to XFS_EXT_UNWRITTEN. This LF|RF|RC case does not do this, and so the resultant extent is always written. And that's the bug I've been chasing for a week - a bmap btree bug, not a reflink/dedupe/copy_file_range bug, but a BMBT bug introduced with the recent in core extent tree scalability enhancements. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-21xfs: flush removing page cache in xfs_reflink_remap_prepDave Chinner3-5/+17
On a sub-page block size filesystem, fsx is failing with a data corruption after a series of operations involving copying a file with the destination offset beyond EOF of the destination of the file: 8093(157 mod 256): TRUNCATE DOWN from 0x7a120 to 0x50000 ******WWWW 8094(158 mod 256): INSERT 0x25000 thru 0x25fff (0x1000 bytes) 8095(159 mod 256): COPY 0x18000 thru 0x1afff (0x3000 bytes) to 0x2f400 8096(160 mod 256): WRITE 0x5da00 thru 0x651ff (0x7800 bytes) HOLE 8097(161 mod 256): COPY 0x2000 thru 0x5fff (0x4000 bytes) to 0x6fc00 The second copy here is beyond EOF, and it is to sub-page (4k) but block aligned (1k) offset. The clone runs the EOF zeroing, landing in a pre-existing post-eof delalloc extent. This zeroes the post-eof extents in the page cache just fine, dirtying the pages correctly. The problem is that xfs_reflink_remap_prep() now truncates the page cache over the range that it is copying it to, and rounds that down to cover the entire start page. This removes the dirty page over the delalloc extent from the page cache without having written it back. Hence later, when the page cache is flushed, the page at offset 0x6f000 has not been written back and hence exposes stale data, which fsx trips over less than 10 operations later. Fix this by changing xfs_reflink_remap_prep() to use xfs_flush_unmap_range(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-20xfs: extent shifting doesn't fully invalidate page cacheDave Chinner1-7/+1
The extent shifting code uses a flush and invalidate mechainsm prior to shifting extents around. This is similar to what xfs_free_file_space() does, but it doesn't take into account things like page cache vs block size differences, and it will fail if there is a page that it currently busy. xfs_flush_unmap_range() handles all of these cases, so just convert xfs_prepare_shift() to us that mechanism rather than having it's own special sauce. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-20xfs: finobt AG reserves don't consider last AG can be a runtDave Chinner1-4/+7
The last AG may be very small comapred to all other AGs, and hence AG reservations based on the superblock AG size may actually consume more space than the AG actually has. This results on assert failures like: XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 319 [ 48.932891] xfs_ag_resv_init+0x1bd/0x1d0 [ 48.933853] xfs_fs_reserve_ag_blocks+0x37/0xb0 [ 48.934939] xfs_mountfs+0x5b3/0x920 [ 48.935804] xfs_fs_fill_super+0x462/0x640 [ 48.936784] ? xfs_test_remount_options+0x60/0x60 [ 48.937908] mount_bdev+0x178/0x1b0 [ 48.938751] mount_fs+0x36/0x170 [ 48.939533] vfs_kern_mount.part.43+0x54/0x130 [ 48.940596] do_mount+0x20e/0xcb0 [ 48.941396] ? memdup_user+0x3e/0x70 [ 48.942249] ksys_mount+0xba/0xd0 [ 48.943046] __x64_sys_mount+0x21/0x30 [ 48.943953] do_syscall_64+0x54/0x170 [ 48.944835] entry_SYSCALL_64_after_hwframe+0x49/0xbe Hence we need to ensure the finobt per-ag space reservations take into account the size of the last AG rather than treat it like all the other full size AGs. Note that both refcountbt and rmapbt already take the size of the AG into account via reading the AGF length directly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-20xfs: fix transient reference count error in xfs_buf_resubmit_failed_buffersDave Chinner1-7/+21
When retrying a failed inode or dquot buffer, xfs_buf_resubmit_failed_buffers() clears all the failed flags from the inde/dquot log items. In doing so, it also drops all the reference counts on the buffer that the failed log items hold. This means it can drop all the active references on the buffer and hence free the buffer before it queues it for write again. Putting the buffer on the delwri queue takes a reference to the buffer (so that it hangs around until it has been written and completed), but this goes bang if the buffer has already been freed. Hence we need to add the buffer to the delwri queue before we remove the failed flags from the log items attached to the buffer to ensure it always remains referenced during the resubmit process. Reported-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-20xfs: uncached buffer tracing needs to print bnoDave Chinner1-1/+4
Useless: xfs_buf_get_uncached: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ... xfs_buf_unlock: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ... xfs_buf_submit: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ... xfs_buf_hold: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ... xfs_buf_iowait: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ... xfs_buf_iodone: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ... xfs_buf_iowait_done: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ... xfs_buf_rele: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ... Useful: xfs_buf_get_uncached: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ... xfs_buf_unlock: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ... xfs_buf_submit: dev 253:32 bno 0x200b5 nblks 0x1 ... xfs_buf_hold: dev 253:32 bno 0x200b5 nblks 0x1 ... xfs_buf_iowait: dev 253:32 bno 0x200b5 nblks 0x1 ... xfs_buf_iodone: dev 253:32 bno 0x200b5 nblks 0x1 ... xfs_buf_iowait_done: dev 253:32 bno 0x200b5 nblks 0x1 ... xfs_buf_rele: dev 253:32 bno 0x200b5 nblks 0x1 ... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-19NFSv4: Fix a NFSv4 state manager deadlockTrond Myklebust2-5/+13
Fix a deadlock whereby the NFSv4 state manager can get stuck in the delegation return code, waiting for a layout return to complete in another thread. If the server reboots before that other thread completes, then we need to be able to start a second state manager thread in order to perform recovery. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-11-19xfs: make xfs_file_remap_range() staticEric Biggers1-1/+1
xfs_file_remap_range() is only used in fs/xfs/xfs_file.c, so make it static. This addresses a gcc warning when -Wmissing-prototypes is enabled. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-19xfs: fix shared extent data corruption due to missing cow reservationBrian Foster1-0/+1
Page writeback indirectly handles shared extents via the existence of overlapping COW fork blocks. If COW fork blocks exist, writeback always performs the associated copy-on-write regardless if the underlying blocks are actually shared. If the blocks are shared, then overlapping COW fork blocks must always exist. fstests shared/010 reproduces a case where a buffered write occurs over a shared block without performing the requisite COW fork reservation. This ultimately causes writeback to the shared extent and data corruption that is detected across md5 checks of the filesystem across a mount cycle. The problem occurs when a buffered write lands over a shared extent that crosses an extent size hint boundary and that also happens to have a partial COW reservation that doesn't cover the start and end blocks of the data fork extent. For example, a buffered write occurs across the file offset (in FSB units) range of [29, 57]. A shared extent exists at blocks [29, 35] and COW reservation already exists at blocks [32, 34]. After accommodating a COW extent size hint of 32 blocks and the existing reservation at offset 32, xfs_reflink_reserve_cow() allocates 32 blocks of reservation at offset 0 and returns with COW reservation across the range of [0, 34]. The associated data fork extent is still [29, 35], however, which isn't fully covered by the COW reservation. This leads to a buffered write at file offset 35 over a shared extent without associated COW reservation. Writeback eventually kicks in, performs an overwrite of the underlying shared block and causes the associated data corruption. Update xfs_reflink_reserve_cow() to accommodate the fact that a delalloc allocation request may not fully cover the extent in the data fork. Trim the data fork extent appropriately, just as is done for shared extent boundaries and/or existing COW reservations that happen to overlap the start of the data fork extent. This prevents shared/010 failures due to data corruption on reflink enabled filesystems. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds1-1/+10
Pull networking fixes from David Miller: 1) Fix some potentially uninitialized variables and use-after-free in kvaser_usb can drier, from Jimmy Assarsson. 2) Fix leaks in qed driver, from Denis Bolotin. 3) Socket leak in l2tp, from Xin Long. 4) RSS context allocation fix in bnxt_en from Michael Chan. 5) Fix cxgb4 build errors, from Ganesh Goudar. 6) Route leaks in ipv6 when removing exceptions, from Xin Long. 7) Memory leak in IDR allocation handling of act_pedit, from Davide Caratti. 8) Use-after-free of bridge vlan stats, from Nikolay Aleksandrov. 9) When MTU is locked, do not force DF bit on ipv4 tunnels. From Sabrina Dubroca. 10) When NAPI cached skb is reused, we must set it to the proper initial state which includes skb->pkt_type. From Eric Dumazet. 11) Lockdep and non-linear SKB handling fix in tipc from Jon Maloy. 12) Set RX queue properly in various tuntap receive paths, from Matthew Cover. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits) tuntap: fix multiqueue rx ipv6: Fix PMTU updates for UDP/raw sockets in presence of VRF tipc: don't assume linear buffer when reading ancillary data tipc: fix lockdep warning when reinitilaizing sockets net-gro: reset skb->pkt_type in napi_reuse_skb() tc-testing: tdc.py: Guard against lack of returncode in executed command tc-testing: tdc.py: ignore errors when decoding stdout/stderr ip_tunnel: don't force DF when MTU is locked MAINTAINERS: Add entry for CAKE qdisc net: bridge: fix vlan stats use-after-free on destruction socket: do a generic_file_splice_read when proto_ops has no splice_read net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs Revert "net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs" net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs net/sched: act_pedit: fix memory leak when IDR allocation fails net: lantiq: Fix returned value in case of error in 'xrx200_probe()' ipv6: fix a dst leak when removing its exception net: mvneta: Don't advertise 2.5G modes drivers/net/ethernet/qlogic/qed/qed_rdma.h: fix typo net/mlx4: Fix UBSAN warning of signed integer overflow ...
2018-11-19dax: Avoid losing wakeup in dax_lock_mapping_entryMatthew Wilcox1-0/+1
After calling get_unlocked_entry(), you have to call put_unlocked_entry() to avoid subsequent waiters losing wakeups. Fixes: c2a7d2a11552 ("filesystem-dax: Introduce dax_lock_mapping_entry()") Cc: stable@vger.kernel.org Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-11-19exec: make de_thread() freezableChanho Min1-2/+3
Suspend fails due to the exec family of functions blocking the freezer. The casue is that de_thread() sleeps in TASK_UNINTERRUPTIBLE waiting for all sub-threads to die, and we have the deadlock if one of them is frozen. This also can occur with the schedule() waiting for the group thread leader to exit if it is frozen. In our machine, it causes freeze timeout as bellows. Freezing of tasks failed after 20.010 seconds (1 tasks refusing to freeze, wq_busy=0): setcpushares-ls D ffffffc00008ed70 0 5817 1483 0x0040000d Call trace: [<ffffffc00008ed70>] __switch_to+0x88/0xa0 [<ffffffc000d1c30c>] __schedule+0x1bc/0x720 [<ffffffc000d1ca90>] schedule+0x40/0xa8 [<ffffffc0001cd784>] flush_old_exec+0xdc/0x640 [<ffffffc000220360>] load_elf_binary+0x2a8/0x1090 [<ffffffc0001ccff4>] search_binary_handler+0x9c/0x240 [<ffffffc00021c584>] load_script+0x20c/0x228 [<ffffffc0001ccff4>] search_binary_handler+0x9c/0x240 [<ffffffc0001ce8e0>] do_execveat_common.isra.14+0x4f8/0x6e8 [<ffffffc0001cedd0>] compat_SyS_execve+0x38/0x48 [<ffffffc00008de30>] el0_svc_naked+0x24/0x28 To fix this, make de_thread() freezable. It looks safe and works fine. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Chanho Min <chanho.min@lge.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Pavel Machek <pavel@ucw.cz> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-11-18ocfs2: free up write context when direct IO failedWengang Wang2-2/+19
The write context should also be freed even when direct IO failed. Otherwise a memory leak is introduced and entries remain in oi->ip_unwritten_list causing the following BUG later in unlink path: ERROR: bug expression: !list_empty(&oi->ip_unwritten_list) ERROR: Clear inode of 215043, inode has unwritten extents ... Call Trace: ? __set_current_blocked+0x42/0x68 ocfs2_evict_inode+0x91/0x6a0 [ocfs2] ? bit_waitqueue+0x40/0x33 evict+0xdb/0x1af iput+0x1a2/0x1f7 do_unlinkat+0x194/0x28f SyS_unlinkat+0x1b/0x2f do_syscall_64+0x79/0x1ae entry_SYSCALL_64_after_hwframe+0x151/0x0 This patch also logs, with frequency limit, direct IO failures. Link: http://lkml.kernel.org/r/20181102170632.25921-1-wen.gang.wang@oracle.com Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Changwei Ge <ge.changwei@h3c.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-18mm: don't reclaim inodes with many attached pagesRoman Gushchin1-2/+5
Spock reported that commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects") leads to a regression on his setup: periodically the majority of the pagecache is evicted without an obvious reason, while before the change the amount of free memory was balancing around the watermark. The reason behind is that the mentioned above change created some minimal background pressure on the inode cache. The problem is that if an inode is considered to be reclaimed, all belonging pagecache page are stripped, no matter how many of them are there. So, if a huge multi-gigabyte file is cached in the memory, and the goal is to reclaim only few slab objects (unused inodes), we still can eventually evict all gigabytes of the pagecache at once. The workload described by Spock has few large non-mapped files in the pagecache, so it's especially noticeable. To solve the problem let's postpone the reclaim of inodes, which have more than 1 attached page. Let's wait until the pagecache pages will be evicted naturally by scanning the corresponding LRU lists, and only then reclaim the inode structure. Link: http://lkml.kernel.org/r/20181023164302.20436-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reported-by: Spock <dairinin@gmail.com> Tested-by: Spock <dairinin@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: <stable@vger.kernel.org> [4.19.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-17dax: Fix huge page faultsMatthew Wilcox1-8/+4
Using xas_load() with a PMD-sized xa_state would work if either a PMD-sized entry was present or a PTE sized entry was present in the first 64 entries (of the 512 PTEs in a PMD on x86). If there was no PTE in the first 64 entries, grab_mapping_entry() would believe there were no entries present, allocate a PMD-sized entry and overwrite the PTE in the page cache. Use xas_find_conflict() instead which turns out to simplify both get_unlocked_entry() and grab_mapping_entry(). Also remove a WARN_ON_ONCE from grab_mapping_entry() as it will have already triggered in get_unlocked_entry(). Fixes: cfc93c6c6c96 ("dax: Convert dax_insert_pfn_mkwrite to XArray") Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-11-17dax: Fix dax_unlock_mapping_entry for PMD pagesMatthew Wilcox1-9/+8
Device DAX PMD pages do not set the PageHead bit for compound pages. Fix for now by retrieving the PMD bit from the entry, but eventually we will be passed the page size by the caller. Reported-by: Dan Williams <dan.j.williams@intel.com> Fixes: 9f32d221301c ("dax: Convert dax_lock_mapping_entry to XArray") Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-11-16dax: Reinstate RCU protection of inodeMatthew Wilcox1-3/+19
For the device-dax case, it is possible that the inode can go away underneath us. The rcu_read_lock() was there to prevent it from being freed, and not (as I thought) to protect the tree. Bring back the rcu_read_lock() protection. Also add a little kernel-doc; while this function is not exported to modules, it is used from outside dax.c Reported-by: Dan Williams <dan.j.williams@intel.com> Fixes: 9f32d221301c ("dax: Convert dax_lock_mapping_entry to XArray") Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-11-16dax: Make sure the unlocking entry isn't lockedMatthew Wilcox1-0/+1
I wrote the semantics in the commit message, but didn't document it in the source code. Use a BUG_ON instead (if any code does do this, it's really buggy; we can't recover and it's worth taking the machine down). Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-11-16dax: Remove optimisation from dax_lock_mapping_entryMatthew Wilcox1-5/+2
Skipping some of the revalidation after we sleep can lead to returning a mapping which has already been freed. Just drop this optimisation. Reported-by: Dan Williams <dan.j.williams@intel.com> Fixes: 9f32d221301c ("dax: Convert dax_lock_mapping_entry to XArray") Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-11-16Merge tag 'fsnotify_for_v4.20-rc3' of ↵Linus Torvalds2-7/+10
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify fix from Jan Kara: "One small fsnotify fix for duplicate events" * tag 'fsnotify_for_v4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fanotify: fix handling of events on child sub-directory
2018-11-16Merge tag 'gfs2-4.20.fixes3' of ↵Linus Torvalds2-28/+29
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull bfs2 fixes from Andreas Gruenbacher: "Fix two bugs leading to leaked buffer head references: - gfs2: Put bitmap buffers in put_super - gfs2: Fix iomap buffer head reference counting bug And one bug leading to significant slow-downs when deleting large files: - gfs2: Fix metadata read-ahead during truncate (2)" * tag 'gfs2-4.20.fixes3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Fix iomap buffer head reference counting bug gfs2: Fix metadata read-ahead during truncate (2) gfs2: Put bitmap buffers in put_super
2018-11-16gfs2: Fix iomap buffer head reference counting bugAndreas Gruenbacher1-23/+17
GFS2 passes the inode buffer head (dibh) from gfs2_iomap_begin to gfs2_iomap_end in iomap->private. It sets that private pointer in gfs2_iomap_get. Users of gfs2_iomap_get other than gfs2_iomap_begin would have to release iomap->private, but this isn't done correctly, leading to a leak of buffer head references. To fix this, move the code for setting iomap->private from gfs2_iomap_get to gfs2_iomap_begin. Fixes: 64bc06bb32 ("gfs2: iomap buffered write support") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-16Merge tag 'fuse-fixes-4.20-rc3' of ↵Linus Torvalds2-5/+15
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "A couple of fixes, all bound for -stable (i.e. not regressions in this cycle)" * tag 'fuse-fixes-4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: fix use-after-free in fuse_direct_IO() fuse: fix possibly missed wake-up after abort fuse: fix leaked notify reply
2018-11-15rxrpc: Fix life checkDavid Howells1-1/+10
The life-checking function, which is used by kAFS to make sure that a call is still live in the event of a pending signal, only samples the received packet serial number counter; it doesn't actually provoke a change in the counter, rather relying on the server to happen to give us a packet in the time window. Fix this by adding a function to force a ping to be transmitted. kAFS then keeps track of whether there's been a stall, and if so, uses the new function to ping the server, resetting the timeout to allow the reply to come back. If there's a stall, a ping and the call is *still* stalled in the same place after another period, then the call will be aborted. Fixes: bc5e3a546d55 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals") Fixes: f4d15fb6f99a ("rxrpc: Provide functions for allowing cleaner handling of signals") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-15Merge tag 'nfs-for-4.20-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds3-8/+17
Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Stable fixes: - Don't exit the NFSv4 state manager without clearing NFS4CLNT_MANAGER_RUNNING Bugfixes: - Fix an Oops when destroying the RPCSEC_GSS credential cache - Fix an Oops during delegation callbacks - Ensure that the NFSv4 state manager exits the loop on SIGKILL - Fix a bogus get/put in generic_key_to_expire()" * tag 'nfs-for-4.20-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4: Fix an Oops during delegation callbacks SUNRPC: Fix a bogus get/put in generic_key_to_expire() SUNRPC: Fix a Oops when destroying the RPCSEC_GSS credential cache NFSv4: Ensure that the state manager exits the loop on SIGKILL NFSv4: Don't exit the state manager without clearing NFS4CLNT_MANAGER_RUNNING
2018-11-14Merge tag 'nfsd-4.20-1' of git://linux-nfs.org/~bfields/linuxLinus Torvalds1-0/+3
Pull nfsd fixes from Bruce Fields: "Three nfsd bugfixes. None are new bugs, but they all take a little effort to hit, which might explain why they weren't found sooner" * tag 'nfsd-4.20-1' of git://linux-nfs.org/~bfields/linux: SUNRPC: drop pointless static qualifier in xdr_get_next_encode_buffer() nfsd: COPY and CLONE operations require the saved filehandle to be set sunrpc: correct the computation for page_ptr when truncating
2018-11-14Merge branch 'for-linus' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace fix from Eric Biederman: "Benjamin Coddington noticed an unkillable busy loop in the kernel that anyone who is sufficiently motivated can trigger. This bug did not exist in earlier kernels making this bug a regression. I have tested the change personally and confirmed that the bug exists and that the fix works. This fix has been picked up by linux-next and hopefully the automated testing bots and no problems have been reported from those sources. Ordinarily I would let something like this sit a little longer but I am going to be away at Linux Plumbers the rest of this week and I am afraid if I don't send the pull request now this fix will get lost" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: mnt: fix __detach_mounts infinite loop
2018-11-13NFSv4: Fix an Oops during delegation callbacksTrond Myklebust2-4/+11
If the server sends a CB_GETATTR or a CB_RECALL while the filesystem is being unmounted, then we can Oops when releasing the inode in nfs4_callback_getattr() and nfs4_callback_recall(). Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-11-12NFSv4: Ensure that the state manager exits the loop on SIGKILLTrond Myklebust1-1/+1
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-11-12NFSv4: Don't exit the state manager without clearing NFS4CLNT_MANAGER_RUNNINGTrond Myklebust1-3/+5
If we exit the NFSv4 state manager due to a umount, then we can end up leaving the NFS4CLNT_MANAGER_RUNNING flag set. If another mount causes the nfs4_client to be rereferenced before it is destroyed, then we end up never being able to recover state. Fixes: 47c2199b6eb5 ("NFSv4.1: Ensure state manager thread dies on last ...") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: stable@vger.kernel.org # v4.15+
2018-11-12mnt: fix __detach_mounts infinite loopBenjamin Coddington1-3/+3
Since commit ff17fa561a04 ("d_invalidate(): unhash immediately") immediately unhashes the dentry, we'll never return the mountpoint in lookup_mountpoint(), which can lead to an unbreakable loop in d_invalidate(). I have reports of NFS clients getting into this condition after the server removes an export of an existing mount created through follow_automount(), but I suspect there are various other ways to produce this problem if we hunt down users of d_invalidate(). For example, it is possible to get into this state by using XFS' d_invalidate() call in xfs_vn_unlink(): truncate -s 100m img{1,2} mkfs.xfs -q -n version=ci img1 mkfs.xfs -q -n version=ci img2 mkdir -p /mnt/xfs mount img1 /mnt/xfs mkdir /mnt/xfs/sub1 mount img2 /mnt/xfs/sub1 cat > /mnt/xfs/sub1/foo & umount -l /mnt/xfs/sub1 mount img2 /mnt/xfs/sub1 mount --make-private /mnt/xfs mkdir /mnt/xfs/sub2 mount --move /mnt/xfs/sub1 /mnt/xfs/sub2 rmdir /mnt/xfs/sub1 Fix this by moving the check for an unlinked dentry out of the detach_mounts() path. Fixes: ff17fa561a04 ("d_invalidate(): unhash immediately") Cc: stable@vger.kernel.org Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2018-11-11Merge tag 'for-4.20-rc1-tag' of ↵Linus Torvalds8-57/+107
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Several fixes to recent release (4.19, fixes tagged for stable) and other fixes" * tag 'for-4.20-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix missing delayed iputs on unmount Btrfs: fix data corruption due to cloning of eof block Btrfs: fix infinite loop on inode eviction after deduplication of eof block Btrfs: fix deadlock on tree root leaf when finding free extent btrfs: avoid link error with CONFIG_NO_AUTO_INLINE btrfs: tree-checker: Fix misleading group system information Btrfs: fix missing data checksums after a ranged fsync (msync) btrfs: fix pinned underflow after transaction aborted Btrfs: fix cur_offset in the error case for nocow
2018-11-11Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds5-31/+51
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "A large number of ext4 bug fixes, mostly buffer and memory leaks on error return cleanup paths" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: missing !bh check in ext4_xattr_inode_write() ext4: fix buffer leak in __ext4_read_dirblock() on error path ext4: fix buffer leak in ext4_expand_extra_isize_ea() on error path ext4: fix buffer leak in ext4_xattr_move_to_block() on error path ext4: release bs.bh before re-using in ext4_xattr_block_find() ext4: fix buffer leak in ext4_xattr_get_block() on error path ext4: fix possible leak of s_journal_flag_rwsem in error path ext4: fix possible leak of sbi->s_group_desc_leak in error path ext4: remove unneeded brelse call in ext4_xattr_inode_update_ref() ext4: avoid possible double brelse() in add_new_gdb() on error path ext4: avoid buffer leak in ext4_orphan_add() after prior errors ext4: avoid buffer leak on shutdown in ext4_mark_iloc_dirty() ext4: fix possible inode leak in the retry loop of ext4_resize_fs() ext4: fix missing cleanup if ext4_alloc_flex_bg_array() fails while resizing ext4: add missing brelse() update_backups()'s error path ext4: add missing brelse() add_new_gdb_meta_bg()'s error path ext4: add missing brelse() in set_flexbg_block_bitmap()'s error path ext4: avoid potential extra brelse in setup_new_flex_group_blocks()
2018-11-10Merge branch 'for-linus' of ↵Linus Torvalds1-5/+17
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace fixes from Eric Biederman: "I believe all of these are simple obviously correct bug fixes. These fall into two groups: - Fixing the implementation of MNT_LOCKED which prevents lesser privileged users from seeing unders mounts created by more privileged users. - Fixing the extended uid and group mapping in user namespaces. As well as ensuring the code looks correct I have spot tested these changes as well and in my testing the fixes are working" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: mount: Prevent MNT_DETACH from disconnecting locked mounts mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts mount: Retest MNT_LOCKED in do_umount userns: also map extents in the reverse map to kernel IDs
2018-11-09Merge tag 'ceph-for-4.20-rc2' of https://github.com/ceph/ceph-clientLinus Torvalds3-12/+14
Pull Ceph fixes from Ilya Dryomov: "Two CephFS fixes (copy_file_range and quota) and a small feature bit cleanup" * tag 'ceph-for-4.20-rc2' of https://github.com/ceph/ceph-client: libceph: assume argonaut on the server side ceph: quota: fix null pointer dereference in quota check ceph: add destination file data sync before doing any remote copy
2018-11-09ext4: missing !bh check in ext4_xattr_inode_write()Vasily Averin1-0/+6
According to Ted Ts'o ext4_getblk() called in ext4_xattr_inode_write() should not return bh = NULL The only time that bh could be NULL, then, would be in the case of something really going wrong; a programming error elsewhere (perhaps a wild pointer dereference) or I/O error causing on-disk file system corruption (although that would be highly unlikely given that we had *just* allocated the blocks and so the metadata blocks in question probably would still be in the cache). Fixes: e50e5129f384 ("ext4: xattr-in-inode support") Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org # 4.13