summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2016-03-04Merge branch 'for-linus2' of git://git.kernel.dk/linux-blockLinus Torvalds2-13/+42
Pull block fixes from Jens Axboe: "Round 2 of this. I cut back to the bare necessities, the patch is still larger than it usually would be at this time, due to the number of NVMe fixes in there. This pull request contains: - The 4 core fixes from Ming, that fix both problems with exceeding the virtual boundary limit in case of merging, and the gap checking for cloned bio's. - NVMe fixes from Keith and Christoph: - Regression on larger user commands, causing problems with reading log pages (for instance). This touches both NVMe, and the block core since that is now generally utilized also for these types of commands. - Hot removal fixes. - User exploitable issue with passthrough IO commands, if !length is given, causing us to fault on writing to the zero page. - Fix for a hang under error conditions - And finally, the current series regression for umount with cgroup writeback, where the final flush would happen async and hence open up window after umount where the device wasn't consistent. fsck right after umount would show this. From Tejun" * 'for-linus2' of git://git.kernel.dk/linux-block: block: support large requests in blk_rq_map_user_iov block: fix blk_rq_get_max_sectors for driver private requests nvme: fix max_segments integer truncation nvme: set queue limits for the admin queue writeback: flush inode cgroup wb switches instead of pinning super_block NVMe: Fix 0-length integrity payload NVMe: Don't allow unsupported flags NVMe: Move error handling to failed reset handler NVMe: Simplify device reset failure NVMe: Fix namespace removal deadlock NVMe: Use IDA for namespace disk naming NVMe: Don't unmap controller registers on reset block: merge: get the 1st and last bvec via helpers block: get the 1st and last bvec via helpers block: check virt boundary in bio_will_gap() block: bio: introduce helpers to get the 1st and last bvec
2016-03-04Merge tag 'for-linus-20160304' of git://git.infradead.org/linux-mtdLinus Torvalds5-51/+91
Pull jffs2 fixes from David Woodhouse: "This contains two important JFFS2 fixes marked for stable: - a lock ordering problem between the page lock and the internal f->sem mutex, which was causing occasional deadlocks in garbage collection - a scan failure causing moved directories to sometimes end up appearing to have hard links. There are also a couple of trivial MAINTAINERS file updates" * tag 'for-linus-20160304' of git://git.infradead.org/linux-mtd: MAINTAINERS: add maintainer entry for FREESCALE GPMI NAND driver Fix directory hardlinks from deleted directories jffs2: Fix page lock / f->sem deadlock Revert "jffs2: Fix lock acquisition order bug in jffs2_write_begin" MAINTAINERS: update Han's email
2016-03-04Merge branch 'for-linus-4.5' of ↵Linus Torvalds1-1/+9
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fix from Chris Mason: "Filipe nailed down a problem where tree log replay would do some work that orphan code wasn't expecting to be done yet, leading to BUG_ON" * 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix loading of orphan roots leading to BUG_ON
2016-03-03Btrfs: fix loading of orphan roots leading to BUG_ONFilipe Manana1-1/+9
When looking for orphan roots during mount we can end up hitting a BUG_ON() (at root-item.c:btrfs_find_orphan_roots()) if a log tree is replayed and qgroups are enabled. This is because after a log tree is replayed, a transaction commit is made, which triggers qgroup extent accounting which in turn does backref walking which ends up reading and inserting all roots in the radix tree fs_info->fs_root_radix, including orphan roots (deleted snapshots). So after the log tree is replayed, when finding orphan roots we hit the BUG_ON with the following trace: [118209.182438] ------------[ cut here ]------------ [118209.183279] kernel BUG at fs/btrfs/root-tree.c:314! [118209.184074] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [118209.185123] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic ppdev xor raid6_pq evdev sg parport_pc parport acpi_cpufreq tpm_tis tpm psmouse processor i2c_piix4 serio_raw pcspkr i2c_core button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs] [118209.186318] CPU: 14 PID: 28428 Comm: mount Tainted: G W 4.5.0-rc5-btrfs-next-24+ #1 [118209.186318] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [118209.186318] task: ffff8801ec131040 ti: ffff8800af34c000 task.ti: ffff8800af34c000 [118209.186318] RIP: 0010:[<ffffffffa04237d7>] [<ffffffffa04237d7>] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs] [118209.186318] RSP: 0018:ffff8800af34faa8 EFLAGS: 00010246 [118209.186318] RAX: 00000000ffffffef RBX: 00000000ffffffef RCX: 0000000000000001 [118209.186318] RDX: 0000000080000000 RSI: 0000000000000001 RDI: 00000000ffffffff [118209.186318] RBP: ffff8800af34fb08 R08: 0000000000000001 R09: 0000000000000000 [118209.186318] R10: ffff8800af34f9f0 R11: 6db6db6db6db6db7 R12: ffff880171b97000 [118209.186318] R13: ffff8801ca9d65e0 R14: ffff8800afa2e000 R15: 0000160000000000 [118209.186318] FS: 00007f5bcb914840(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000 [118209.186318] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [118209.186318] CR2: 00007f5bcaceb5d9 CR3: 00000000b49b5000 CR4: 00000000000006e0 [118209.186318] Stack: [118209.186318] fffffbffffffffff 010230ffffffffff 0101000000000000 ff84000000000000 [118209.186318] fbffffffffffffff 30ffffffffffffff 0000000000000101 ffff880082348000 [118209.186318] 0000000000000000 ffff8800afa2e000 ffff8800afa2e000 0000000000000000 [118209.186318] Call Trace: [118209.186318] [<ffffffffa042e2db>] open_ctree+0x1e37/0x21b9 [btrfs] [118209.186318] [<ffffffffa040a753>] btrfs_mount+0x97e/0xaed [btrfs] [118209.186318] [<ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf [118209.186318] [<ffffffff8117b87e>] mount_fs+0x67/0x131 [118209.186318] [<ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde [118209.186318] [<ffffffffa0409f81>] btrfs_mount+0x1ac/0xaed [btrfs] [118209.186318] [<ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf [118209.186318] [<ffffffff8108c26b>] ? lockdep_init_map+0xb9/0x1b3 [118209.186318] [<ffffffff8117b87e>] mount_fs+0x67/0x131 [118209.186318] [<ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde [118209.186318] [<ffffffff81195637>] do_mount+0x8a6/0x9e8 [118209.186318] [<ffffffff8119598d>] SyS_mount+0x77/0x9f [118209.186318] [<ffffffff81493017>] entry_SYSCALL_64_fastpath+0x12/0x6b [118209.186318] Code: 64 00 00 85 c0 89 c3 75 24 f0 41 80 4c 24 20 20 49 8b bc 24 f0 01 00 00 4c 89 e6 e8 e8 65 00 00 85 c0 89 c3 74 11 83 f8 ef 75 02 <0f> 0b 4c 89 e7 e8 da 72 00 00 eb 1c 41 83 bc 24 00 01 00 00 00 [118209.186318] RIP [<ffffffffa04237d7>] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs] [118209.186318] RSP <ffff8800af34faa8> [118209.230735] ---[ end trace 83938f987d85d477 ]--- So fix this by not treating the error -EEXIST, returned when attempting to insert a root already inserted by the backref walking code, as an error. The following test case for xfstests reproduces the bug: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { _cleanup_flakey cd / rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey # real QA test starts here _supported_fs btrfs _supported_os Linux _require_scratch _require_dm_target flakey _require_metadata_journaling $SCRATCH_DEV rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _init_flakey _mount_flakey _run_btrfs_util_prog quota enable $SCRATCH_MNT # Create 2 directories with one file in one of them. # We use these just to trigger a transaction commit later, moving the file from # directory a to directory b and doing an fsync against directory a. mkdir $SCRATCH_MNT/a mkdir $SCRATCH_MNT/b touch $SCRATCH_MNT/a/f sync # Create our test file with 2 4K extents. $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foobar | _filter_xfs_io # Create a snapshot and delete it. This doesn't really delete the snapshot # immediately, just makes it inaccessible and invisible to user space, the # snapshot is deleted later by a dedicated kernel thread (cleaner kthread) # which is woke up at the next transaction commit. # A root orphan item is inserted into the tree of tree roots, so that if a # power failure happens before the dedicated kernel thread does the snapshot # deletion, the next time the filesystem is mounted it resumes the snapshot # deletion. _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap # Now overwrite half of the extents we wrote before. Because we made a snapshpot # before, which isn't really deleted yet (since no transaction commit happened # after we did the snapshot delete request), the non overwritten extents get # referenced twice, once by the default subvolume and once by the snapshot. $XFS_IO_PROG -c "pwrite -S 0xbb 4K 8K" $SCRATCH_MNT/foobar | _filter_xfs_io # Now move file f from directory a to directory b and fsync directory a. # The fsync on the directory a triggers a transaction commit (because a file # was moved from it to another directory) and the file fsync leaves a log tree # with file extent items to replay. mv $SCRATCH_MNT/a/f $SCRATCH_MNT/a/b $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar echo "File digest before power failure:" md5sum $SCRATCH_MNT/foobar | _filter_scratch # Now simulate a power failure and mount the filesystem to replay the log tree. # After the log tree was replayed, we used to hit a BUG_ON() when processing # the root orphan item for the deleted snapshot. This is because when processing # an orphan root the code expected to be the first code inserting the root into # the fs_info->fs_root_radix radix tree, while in reallity it was the second # caller attempting to do it - the first caller was the transaction commit that # took place after replaying the log tree, when updating the qgroup counters. _flakey_drop_and_remount echo "File digest before after failure:" # Must match what he got before the power failure. md5sum $SCRATCH_MNT/foobar | _filter_scratch _unmount_flakey status=0 exit Fixes: 2d9e97761087 ("Btrfs: use btrfs_get_fs_root in resolve_indirect_ref") Cc: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-03-03writeback: flush inode cgroup wb switches instead of pinning super_blockTejun Heo2-13/+42
If cgroup writeback is in use, inodes can be scheduled for asynchronous wb switching. Before 5ff8eaac1636 ("writeback: keep superblock pinned during cgroup writeback association switches"), this could race with umount leading to super_block being destroyed while inodes are pinned for wb switching. 5ff8eaac1636 fixed it by bumping s_active while wb switches are in flight; however, this allowed in-flight wb switches to make umounts asynchronous when the userland expected synchronosity - e.g. fsck immediately following umount may fail because the device is still busy. This patch removes the problematic super_block pinning and instead makes generic_shutdown_super() flush in-flight wb switches. wb switches are now executed on a dedicated isw_wq so that they can be flushed and isw_nr_in_flight keeps track of the number of in-flight wb switches so that flushing can be avoided in most cases. v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE check in inode_switch_wbs() as Jan an Al suggested. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Tahsin Erdogan <tahsin@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@ZenIV.linux.org.uk> Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com Fixes: 5ff8eaac1636 ("writeback: keep superblock pinned during cgroup writeback association switches") Cc: stable@vger.kernel.org #v4.5 Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-02Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds4-22/+36
Pull cifs fixes from Steve French: "Various small CIFS/SMB3 fixes for stable: Fixes address oops that can occur when accessing Macs with SMB3, and another problem found to Samba when read responses queued (e.g. with gluster under Samba)" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: CIFS: Fix duplicate line introduced by clone_file_range patch Fix cifs_uniqueid_to_ino_t() function for s390x CIFS: Fix SMB2+ interim response processing for read requests cifs: fix out-of-bounds access in lease parsing
2016-03-02userfaultfd: don't block on the last VM updates at exit timeLinus Torvalds1-0/+6
The exit path will do some final updates to the VM of an exiting process to inform others of the fact that the process is going away. That happens, for example, for robust futex state cleanup, but also if the parent has asked for a TID update when the process exits (we clear the child tid field in user space). However, at the time we do those final VM accesses, we've already stopped accepting signals, so the usual "stop waiting for userfaults on signal" code in fs/userfaultfd.c no longer works, and the process can become an unkillable zombie waiting for something that will never happen. To solve this, just make handle_userfault() abort any user fault handling if we're already in the exit path past the signal handling state being dead (marked by PF_EXITING). This VM special case is pretty ugly, and it is possible that we should look at finalizing signals later (or move the VM final accesses earlier). But in the meantime this is a fairly minimally intrusive fix. Reported-and-tested-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-01Merge branch 'for-linus' of ↵Linus Torvalds1-15/+5
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull d_inode/d_flags race fix from Al Viro. I love this fix. Not only does it fix the race in the dentry type handling, it entirely gets rid of the nasty and subtle memory ordering rules for d_type and d_inode, and replaces them with the basic dentry locking rules (sequence numbers under RCU, d_lock elsewhere). * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: use ->d_seq to get coherency between ->d_inode and ->d_flags
2016-03-01CIFS: Fix duplicate line introduced by clone_file_range patchSteve French1-1/+0
Commit 04b38d601239b4 ("vfs: pull btrfs clone API to vfs layer") added a duplicated line (in cifsfs.c) which causes a sparse compile warning. Signed-off-by: Steve French <steve.french@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-02-29use ->d_seq to get coherency between ->d_inode and ->d_flagsAl Viro1-15/+5
Games with ordering and barriers are way too brittle. Just bump ->d_seq before and after updating ->d_inode and ->d_flags type bits, so that verifying ->d_seq would guarantee they are coherent. Cc: stable@vger.kernel.org # v3.13+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-29Fix cifs_uniqueid_to_ino_t() function for s390xYadan Fan1-8/+4
This issue is caused by commit 02323db17e3a7 ("cifs: fix cifs_uniqueid_to_ino_t not to ever return 0"), when BITS_PER_LONG is 64 on s390x, the corresponding cifs_uniqueid_to_ino_t() function will cast 64-bit fileid to 32-bit by using (ino_t)fileid, because ino_t (typdefed __kernel_ino_t) is int type. It's defined in arch/s390/include/uapi/asm/posix_types.h #ifndef __s390x__ typedef unsigned long __kernel_ino_t; ... #else /* __s390x__ */ typedef unsigned int __kernel_ino_t; So the #ifdef condition is wrong for s390x, we can just still use one cifs_uniqueid_to_ino_t() function with comparing sizeof(ino_t) and sizeof(u64) to choose the correct execution accordingly. Signed-off-by: Yadan Fan <ydfan@suse.com> CC: stable <stable@vger.kernel.org> Signed-off-by: Steve French <smfrench@gmail.com>
2016-02-29CIFS: Fix SMB2+ interim response processing for read requestsPavel Shilovsky1-3/+18
For interim responses we only need to parse a header and update a number credits. Now it is done for all SMB2+ command except SMB2_READ which is wrong. Fix this by adding such processing. Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Tested-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <smfrench@gmail.com>
2016-02-29cifs: fix out-of-bounds access in lease parsingJustin Maggard1-10/+14
When opening a file, SMB2_open() attempts to parse the lease state from the SMB2 CREATE Response. However, the parsing code was not careful to ensure that the create contexts are not empty or invalid, which can lead to out- of-bounds memory access. This can be seen easily by trying to read a file from a OSX 10.11 SMB3 server. Here is sample crash output: BUG: unable to handle kernel paging request at ffff8800a1a77cc6 IP: [<ffffffff8828a734>] SMB2_open+0x804/0x960 PGD 8f77067 PUD 0 Oops: 0000 [#1] SMP Modules linked in: CPU: 3 PID: 2876 Comm: cp Not tainted 4.5.0-rc3.x86_64.1+ #14 Hardware name: NETGEAR ReadyNAS 314 /ReadyNAS 314 , BIOS 4.6.5 10/11/2012 task: ffff880073cdc080 ti: ffff88005b31c000 task.ti: ffff88005b31c000 RIP: 0010:[<ffffffff8828a734>] [<ffffffff8828a734>] SMB2_open+0x804/0x960 RSP: 0018:ffff88005b31fa08 EFLAGS: 00010282 RAX: 0000000000000015 RBX: 0000000000000000 RCX: 0000000000000006 RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff88007eb8c8b0 RBP: ffff88005b31fad8 R08: 666666203d206363 R09: 6131613030383866 R10: 3030383866666666 R11: 00000000000002b0 R12: ffff8800660fd800 R13: ffff8800a1a77cc2 R14: 00000000424d53fe R15: ffff88005f5a28c0 FS: 00007f7c8a2897c0(0000) GS:ffff88007eb80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffff8800a1a77cc6 CR3: 000000005b281000 CR4: 00000000000006e0 Stack: ffff88005b31fa70 ffffffff88278789 00000000000001d3 ffff88005f5a2a80 ffffffff00000003 ffff88005d029d00 ffff88006fde05a0 0000000000000000 ffff88005b31fc78 ffff88006fde0780 ffff88005b31fb2f 0000000100000fe0 Call Trace: [<ffffffff88278789>] ? cifsConvertToUTF16+0x159/0x2d0 [<ffffffff8828cf68>] smb2_open_file+0x98/0x210 [<ffffffff8811e80c>] ? __kmalloc+0x1c/0xe0 [<ffffffff882685f4>] cifs_open+0x2a4/0x720 [<ffffffff88122cef>] do_dentry_open+0x1ff/0x310 [<ffffffff88268350>] ? cifsFileInfo_get+0x30/0x30 [<ffffffff88123d92>] vfs_open+0x52/0x60 [<ffffffff88131dd0>] path_openat+0x170/0xf70 [<ffffffff88097d48>] ? remove_wait_queue+0x48/0x50 [<ffffffff88133a29>] do_filp_open+0x79/0xd0 [<ffffffff8813f2ca>] ? __alloc_fd+0x3a/0x170 [<ffffffff881240c4>] do_sys_open+0x114/0x1e0 [<ffffffff881241a9>] SyS_open+0x19/0x20 [<ffffffff8896e257>] entry_SYSCALL_64_fastpath+0x12/0x6a Code: 4d 8d 6c 07 04 31 c0 4c 89 ee e8 47 6f e5 ff 31 c9 41 89 ce 44 89 f1 48 c7 c7 28 b1 bd 88 31 c0 49 01 cd 4c 89 ee e8 2b 6f e5 ff <45> 0f b7 75 04 48 c7 c7 31 b1 bd 88 31 c0 4d 01 ee 4c 89 f6 e8 RIP [<ffffffff8828a734>] SMB2_open+0x804/0x960 RSP <ffff88005b31fa08> CR2: ffff8800a1a77cc6 ---[ end trace d9f69ba64feee469 ]--- Signed-off-by: Justin Maggard <jmaggard@netgear.com> Signed-off-by: Steve French <smfrench@gmail.com> CC: Stable <stable@vger.kernel.org>
2016-02-27Merge branch 'for-linus' of ↵Linus Torvalds3-37/+22
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: do_last(): ELOOP failure exit should be done after leaving RCU mode should_follow_link(): validate ->d_seq after having decided to follow namei: ->d_inode of a pinned dentry is stable only for positives do_last(): don't let a bogus return value from ->open() et.al. to confuse us fs: return -EOPNOTSUPP if clone is not supported hpfs: don't truncate the file when delete fails
2016-02-27do_last(): ELOOP failure exit should be done after leaving RCU modeAl Viro1-5/+4
... or we risk seeing a bogus value of d_is_symlink() there. Cc: stable@vger.kernel.org # v4.2+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-27should_follow_link(): validate ->d_seq after having decided to followAl Viro1-0/+5
... otherwise d_is_symlink() above might have nothing to do with the inode value we've got. Cc: stable@vger.kernel.org # v4.2+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-27namei: ->d_inode of a pinned dentry is stable only for positivesAl Viro1-2/+2
both do_last() and walk_component() risk picking a NULL inode out of dentry about to become positive, *then* checking its flags and seeing that it's not negative anymore and using (already stale by then) value they'd fetched earlier. Usually ends up oopsing soon after that... Cc: stable@vger.kernel.org # v3.13+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-27do_last(): don't let a bogus return value from ->open() et.al. to confuse usAl Viro1-0/+4
... into returning a positive to path_openat(), which would interpret that as "symlink had been encountered" and proceed to corrupt memory, etc. It can only happen due to a bug in some ->open() instance or in some LSM hook, etc., so we report any such event *and* make sure it doesn't trick us into further unpleasantness. Cc: stable@vger.kernel.org # v3.6+, at least Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-27fs: return -EOPNOTSUPP if clone is not supportedChristoph Hellwig1-2/+4
-EBADF is a rather confusing error if an operations is not supported, and nfsd gets rather upset about it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-27hpfs: don't truncate the file when delete failsMikulas Patocka1-28/+3
The delete opration can allocate additional space on the HPFS filesystem due to btree split. The HPFS driver checks in advance if there is available space, so that it won't corrupt the btree if we run out of space during splitting. If there is not enough available space, the HPFS driver attempted to truncate the file, but this results in a deadlock since the commit 7dd29d8d865efdb00c0542a5d2c87af8c52ea6c7 ("HPFS: Introduce a global mutex and lock it on every callback from VFS"). This patch removes the code that tries to truncate the file and -ENOSPC is returned instead. If the user hits -ENOSPC on delete, he should try to delete other files (that are stored in a leaf btree node), so that the delete operation will make some space for deleting the file stored in non-leaf btree node. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Cc: stable@vger.kernel.org # 2.6.39+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-27Merge branch 'akpm' (patches from Andrew)Linus Torvalds10-19/+61
Merge fixes from Andrew Morton: "10 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: dax: move writeback calls into the filesystems dax: give DAX clearing code correct bdev ext4: online defrag not supported with DAX ext2, ext4: only set S_DAX for regular inodes block: disable block device DAX by default ocfs2: unlock inode if deleting inode from orphan fails mm: ASLR: use get_random_long() drivers: char: random: add get_random_long() mm: numa: quickly fail allocations for NUMA balancing on full nodes mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED
2016-02-27Merge tag 'tags/ext4_for_linus_stable' of ↵Linus Torvalds2-35/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext2/4 DAX fix from Ted Ts'o: "This fixes a file system corruption bug with DAX" * tag 'tags/ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext2, ext4: fix issue with missing journal entry in ext4_dax_mkwrite()
2016-02-27ext2, ext4: fix issue with missing journal entry in ext4_dax_mkwrite()Ross Zwisler2-35/+3
As it is currently written ext4_dax_mkwrite() assumes that the call into __dax_mkwrite() will not have to do a block allocation so it doesn't create a journal entry. For a read that creates a zero page to cover a hole followed by a write that actually allocates storage this is incorrect. The ext4_dax_mkwrite() -> __dax_mkwrite() -> __dax_fault() path calls get_blocks() to allocate storage. Fix this by having the ->page_mkwrite fault handler call ext4_dax_fault() as this function already has all the logic needed to allocate a journal entry and call __dax_fault(). Also update the ext2 fault handlers in this same way to remove duplicate code and keep the logic between ext2 and ext4 the same. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-27dax: move writeback calls into the filesystemsRoss Zwisler5-6/+35
Previously calls to dax_writeback_mapping_range() for all DAX filesystems (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). dax_writeback_mapping_range() needs a struct block_device, and it used to get that from inode->i_sb->s_bdev. This is correct for normal inodes mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time files. Instead, call dax_writeback_mapping_range() directly from the filesystem ->writepages function so that it can supply us with a valid block device. This also fixes DAX code to properly flush caches in response to sync(2). Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@fb.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27dax: give DAX clearing code correct bdevRoss Zwisler5-9/+12
dax_clear_blocks() needs a valid struct block_device and previously it was using inode->i_sb->s_bdev in all cases. This is correct for normal inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time devices. Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change its arguments to take a bdev and a sector instead of an inode and a block. This better reflects what the function does, and it allows the filesystem and raw block device code to pass in an appropriate struct block_device. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@fb.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27ext4: online defrag not supported with DAXRoss Zwisler1-0/+5
Online defrag operations for ext4 are hard coded to use the page cache. See ext4_ioctl() -> ext4_move_extents() -> move_extent_per_page() When combined with DAX I/O, which circumvents the page cache, this can result in data corruption. This was observed with xfstests ext4/307 and ext4/308. Fix this by only allowing online defrag for non-DAX files. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@fb.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27ext2, ext4: only set S_DAX for regular inodesRoss Zwisler2-2/+2
When S_DAX is set on an inode we assume that if there are pages attached to the mapping (mapping->nrpages != 0), those pages are clean zero pages that were used to service reads from holes. Any dirty data associated with the inode should be in the form of DAX exceptional entries (mapping->nrexceptional) that is written back via dax_writeback_mapping_range(). With the current code, though, this isn't always true. For example, ext2 and ext4 directory inodes can have S_DAX set, but have their dirty data stored as dirty page cache entries. For these types of inodes, having S_DAX set doesn't really make sense since their I/O doesn't actually happen through the DAX code path. Instead, only allow S_DAX to be set for regular inodes for ext2 and ext4. This allows us to have strict DAX vs non-DAX paths in the writeback code. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@fb.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27block: disable block device DAX by defaultDan Williams1-1/+5
The recent *sync enabling discovered that we are inserting into the block_device pagecache counter to the expectations of the dirty data tracking for dax mappings. This can lead to data corruption. We want to support DAX for block devices eventually, but it requires wider changes to properly manage the pagecache. dump_stack+0x85/0xc2 dax_writeback_mapping_range+0x60/0xe0 blkdev_writepages+0x3f/0x50 do_writepages+0x21/0x30 __filemap_fdatawrite_range+0xc6/0x100 filemap_write_and_wait+0x4a/0xa0 set_blocksize+0x70/0xd0 sb_set_blocksize+0x1d/0x50 ext4_fill_super+0x75b/0x3360 mount_bdev+0x180/0x1b0 ext4_mount+0x15/0x20 mount_fs+0x38/0x170 Mark the support broken so its disabled by default, but otherwise still available for testing. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@fb.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27ocfs2: unlock inode if deleting inode from orphan failsGuozhonghua1-0/+1
When doing append direct io cleanup, if deleting inode fails, it goes out without unlocking inode, which will cause the inode deadlock. This issue was introduced by commit cf1776a9e834 ("ocfs2: fix a tiny race when truncate dio orohaned entry"). Signed-off-by: Guozhonghua <guozhonghua@h3c.com> Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> [4.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27mm: ASLR: use get_random_long()Daniel Cashman1-1/+1
Replace calls to get_random_int() followed by a cast to (unsigned long) with calls to get_random_long(). Also address shifting bug which, in case of x86 removed entropy mask for mmap_rnd_bits values > 31 bits. Signed-off-by: Daniel Cashman <dcashman@android.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: David S. Miller <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Nick Kralevich <nnk@google.com> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Mark Salyzyn <salyzyn@android.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-25Fix directory hardlinks from deleted directoriesDavid Woodhouse2-19/+62
When a directory is deleted, we don't take too much care about killing off all the dirents that belong to it — on the basis that on remount, the scan will conclude that the directory is dead anyway. This doesn't work though, when the deleted directory contained a child directory which was moved *out*. In the early stages of the fs build we can then end up with an apparent hard link, with the child directory appearing both in its true location, and as a child of the original directory which are this stage of the mount process we don't *yet* know is defunct. To resolve this, take out the early special-casing of the "directories shall not have hard links" rule in jffs2_build_inode_pass1(), and let the normal nlink processing happen for directories as well as other inodes. Then later in the build process we can set ic->pino_nlink to the parent inode#, as is required for directories during normal operaton, instead of the nlink. And complain only *then* about hard links which are still in evidence even after killing off all the unreachable paths. Reported-by: Liu Song <liu.song11@zte.com.cn> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Cc: stable@vger.kernel.org
2016-02-25jffs2: Fix page lock / f->sem deadlockDavid Woodhouse2-11/+11
With this fix, all code paths should now be obtaining the page lock before f->sem. Reported-by: Szabó Tamás <sztomi89@gmail.com> Tested-by: Thomas Betker <thomas.betker@rohde-schwarz.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Cc: stable@vger.kernel.org
2016-02-25Revert "jffs2: Fix lock acquisition order bug in jffs2_write_begin"Thomas Betker1-21/+18
This reverts commit 5ffd3412ae55 ("jffs2: Fix lock acquisition order bug in jffs2_write_begin"). The commit modified jffs2_write_begin() to remove a deadlock with jffs2_garbage_collect_live(), but this introduced new deadlocks found by multiple users. page_lock() actually has to be called before mutex_lock(&c->alloc_sem) or mutex_lock(&f->sem) because jffs2_write_end() and jffs2_readpage() are called with the page locked, and they acquire c->alloc_sem and f->sem, resp. In other words, the lock order in jffs2_write_begin() was correct, and it is the jffs2_garbage_collect_live() path that has to be changed. Revert the commit to get rid of the new deadlocks, and to clear the way for a better fix of the original deadlock. Reported-by: Deng Chao <deng.chao1@zte.com.cn> Reported-by: Ming Liu <liu.ming50@gmail.com> Reported-by: wangzaiwei <wangzaiwei@top-vision.cn> Signed-off-by: Thomas Betker <thomas.betker@rohde-schwarz.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Cc: stable@vger.kernel.org
2016-02-24Merge branch 'for-linus' of ↵Linus Torvalds4-9/+14
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "Assorted fixes - xattr one from this cycle, the rest - stable fodder" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs/pnode.c: treat zero mnt_group_id-s as unequal affs_do_readpage_ofs(): just use kmap_atomic() around memcpy() xattr handlers: plug a lock leak in simple_xattr_list fs: allow no_seek_end_llseek to actually seek
2016-02-23Merge tag 'nfs-for-4.5-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds4-62/+126
Pull NFS client bugfixes from Trond Myklebust: "Stable bugfixes: - Fix nfs_size_to_loff_t - NFSv4: Fix a dentry leak on alias use Other bugfixes: - Don't schedule a layoutreturn if the layout segment can be freed immediately. - Always set NFS_LAYOUT_RETURN_REQUESTED with lo->plh_return_iomode - rpcrdma_bc_receive_call() should init rq_private_buf.len - fix stateid handling for the NFS v4.2 operations - pnfs/blocklayout: fix a memeory leak when using,vmalloc_to_page - fix panic in gss_pipe_downcall() in fips mode - Fix a race between layoutget and pnfs_destroy_layout - Fix a race between layoutget and bulk recalls" * tag 'nfs-for-4.5-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4.x/pnfs: Fix a race between layoutget and bulk recalls NFSv4.x/pnfs: Fix a race between layoutget and pnfs_destroy_layout auth_gss: fix panic in gss_pipe_downcall() in fips mode pnfs/blocklayout: fix a memeory leak when using,vmalloc_to_page nfs4: fix stateid handling for the NFS v4.2 operations NFSv4: Fix a dentry leak on alias use xprtrdma: rpcrdma_bc_receive_call() should init rq_private_buf.len pNFS: Always set NFS_LAYOUT_RETURN_REQUESTED with lo->plh_return_iomode pNFS: Fix pnfs_mark_matching_lsegs_return() nfs: fix nfs_size_to_loff_t
2016-02-22NFSv4.x/pnfs: Fix a race between layoutget and bulk recallsTrond Myklebust1-11/+6
Replace another case where the layout 'plh_block_lgets' can trigger infinite loops in send_layoutget(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-22NFSv4.x/pnfs: Fix a race between layoutget and pnfs_destroy_layoutTrond Myklebust1-2/+22
If the server reboots while there is a layoutget outstanding, then the call to pnfs_choose_layoutget_stateid() will fail with an EAGAIN error, which causes an infinite loop in send_layoutget(). The reason why we never break out of the loop is that the layout 'plh_block_lgets' field is never cleared. Fix is to replace plh_block_lgets with NFS_LAYOUT_INVALID_STID, which can be reset after a new layoutget. Fixes: ab7d763e477c5 ("pNFS: Ensure nfs4_layoutget_prepare returns...") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-20Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds4-18/+101
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "This is unusually large, partly due to the EFI fixes that prevent accidental deletion of EFI variables through efivarfs that may brick machines. These fixes are somewhat involved to maintain compatibility with existing install methods and other usage modes, while trying to turn off the 'rm -rf' bricking vector. Other fixes are for large page ioremap()s and for non-temporal user-memcpy()s" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Fix vmalloc_fault() to handle large pages properly hpet: Drop stale URLs x86/uaccess/64: Handle the caching of 4-byte nocache copies properly in __copy_user_nocache() x86/uaccess/64: Make the __copy_user_nocache() assembly code more readable lib/ucs2_string: Correct ucs2 -> utf8 conversion efi: Add pstore variables to the deletion whitelist efi: Make efivarfs entries immutable by default efi: Make our variable validation list include the guid efi: Do variable name validation tests in utf8 efi: Use ucs2_as_utf8 in efivarfs instead of open coding a bad version lib/ucs2_string: Add ucs2 -> utf8 helper functions
2016-02-20fs/pnode.c: treat zero mnt_group_id-s as unequalMaxim Patlasov1-2/+7
propagate_one(m) calculates "type" argument for copy_tree() like this: > if (m->mnt_group_id == last_dest->mnt_group_id) { > type = CL_MAKE_SHARED; > } else { > type = CL_SLAVE; > if (IS_MNT_SHARED(m)) > type |= CL_MAKE_SHARED; > } The "type" argument then governs clone_mnt() behavior with respect to flags and mnt_master of new mount. When we iterate through a slave group, it is possible that both current "m" and "last_dest" are not shared (although, both are slaves, i.e. have non-NULL mnt_master-s). Then the comparison above erroneously makes new mount shared and sets its mnt_master to last_source->mnt_master. The patch fixes the problem by handling zero mnt_group_id-s as though they are unequal. The similar problem exists in the implementation of "else" clause above when we have to ascend upward in the master/slave tree by calling: > last_source = last_source->mnt_master; > last_dest = last_source->mnt_parent; proper number of times. The last step is governed by "n->mnt_group_id != last_dest->mnt_group_id" condition that may lie if both are zero. The patch fixes this case in the same way as the former one. [AV: don't open-code an obvious helper...] Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-20affs_do_readpage_ofs(): just use kmap_atomic() around memcpy()Al Viro1-3/+2
It forgets kunmap() on a failure exit, but there's really no point keeping the page kmapped at all - after all, what we are doing is a bunch of memcpy() into the parts of page, so kmap_atomic()/kunmap_atomic() just around those memcpy() is enough. Spotted-by: Insu Yun <wuninsu@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-20xattr handlers: plug a lock leak in simple_xattr_listMateusz Guzik1-3/+3
The code could leak xattrs->lock on error. Problem introduced with 786534b92f3ce68f4 "tmpfs: listxattr should include POSIX ACL xattrs". Signed-off-by: Mateusz Guzik <mguzik@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-20fs: allow no_seek_end_llseek to actually seekWouter van Kesteren1-1/+2
The user-visible impact of the issue is for example that without this patch sensors-detect breaks when trying to seek in /dev/cpu/0/cpuid. '~0ULL' is a 'unsigned long long' that when converted to a loff_t, which is signed, gets turned into -1. later in vfs_setpos we have 'if (offset > maxsize)', which makes it always return EINVAL. Fixes: b25472f9b961 ("new helpers: no_seek_end_llseek{,_size}()") Signed-off-by: Wouter van Kesteren <woutershep@gmail.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-19Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds13-39/+176
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bugfixes from Ted Ts'o: "Miscellaneous ext4 bug fixes for v4.5" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix crashes in dioread_nolock mode ext4: fix bh->b_state corruption ext4: fix memleak in ext4_readdir() ext4: remove unused parameter "newblock" in convert_initialized_extent() ext4: don't read blocks from disk after extents being swapped ext4: fix potential integer overflow ext4: add a line break for proc mb_groups display ext4: ioctl: fix erroneous return value ext4: fix scheduling in atomic on group checksum failure ext4 crypto: move context consistency check to ext4_file_open() ext4 crypto: revalidate dentry after adding or removing the key
2016-02-19Merge branch 'for-linus-4.5' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fix from Chris Mason: "My for-linus-4.5 branch has a btrfs DIO error passing fix. I know how much you love DIO, so I'm going to suggest against reading it. We'll follow up with a patch to drop the error arg from dio_end_io in the next merge window." * 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix direct IO requests not reporting IO error to user space
2016-02-19Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-14/+39
Merge fixes from Andrew Morton: "10 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm: slab: free kmem_cache_node after destroy sysfs file ipc/shm: handle removed segments gracefully in shm_mmap() MAINTAINERS: update Kselftest Framework mailing list devm_memremap_release(): fix memremap'd addr handling mm/hugetlb.c: fix incorrect proc nr_hugepages value mm, x86: fix pte_page() crash in gup_pte_range() fsnotify: turn fsnotify reaper thread into a workqueue job Revert "fsnotify: destroy marks with call_srcu instead of dedicated thread" mm: fix regression in remap_file_pages() emulation thp, dax: do not try to withdraw pgtable from non-anon VMA
2016-02-19ext4: fix crashes in dioread_nolock modeJan Kara1-20/+20
Competing overwrite DIO in dioread_nolock mode will just overwrite pointer to io_end in the inode. This may result in data corruption or extent conversion happening from IO completion interrupt because we don't properly set buffer_defer_completion() when unlocked DIO races with locked DIO to unwritten extent. Since unlocked DIO doesn't need io_end for anything, just avoid allocating it and corrupting pointer from inode for locked DIO. A cleaner fix would be to avoid these games with io_end pointer from the inode but that requires more intrusive changes so we leave that for later. Cc: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-19ext4: fix bh->b_state corruptionJan Kara1-2/+30
ext4 can update bh->b_state non-atomically in _ext4_get_block() and ext4_da_get_block_prep(). Usually this is fine since bh is just a temporary storage for mapping information on stack but in some cases it can be fully living bh attached to a page. In such case non-atomic update of bh->b_state can race with an atomic update which then gets lost. Usually when we are mapping bh and thus updating bh->b_state non-atomically, nobody else touches the bh and so things work out fine but there is one case to especially worry about: ext4_finish_bio() uses BH_Uptodate_Lock on the first bh in the page to synchronize handling of PageWriteback state. So when blocksize < pagesize, we can be atomically modifying bh->b_state of a buffer that actually isn't under IO and thus can race e.g. with delalloc trying to map that buffer. The result is that we can mistakenly set / clear BH_Uptodate_Lock bit resulting in the corruption of PageWriteback state or missed unlock of BH_Uptodate_Lock. Fix the problem by always updating bh->b_state bits atomically. CC: stable@vger.kernel.org Reported-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-18fsnotify: turn fsnotify reaper thread into a workqueue jobJeff Layton1-31/+18
We don't require a dedicated thread for fsnotify cleanup. Switch it over to a workqueue job instead that runs on the system_unbound_wq. In the interest of not thrashing the queued job too often when there are a lot of marks being removed, we delay the reaper job slightly when queueing it, to allow several to gather on the list. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Tested-by: Eryu Guan <guaneryu@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Eric Paris <eparis@parisplace.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-18Revert "fsnotify: destroy marks with call_srcu instead of dedicated thread"Jeff Layton1-14/+52
This reverts commit c510eff6beba ("fsnotify: destroy marks with call_srcu instead of dedicated thread"). Eryu reported that he was seeing some OOM kills kick in when running a testcase that adds and removes inotify marks on a file in a tight loop. The above commit changed the code to use call_srcu to clean up the marks. While that does (in principle) work, the srcu callback job is limited to cleaning up entries in small batches and only once per jiffy. It's easily possible to overwhelm that machinery with too many call_srcu callbacks, and Eryu's reproduer did just that. There's also another potential problem with using call_srcu here. While you can obviously sleep while holding the srcu_read_lock, the callbacks run under local_bh_disable, so you can't sleep there. It's possible when putting the last reference to the fsnotify_mark that we'll end up putting a chain of references including the fsnotify_group, uid, and associated keys. While I don't see any obvious ways that that could occurs, it's probably still best to avoid using call_srcu here after all. This patch reverts the above patch. A later patch will take a different approach to eliminated the dedicated thread here. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reported-by: Eryu Guan <guaneryu@gmail.com> Tested-by: Eryu Guan <guaneryu@gmail.com> Cc: Jan Kara <jack@suse.com> Cc: Eric Paris <eparis@parisplace.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-17Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds3-5/+18
Pull block fixes from Jens Axboe: "A collection of fixes from the past few weeks that should go into 4.5. This contains: - Overflow fix for sysfs discard show function from Alan. - A stacking limit init fix for max_dev_sectors, so we don't end up artificially capping some use cases. From Keith. - Have blk-mq proper end unstarted requests on a dying queue, instead of pushing that to the driver. From Keith. - NVMe: - Update to Kconfig description for NVME_SCSI, since it was vague and having it on is important for some SUSE distros. From Christoph. - Set of fixes from Keith, around surprise removal. Also kills the no-merge flag, so it supports merging. - Set of fixes for lightnvm from Matias, Javier, and Wenwei. - Fix null_blk oops when asked for lightnvm, but not available. From Matias. - Copy-to-user EINTR fix from Hannes, fixing a case where SG_IO fails if interrupted by a signal. - Two floppy fixes from Jiri, fixing signal handling and blocking open. - A use-after-free fix for O_DIRECT, from Mike Krinkin. - A block module ref count fix from Roman Pen. - An fs IO wait accounting fix for O_DSYNC from Stephane Gasparini. - Smaller reallo fix for xen-blkfront from Bob Liu. - Removal of an unused struct member in the deadline IO scheduler, from Tahsin. - Also from Tahsin, properly initialize inode struct members associated with cgroup writeback, if enabled. - From Tejun, ensure that we keep the superblock pinned during cgroup writeback" * 'for-linus' of git://git.kernel.dk/linux-block: (25 commits) blk: fix overflow in queue_discard_max_hw_show writeback: initialize inode members that track writeback history writeback: keep superblock pinned during cgroup writeback association switches bio: return EINTR if copying to user space got interrupted NVMe: Rate limit nvme IO warnings NVMe: Poll device while still active during remove NVMe: Requeue requests on suspended queues NVMe: Allow request merges NVMe: Fix io incapable return values blk-mq: End unstarted requests on dying queue block: Initialize max_dev_sectors to 0 null_blk: oops when initializing without lightnvm block: fix module reference leak on put_disk() call for cgroups throttle nvme: fix Kconfig description for BLK_DEV_NVME_SCSI kernel/fs: fix I/O wait not accounted for RW O_DSYNC floppy: refactor open() flags handling lightnvm: allow to force mm initialization lightnvm: check overflow and correct mlc pairs lightnvm: fix request intersection locking in rrpc lightnvm: warn if irqs are disabled in lock laddr ...