summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2019-12-17Merge tag 'for-5.5-rc2-tag' of ↵Linus Torvalds17-56/+127
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A mix of regression fixes and regular fixes for stable trees: - fix swapped error messages for qgroup enable/rescan - fixes for NO_HOLES feature with clone range - fix deadlock between iget/srcu lock/synchronize srcu while freeing an inode - fix double lock on subvolume cross-rename - tree log fixes * fix missing data checksums after replaying a log tree * also teach tree-checker about this problem * skip log replay on orphaned roots - fix maximum devices constraints for RAID1C -3 and -4 - send: don't print warning on read-only mount regarding orphan cleanup - error handling fixes" * tag 'for-5.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: send: remove WARN_ON for readonly mount btrfs: do not leak reloc root if we fail to read the fs root btrfs: skip log replay on orphaned roots btrfs: handle ENOENT in btrfs_uuid_tree_iterate btrfs: abort transaction after failed inode updates in create_subvol Btrfs: fix hole extent items with a zero size after range cloning Btrfs: fix removal logic of the tree mod log that leads to use-after-free issues Btrfs: make tree checker detect checksum items with overlapping ranges Btrfs: fix missing data checksums after replaying a log tree btrfs: return error pointer from alloc_test_extent_buffer btrfs: fix devs_max constraints for raid1c3 and raid1c4 btrfs: tree-checker: Fix error format string for size_t btrfs: don't double lock the subvol_sem for rename exchange btrfs: handle error in btrfs_cache_block_group btrfs: do not call synchronize_srcu() in inode_tree_del Btrfs: fix cloning range with a hole when using the NO_HOLES feature btrfs: Fix error messages in qgroup_rescan_init
2019-12-17Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix the guest-nice cpustat values in /proc" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/cputime, proc/stat: Fix incorrect guest nice cpustat value
2019-12-16Merge tag 'linux-kselftest-5.5-rc2' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull kselftest fixes from Shuah Khan: - ftrace and safesetid test fixes from Masami Hiramatsu - Kunit fixes from Brendan Higgins, Iurii Zaikin, and Heidi Fahim - Kselftest framework fixes from SeongJae Park and Michael Ellerman * tag 'linux-kselftest-5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: kselftest: Support old perl versions kselftest/runner: Print new line in print of timeout log selftests: Fix dangling documentation references to kselftest_module.sh Documentation: kunit: add documentation for kunit_tool Documentation: kunit: fix typos and gramatical errors kunit: testing kunit: Bug fix in test_run_timeout function fs/ext4/inode-test: Fix inode test on 32 bit platforms. selftests: safesetid: Fix Makefile to set correct test program selftests: safesetid: Check the return value of setuid/setgid selftests: safesetid: Move link library to LDLIBS selftests/ftrace: Fix multiple kprobe testcase selftests/ftrace: Do not to use absolute debugfs path selftests/ftrace: Fix ftrace test cases to check unsupported selftests/ftrace: Fix to check the existence of set_ftrace_filter
2019-12-15Merge branch 'remove-ksys-mount-dup' of ↵Linus Torvalds2-14/+3
git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux Pull ksys_mount() and ksys_dup() removal from Dominik Brodowski: "This small series replaces all in-kernel calls to the userspace-focused ksys_mount() and ksys_dup() with calls to kernel-centric functions: For each replacement of ksys_mount() with do_mount(), one needs to verify that the first and third parameter (char *dev_name, char *type) are strings allocated in kernelspace and that the fifth parameter (void *data) is either NULL or refers to a full page (only occurence in init/do_mounts.c::do_mount_root()). The second and fourth parameters (char *dir_name, unsigned long flags) are passed by ksys_mount() to do_mount() unchanged, and therefore do not require particular care. Moreover, instead of pretending to be userspace, the opening of /dev/console as stdin/stdout/stderr can be implemented using in-kernel functions as well. Thereby, ksys_dup() can be removed for good" [ This doesn't get rid of the special "kernel init runs with KERNEL_DS" case, but it at least removes _some_ of the users of "treat kernel pointers as user pointers for our magical init sequence". One day we'll hopefully be rid of it all, and can initialize our init_thread addr_limit to USER_DS. - Linus ] * 'remove-ksys-mount-dup' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux: fs: remove ksys_dup() init: unify opening /dev/console as stdin/stdout/stderr init: use do_mount() instead of ksys_mount() initrd: use do_mount() instead of ksys_mount() devtmpfs: use do_mount() instead of ksys_mount()
2019-12-14Merge tag '5.5-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds6-3/+26
Pull cfis fixes from Steve French: "Three small smb3 fixes: this addresses two recent issues reported in additional testing during rc1, a refcount underflow and a problem with an intermittent crash in SMB2_open_init" * tag '5.5-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: CIFS: Close cached root handle only if it has a lease SMB3: Fix crash in SMB2_open_init due to uninitialized field in compounding path smb3: fix refcount underflow warning on unmount when no directory leases
2019-12-14Merge tag 'ovl-fixes-5.5-rc2' of ↵Linus Torvalds8-96/+159
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fixes from Miklos Szeredi: "Fix some bugs and documentation" * tag 'ovl-fixes-5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: docs: filesystems: overlayfs: Fix restview warnings docs: filesystems: overlayfs: Rename overlayfs.txt to .rst ovl: relax WARN_ON() on rename to self ovl: fix corner case of non-unique st_dev;st_ino ovl: don't use a temp buf for encoding real fh ovl: make sure that real fid is 32bit aligned in memory ovl: fix lookup failure on multi lower squashfs
2019-12-13Merge tag 'io_uring-5.5-20191212' of git://git.kernel.dk/linux-blockLinus Torvalds3-88/+121
Pull io_uring fixes from Jens Axboe: - A tweak to IOSQE_IO_LINK (also marked for stable) to allow links that don't sever if the result is < 0. This is mostly for linked timeouts, where if we ask for a pure timeout we always get -ETIME. This makes links useless for that case, hence allow a case where it works. - Five minor optimizations to fix and improve cases that regressed since v5.4. - An SQTHREAD locking fix. - A sendmsg/recvmsg iov assignment fix. - Net fix where read_iter/write_iter don't honor IOCB_NOWAIT, and subsequently ensuring that works for io_uring. - Fix a case where for an invalid opcode we might return -EBADF instead of -EINVAL, if the ->fd of that sqe was set to an invalid fd value. * tag 'io_uring-5.5-20191212' of git://git.kernel.dk/linux-block: io_uring: ensure we return -EINVAL on unknown opcode io_uring: add sockets to list of files that support non-blocking issue net: make socket read/write_iter() honor IOCB_NOWAIT io_uring: only hash regular files for async work execution io_uring: run next sqe inline if possible io_uring: don't dynamically allocate poll data io_uring: deferred send/recvmsg should assign iov io_uring: sqthread should grab ctx->uring_lock for submissions io-wq: briefly spin for new work after finishing work io-wq: remove worker->wait waitqueue io_uring: allow unbreakable links
2019-12-13Merge tag 'sizeof_field-v5.5-rc2' of ↵Linus Torvalds2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull FIELD_SIZEOF conversion from Kees Cook: "A mostly mechanical treewide conversion from FIELD_SIZEOF() to sizeof_field(). This avoids the redundancy of having 2 macros (actually 3) doing the same thing, and consolidates on sizeof_field(). While "field" is not an accurate name, it is the common name used in the kernel, and doesn't result in any unintended innuendo. As there are still users of FIELD_SIZEOF() in -next, I will clean up those during this coming development cycle and send the final old macro removal patch at that time" * tag 'sizeof_field-v5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: treewide: Use sizeof_field() macro MIPS: OCTEON: Replace SIZEOF_FIELD() macro
2019-12-13btrfs: send: remove WARN_ON for readonly mountAnand Jain1-6/+0
We log warning if root::orphan_cleanup_state is not set to ORPHAN_CLEANUP_DONE in btrfs_ioctl_send(). However if the filesystem is mounted as readonly we skip the orphan item cleanup during the lookup and root::orphan_cleanup_state remains at the init state 0 instead of ORPHAN_CLEANUP_DONE (2). So during send in btrfs_ioctl_send() we hit the warning as below. WARN_ON(send_root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE); WARNING: CPU: 0 PID: 2616 at /Volumes/ws/btrfs-devel/fs/btrfs/send.c:7090 btrfs_ioctl_send+0xb2f/0x18c0 [btrfs] :: RIP: 0010:btrfs_ioctl_send+0xb2f/0x18c0 [btrfs] :: Call Trace: :: _btrfs_ioctl_send+0x7b/0x110 [btrfs] btrfs_ioctl+0x150a/0x2b00 [btrfs] :: do_vfs_ioctl+0xa9/0x620 ? __fget+0xac/0xe0 ksys_ioctl+0x60/0x90 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x49/0x130 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reproducer: mkfs.btrfs -fq /dev/sdb mount /dev/sdb /btrfs btrfs subvolume create /btrfs/sv1 btrfs subvolume snapshot -r /btrfs/sv1 /btrfs/ss1 umount /btrfs mount -o ro /dev/sdb /btrfs btrfs send /btrfs/ss1 -f /tmp/f The warning exists because having orphan inodes could confuse send and cause it to fail or produce incorrect streams. The two cases that would cause such send failures, which are already fixed are: 1) Inodes that were unlinked - these are orphanized and remain with a link count of 0. These caused send operations to fail because it expected to always find at least one path for an inode. However this is no longer a problem since send is now able to deal with such inodes since commit 46b2f4590aab ("Btrfs: fix send failure when root has deleted files still open") and treats them as having been completely removed (the state after an orphan cleanup is performed). 2) Inodes that were in the process of being truncated. These resulted in send not knowing about the truncation and potentially issue write operations full of zeroes for the range from the new file size to the old file size. This is no longer a problem because we no longer create orphan items for truncation since commit f7e9e8fc792f ("Btrfs: stop creating orphan items for truncate"). As such before these commits, the WARN_ON here provided a clue in case something went wrong. Instead of being a warning against the root::orphan_cleanup_state value, it could have been more accurate by checking if there were actually any orphan items, and then issue a warning only if any exists, but that would be more expensive to check. Since orphanized inodes no longer cause problems for send, just remove the warning. Reported-by: Christoph Anton Mitterer <calestyo@scientia.net> Link: https://lore.kernel.org/linux-btrfs/21cb5e8d059f6e1496a903fa7bfc0a297e2f5370.camel@scientia.net/ CC: stable@vger.kernel.org # 4.19+ Suggested-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13btrfs: do not leak reloc root if we fail to read the fs rootJosef Bacik1-0/+1
If we fail to read the fs root corresponding with a reloc root we'll just break out and free the reloc roots. But we remove our current reloc_root from this list higher up, which means we'll leak this reloc_root. Fix this by adding ourselves back to the reloc_roots list so we are properly cleaned up. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13btrfs: skip log replay on orphaned rootsJosef Bacik1-2/+21
My fsstress modifications coupled with generic/475 uncovered a failure to mount and replay the log if we hit a orphaned root. We do not want to replay the log for an orphan root, but it's completely legitimate to have an orphaned root with a log attached. Fix this by simply skipping replaying the log. We still need to pin it's root node so that we do not overwrite it while replaying other logs, as we re-read the log root at every stage of the replay. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13btrfs: handle ENOENT in btrfs_uuid_tree_iterateJosef Bacik1-0/+2
If we get an -ENOENT back from btrfs_uuid_iter_rem when iterating the uuid tree we'll just continue and do btrfs_next_item(). However we've done a btrfs_release_path() at this point and no longer have a valid path. So increment the key and go back and do a normal search. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13btrfs: abort transaction after failed inode updates in create_subvolJosef Bacik1-2/+8
We can just abort the transaction here, and in fact do that for every other failure in this function except these two cases. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13Btrfs: fix hole extent items with a zero size after range cloningFilipe Manana1-11/+5
Normally when cloning a file range if we find an implicit hole at the end of the range we assume it is because the NO_HOLES feature is enabled. However that is not always the case. One well known case [1] is when we have a power failure after mixing buffered and direct IO writes against the same file. In such cases we need to punch a hole in the destination file, and if the NO_HOLES feature is not enabled, we need to insert explicit file extent items to represent the hole. After commit 690a5dbfc51315 ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents"), we started to insert file extent items representing the hole with an item size of 0, which is invalid and should be 53 bytes (the size of a btrfs_file_extent_item structure), resulting in all sorts of corruptions and invalid memory accesses. This is detected by the tree checker when we attempt to write a leaf to disk. The problem can be sporadically triggered by test case generic/561 from fstests. That test case does not exercise power failure and creates a new filesystem when it starts, so it does not use a filesystem created by any previous test that tests power failure. However the test does both buffered and direct IO writes (through fsstress) and it's precisely that which is creating the implicit holes in files. That happens even before the commit mentioned earlier. I need to investigate why we get those implicit holes to check if there is a real problem or not. For now this change fixes the regression of introducing file extent items with an item size of 0 bytes. Fix the issue by calling btrfs_punch_hole_range() without passing a btrfs_clone_extent_info structure, which ensures file extent items are inserted to represent the hole with a correct item size. We were passing a btrfs_clone_extent_info with a value of 0 for its 'item_size' field, which was causing the insertion of file extent items with an item size of 0. [1] https://www.spinics.net/lists/linux-btrfs/msg75350.html Reported-by: David Sterba <dsterba@suse.com> Fixes: 690a5dbfc51315 ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13Btrfs: fix removal logic of the tree mod log that leads to use-after-free issuesFilipe Manana1-1/+1
When a tree mod log user no longer needs to use the tree it calls btrfs_put_tree_mod_seq() to remove itself from the list of users and delete all no longer used elements of the tree's red black tree, which should be all elements with a sequence number less then our equals to the caller's sequence number. However the logic is broken because it can delete and free elements from the red black tree that have a sequence number greater then the caller's sequence number: 1) At a point in time we have sequence numbers 1, 2, 3 and 4 in the tree mod log; 2) The task which got assigned the sequence number 1 calls btrfs_put_tree_mod_seq(); 3) Sequence number 1 is deleted from the list of sequence numbers; 4) The current minimum sequence number is computed to be the sequence number 2; 5) A task using sequence number 2 is at tree_mod_log_rewind() and gets a pointer to one of its elements from the red black tree through a call to tree_mod_log_search(); 6) The task with sequence number 1 iterates the red black tree of tree modification elements and deletes (and frees) all elements with a sequence number less then or equals to 2 (the computed minimum sequence number) - it ends up only leaving elements with sequence numbers of 3 and 4; 7) The task with sequence number 2 now uses the pointer to its element, already freed by the other task, at __tree_mod_log_rewind(), resulting in a use-after-free issue. When CONFIG_DEBUG_PAGEALLOC=y it produces a trace like the following: [16804.546854] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [16804.547451] CPU: 0 PID: 28257 Comm: pool Tainted: G W 5.4.0-rc8-btrfs-next-51 #1 [16804.548059] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 [16804.548666] RIP: 0010:rb_next+0x16/0x50 (...) [16804.550581] RSP: 0018:ffffb948418ef9b0 EFLAGS: 00010202 [16804.551227] RAX: 6b6b6b6b6b6b6b6b RBX: ffff90e0247f6600 RCX: 6b6b6b6b6b6b6b6b [16804.551873] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff90e0247f6600 [16804.552504] RBP: ffff90dffe0d4688 R08: 0000000000000001 R09: 0000000000000000 [16804.553136] R10: ffff90dffa4a0040 R11: 0000000000000000 R12: 000000000000002e [16804.553768] R13: ffff90e0247f6600 R14: 0000000000001663 R15: ffff90dff77862b8 [16804.554399] FS: 00007f4b197ae700(0000) GS:ffff90e036a00000(0000) knlGS:0000000000000000 [16804.555039] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [16804.555683] CR2: 00007f4b10022000 CR3: 00000002060e2004 CR4: 00000000003606f0 [16804.556336] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [16804.556968] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [16804.557583] Call Trace: [16804.558207] __tree_mod_log_rewind+0xbf/0x280 [btrfs] [16804.558835] btrfs_search_old_slot+0x105/0xd00 [btrfs] [16804.559468] resolve_indirect_refs+0x1eb/0xc70 [btrfs] [16804.560087] ? free_extent_buffer.part.19+0x5a/0xc0 [btrfs] [16804.560700] find_parent_nodes+0x388/0x1120 [btrfs] [16804.561310] btrfs_check_shared+0x115/0x1c0 [btrfs] [16804.561916] ? extent_fiemap+0x59d/0x6d0 [btrfs] [16804.562518] extent_fiemap+0x59d/0x6d0 [btrfs] [16804.563112] ? __might_fault+0x11/0x90 [16804.563706] do_vfs_ioctl+0x45a/0x700 [16804.564299] ksys_ioctl+0x70/0x80 [16804.564885] ? trace_hardirqs_off_thunk+0x1a/0x20 [16804.565461] __x64_sys_ioctl+0x16/0x20 [16804.566020] do_syscall_64+0x5c/0x250 [16804.566580] entry_SYSCALL_64_after_hwframe+0x49/0xbe [16804.567153] RIP: 0033:0x7f4b1ba2add7 (...) [16804.568907] RSP: 002b:00007f4b197adc88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [16804.569513] RAX: ffffffffffffffda RBX: 00007f4b100210d8 RCX: 00007f4b1ba2add7 [16804.570133] RDX: 00007f4b100210d8 RSI: 00000000c020660b RDI: 0000000000000003 [16804.570726] RBP: 000055de05a6cfe0 R08: 0000000000000000 R09: 00007f4b197add44 [16804.571314] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4b197add48 [16804.571905] R13: 00007f4b197add40 R14: 00007f4b100210d0 R15: 00007f4b197add50 (...) [16804.575623] ---[ end trace 87317359aad4ba50 ]--- Fix this by making btrfs_put_tree_mod_seq() skip deletion of elements that have a sequence number equals to the computed minimum sequence number, and not just elements with a sequence number greater then that minimum. Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13Btrfs: make tree checker detect checksum items with overlapping rangesFilipe Manana1-2/+16
Having checksum items, either on the checksums tree or in a log tree, that represent ranges that overlap each other is a sign of a corruption. Such case confuses the checksum lookup code and can result in not being able to find checksums or find stale checksums. So add a check for such case. This is motivated by a recent fix for a case where a log tree had checksum items covering ranges that overlap each other due to extent cloning, and resulted in missing checksums after replaying the log tree. It also helps detect past issues such as stale and outdated checksums due to overlapping, commit 27b9a8122ff71a ("Btrfs: fix csum tree corruption, duplicate and outdated checksums"). CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13Btrfs: fix missing data checksums after replaying a log treeFilipe Manana4-9/+36
When logging a file that has shared extents (reflinked with other files or with itself), we can end up logging multiple checksum items that cover overlapping ranges. This confuses the search for checksums at log replay time causing some checksums to never be added to the fs/subvolume tree. Consider the following example of a file that shares the same extent at offsets 0 and 256Kb: [ bytenr 13893632, offset 64Kb, len 64Kb ] 0 64Kb [ bytenr 13631488, offset 64Kb, len 192Kb ] 64Kb 256Kb [ bytenr 13893632, offset 0, len 256Kb ] 256Kb 512Kb When logging the inode, at tree-log.c:copy_items(), when processing the file extent item at offset 0, we log a checksum item covering the range 13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 + 64Kb + 64Kb, respectively. Later when processing the extent item at offset 256K, we log the checksums for the range from 13893632 to 14155776 (which corresponds to 13893632 + 256Kb). These checksums get merged with the checksum item for the range from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync. So after this we get the two following checksum items in the log tree: (...) item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512 range start 13631488 end 14155776 length 524288 item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64 range start 13959168 end 14024704 length 65536 The first one covers the range from the second one, they overlap. So far this does not cause a problem after replaying the log, because when replaying the file extent item for offset 256K, we copy all the checksums for the extent 13893632 from the log tree to the fs/subvolume tree, since searching for an checksum item for bytenr 13893632 leaves us at the first checksum item, which covers the whole range of the extent. However if we write 64Kb to file offset 256Kb for example, we will not be able to find and copy the checksums for the last 128Kb of the extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb. After writing 64Kb into file offset 256Kb we get the following extent layout for our file: [ bytenr 13893632, offset 64K, len 64Kb ] 0 64Kb [ bytenr 13631488, offset 64Kb, len 192Kb ] 64Kb 256Kb [ bytenr 14155776, offset 0, len 64Kb ] 256Kb 320Kb [ bytenr 13893632, offset 64Kb, len 192Kb ] 320Kb 512Kb After fsync'ing the file, if we have a power failure and then mount the filesystem to replay the log, the following happens: 1) When replaying the file extent item for file offset 320Kb, we lookup for the checksums for the extent range from 13959168 (13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call to btrfs_lookup_csums_range(); 2) btrfs_lookup_csums_range() finds the checksum item that starts precisely at offset 13959168 (item 7 in the log tree, shown before); 3) However that checksum item only covers 64Kb of data, and not 192Kb of data; 4) As a result only the checksums for the first 64Kb of data referenced by the file extent item are found and copied to the fs/subvolume tree. The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get the corresponding data checksums found and copied to the fs/subvolume tree. 5) After replaying the log userspace will not be able to read the file range from 384Kb to 512Kb, because the checksums are missing and resulting in an -EIO error. The following steps reproduce this scenario: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar $ xfs_io -c "fsync" /mnt/sdc/foobar $ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar $ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar $ xfs_io -c "fsync" /mnt/sdc/foobar $ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar $ xfs_io -c "fsync" /mnt/sdc/foobar <power failure> $ mount /dev/sdc /mnt/sdc $ md5sum /mnt/sdc/foobar md5sum: /mnt/sdc/foobar: Input/output error $ dmesg | tail [165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408 [165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504 [165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600 [165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696 [165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792 [165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888 [165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984 [165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080 [165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1 [165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1 Fix this simply by deleting first any checksums, from the log tree, for the range of the extent we are logging at copy_items(). This ensures we do not get checksum items in the log tree that have overlapping ranges. This is a long time issue that has been present since we have the clone (and deduplication) ioctl, and can happen both when an extent is shared between different files and within the same file. A test case for fstests follows soon. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13btrfs: return error pointer from alloc_test_extent_bufferDan Carpenter3-6/+8
Callers of alloc_test_extent_buffer have not correctly interpreted the return value as error pointer, as alloc_test_extent_buffer should behave as alloc_extent_buffer. The self-tests were unaffected but btrfs_find_create_tree_block could call both functions and that would cause problems up in the call chain. Fixes: faa2dbf004e8 ("Btrfs: add sanity tests for new qgroup accounting code") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13btrfs: fix devs_max constraints for raid1c3 and raid1c4David Sterba1-2/+2
The value 0 for devs_max means to spread the allocated chunks over all available devices, eg. stripe for RAID0 or RAID5. This got mistakenly copied to the RAID1C3/4 profiles. The intention is to have exactly 3 and 4 copies respectively. Fixes: 47e6f7423b91 ("btrfs: add support for 3-copy replication (raid1c3)") Fixes: 8d6fac0087e5 ("btrfs: add support for 4-copy replication (raid1c4)") Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13btrfs: tree-checker: Fix error format string for size_tAndreas Färber1-1/+1
Argument BTRFS_FILE_EXTENT_INLINE_DATA_START is defined as offsetof(), which returns type size_t, so we need %zu instead of %lu. This fixes a build warning on 32-bit ARM: ../fs/btrfs/tree-checker.c: In function 'check_extent_data_item': ../fs/btrfs/tree-checker.c:230:43: warning: format '%lu' expects argument of type 'long unsigned int', but argument 5 has type 'unsigned int' [-Wformat=] 230 | "invalid item size, have %u expect [%lu, %u)", | ~~^ | long unsigned int | %u Fixes: 153a6d299956 ("btrfs: tree-checker: Check item size before reading file extent type") Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andreas Färber <afaerber@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13btrfs: don't double lock the subvol_sem for rename exchangeJosef Bacik1-6/+4
If we're rename exchanging two subvols we'll try to lock this lock twice, which is bad. Just lock once if either of the ino's are subvols. Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13btrfs: handle error in btrfs_cache_block_groupJosef Bacik1-2/+18
We have a BUG_ON(ret < 0) in find_free_extent from btrfs_cache_block_group. If we fail to allocate our ctl we'll just panic, which is not good. Instead just go on to another block group. If we fail to find a block group we don't want to return ENOSPC, because really we got a ENOMEM and that's the root of the problem. Save our return from btrfs_cache_block_group(), and then if we still fail to make our allocation return that ret so we get the right error back. Tested with inject-error.py from bcc. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13btrfs: do not call synchronize_srcu() in inode_tree_delJosef Bacik1-2/+0
Testing with the new fsstress uncovered a pretty nasty deadlock with lookup and snapshot deletion. Process A unlink -> final iput -> inode_tree_del -> synchronize_srcu(subvol_srcu) Process B btrfs_lookup <- srcu_read_lock() acquired here -> btrfs_iget -> find inode that has I_FREEING set -> __wait_on_freeing_inode() We're holding the srcu_read_lock() while doing the iget in order to make sure our fs root doesn't go away, and then we are waiting for the inode to finish freeing. However because the free'ing process is doing a synchronize_srcu() we deadlock. Fix this by dropping the synchronize_srcu() in inode_tree_del(). We don't need people to stop accessing the fs root at this point, we're only adding our empty root to the dead roots list. A larger much more invasive fix is forthcoming to address how we deal with fs roots, but this fixes the immediate problem. Fixes: 76dda93c6ae2 ("Btrfs: add snapshot/subvolume destroy ioctl") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13Btrfs: fix cloning range with a hole when using the NO_HOLES featureFilipe Manana1-2/+2
When using the NO_HOLES feature if we clone a range that contains a hole and a temporary ENOSPC happens while dropping extents from the target inode's range, we can end up failing and aborting the transaction with -EEXIST or with a corrupt file extent item, that has a length greater than it should and overlaps with other extents. For example when cloning the following range from inode A to inode B: Inode A: extent A1 extent A2 [ ----------- ] [ hole, implicit, 4MB length ] [ ------------- ] 0 1MB 5MB 6MB Range to clone: [1MB, 6MB) Inode B: extent B1 extent B2 extent B3 extent B4 [ ---------- ] [ --------- ] [ ---------- ] [ ---------- ] 0 1MB 1MB 2MB 2MB 5MB 5MB 6MB Target range: [1MB, 6MB) (same as source, to make it easier to explain) The following can happen: 1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents(); 2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents() set 'drop_end' to 2MB, meaning it was able to drop only extent B2; 3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 2MB - 1MB = 1MB; 4) We then attempt to insert a file extent item at inode B with a file offset of 5MB, which is the value of clone_info->file_offset. This fails with error -EEXIST because there's already an extent at that offset (extent B4); 5) We abort the current transaction with -EEXIST and return that error to user space as well. Another example, for extent corruption: Inode A: extent A1 extent A2 [ ----------- ] [ hole, implicit, 10MB length ] [ ------------- ] 0 1MB 11MB 12MB Inode B: extent B1 extent B2 [ ----------- ] [ --------- ] [ ----------------------------- ] 0 1MB 1MB 5MB 5MB 12MB Target range: [1MB, 12MB) (same as source, to make it easier to explain) 1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents(); 2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents() set 'drop_end' to 5MB, meaning it was able to drop only extent B2; 3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 5MB - 1MB = 4MB; 4) We then insert a file extent item at inode B with a file offset of 11MB which is the value of clone_info->file_offset, and a length of 4MB (the value of 'clone_len'). So we get 2 extents items with ranges that overlap and an extent length of 4MB, larger then the extent A2 from inode A (1MB length); 5) After that we end the transaction, balance the btree dirty pages and then start another or join the previous transaction. It might happen that the transaction which inserted the incorrect extent was committed by another task so we end up with extent corruption if a power failure happens. So fix this by making sure we attempt to insert the extent to clone at the destination inode only if we are past dropping the sub-range that corresponds to a hole. Fixes: 690a5dbfc51315 ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13btrfs: Fix error messages in qgroup_rescan_initNikolay Borisov1-2/+2
The branch of qgroup_rescan_init which is executed from the mount path prints wrong errors messages. The textual print out in case BTRFS_QGROUP_STATUS_FLAG_RESCAN/BTRFS_QGROUP_STATUS_FLAG_ON are not set are transposed. Fix it by exchanging their place. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13CIFS: Close cached root handle only if it has a leasePavel Shilovsky5-4/+25
SMB2_tdis() checks if a root handle is valid in order to decide whether it needs to close the handle or not. However if another thread has reference for the handle, it may end up with putting the reference twice. The extra reference that we want to put during the tree disconnect is the reference that has a directory lease. So, track the fact that we have a directory lease and close the handle only in that case. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-12-13SMB3: Fix crash in SMB2_open_init due to uninitialized field in compounding pathSteve French1-0/+1
Ran into an intermittent crash in SMB2_open_init+0x2f6/0x970 due to oparms.cifs_sb not being initialized when called from: smb2_compound_op+0x45d/0x1690 Zero the whole oparms struct in the compounding path before setting up the oparms so we don't risk any uninitialized fields. Fixes: fdef665ba44a ("smb3: fix mode passed in on create for modetosid mount option") Signed-off-by: Steve French <stfrench@microsoft.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-12-12Merge tag 'ceph-for-5.5-rc2' of git://github.com/ceph/ceph-clientLinus Torvalds7-42/+85
Pull ceph fixes from Ilya Dryomov: "A fix to avoid a corner case when scheduling cap reclaim in batches from Xiubo, a patch to add some observability into cap waiters from Jeff and a couple of cleanups" * tag 'ceph-for-5.5-rc2' of git://github.com/ceph/ceph-client: ceph: add more debug info when decoding mdsmap ceph: switch to global cap helper ceph: trigger the reclaim work once there has enough pending caps ceph: show tasks waiting on caps in debugfs caps file ceph: convert int fields in ceph_mount_options to unsigned int
2019-12-12fs: remove ksys_dup()Dominik Brodowski1-6/+1
ksys_dup() is used only at one place in the kernel, namely to duplicate fd 0 of /dev/console to stdout and stderr. The same functionality can be achieved by using functions already available within the kernel namespace. Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2019-12-12init: use do_mount() instead of ksys_mount()Dominik Brodowski1-8/+2
In prepare_namespace(), do_mount() can be used instead of ksys_mount() as the first and third argument are const strings in the kernel, the second and fourth argument are passed through anyway, and the fifth argument is NULL. In do_mount_root(), ksys_mount() is called with the first and third argument being already kernelspace strings, which do not need to be copied over from userspace to kernelspace (again). The second and fourth arguments are passed through to do_mount() anyway. The fifth argument, while already residing in kernelspace, needs to be put into a page of its own. Then, do_mount() can be used instead of ksys_mount(). Once this is done, there are no in-kernel users to ksys_mount() left, which can therefore be removed. Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2019-12-11Merge tag 'afs-fixes-20191211' of ↵Linus Torvalds5-19/+20
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull AFS fixes from David Howells: "Fixes for AFS plus one patch to make debugging easier: - Fix how addresses are matched to server records. This is currently incorrect which means cache invalidation callbacks from the server don't necessarily get delivered correctly. This causes stale data and metadata to be seen under some circumstances. - Make the dynamic root superblock R/W so that rpm/dnf can reapply the SELinux label to it when upgrading the Fedora filesystem-afs package. If the filesystem is R/O, this fails and the upgrade fails. It might be better in future to allow setxattr from an LSM to bypass the R/O protections, if only for pseudo-filesystems. - Fix the parsing of mountpoint strings. The mountpoint object has to have a terminal dot, whereas the source/device string passed to mount should not. This confuses type-forcing suffix detection leading to the wrong volume variant being mounted. - Make lookups in the dynamic root superblock for creation events (such as mkdir) fail with EOPNOTSUPP rather than something like EEXIST. The dynamic root only allows implicit creation by the ->lookup() method - and only if the target cell exists. - Fix the looking up of an AFS superblock to include the cell in the matching key - otherwise all volumes with the same ID number are treated as the same thing, irrespective of which cell they're in. - Show the volume name of each volume in the volume records displayed in /proc/net/afs/<cell>/volumes. This proved useful in debugging as it provides a way to map the volume IDs to names, where the names are what appear in /proc/mounts" * tag 'afs-fixes-20191211' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Show volume name in /proc/net/afs/<cell>/volumes afs: Fix missing cell comparison in afs_test_super() afs: Fix creation calls in the dynamic root to fail with EOPNOTSUPP afs: Fix mountpoint parsing afs: Fix SELinux setting security label on /afs afs: Fix afs_find_server lookups for ipv4 peers
2019-12-11io_uring: ensure we return -EINVAL on unknown opcodeJens Axboe1-7/+14
If we submit an unknown opcode and have fd == -1, io_op_needs_file() will return true as we default to needing a file. Then when we go and assign the file, we find the 'fd' invalid and return -EBADF. We really should be returning -EINVAL for that case, as we normally do for unsupported opcodes. Change io_op_needs_file() to have the following return values: 0 - does not need a file 1 - does need a file < 0 - error value and use this to pass back the right value for this invalid case. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-11Merge tag 'erofs-for-5.5-rc2-fixes' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: "Mainly address a regression reported by David recently observed together with overlayfs due to the improper return value of listxattr() without xattr. Update outdated expressions in document as well. Summary: - Fix improper return value of listxattr() with no xattr - Keep up documentation with latest code" * tag 'erofs-for-5.5-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: update documentation erofs: zero out when listxattr is called with no xattr
2019-12-11pipe: simplify signal handling in pipe_read() and add commentsLinus Torvalds1-7/+29
There's no need to separately check for signals while inside the locked region, since we're going to do "wait_event_interruptible()" right afterwards anyway, and the error handling is much simpler there. The check for whether we had already read anything was also redundant, since we no longer do the odd merging of reads when there are pending writers. But perhaps more importantly, this adds commentary about why we still need to wake up possible writers even though we didn't read any data, and why we can skip all the finishing touches now if we get a signal (or had a signal pending) while waiting for more data. [ This is a split-out cleanup from my "make pipe IO use exclusive wait queues" thing, which I can't apply because it triggers a nasty bug in the GNU make jobserver - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-11afs: Show volume name in /proc/net/afs/<cell>/volumesDavid Howells1-3/+4
Show the name of each volume in /proc/net/afs/<cell>/volumes to make it easier to work out the name corresponding to a volume ID. This makes it easier to work out which mounts in /proc/mounts correspond to which volume ID. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
2019-12-11afs: Fix missing cell comparison in afs_test_super()David Howells1-0/+1
Fix missing cell comparison in afs_test_super(). Without this, any pair volumes that have the same volume ID will share a superblock, no matter the cell, unless they're in different network namespaces. Normally, most users will only deal with a single cell and so they won't see this. Even if they do look into a second cell, they won't see a problem unless they happen to hit a volume with the same ID as one they've already got mounted. Before the patch: # ls /afs/grand.central.org/archive linuxdev/ mailman/ moin/ mysql/ pipermail/ stage/ twiki/ # ls /afs/kth.se/ linuxdev/ mailman/ moin/ mysql/ pipermail/ stage/ twiki/ # cat /proc/mounts | grep afs none /afs afs rw,relatime,dyn,autocell 0 0 #grand.central.org:root.cell /afs/grand.central.org afs ro,relatime 0 0 #grand.central.org:root.archive /afs/grand.central.org/archive afs ro,relatime 0 0 #grand.central.org:root.archive /afs/kth.se afs ro,relatime 0 0 After the patch: # ls /afs/grand.central.org/archive linuxdev/ mailman/ moin/ mysql/ pipermail/ stage/ twiki/ # ls /afs/kth.se/ admin/ common/ install/ OldFiles/ service/ system/ bakrestores/ home/ misc/ pkg/ src/ wsadmin/ # cat /proc/mounts | grep afs none /afs afs rw,relatime,dyn,autocell 0 0 #grand.central.org:root.cell /afs/grand.central.org afs ro,relatime 0 0 #grand.central.org:root.archive /afs/grand.central.org/archive afs ro,relatime 0 0 #kth.se:root.cell /afs/kth.se afs ro,relatime 0 0 Fixes: ^1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Carsten Jacobi <jacobi@de.ibm.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> Tested-by: Jonathan Billings <jsbillings@jsbillings.org> cc: Todd DeSantis <atd@us.ibm.com>
2019-12-11afs: Fix creation calls in the dynamic root to fail with EOPNOTSUPPDavid Howells1-0/+3
Fix the lookup method on the dynamic root directory such that creation calls, such as mkdir, open(O_CREAT), symlink, etc. fail with EOPNOTSUPP rather than failing with some odd error (such as EEXIST). lookup() itself tries to create automount directories when it is invoked. These are cached locally in RAM and not committed to storage. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> Tested-by: Jonathan Billings <jsbillings@jsbillings.org>
2019-12-11afs: Fix mountpoint parsingDavid Howells1-2/+4
Each AFS mountpoint has strings that define the target to be mounted. This is required to end in a dot that is supposed to be stripped off. The string can include suffixes of ".readonly" or ".backup" - which are supposed to come before the terminal dot. To add to the confusion, the "fs lsmount" afs utility does not show the terminal dot when displaying the string. The kernel mount source string parser, however, assumes that the terminal dot marks the suffix and that the suffix is always "" and is thus ignored. In most cases, there is no suffix and this is not a problem - but if there is a suffix, it is lost and this affects the ability to mount the correct volume. The command line mount command, on the other hand, is expected not to include a terminal dot - so the problem doesn't arise there. Fix this by making sure that the dot exists and then stripping it when passing the string to the mount configuration. Fixes: bec5eb614130 ("AFS: Implement an autocell mount capability [ver #2]") Reported-by: Jonathan Billings <jsbillings@jsbillings.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> Tested-by: Jonathan Billings <jsbillings@jsbillings.org>
2019-12-11sched/cputime, proc/stat: Fix incorrect guest nice cpustat valueFlavio Leitner1-2/+2
The value being used for guest_nice should be CPUTIME_GUEST_NICE and not CPUTIME_USER. Fixes: 26dae145a76c ("procfs: Use all-in-one vtime aware kcpustat accessor") Signed-off-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191205020344.14940-1-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-10io_uring: add sockets to list of files that support non-blocking issueJens Axboe1-2/+4
In chasing a performance issue between using IORING_OP_RECVMSG and IORING_OP_READV on sockets, tracing showed that we always punt the socket reads to async offload. This is due to io_file_supports_async() not checking for S_ISSOCK on the inode. Since sockets supports the O_NONBLOCK (or MSG_DONTWAIT) flag just fine, add sockets to the list of file types that we can do a non-blocking issue to. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10io_uring: only hash regular files for async work executionJens Axboe1-1/+3
We hash regular files to avoid having multiple threads hammer on the inode mutex, but it should not be needed on other types of files (like sockets). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10io_uring: run next sqe inline if possibleJens Axboe1-4/+11
One major use case of linked commands is the ability to run the next link inline, if at all possible. This is done correctly for async offload, but somewhere along the line we lost the ability to do so when we were able to complete a request without having to punt it. Ensure that we do so correctly. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10io_uring: don't dynamically allocate poll dataJens Axboe1-16/+11
This essentially reverts commit e944475e6984. For high poll ops workloads, like TAO, the dynamic allocation of the wait_queue entry for IORING_OP_POLL_ADD adds considerable extra overhead. Go back to embedding the wait_queue_entry, but keep the usage of wait->private for the pointer stashing. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10io_uring: deferred send/recvmsg should assign iovJens Axboe1-2/+2
Don't just assign it from the main call path, that can miss the case when we're called from issue deferral. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10io_uring: sqthread should grab ctx->uring_lock for submissionsJens Axboe1-5/+2
We use the mutex to guard against registered file updates, for instance. Ensure we're safe in accessing that state against concurrent updates. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10io-wq: briefly spin for new work after finishing workJens Axboe2-5/+26
To avoid going to sleep only to get woken shortly thereafter, spin briefly for new work upon completion of work. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10io-wq: remove worker->wait waitqueueJens Axboe1-8/+2
We only have one cases of using the waitqueue to wake the worker, the rest are using wake_up_process(). Since we can save some cycles not fiddling with the waitqueue io_wqe_worker(), switch the work activation to task wakeup and get rid of the now unused wait_queue_head_t in struct io_worker. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10io_uring: allow unbreakable linksJens Axboe1-38/+46
Some commands will invariably end in a failure in the sense that the completion result will be less than zero. One such example is timeouts that don't have a completion count set, they will always complete with -ETIME unless cancelled. For linked commands, we sever links and fail the rest of the chain if the result is less than zero. Since we have commands where we know that will happen, add IOSQE_IO_HARDLINK as a stronger link that doesn't sever regardless of the completion result. Note that the link will still sever if we fail submitting the parent request, hard links are only resilient in the presence of completion results for requests that did submit correctly. Cc: stable@vger.kernel.org # v5.4 Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-10ovl: relax WARN_ON() on rename to selfAmir Goldstein1-1/+1
In ovl_rename(), if new upper is hardlinked to old upper underneath overlayfs before upper dirs are locked, user will get an ESTALE error and a WARN_ON will be printed. Changes to underlying layers while overlayfs is mounted may result in unexpected behavior, but it shouldn't crash the kernel and it shouldn't trigger WARN_ON() either, so relax this WARN_ON(). Reported-by: syzbot+bb1836a212e69f8e201a@syzkaller.appspotmail.com Fixes: 804032fabb3b ("ovl: don't check rename to self") Cc: <stable@vger.kernel.org> # v4.9+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-12-10ovl: fix corner case of non-unique st_dev;st_inoAmir Goldstein1-1/+7
On non-samefs overlay without xino, non pure upper inodes should use a pseudo_dev assigned to each unique lower fs and pure upper inodes use the real upper st_dev. It is fine for an overlay pure upper inode to use the same st_dev;st_ino values as the real upper inode, because the content of those two different filesystem objects is always the same. In this case, however: - two filesystems, A and B - upper layer is on A - lower layer 1 is also on A - lower layer 2 is on B Non pure upper overlay inode, whose origin is in layer 1 will have the same st_dev;st_ino values as the real lower inode. This may result with a false positive results of 'diff' between the real lower and copied up overlay inode. Fix this by using the upper st_dev;st_ino values in this case. This breaks the property of constant st_dev;st_ino across copy up of this case. This breakage will be fixed by a later patch. Fixes: 5148626b806a ("ovl: allocate anon bdev per unique lower fs") Cc: stable@vger.kernel.org # v4.17+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>