summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2022-07-17proc: fix a dentry lock race between release_task and lookupZhihao Cheng1-8/+38
Commit 7bc3e6e55acf06 ("proc: Use a list of inodes to flush from proc") moved proc_flush_task() behind __exit_signal(). Then, process systemd can take long period high cpu usage during releasing task in following concurrent processes: systemd ps kernel_waitid stat(/proc/tgid) do_wait filename_lookup wait_consider_task lookup_fast release_task __exit_signal __unhash_process detach_pid __change_pid // remove task->pid_links d_revalidate -> pid_revalidate // 0 d_invalidate(/proc/tgid) shrink_dcache_parent(/proc/tgid) d_walk(/proc/tgid) spin_lock_nested(/proc/tgid/fd) // iterating opened fd proc_flush_pid | d_invalidate (/proc/tgid/fd) | shrink_dcache_parent(/proc/tgid/fd) | shrink_dentry_list(subdirs) ↓ shrink_lock_dentry(/proc/tgid/fd) --> race on dentry lock Function d_invalidate() will remove dentry from hash firstly, but why does proc_flush_pid() process dentry '/proc/tgid/fd' before dentry '/proc/tgid'? That's because proc_pid_make_inode() adds proc inode in reverse order by invoking hlist_add_head_rcu(). But proc should not add any inodes under '/proc/tgid' except '/proc/tgid/task/pid', fix it by adding inode into 'pid->inodes' only if the inode is /proc/tgid or /proc/tgid/task/pid. Performance regression: Create 200 tasks, each task open one file for 50,000 times. Kill all tasks when opened files exceed 10,000,000 (cat /proc/sys/fs/file-nr). Before fix: $ time killall -wq aa real 4m40.946s # During this period, we can see 'ps' and 'systemd' taking high cpu usage. After fix: $ time killall -wq aa real 1m20.732s # During this period, we can see 'systemd' taking high cpu usage. Link: https://lkml.kernel.org/r/20220713130029.4133533-1-chengzhihao1@huawei.com Fixes: 7bc3e6e55acf06 ("proc: Use a list of inodes to flush from proc") Link: https://bugzilla.kernel.org/show_bug.cgi?id=216054 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Suggested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Baoquan He <bhe@redhat.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17autofs: remove unused ino field inodeIan Kent1-2/+0
Remove the unused inode field of the autofs dentry info structure. Link: https://lkml.kernel.org/r/165724460393.30914.6511330213821246793.stgit@donald.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17autofs: add comment about autofs_mountpoint_changed()Ian Kent1-3/+20
The function autofs_mountpoint_changed() is unusual, add a comment about two cases for which it is needed. Link: https://lkml.kernel.org/r/165724459804.30914.10974834416046555127.stgit@donald.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17autofs: use dentry info count instead of simple_empty()Ian Kent3-11/+14
The dentry info. field count is used to check if a dentry is in use during expire. But, to be used for this the count field must account for the presence of child dentries in a directory dentry. Therefore it can also be used to check for an empty directory dentry which can be done without having to to take an additional lock or account for the presence of a readdir cursor dentry as is done by simple_empty(). Link: https://lkml.kernel.org/r/165724459238.30914.1504611159945950108.stgit@donald.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17autofs: make dentry info count consistentIan Kent2-4/+1
If an autofs dentry is a mount root directory there's no ->mkdir() call to set its count to one. To make the dentry info count consistent for all autofs dentries set count to one when the dentry info struct is allocated. Link: https://lkml.kernel.org/r/165724458671.30914.2902424437132835325.stgit@donald.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17autofs: use inode permission method for write accessIan Kent1-41/+22
Patch series "autofs: misc patches". This series contains several patches that resulted mostly from comments made by Al Viro (quite a long time ago now). This patch (of 5): Eliminate some code duplication from mkdir/rmdir/symlink/unlink methods by using the inode operation .permission(). Link: https://lkml.kernel.org/r/165724445154.30914.10970894936827635879.stgit@donald.themaw.net Link: https://lkml.kernel.org/r/165724458096.30914.13499431569758625806.stgit@donald.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17epoll: autoremove wakers even more aggressivelyBenjamin Segall1-0/+22
If a process is killed or otherwise exits while having active network connections and many threads waiting on epoll_wait, the threads will all be woken immediately, but not removed from ep->wq. Then when network traffic scans ep->wq in wake_up, every wakeup attempt will fail, and will not remove the entries from the list. This means that the cost of the wakeup attempt is far higher than usual, does not decrease, and this also competes with the dying threads trying to actually make progress and remove themselves from the wq. Handle this by removing visited epoll wq entries unconditionally, rather than only when the wakeup succeeds - the structure of ep_poll means that the only potential loss is the timed_out->eavail heuristic, which now can race and result in a redundant ep_send_events attempt. (But only when incoming data and a timeout actually race, not on every timeout) Shakeel added: : We are seeing this issue in production with real workloads and it has : caused hard lockups. Particularly network heavy workloads with a lot : of threads in epoll_wait() can easily trigger this issue if they get : killed (oom-killed in our case). Link: https://lkml.kernel.org/r/xm26fsjotqda.fsf@google.com Signed-off-by: Ben Segall <bsegall@google.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Roman Penyaev <rpenyaev@suse.de> Cc: Jason Baron <jbaron@akamai.com> Cc: Khazhismel Kumykov <khazhy@google.com> Cc: Heiher <r@hev.cc> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17proc: delete unused <linux/uaccess.h> includesAlexey Dobriyan8-14/+0
Those aren't necessary after seq files won. Link: https://lkml.kernel.org/r/YqnA3mS7KBt8Z4If@localhost.localdomain Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-27Merge branch 'master' into mm-nonmm-stableakpm56-574/+1098
2022-06-26Merge tag 'mm-hotfixes-stable-2022-06-26' of ↵Linus Torvalds1-15/+53
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "Minor things, mainly - mailmap updates, MAINTAINERS updates, etc. Fixes for this merge window: - fix for a damon boot hang, from SeongJae - fix for a kfence warning splat, from Jason Donenfeld - fix for zero-pfn pinning, from Alex Williamson - fix for fallocate hole punch clearing, from Mike Kravetz Fixes for previous releases: - fix for a performance regression, from Marcelo - fix for a hwpoisining BUG from zhenwei pi" * tag 'mm-hotfixes-stable-2022-06-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mailmap: add entry for Christian Marangi mm/memory-failure: disable unpoison once hw error happens hugetlbfs: zero partial pages during fallocate hole punch mm: memcontrol: reference to tools/cgroup/memcg_slabinfo.py mm: re-allow pinning of zero pfns mm/kfence: select random number before taking raw lock MAINTAINERS: add maillist information for LoongArch MAINTAINERS: update MM tree references MAINTAINERS: update Abel Vesa's email MAINTAINERS: add MEMORY HOT(UN)PLUG section and add David as reviewer MAINTAINERS: add Miaohe Lin as a memory-failure reviewer mailmap: add alias for jarkko@profian.com mm/damon/reclaim: schedule 'damon_reclaim_timer' only after 'system_wq' is initialized kthread: make it clear that kthread_create_on_node() might be terminated by any fatal signal mm: lru_cache_disable: use synchronize_rcu_expedited mm/page_isolation.c: fix one kernel-doc comment
2022-06-26Merge tag 'for-5.19-rc3-tag' of ↵Linus Torvalds10-27/+145
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - zoned relocation fixes: - fix critical section end for extent writeback, this could lead to out of order write - prevent writing to previous data relocation block group if space gets low - reflink fixes: - fix race between reflinking and ordered extent completion - proper error handling when block reserve migration fails - add missing inode iversion/mtime/ctime updates on each iteration when replacing extents - fix deadlock when running fsync/fiemap/commit at the same time - fix false-positive KCSAN report regarding pid tracking for read locks and data race - minor documentation update and link to new site * tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Documentation: update btrfs list of features and link to readthedocs.io btrfs: fix deadlock with fsync+fiemap+transaction commit btrfs: don't set lock_owner when locking extent buffer for reading btrfs: zoned: fix critical section of relocation inode writeback btrfs: zoned: prevent allocation from previous data relocation BG btrfs: do not BUG_ON() on failure to migrate space when replacing extents btrfs: add missing inode updates on each iteration when replacing extents btrfs: fix race between reflinking and ordered extent completion
2022-06-26Merge tag 'exfat-for-5.19-rc4' of ↵Linus Torvalds1-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat fix from Namjae Jeon: - Use updated exfat_chain directly instead of snapshot values in rename. * tag 'exfat-for-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: use updated exfat_chain directly during renaming
2022-06-26Merge tag '5.19-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds8-139/+366
Pull cifs client fixes from Steve French: "Fixes addressing important multichannel, and reconnect issues. Multichannel mounts when the server network interfaces changed, or ip addresses changed, uncovered problems, especially in reconnect, but the patches for this were held up until recently due to some lock conflicts that are now addressed. Included in this set of fixes: - three fixes relating to multichannel reconnect, dynamically adjusting the list of server interfaces to avoid problems during reconnect - a lock conflict fix related to the above - two important fixes for negotiate on secondary channels (null netname can unintentionally cause multichannel to be disabled to some servers) - a reconnect fix (reporting incorrect IP address in some cases)" * tag '5.19-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: update cifs_ses::ip_addr after failover cifs: avoid deadlocks while updating iface cifs: periodically query network interfaces from server cifs: during reconnect, update interface if necessary cifs: change iface_list from array to sorted linked list smb3: use netname when available on secondary channels smb3: fix empty netname context on secondary channels
2022-06-25Merge tag 'f2fs-for-5.19-rc4' of ↵Linus Torvalds3-20/+32
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs fixes from Jaegeuk Kim: "Some urgent fixes to avoid generating corrupted inodes caused by compressed and inline_data files. In addition, avoid a wrong error report which prevents a roll-forward recovery" * tag 'f2fs-for-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: f2fs: do not count ENOENT for error case f2fs: fix iostat related lock protection f2fs: attach inline_data after setting compression
2022-06-24cifs: update cifs_ses::ip_addr after failoverPaulo Alcantara1-1/+7
cifs_ses::ip_addr wasn't being updated in cifs_session_setup() when reconnecting SMB sessions thus returning wrong value in /proc/fs/cifs/DebugData. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Cc: stable@kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-24Merge tag 'io_uring-5.19-2022-06-24' of git://git.kernel.dk/linux-blockLinus Torvalds1-11/+15
Pull io_uring fixes from Jens Axboe: "A few fixes that should go into the 5.19 release. All are fixing issues that either happened in this release, or going to stable. In detail: - A small series of fixlets for the poll handling, all destined for stable (Pavel) - Fix a merge error from myself that caused a potential -EINVAL for the recv/recvmsg flag setting (me) - Fix a kbuf recycling issue for partial IO (me) - Use the original request for the inflight tracking (me) - Fix an issue introduced this merge window with trace points using a custom decoder function, which won't work for perf (Dylan)" * tag 'io_uring-5.19-2022-06-24' of git://git.kernel.dk/linux-block: io_uring: use original request task for inflight tracking io_uring: move io_uring_get_opcode out of TP_printk io_uring: fix double poll leak on repolling io_uring: fix wrong arm_poll error handling io_uring: fail links when poll fails io_uring: fix req->apoll_events io_uring: fix merge error in checking send/recv addr2 flags io_uring: mark reissue requests with REQ_F_PARTIAL_IO
2022-06-24cifs: avoid deadlocks while updating ifaceShyam Prasad N2-15/+33
We use cifs_tcp_ses_lock to protect a lot of things. Not only does it protect the lists of connections, sessions, tree connects, open file lists, etc., we also use it to protect some fields in each of it's entries. In this case, cifs_mark_ses_for_reconnect takes the cifs_tcp_ses_lock to traverse the lists, and then calls cifs_update_iface. However, that can end up calling cifs_put_tcp_session, which picks up the same lock again. Avoid this by taking a ref for the session, drop the lock, and then call update iface. Also, in cifs_update_iface, avoid nested locking of iface_lock and chan_lock, as much as possible. When unavoidable, we need to pick iface_lock first. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-23Merge tag 'trace-v5.19-rc1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: - Check for NULL in kretprobe_dispatcher() NULL can now be passed in, make sure it can handle it - Clean up unneeded #endif #ifdef of the same preprocessor check in the middle of the block. - Comment clean up - Remove unneeded initialization of the "ret" variable in __trace_uprobe_create() * tag 'trace-v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing/uprobes: Remove unwanted initialization in __trace_uprobe_create() tracefs: Fix syntax errors in comments tracing: Simplify conditional compilation code in tracing_set_tracer() tracing/kprobes: Check whether get_kretprobe() returns NULL in kretprobe_dispatcher()
2022-06-23io_uring: use original request task for inflight trackingJens Axboe1-1/+1
In prior kernels, we did file assignment always at prep time. This meant that req->task == current. But after deferring that assignment and then pushing the inflight tracking back in, we've got the inflight tracking using current when it should in fact now be using req->task. Fixup that error introduced by adding the inflight tracking back after file assignments got modifed. Fixes: 9cae36a094e7 ("io_uring: reinstate the inflight tracking") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-22cifs: periodically query network interfaces from serverShyam Prasad N4-1/+35
Currently, we only query the server for network interfaces information at the time of mount, and never afterwards. This can be a problem, especially for services like Azure, where the IP address of the channel endpoints can change over time. With this change, we schedule a 600s polling of this info from the server for each tree connect. An alternative for periodic polling was to do this only at the time of reconnect. But this could delay the reconnect time slightly. Also, there are some challenges w.r.t how we have cifs_reconnect implemented today. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-22cifs: during reconnect, update interface if necessaryShyam Prasad N3-0/+88
Going forward, the plan is to periodically query the server for it's interfaces (when multichannel is enabled). This change allows checking for inactive interfaces during reconnect, and reconnect to a new interface if necessary. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-22cifs: change iface_list from array to sorted linked listShyam Prasad N6-129/+201
A server's published interface list can change over time, and needs to be updated. We've storing iface_list as a simple array, which makes it difficult to manipulate an existing list. With this change, iface_list is modified into a linked list of interfaces, which is kept sorted by speed. Also added a reference counter for an iface entry, so that each channel can maintain a backpointer to the iface and drop it easily when needed. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-22smb3: use netname when available on secondary channelsShyam Prasad N1-2/+9
Some servers do not allow null netname contexts, which would cause multichannel to revert to single channel when mounting to some servers (e.g. Azure xSMB). The previous patch fixed that by avoiding incorrectly sending the netname context when there would be a null hostname sent in the netname context, while this patch fixes the null hostname for the secondary channel by using the hostname of the primary channel for the secondary channel. Fixes: 4c14d7043fede ("cifs: populate empty hostnames for extra channels") Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-22Merge tag '9p-for-5.19-rc4' of https://github.com/martinetd/linuxLinus Torvalds4-17/+29
Pull 9pfs fixes from Dominique Martinet: "A couple of fid refcount and fscache fixes: - fid refcounting was incorrect in some corner cases and would leak resources, only freed at umount time. The first three commits fix three such cases - 'cache=loose' or fscache was broken when trying to write a partial page to a file with no read permission since the rework a few releases ago. The fix taken here is just to restore old behavior of using the special 'writeback_fid' for such reads, which is open as root/RDWR and such not get complains that we try to read on a WRONLY fid. Long-term it'd be nice to get rid of this and not issue the read at all (skip cache?) in such cases, but that direction hasn't progressed" * tag '9p-for-5.19-rc4' of https://github.com/martinetd/linux: 9p: fix EBADF errors in cached mode 9p: Fix refcounting during full path walks for fid lookups 9p: fix fid refcount leak in v9fs_vfs_get_link 9p: fix fid refcount leak in v9fs_vfs_atomic_open_dotl
2022-06-21io_uring: fix double poll leak on repollingPavel Begunkov1-0/+1
We have re-polling for partial IO, so a request can be polled twice. If it used two poll entries the first time then on the second io_arm_poll_handler() it will find the old apoll entry and NULL kmalloc()'ed second entry, i.e. apoll->double_poll, so leaking it. Fixes: 10c873334feba ("io_uring: allow re-poll if we made progress") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fee2452494222ecc7f1f88c8fb659baef971414a.1655852245.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-21io_uring: fix wrong arm_poll error handlingPavel Begunkov1-0/+1
Leaving ip.error set when a request was punted to task_work execution is problematic, don't forget to clear it. Fixes: aa43477b04025 ("io_uring: poll rework") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a6c84ef4182c6962380aebe11b35bdcb25b0ccfb.1655852245.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-21io_uring: fail links when poll failsPavel Begunkov1-0/+2
Don't forget to cancel all linked requests of poll request when __io_arm_poll_handler() failed. Fixes: aa43477b04025 ("io_uring: poll rework") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a78aad962460f9fdfe4aa4c0b62425c88f9415bc.1655852245.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-21Merge tag 'for-5.19-rc3-tag' of ↵Linus Torvalds2-9/+51
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - print more error messages for invalid mount option values - prevent remount with v1 space cache for subpage filesystem - fix hang during unmount when block group reclaim task is running * tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: add error messages to all unrecognized mount options btrfs: prevent remounting to v1 space cache for subpage mount btrfs: fix hang during unmount when block group reclaim task is running
2022-06-21afs: Fix dynamic root getattrDavid Howells1-1/+2
The recent patch to make afs_getattr consult the server didn't account for the pseudo-inodes employed by the dynamic root-type afs superblock not having a volume or a server to access, and thus an oops occurs if such a directory is stat'd. Fix this by checking to see if the vnode->volume pointer actually points anywhere before following it in afs_getattr(). This can be tested by stat'ing a directory in /afs. It may be sufficient just to do "ls /afs" and the oops looks something like: BUG: kernel NULL pointer dereference, address: 0000000000000020 ... RIP: 0010:afs_getattr+0x8b/0x14b ... Call Trace: <TASK> vfs_statx+0x79/0xf5 vfs_fstatat+0x49/0x62 Fixes: 2aeb8c86d499 ("afs: Fix afs_getattr() to refetch file status if callback break occurred") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> Tested-by: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Link: https://lore.kernel.org/r/165408450783.1031787.7941404776393751186.stgit@warthog.procyon.org.uk/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-06-21f2fs: do not count ENOENT for error caseJaegeuk Kim1-1/+3
Otherwise, we can get a wrong cp_error mark. Cc: <stable@vger.kernel.org> Fixes: a7b8618aa2f0 ("f2fs: avoid infinite loop to flush node pages") Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-06-21io_uring: fix req->apoll_eventsPavel Begunkov1-4/+8
apoll_events should be set once in the beginning of poll arming just as poll->events and not change after. However, currently io_uring resets it on each __io_poll_execute() for no clear reason. There is also a place in __io_arm_poll_handler() where we add EPOLLONESHOT to downgrade a multishot, but forget to do the same thing with ->apoll_events, which is buggy. Fixes: 81459350d581e ("io_uring: cache req->apoll->events in req->cflags") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Hao Xu <howeyxu@tencent.com> Link: https://lore.kernel.org/r/0aef40399ba75b1a4d2c2e85e6e8fd93c02fc6e4.1655814213.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-21io_uring: fix merge error in checking send/recv addr2 flagsJens Axboe1-4/+0
With the dropping of the IOPOLL checking in the per-opcode handlers, we inadvertently left two checks in the recv/recvmsg and send/sendmsg prep handlers for the same thing, and one of them includes addr2 which holds the flags for these opcodes. Fix it up and kill the redundant checks. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-21btrfs: fix deadlock with fsync+fiemap+transaction commitJosef Bacik1-15/+52
We are hitting the following deadlock in production occasionally Task 1 Task 2 Task 3 Task 4 Task 5 fsync(A) start trans start commit falloc(A) lock 5m-10m start trans wait for commit fiemap(A) lock 0-10m wait for 5m-10m (have 0-5m locked) have btrfs_need_log_full_commit !full_sync wait_ordered_extents finish_ordered_io(A) lock 0-5m DEADLOCK We have an existing dependency of file extent lock -> transaction. However in fsync if we tried to do the fast logging, but then had to fall back to committing the transaction, we will be forced to call btrfs_wait_ordered_range() to make sure all of our extents are updated. This creates a dependency of transaction -> file extent lock, because btrfs_finish_ordered_io() will need to take the file extent lock in order to run the ordered extents. Fix this by stopping the transaction if we have to do the full commit and we attempted to do the fast logging. Then attach to the transaction and commit it if we need to. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-06-21btrfs: don't set lock_owner when locking extent buffer for readingZygo Blaxell1-3/+0
In 196d59ab9ccc "btrfs: switch extent buffer tree lock to rw_semaphore" the functions for tree read locking were rewritten, and in the process the read lock functions started setting eb->lock_owner = current->pid. Previously lock_owner was only set in tree write lock functions. Read locks are shared, so they don't have exclusive ownership of the underlying object, so setting lock_owner to any single value for a read lock makes no sense. It's mostly harmless because write locks and read locks are mutually exclusive, and none of the existing code in btrfs (btrfs_init_new_buffer and print_eb_refs_lock) cares what nonsense is written in lock_owner when no writer is holding the lock. KCSAN does care, and will complain about the data race incessantly. Remove the assignments in the read lock functions because they're useless noise. Fixes: 196d59ab9ccc ("btrfs: switch extent buffer tree lock to rw_semaphore") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: David Sterba <dsterba@suse.com>
2022-06-21btrfs: zoned: fix critical section of relocation inode writebackNaohiro Aota1-1/+2
We use btrfs_zoned_data_reloc_{lock,unlock} to allow only one process to write out to the relocation inode. That critical section must include all the IO submission for the inode. However, flush_write_bio() in extent_writepages() is out of the critical section, causing an IO submission outside of the lock. This leads to an out of the order IO submission and fail the relocation process. Fix it by extending the critical section. Fixes: 35156d852762 ("btrfs: zoned: only allow one process to add pages to a relocation inode") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-06-21btrfs: zoned: prevent allocation from previous data relocation BGNaohiro Aota5-2/+53
After commit 5f0addf7b890 ("btrfs: zoned: use dedicated lock for data relocation"), we observe IO errors on e.g, btrfs/232 like below. [09.0][T4038707] WARNING: CPU: 3 PID: 4038707 at fs/btrfs/extent-tree.c:2381 btrfs_cross_ref_exist+0xfc/0x120 [btrfs] <snip> [09.9][T4038707] Call Trace: [09.5][T4038707] <TASK> [09.3][T4038707] run_delalloc_nocow+0x7f1/0x11a0 [btrfs] [09.6][T4038707] ? test_range_bit+0x174/0x320 [btrfs] [09.2][T4038707] ? fallback_to_cow+0x980/0x980 [btrfs] [09.3][T4038707] ? find_lock_delalloc_range+0x33e/0x3e0 [btrfs] [09.5][T4038707] btrfs_run_delalloc_range+0x445/0x1320 [btrfs] [09.2][T4038707] ? test_range_bit+0x320/0x320 [btrfs] [09.4][T4038707] ? lock_downgrade+0x6a0/0x6a0 [09.2][T4038707] ? orc_find.part.0+0x1ed/0x300 [09.5][T4038707] ? __module_address.part.0+0x25/0x300 [09.0][T4038707] writepage_delalloc+0x159/0x310 [btrfs] <snip> [09.4][ C3] sd 10:0:1:0: [sde] tag#2620 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s [09.5][ C3] sd 10:0:1:0: [sde] tag#2620 Sense Key : Illegal Request [current] [09.9][ C3] sd 10:0:1:0: [sde] tag#2620 Add. Sense: Unaligned write command [09.5][ C3] sd 10:0:1:0: [sde] tag#2620 CDB: Write(16) 8a 00 00 00 00 00 02 f3 63 87 00 00 00 2c 00 00 [09.4][ C3] critical target error, dev sde, sector 396041272 op 0x1:(WRITE) flags 0x800 phys_seg 3 prio class 0 [09.9][ C3] BTRFS error (device dm-1): bdev /dev/mapper/dml_102_2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 The IO errors occur when we allocate a regular extent in previous data relocation block group. On zoned btrfs, we use a dedicated block group to relocate a data extent. Thus, we allocate relocating data extents (pre-alloc) only from the dedicated block group and vice versa. Once the free space in the dedicated block group gets tight, a relocating extent may not fit into the block group. In that case, we need to switch the dedicated block group to the next one. Then, the previous one is now freed up for allocating a regular extent. The BG is already not enough to allocate the relocating extent, but there is still room to allocate a smaller extent. Now the problem happens. By allocating a regular extent while nocow IOs for the relocation is still on-going, we will issue WRITE IOs (for relocation) and ZONE APPEND IOs (for the regular writes) at the same time. That mixed IOs confuses the write pointer and arises the unaligned write errors. This commit introduces a new bit 'zoned_data_reloc_ongoing' to the btrfs_block_group. We set this bit before releasing the dedicated block group, and no extent are allocated from a block group having this bit set. This bit is similar to setting block_group->ro, but is different from it by allowing nocow writes to start. Once all the nocow IO for relocation is done (hooked from btrfs_finish_ordered_io), we reset the bit to release the block group for further allocation. Fixes: c2707a255623 ("btrfs: zoned: add a dedicated data relocation block group") CC: stable@vger.kernel.org # 5.16+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-06-21btrfs: do not BUG_ON() on failure to migrate space when replacing extentsFilipe Manana1-2/+4
At btrfs_replace_file_extents(), if we fail to migrate reserved metadata space from the transaction block reserve into the local block reserve, we trigger a BUG_ON(). This is because it should not be possible to have a failure here, as we reserved more space when we started the transaction than the space we want to migrate. However having a BUG_ON() is way too drastic, we can perfectly handle the failure and return the error to the caller. So just do that instead, and add a WARN_ON() to make it easier to notice the failure if it ever happens (which is particularly useful for fstests, and the warning will trigger a failure of a test case). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-06-21btrfs: add missing inode updates on each iteration when replacing extentsFilipe Manana4-0/+23
When replacing file extents, called during fallocate, hole punching, clone and deduplication, we may not be able to replace/drop all the target file extent items with a single transaction handle. We may get -ENOSPC while doing it, in which case we release the transaction handle, balance the dirty pages of the btree inode, flush delayed items and get a new transaction handle to operate on what's left of the target range. By dropping and replacing file extent items we have effectively modified the inode, so we should bump its iversion and update its mtime/ctime before we update the inode item. This is because if the transaction we used for partially modifying the inode gets committed by someone after we release it and before we finish the rest of the range, a power failure happens, then after mounting the filesystem our inode has an outdated iversion and mtime/ctime, corresponding to the values it had before we changed it. So add the missing iversion and mtime/ctime updates. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-06-21btrfs: fix race between reflinking and ordered extent completionFilipe Manana1-4/+11
While doing a reflink operation, if an ordered extent for a file range that does not overlap with the source and destination ranges of the reflink operation happens, we can end up having a failure in the reflink operation and return -EINVAL to user space. The following sequence of steps explains how this can happen: 1) We have the page at file offset 315392 dirty (under delalloc); 2) A reflink operation for this file starts, using the same file as both source and destination, the source range is [372736, 409600) (length of 36864 bytes) and the destination range is [208896, 245760); 3) At btrfs_remap_file_range_prep(), we flush all delalloc in the source and destination ranges, and wait for any ordered extents in those range to complete; 4) Still at btrfs_remap_file_range_prep(), we then flush all delalloc in the inode, but we neither wait for it to complete nor any ordered extents to complete. This results in starting delalloc for the page at file offset 315392 and creating an ordered extent for that single page range; 5) We then move to btrfs_clone() and enter the loop to find file extent items to copy from the source range to destination range; 6) In the first iteration we end up at last file extent item stored in leaf A: (...) item 131 key (143616 108 315392) itemoff 5101 itemsize 53 extent data disk bytenr 1903988736 nr 73728 extent data offset 12288 nr 61440 ram 73728 This represents the file range [315392, 376832), which overlaps with the source range to clone. @datal is set to 61440, key.offset is 315392 and @next_key_min_offset is therefore set to 376832 (315392 + 61440). @off (372736) is > key.offset (315392), so @new_key.offset is set to the value of @destoff (208896). @new_key.offset == @last_dest_end (208896) so @drop_start is set to 208896 (@new_key.offset). @datal is adjusted to 4096, as @off is > @key.offset. So in this iteration we call btrfs_replace_file_extents() for the range [208896, 212991] (a single page, which is [@drop_start, @new_key.offset + @datal - 1]). @last_dest_end is set to 212992 (@new_key.offset + @datal = 208896 + 4096 = 212992). Before the next iteration of the loop, @key.offset is set to the value 376832, which is @next_key_min_offset; 7) On the second iteration btrfs_search_slot() leaves us again at leaf A, but this time pointing beyond the last slot of leaf A, as that's where a key with offset 376832 should be at if it existed. So end up calling btrfs_next_leaf(); 8) btrfs_next_leaf() releases the path, but before it searches again the tree for the next key/leaf, the ordered extent for the single page range at file offset 315392 completes. That results in trimming the file extent item we processed before, adjusting its key offset from 315392 to 319488, reducing its length from 61440 to 57344 and inserting a new file extent item for that single page range, with a key offset of 315392 and a length of 4096. Leaf A now looks like: (...) item 132 key (143616 108 315392) itemoff 4995 itemsize 53 extent data disk bytenr 1801666560 nr 4096 extent data offset 0 nr 4096 ram 4096 item 133 key (143616 108 319488) itemoff 4942 itemsize 53 extent data disk bytenr 1903988736 nr 73728 extent data offset 16384 nr 57344 ram 73728 9) When btrfs_next_leaf() returns, it gives us a path pointing to leaf A at slot 133, since it's the first key that follows what was the last key we saw (143616 108 315392). In fact it's the same item we processed before, but its key offset was changed, so it counts as a new key; 10) So now we have: @key.offset == 319488 @datal == 57344 @off (372736) is > key.offset (319488), so @new_key.offset is set to 208896 (@destoff value). @new_key.offset (208896) != @last_dest_end (212992), so @drop_start is set to 212992 (@last_dest_end value). @datal is adjusted to 4096 because @off > @key.offset. So in this iteration we call btrfs_replace_file_extents() for the invalid range of [212992, 212991] (which is [@drop_start, @new_key.offset + @datal - 1]). This range is empty, the end offset is smaller than the start offset so btrfs_replace_file_extents() returns -EINVAL, which we end up returning to user space and fail the reflink operation. This all happens because the range of this file extent item was already processed in the previous iteration. This scenario can be triggered very sporadically by fsx from fstests, for example with test case generic/522. So fix this by having btrfs_clone() skip file extent items that cover a file range that we have already processed. CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-06-20smb3: fix empty netname context on secondary channelsSteve French1-6/+8
Some servers do not allow null netname contexts, which would cause multichannel to revert to single channel when mounting to some servers (e.g. Azure xSMB). Fixes: 4c14d7043fede ("cifs: populate empty hostnames for extra channels") Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-20io_uring: mark reissue requests with REQ_F_PARTIAL_IOJens Axboe1-2/+2
If we mark for reissue, we assume that the buffer will remain stable. Hence if are using a provided buffer, we need to ensure that we stick with it for the duration of that request. This only affects block devices that use provided buffers, as those are the only ones that get marked with REQ_F_REISSUE. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-19f2fs: fix iostat related lock protectionDaeho Jeong1-13/+18
Made iostat related locks safe to be called from irq context again. Cc: <stable@vger.kernel.org> Fixes: a1e09b03e6f5 ("f2fs: use iomap for direct I/O") Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Stanley Chu <stanley.chu@mediatek.com> Tested-by: Eddie Huang <eddie.huang@mediatek.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-06-19f2fs: attach inline_data after setting compressionJaegeuk Kim1-6/+11
This fixes the below corruption. [345393.335389] F2FS-fs (vdb): sanity_check_inode: inode (ino=6d0, mode=33206) should not have inline_data, run fsck to fix Cc: <stable@vger.kernel.org> Fixes: 677a82b44ebf ("f2fs: fix to do sanity check for inline inode") Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-06-19Merge tag 'xfs-5.19-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds7-25/+37
Pull xfs fixes from Darrick Wong: "There's not a whole lot this time around (I'm still on vacation) but here are some important fixes for new features merged in -rc1: - Fix a bug where inode flag changes would accidentally drop nrext64 - Fix a race condition when toggling LARP mode" * tag 'xfs-5.19-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: preserve DIFLAG2_NREXT64 when setting other inode attributes xfs: fix variable state usage xfs: fix TOCTOU race involving the new logged xattrs control knob
2022-06-18Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds9-85/+137
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Fix a variety of bugs, many of which were found by folks using fuzzing or error injection. Also fix up how test_dummy_encryption mount option is handled for the new mount API. Finally, fix/cleanup a number of comments and ext4 Documentation files" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix a doubled word "need" in a comment ext4: add reserved GDT blocks check ext4: make variable "count" signed ext4: correct the judgment of BUG in ext4_mb_normalize_request ext4: fix bug_on ext4_mb_use_inode_pa ext4: fix up test_dummy_encryption handling for new mount API ext4: use kmemdup() to replace kmalloc + memcpy ext4: fix super block checksum incorrect after mount ext4: improve write performance with disabled delalloc ext4: fix warning when submitting superblock in ext4_commit_super() ext4, doc: remove unnecessary escaping ext4: fix incorrect comment in ext4_bio_write_page() fs: fix jbd2_journal_try_to_free_buffers() kernel-doc comment
2022-06-18Merge tag '5.19-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds3-0/+43
Pull cifs client fixes from Steve French: "Two cifs debugging improvements - one found to deal with debugging a multichannel problem and one for a recent fallocate issue This does include the two larger multichannel reconnect (dynamically adjusting interfaces on reconnect) patches, because we recently found an additional problem with multichannel to one server type that I want to include at the same time" * tag '5.19-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: when a channel is not found for server, log its connection id smb3: add trace point for SMB2_set_eof
2022-06-18ext4: fix a doubled word "need" in a commentXiang wangx1-1/+1
Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com> Link: https://lore.kernel.org/r/20220605091503.12513-1-wangxiang@cdjrlc.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-18ext4: add reserved GDT blocks checkZhang Yi1-0/+10
We capture a NULL pointer issue when resizing a corrupt ext4 image which is freshly clear resize_inode feature (not run e2fsck). It could be simply reproduced by following steps. The problem is because of the resize_inode feature was cleared, and it will convert the filesystem to meta_bg mode in ext4_resize_fs(), but the es->s_reserved_gdt_blocks was not reduced to zero, so could we mistakenly call reserve_backup_gdb() and passing an uninitialized resize_inode to it when adding new group descriptors. mkfs.ext4 /dev/sda 3G tune2fs -O ^resize_inode /dev/sda #forget to run requested e2fsck mount /dev/sda /mnt resize2fs /dev/sda 8G ======== BUG: kernel NULL pointer dereference, address: 0000000000000028 CPU: 19 PID: 3243 Comm: resize2fs Not tainted 5.18.0-rc7-00001-gfde086c5ebfd #748 ... RIP: 0010:ext4_flex_group_add+0xe08/0x2570 ... Call Trace: <TASK> ext4_resize_fs+0xbec/0x1660 __ext4_ioctl+0x1749/0x24e0 ext4_ioctl+0x12/0x20 __x64_sys_ioctl+0xa6/0x110 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f2dd739617b ======== The fix is simple, add a check in ext4_resize_begin() to make sure that the es->s_reserved_gdt_blocks is zero when the resize_inode feature is disabled. Cc: stable@kernel.org Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220601092717.763694-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-18ext4: make variable "count" signedDing Xiang1-1/+2
Since dx_make_map() may return -EFSCORRUPTED now, so change "count" to be a signed integer so we can correctly check for an error code returned by dx_make_map(). Fixes: 46c116b920eb ("ext4: verify dir block before splitting it") Cc: stable@kernel.org Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com> Link: https://lore.kernel.org/r/20220530100047.537598-1-dingxiang@cmss.chinamobile.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-18ext4: correct the judgment of BUG in ext4_mb_normalize_requestBaokun Li1-1/+16
ext4_mb_normalize_request() can move logical start of allocated blocks to reduce fragmentation and better utilize preallocation. However logical block requested as a start of allocation (ac->ac_o_ex.fe_logical) should always be covered by allocated blocks so we should check that by modifying and to or in the assertion. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220528110017.354175-3-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>