Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull cifs fixes from Steve French:
"An important signal handling patch for stable, and two small cleanup
patches"
* tag '5.11-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6:
cifs: do not fail __smb_send_rqst if non-fatal signals are pending
fs/cifs: Simplify bool comparison.
fs/cifs: Assign boolean values to a bool variable
|
|
We need to actively cancel anything that introduces a potential circular
loop, where io_uring holds a reference to itself. If the file in question
is an io_uring file, then add the request to the inflight list.
Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
[ 27.629441] BUG: sleeping function called from invalid context
at fs/file.c:402
[ 27.631317] in_atomic(): 1, irqs_disabled(): 1, non_block: 0,
pid: 1012, name: io_wqe_worker-0
[ 27.633220] 1 lock held by io_wqe_worker-0/1012:
[ 27.634286] #0: ffff888105e26c98 (&ctx->completion_lock)
{....}-{2:2}, at: __io_req_complete.part.102+0x30/0x70
[ 27.649249] Call Trace:
[ 27.649874] dump_stack+0xac/0xe3
[ 27.650666] ___might_sleep+0x284/0x2c0
[ 27.651566] put_files_struct+0xb8/0x120
[ 27.652481] __io_clean_op+0x10c/0x2a0
[ 27.653362] __io_cqring_fill_event+0x2c1/0x350
[ 27.654399] __io_req_complete.part.102+0x41/0x70
[ 27.655464] io_openat2+0x151/0x300
[ 27.656297] io_issue_sqe+0x6c/0x14e0
[ 27.660991] io_wq_submit_work+0x7f/0x240
[ 27.662890] io_worker_handle_work+0x501/0x8a0
[ 27.664836] io_wqe_worker+0x158/0x520
[ 27.667726] kthread+0x134/0x180
[ 27.669641] ret_from_fork+0x1f/0x30
Instead of cleaning files on overflow, return back overflow cancellation
into io_uring_cancel_files(). Previously it was racy to clean
REQ_F_OVERFLOW flag, but we got rid of it, and can do it through
repetitive attempts targeting all matching requests.
Reported-by: Abaci <abaci@linux.alibaba.com>
Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
RHBZ 1848178
The original intent of returning an error in this function
in the patch:
"CIFS: Mask off signals when sending SMB packets"
was to avoid interrupting packet send in the middle of
sending the data (and thus breaking an SMB connection),
but we also don't want to fail the request for non-fatal
signals even before we have had a chance to try to
send it (the reported problem could be reproduced e.g.
by exiting a child process when the parent process was in
the midst of calling futimens to update a file's timestamps).
In addition, since the signal may remain pending when we enter the
sending loop, we may end up not sending the whole packet before
TCP buffers become full. In this case the code returns -EINTR
but what we need here is to return -ERESTARTSYS instead to
allow system calls to be restarted.
Fixes: b30c74c73c78 ("CIFS: Mask off signals when sending SMB packets")
Cc: stable@vger.kernel.org # v5.1+
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Pull ceph fixes from Ilya Dryomov:
"A patch to zero out sensitive cryptographic data and two minor
cleanups prompted by the fact that a bunch of code was moved in this
cycle"
* tag 'ceph-for-5.11-rc5' of git://github.com/ceph/ceph-client:
libceph: fix "Boolean result is used in bitwise operation" warning
libceph, ceph: disambiguate ceph_connection_operations handlers
libceph: zero out session key and connection secret
|
|
Sockets and other non-regular files may actually expect short reads to
happen, don't retry reads for them. Because non-reg files don't set
FMODE_BUF_RASYNC and so it won't do second/retry do_read, we can filter
out those cases after first do_read() attempt with ret>0.
Cc: stable@vger.kernel.org # 5.9+
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
IORING_OP_CLOSE is special in terms of cancelation, since it has an
intermediate state where we've removed the file descriptor but hasn't
closed the file yet. For that reason, it's currently marked with
IO_WQ_WORK_NO_CANCEL to prevent cancelation. This ensures that the op
is always run even if canceled, to prevent leaving us with a live file
but an fd that is gone. However, with SQPOLL, since a cancel request
doesn't carry any resources on behalf of the request being canceled, if
we cancel before any of the close op has been run, we can end up with
io-wq not having the ->files assigned. This can result in the following
oops reported by Joseph:
BUG: kernel NULL pointer dereference, address: 00000000000000d8
PGD 800000010b76f067 P4D 800000010b76f067 PUD 10b462067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 1 PID: 1788 Comm: io_uring-sq Not tainted 5.11.0-rc4 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:__lock_acquire+0x19d/0x18c0
Code: 00 00 8b 1d fd 56 dd 08 85 db 0f 85 43 05 00 00 48 c7 c6 98 7b 95 82 48 c7 c7 57 96 93 82 e8 9a bc f5 ff 0f 0b e9 2b 05 00 00 <48> 81 3f c0 ca 67 8a b8 00 00 00 00 41 0f 45 c0 89 04 24 e9 81 fe
RSP: 0018:ffffc90001933828 EFLAGS: 00010002
RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000d8
RBP: 0000000000000246 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffff888106e8a140 R15: 00000000000000d8
FS: 0000000000000000(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000d8 CR3: 0000000106efa004 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
lock_acquire+0x31a/0x440
? close_fd_get_file+0x39/0x160
? __lock_acquire+0x647/0x18c0
_raw_spin_lock+0x2c/0x40
? close_fd_get_file+0x39/0x160
close_fd_get_file+0x39/0x160
io_issue_sqe+0x1334/0x14e0
? lock_acquire+0x31a/0x440
? __io_free_req+0xcf/0x2e0
? __io_free_req+0x175/0x2e0
? find_held_lock+0x28/0xb0
? io_wq_submit_work+0x7f/0x240
io_wq_submit_work+0x7f/0x240
io_wq_cancel_cb+0x161/0x580
? io_wqe_wake_worker+0x114/0x360
? io_uring_get_socket+0x40/0x40
io_async_find_and_cancel+0x3b/0x140
io_issue_sqe+0xbe1/0x14e0
? __lock_acquire+0x647/0x18c0
? __io_queue_sqe+0x10b/0x5f0
__io_queue_sqe+0x10b/0x5f0
? io_req_prep+0xdb/0x1150
? mark_held_locks+0x6d/0xb0
? mark_held_locks+0x6d/0xb0
? io_queue_sqe+0x235/0x4b0
io_queue_sqe+0x235/0x4b0
io_submit_sqes+0xd7e/0x12a0
? _raw_spin_unlock_irq+0x24/0x30
? io_sq_thread+0x3ae/0x940
io_sq_thread+0x207/0x940
? do_wait_intr_irq+0xc0/0xc0
? __ia32_sys_io_uring_enter+0x650/0x650
kthread+0x134/0x180
? kthread_create_worker_on_cpu+0x90/0x90
ret_from_fork+0x1f/0x30
Fix this by moving the IO_WQ_WORK_NO_CANCEL until _after_ we've modified
the fdtable. Canceling before this point is totally fine, and running
it in the io-wq context _after_ that point is also fine.
For 5.12, we'll handle this internally and get rid of the no-cancel
flag, as IORING_OP_CLOSE is the only user of it.
Cc: stable@vger.kernel.org
Fixes: b5dba59e0cf7 ("io_uring: add support for IORING_OP_CLOSE")
Reported-by: "Abaci <abaci@linux.alibaba.com>"
Reviewed-and-tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fs and udf fixes from Jan Kara:
"A lazytime handling fix from Eric Biggers and a fix of UDF session
handling for large devices"
* tag 'fs_for_v5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: fix the problem that the disc content is not displayed
fs: fix lazytime expiration handling in __writeback_single_inode()
|
|
Wire up the splice_read and splice_write methods to the default
helpers using ->read_iter and ->write_iter now that those are
implemented for kernfs. This restores support to use splice and
sendfile on kernfs files.
Fixes: 36e2c7421f02 ("fs: don't allow splice read/write without explicit ops")
Reported-by: Siddharth Gupta <sidgup@codeaurora.org>
Tested-by: Siddharth Gupta <sidgup@codeaurora.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210120204631.274206-4-hch@lst.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Switch kernfs to implement the write_iter method instead of plain old
write to prepare to supporting splice and sendfile again.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210120204631.274206-3-hch@lst.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Switch kernfs to implement the read_iter method instead of plain old
read to prepare to supporting splice and sendfile again.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210120204631.274206-2-hch@lst.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more one line fixes for various bugs, stable material.
- fix send when emitting clone operation from the same file and root
- fix double free on error when cleaning backrefs
- lockdep fix during relocation
- handle potential error during reloc when starting transaction
- skip running delayed refs during commit (leftover from code removal
in this dev cycle)"
* tag 'for-5.11-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: don't clear ret in btrfs_start_dirty_block_groups
btrfs: fix lockdep splat in btrfs_recover_relocation
btrfs: do not double free backref nodes on error
btrfs: don't get an EINTR during drop_snapshot for reloc
btrfs: send: fix invalid clone operations when cloning from the same file and root
btrfs: no need to run delayed refs after commit_fs_roots during commit
|
|
After the recent actions to convert readpages aops to readahead, the
NULL checks of readpages aops in cachefiles_read_or_alloc_page() may
hit falsely. More badly, it's an ASSERT() call, and this panics.
Drop the superfluous NULL checks for fixing this regression.
[DH: Note that cachefiles never actually used readpages, so this check was
never actually necessary]
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=208883
BugLink: https://bugzilla.opensuse.org/show_bug.cgi?id=1175245
Fixes: 9ae326a69004 ("CacheFiles: A cache that backs onto a mounted filesystem")
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Avoid exposing parent of root directory in NFSv3 READDIRPLUS results
- Fix a tracepoint change that went in the initial 5.11 merge
* tag 'nfsd-5.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
SUNRPC: Move the svc_xdr_recvfrom tracepoint again
nfsd4: readdirplus shouldn't return parent of export
|
|
If we fail to update a block group item in the loop we'll break, however
we'll do btrfs_run_delayed_refs and lose our error value in ret, and
thus not clean up properly. Fix this by only running the delayed refs
if there was no failure.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
While testing the error paths of relocation I hit the following lockdep
splat:
======================================================
WARNING: possible circular locking dependency detected
5.10.0-rc6+ #217 Not tainted
------------------------------------------------------
mount/779 is trying to acquire lock:
ffffa0e676945418 (&fs_info->balance_mutex){+.+.}-{3:3}, at: btrfs_recover_balance+0x2f0/0x340
but task is already holding lock:
ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (btrfs-root-00){++++}-{3:3}:
down_read_nested+0x43/0x130
__btrfs_tree_read_lock+0x27/0x100
btrfs_read_lock_root_node+0x31/0x40
btrfs_search_slot+0x462/0x8f0
btrfs_update_root+0x55/0x2b0
btrfs_drop_snapshot+0x398/0x750
clean_dirty_subvols+0xdf/0x120
btrfs_recover_relocation+0x534/0x5a0
btrfs_start_pre_rw_mount+0xcb/0x170
open_ctree+0x151f/0x1726
btrfs_mount_root.cold+0x12/0xea
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x10d/0x380
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
path_mount+0x433/0xc10
__x64_sys_mount+0xe3/0x120
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (sb_internal#2){.+.+}-{0:0}:
start_transaction+0x444/0x700
insert_balance_item.isra.0+0x37/0x320
btrfs_balance+0x354/0xf40
btrfs_ioctl_balance+0x2cf/0x380
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&fs_info->balance_mutex){+.+.}-{3:3}:
__lock_acquire+0x1120/0x1e10
lock_acquire+0x116/0x370
__mutex_lock+0x7e/0x7b0
btrfs_recover_balance+0x2f0/0x340
open_ctree+0x1095/0x1726
btrfs_mount_root.cold+0x12/0xea
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x10d/0x380
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
path_mount+0x433/0xc10
__x64_sys_mount+0xe3/0x120
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
&fs_info->balance_mutex --> sb_internal#2 --> btrfs-root-00
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-root-00);
lock(sb_internal#2);
lock(btrfs-root-00);
lock(&fs_info->balance_mutex);
*** DEADLOCK ***
2 locks held by mount/779:
#0: ffffa0e60dc040e0 (&type->s_umount_key#47/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
#1: ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100
stack backtrace:
CPU: 0 PID: 779 Comm: mount Not tainted 5.10.0-rc6+ #217
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack+0x8b/0xb0
check_noncircular+0xcf/0xf0
? trace_call_bpf+0x139/0x260
__lock_acquire+0x1120/0x1e10
lock_acquire+0x116/0x370
? btrfs_recover_balance+0x2f0/0x340
__mutex_lock+0x7e/0x7b0
? btrfs_recover_balance+0x2f0/0x340
? btrfs_recover_balance+0x2f0/0x340
? rcu_read_lock_sched_held+0x3f/0x80
? kmem_cache_alloc_trace+0x2c4/0x2f0
? btrfs_get_64+0x5e/0x100
btrfs_recover_balance+0x2f0/0x340
open_ctree+0x1095/0x1726
btrfs_mount_root.cold+0x12/0xea
? rcu_read_lock_sched_held+0x3f/0x80
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x10d/0x380
? __kmalloc_track_caller+0x2f2/0x320
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
? capable+0x3a/0x60
path_mount+0x433/0xc10
__x64_sys_mount+0xe3/0x120
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This is straightforward to fix, simply release the path before we setup
the balance_ctl.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Zygo reported the following KASAN splat:
BUG: KASAN: use-after-free in btrfs_backref_cleanup_node+0x18a/0x420
Read of size 8 at addr ffff888112402950 by task btrfs/28836
CPU: 0 PID: 28836 Comm: btrfs Tainted: G W 5.10.0-e35f27394290-for-next+ #23
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Call Trace:
dump_stack+0xbc/0xf9
? btrfs_backref_cleanup_node+0x18a/0x420
print_address_description.constprop.8+0x21/0x210
? record_print_text.cold.34+0x11/0x11
? btrfs_backref_cleanup_node+0x18a/0x420
? btrfs_backref_cleanup_node+0x18a/0x420
kasan_report.cold.10+0x20/0x37
? btrfs_backref_cleanup_node+0x18a/0x420
__asan_load8+0x69/0x90
btrfs_backref_cleanup_node+0x18a/0x420
btrfs_backref_release_cache+0x83/0x1b0
relocate_block_group+0x394/0x780
? merge_reloc_roots+0x4a0/0x4a0
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x1900
? check_flags.part.50+0x6c/0x1e0
? btrfs_relocate_chunk+0x120/0x120
? kmem_cache_alloc_trace+0xa06/0xcb0
? _copy_from_user+0x83/0xc0
btrfs_ioctl_balance+0x3a7/0x460
btrfs_ioctl+0x24c8/0x4360
? __kasan_check_read+0x11/0x20
? check_chain_key+0x1f4/0x2f0
? __asan_loadN+0xf/0x20
? btrfs_ioctl_get_supported_features+0x30/0x30
? kvm_sched_clock_read+0x18/0x30
? check_chain_key+0x1f4/0x2f0
? lock_downgrade+0x3f0/0x3f0
? handle_mm_fault+0xad6/0x2150
? do_vfs_ioctl+0xfc/0x9d0
? ioctl_file_clone+0xe0/0xe0
? check_flags.part.50+0x6c/0x1e0
? check_flags.part.50+0x6c/0x1e0
? check_flags+0x26/0x30
? lock_is_held_type+0xc3/0xf0
? syscall_enter_from_user_mode+0x1b/0x60
? do_syscall_64+0x13/0x80
? rcu_read_lock_sched_held+0xa1/0xd0
? __kasan_check_read+0x11/0x20
? __fget_light+0xae/0x110
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f4c4bdfe427
Allocated by task 28836:
kasan_save_stack+0x21/0x50
__kasan_kmalloc.constprop.18+0xbe/0xd0
kasan_kmalloc+0x9/0x10
kmem_cache_alloc_trace+0x410/0xcb0
btrfs_backref_alloc_node+0x46/0xf0
btrfs_backref_add_tree_node+0x60d/0x11d0
build_backref_tree+0xc5/0x700
relocate_tree_blocks+0x2be/0xb90
relocate_block_group+0x2eb/0x780
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x1900
btrfs_ioctl_balance+0x3a7/0x460
btrfs_ioctl+0x24c8/0x4360
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 28836:
kasan_save_stack+0x21/0x50
kasan_set_track+0x20/0x30
kasan_set_free_info+0x1f/0x30
__kasan_slab_free+0xf3/0x140
kasan_slab_free+0xe/0x10
kfree+0xde/0x200
btrfs_backref_error_cleanup+0x452/0x530
build_backref_tree+0x1a5/0x700
relocate_tree_blocks+0x2be/0xb90
relocate_block_group+0x2eb/0x780
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x1900
btrfs_ioctl_balance+0x3a7/0x460
btrfs_ioctl+0x24c8/0x4360
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This occurred because we freed our backref node in
btrfs_backref_error_cleanup(), but then tried to free it again in
btrfs_backref_release_cache(). This is because
btrfs_backref_release_cache() will cycle through all of the
cache->leaves nodes and free them up. However
btrfs_backref_error_cleanup() freed the backref node with
btrfs_backref_free_node(), which simply kfree()d the backref node
without unlinking it from the cache. Change this to a
btrfs_backref_drop_node(), which does the appropriate cleanup and
removes the node from the cache->leaves list, so when we go to free the
remaining cache we don't trip over items we've already dropped.
Fixes: 75bfb9aff45e ("Btrfs: cleanup error handling in build_backref_tree")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This was partially fixed by f3e3d9cc3525 ("btrfs: avoid possible signal
interruption of btrfs_drop_snapshot() on relocation tree"), however it
missed a spot when we restart a trans handle because we need to end the
transaction. The fix is the same, simply use btrfs_join_transaction()
instead of btrfs_start_transaction() when deleting reloc roots.
Fixes: f3e3d9cc3525 ("btrfs: avoid possible signal interruption of btrfs_drop_snapshot() on relocation tree")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When the capacity of the disc is too large (assuming the 4.7G
specification), the disc (UDF file system) will be burned
multiple times in the windows (Multisession Usage). When the
remaining capacity of the CD is less than 300M (estimated
value, for reference only), open the CD in the Linux system,
the content of the CD is displayed as blank (the kernel will
say "No VRS found"). Windows can display the contents of the
CD normally.
Through analysis, in the "fs/udf/super.c": udf_check_vsd
function, the actual value of VSD_MAX_SECTOR_OFFSET may
be much larger than 0x800000. According to the current code
logic, it is found that the type of sbi->s_session is "__s32",
when the remaining capacity of the disc is less than 300M
(take a set of test values: sector=3154903040,
sbi->s_session=1540464, sb->s_blocksize_bits=11 ), the
calculation result of "sbi->s_session << sb->s_blocksize_bits"
will overflow. Therefore, it is necessary to convert the
type of s_session to "loff_t" (when udf_check_vsd starts,
assign a value to _sector, which is also converted in this
way), so that the result will not overflow, and then the
content of the disc can be displayed normally.
Link: https://lore.kernel.org/r/20210114075741.30448-1-changlianzhi@uniontech.com
Signed-off-by: lianzhi chang <changlianzhi@uniontech.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Fix the follow warnings:
./fs/cifs/connect.c: WARNING: Comparison of 0/1 to bool variable
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Zhong <abaci-bugfix@linux.alibaba.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Fix the following coccicheck warnings:
./fs/cifs/connect.c:3386:2-21: WARNING: Assignment of 0/1 to
bool variable.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Zhong <abaci-bugfix@linux.alibaba.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Pull misc vfs fixes from Al Viro:
"Several assorted fixes.
I still think that audit ->d_name race is better fixed this way for
the benefit of backports, with any possibly fancier variants done on
top of it"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
dump_common_audit_data(): fix racy accesses to ->d_name
iov_iter: fix the uaccess area in copy_compat_iovec_from_user
umount(2): move the flag validity checks first
|
|
If there are no requests at the time __io_uring_task_cancel() is called,
tctx_inflight() returns zero and and it terminates not getting a chance
to go through __io_uring_files_cancel() and do
io_disable_sqo_submit(). And we absolutely want them disabled by the
time cancellation ends.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Fixes: d9d05217cb69 ("io_uring: stop SQPOLL submit on creator's death")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
WARNING: CPU: 1 PID: 11100 at fs/io_uring.c:9096
io_uring_flush+0x326/0x3a0 fs/io_uring.c:9096
RIP: 0010:io_uring_flush+0x326/0x3a0 fs/io_uring.c:9096
Call Trace:
filp_close+0xb4/0x170 fs/open.c:1280
close_files fs/file.c:401 [inline]
put_files_struct fs/file.c:416 [inline]
put_files_struct+0x1cc/0x350 fs/file.c:413
exit_files+0x7e/0xa0 fs/file.c:433
do_exit+0xc22/0x2ae0 kernel/exit.c:820
do_group_exit+0x125/0x310 kernel/exit.c:922
get_signal+0x3e9/0x20a0 kernel/signal.c:2770
arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
handle_signal_work kernel/entry/common.c:147 [inline]
exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
exit_to_user_mode_prepare+0x148/0x250 kernel/entry/common.c:201
__syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302
entry_SYSCALL_64_after_hwframe+0x44/0xa9
An SQPOLL ring creator task may have gotten rid of its file note during
exit and called io_disable_sqo_submit(), but the io_uring is still left
referenced through fdtable, which will be put during close_files() and
cause a false positive warning.
First split the warning into two for more clarity when is hit, and the
add sqo_dead check to handle the described case.
Reported-by: syzbot+a32b546d58dde07875a1@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
WARNING: CPU: 1 PID: 9094 at fs/io_uring.c:8884
io_disable_sqo_submit+0x106/0x130 fs/io_uring.c:8884
Call Trace:
io_uring_flush+0x28b/0x3a0 fs/io_uring.c:9099
filp_close+0xb4/0x170 fs/open.c:1280
close_fd+0x5c/0x80 fs/file.c:626
__do_sys_close fs/open.c:1299 [inline]
__se_sys_close fs/open.c:1297 [inline]
__x64_sys_close+0x2f/0xa0 fs/open.c:1297
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
io_uring's final close() may be triggered by any task not only the
creator. It's well handled by io_uring_flush() including SQPOLL case,
though a warning in io_disable_sqo_submit() will fallaciously fire by
moving this warning out to the only call site that matters.
Reported-by: syzbot+2f5d1785dc624932da78@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If we're freeing/finishing iopoll requests, ensure we check if the task
is in idling in terms of cancelation. Otherwise we could end up waiting
forever in __io_uring_task_cancel() if the task has active iopoll
requests that need cancelation.
Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull io_uring fixes from Jens Axboe:
"We still have a pending fix for a cancelation issue, but it's still
being investigated. In the meantime:
- Dead mm handling fix (Pavel)
- SQPOLL setup error handling (Pavel)
- Flush timeout sequence fix (Marcelo)
- Missing finish_wait() for one exit case"
* tag 'io_uring-5.11-2021-01-16' of git://git.kernel.dk/linux-block:
io_uring: ensure finish_wait() is always called in __io_uring_task_cancel()
io_uring: flush timeouts that should already have expired
io_uring: do sqo disable on install_fd error
io_uring: fix null-deref in io_disable_sqo_submit
io_uring: don't take files/mm for a dead task
io_uring: drop mm and files after task_work_run
|
|
Turning a pinned page read-only breaks the pinning after COW. Don't do it.
The whole "track page soft dirty" state doesn't work with pinned pages
anyway, since the page might be dirtied by the pinning entity without
ever being noticed in the page tables.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Turning page table entries read-only requires the mmap_sem held for
writing.
So stop doing the odd games with turning things from read locks to write
locks and back. Just get the write lock.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If we enter with requests pending and performm cancelations, we'll have
a different inflight count before and after calling prepare_to_wait().
This causes the loop to restart. If we actually ended up canceling
everything, or everything completed in-between, then we'll break out
of the loop without calling finish_wait() on the waitqueue. This can
trigger a warning on exit_signals(), as we leave the task state in
TASK_UNINTERRUPTIBLE.
Put a finish_wait() after the loop to catch that case.
Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"A number of bug fixes for ext4:
- Fix for the new fast_commit feature
- Fix some error handling codepaths in whiteout handling and
mountpoint sampling
- Fix how we write ext4_error information so it goes through the
journal when journalling is active, to avoid races that can lead to
lost error information, superblock checksum failures, or DIF/DIX
features"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: remove expensive flush on fast commit
ext4: fix bug for rename with RENAME_WHITEOUT
ext4: fix wrong list_splice in ext4_fc_cleanup
ext4: use IS_ERR instead of IS_ERR_OR_NULL and set inode null when IS_ERR
ext4: don't leak old mountpoint samples
ext4: drop ext4_handle_dirty_super()
ext4: fix superblock checksum failure when setting password salt
ext4: use sbi instead of EXT4_SB(sb) in ext4_update_super()
ext4: save error info to sb through journal if available
ext4: protect superblock modifications with a buffer lock
ext4: drop sync argument of ext4_commit_super()
ext4: combine ext4_handle_error() and save_error_info()
|
|
Pull cifs fixes from Steve French:
"Two small cifs fixes for stable (including an important handle leak
fix) and three small cleanup patches"
* tag '5.11-rc3-smb3' of git://git.samba.org/sfrench/cifs-2.6:
cifs: style: replace one-element array with flexible-array
cifs: connect: style: Simplify bool comparison
fs: cifs: remove unneeded variable in smb3_fs_context_dup
cifs: fix interrupted close commands
cifs: check pointer before freeing
|
|
In the fast commit, it adds REQ_FUA and REQ_PREFLUSH on each fast
commit block when barrier is enabled. However, in recovery phase,
ext4 compares CRC value in the tail. So it is sufficient to add
REQ_FUA and REQ_PREFLUSH on the block that has tail.
Signed-off-by: Daejun Park <daejun7.park@samsung.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20210106013242epcms2p5b6b4ed8ca86f29456fdf56aa580e74b4@epcms2p5
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
We got a "deleted inode referenced" warning cross our fsstress test. The
bug can be reproduced easily with following steps:
cd /dev/shm
mkdir test/
fallocate -l 128M img
mkfs.ext4 -b 1024 img
mount img test/
dd if=/dev/zero of=test/foo bs=1M count=128
mkdir test/dir/ && cd test/dir/
for ((i=0;i<1000;i++)); do touch file$i; done # consume all block
cd ~ && renameat2(AT_FDCWD, /dev/shm/test/dir/file1, AT_FDCWD,
/dev/shm/test/dir/dst_file, RENAME_WHITEOUT) # ext4_add_entry in
ext4_rename will return ENOSPC!!
cd /dev/shm/ && umount test/ && mount img test/ && ls -li test/dir/file1
We will get the output:
"ls: cannot access 'test/dir/file1': Structure needs cleaning"
and the dmesg show:
"EXT4-fs error (device loop0): ext4_lookup:1626: inode #2049: comm ls:
deleted inode referenced: 139"
ext4_rename will create a special inode for whiteout and use this 'ino'
to replace the source file's dir entry 'ino'. Once error happens
latter(the error above was the ENOSPC return from ext4_add_entry in
ext4_rename since all space has been consumed), the cleanup do drop the
nlink for whiteout, but forget to restore 'ino' with source file. This
will trigger the bug describle as above.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Fixes: cd808deced43 ("ext4: support RENAME_WHITEOUT")
Link: https://lore.kernel.org/r/20210105062857.3566-1-yangerkun@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
After full/fast commit, entries in staging queue are promoted to main
queue. In ext4_fs_cleanup function, it splice to staging queue to
staging queue.
Fixes: aa75f4d3daaeb ("ext4: main fast-commit commit path")
Signed-off-by: Daejun Park <daejun7.park@samsung.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201230094851epcms2p6eeead8cc984379b37b2efd21af90fd1a@epcms2p6
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
1: ext4_iget/ext4_find_extent never returns NULL, use IS_ERR
instead of IS_ERR_OR_NULL to fix this.
2: ext4_fc_replay_inode should set the inode to NULL when IS_ERR.
and go to call iput properly.
Fixes: 8016e29f4362 ("ext4: fast commit recovery path")
Signed-off-by: Yi Li <yili@winhong.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201230033827.3996064-1-yili@winhong.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
Right now io_flush_timeouts() checks if the current number of events
is equal to ->timeout.target_seq, but this will miss some timeouts if
there have been more than 1 event added since the last time they were
flushed (possible in io_submit_flush_completions(), for example). Fix
it by recording the last sequence at which timeouts were flushed so
that the number of events seen can be compared to the number of events
needed without overflow.
Signed-off-by: Marcelo Diop-Gonzalez <marcelo827@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There is a regular need in the kernel to provide a way to declare
having a dynamically sized set of trailing elements in a structure.
Kernel code should always use "flexible array members"[1] for these
cases. The older style of one-element or zero-length arrays should
no longer be used[2].
[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9/process/
deprecated.html#zero-length-and-one-element-arrays
Signed-off-by: YANG LI <abaci-bugfix@linux.alibaba.com>
Reported-by: Abaci <abaci@linux.alibaba.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Fix the following coccicheck warning:
./fs/cifs/connect.c:3740:6-21: WARNING: Comparison of 0/1 to bool
variable
Signed-off-by: YANG LI <abaci-bugfix@linux.alibaba.com>
Reported-by: Abaci Robot<abaci@linux.alibaba.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
'rc' in smb3_fs_context_dup is not used and can be removed.
Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Retry close command if it gets interrupted to not leak open handles on
the server.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reported-by: Duncan Findlay <duncf@duncf.ca>
Suggested-by: Pavel Shilovsky <pshilov@microsoft.com>
Fixes: 6988a619f5b7 ("cifs: allow syscalls to be restarted in __smb_send_rqst()")
Cc: stable@vger.kernel.org
Reviewd-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
clang static analysis reports this problem
dfs_cache.c:591:2: warning: Argument to kfree() is a constant address
(18446744073709551614), which is not memory allocated by malloc()
kfree(vi);
^~~~~~~~~
In dfs_cache_del_vol() the volume info pointer 'vi' being freed
is the return of a call to find_vol(). The large constant address
is find_vol() returning an error.
Add an error check to dfs_cache_del_vol() similar to the one done
in dfs_cache_update_vol().
Fixes: 54be1f6c1c37 ("cifs: Add DFS cache routines")
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
CC: <stable@vger.kernel.org> # v5.0+
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
When lazytime is enabled and an inode is being written due to its
in-memory updated timestamps having expired, either due to a sync() or
syncfs() system call or due to dirtytime_expire_interval having elapsed,
the VFS needs to inform the filesystem so that the filesystem can copy
the inode's timestamps out to the on-disk data structures.
This is done by __writeback_single_inode() calling
mark_inode_dirty_sync(), which then calls ->dirty_inode(I_DIRTY_SYNC).
However, this occurs after __writeback_single_inode() has already
cleared the dirty flags from ->i_state. This causes two bugs:
- mark_inode_dirty_sync() redirties the inode, causing it to remain
dirty. This wastefully causes the inode to be written twice. But
more importantly, it breaks cases where sync_filesystem() is expected
to clean dirty inodes. This includes the FS_IOC_REMOVE_ENCRYPTION_KEY
ioctl (as reported at
https://lore.kernel.org/r/20200306004555.GB225345@gmail.com), as well
as possibly filesystem freezing (freeze_super()).
- Since ->i_state doesn't contain I_DIRTY_TIME when ->dirty_inode() is
called from __writeback_single_inode() for lazytime expiration,
xfs_fs_dirty_inode() ignores the notification. (XFS only cares about
lazytime expirations, and it assumes that i_state will contain
I_DIRTY_TIME during those.) Therefore, lazy timestamps aren't
persisted by sync(), syncfs(), or dirtytime_expire_interval on XFS.
Fix this by moving the call to mark_inode_dirty_sync() to earlier in
__writeback_single_inode(), before the dirty flags are cleared from
i_state. This makes filesystems be properly notified of the timestamp
expiration, and it avoids incorrectly redirtying the inode.
This fixes xfstest generic/580 (which tests
FS_IOC_REMOVE_ENCRYPTION_KEY) when run on ext4 or f2fs with lazytime
enabled. It also fixes the new lazytime xfstest I've proposed, which
reproduces the above-mentioned XFS bug
(https://lore.kernel.org/r/20210105005818.92978-1-ebiggers@kernel.org).
Alternatively, we could call ->dirty_inode(I_DIRTY_SYNC) directly. But
due to the introduction of I_SYNC_QUEUED, mark_inode_dirty_sync() is the
right thing to do because mark_inode_dirty_sync() now knows not to move
the inode to a writeback list if it is currently queued for sync.
Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option")
Cc: stable@vger.kernel.org
Depends-on: 5afced3bf281 ("writeback: Avoid skipping inode writeback")
Link: https://lore.kernel.org/r/20210112190253.64307-2-ebiggers@kernel.org
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
WARNING: CPU: 0 PID: 8494 at fs/io_uring.c:8717
io_ring_ctx_wait_and_kill+0x4f2/0x600 fs/io_uring.c:8717
Call Trace:
io_uring_release+0x3e/0x50 fs/io_uring.c:8759
__fput+0x283/0x920 fs/file_table.c:280
task_work_run+0xdd/0x190 kernel/task_work.c:140
tracehook_notify_resume include/linux/tracehook.h:189 [inline]
exit_to_user_mode_loop kernel/entry/common.c:174 [inline]
exit_to_user_mode_prepare+0x249/0x250 kernel/entry/common.c:201
__syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302
entry_SYSCALL_64_after_hwframe+0x44/0xa9
failed io_uring_install_fd() is a special case, we don't do
io_ring_ctx_wait_and_kill() directly but defer it to fput, though still
need to io_disable_sqo_submit() before.
note: it doesn't fix any real problem, just a warning. That's because
sqring won't be available to the userspace in this case and so SQPOLL
won't submit anything.
Reported-by: syzbot+9c9c35374c0ecac06516@syzkaller.appspotmail.com
Fixes: d9d05217cb69 ("io_uring: stop SQPOLL submit on creator's death")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
general protection fault, probably for non-canonical address
0xdffffc0000000022: 0000 [#1] KASAN: null-ptr-deref
in range [0x0000000000000110-0x0000000000000117]
RIP: 0010:io_ring_set_wakeup_flag fs/io_uring.c:6929 [inline]
RIP: 0010:io_disable_sqo_submit+0xdb/0x130 fs/io_uring.c:8891
Call Trace:
io_uring_create fs/io_uring.c:9711 [inline]
io_uring_setup+0x12b1/0x38e0 fs/io_uring.c:9739
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
io_disable_sqo_submit() might be called before user rings were
allocated, don't do io_ring_set_wakeup_flag() in those cases.
Reported-by: syzbot+ab412638aeb652ded540@syzkaller.appspotmail.com
Fixes: d9d05217cb69 ("io_uring: stop SQPOLL submit on creator's death")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull NFS client fixes from Trond Myklebust:
"Highlights include:
- Fix parsing of link-local IPv6 addresses
- Fix confusing logging of mount errors that was introduced by the
fsopen() patchset.
- Fix a tracing use after free in _nfs4_do_setlk()
- Layout return-on-close fixes when called from nfs4_evict_inode()
- Layout segments were being leaked in
pnfs_generic_clear_request_commit()
- Don't leak DS commits in pnfs_generic_retry_commit()
- Fix an Oopsable use-after-free when nfs_delegation_find_inode_server()
calls iput() on an inode after the super block has gone away"
* tag 'nfs-for-5.11-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: nfs_igrab_and_active must first reference the superblock
NFS: nfs_delegation_find_inode_server must first reference the superblock
NFS/pNFS: Fix a leak of the layout 'plh_outstanding' counter
NFS/pNFS: Don't leak DS commits in pnfs_generic_retry_commit()
NFS/pNFS: Don't call pnfs_free_bucket_lseg() before removing the request
pNFS: Stricter ordering of layoutget and layoutreturn
pNFS: Clean up pnfs_layoutreturn_free_lsegs()
pNFS: We want return-on-close to complete when evicting the inode
pNFS: Mark layout for return if return-on-close was not sent
net: sunrpc: interpret the return value of kstrtou32 correctly
NFS: Adjust fs_context error logging
NFS4: Fix use-after-free in trace_event_raw_event_nfs4_set_lock
|
|
and root
When an incremental send finds an extent that is shared, it checks which
file extent items in the range refer to that extent, and for those it
emits clone operations, while for others it emits regular write operations
to avoid corruption at the destination (as described and fixed by commit
d906d49fc5f4 ("Btrfs: send, fix file corruption due to incorrect cloning
operations")).
However when the root we are cloning from is the send root, we are cloning
from the inode currently being processed and the source file range has
several extent items that partially point to the desired extent, with an
offset smaller than the offset in the file extent item for the range we
want to clone into, it can cause the algorithm to issue a clone operation
that starts at the current eof of the file being processed in the receiver
side, in which case the receiver will fail, with EINVAL, when attempting
to execute the clone operation.
Example reproducer:
$ cat test-send-clone.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV >/dev/null
mount $DEV $MNT
# Create our test file with a single and large extent (1M) and with
# different content for different file ranges that will be reflinked
# later.
xfs_io -f \
-c "pwrite -S 0xab 0 128K" \
-c "pwrite -S 0xcd 128K 128K" \
-c "pwrite -S 0xef 256K 256K" \
-c "pwrite -S 0x1a 512K 512K" \
$MNT/foobar
btrfs subvolume snapshot -r $MNT $MNT/snap1
btrfs send -f /tmp/snap1.send $MNT/snap1
# Now do a series of changes to our file such that we end up with
# different parts of the extent reflinked into different file offsets
# and we overwrite a large part of the extent too, so no file extent
# items refer to that part that was overwritten. This used to confuse
# the algorithm used by the kernel to figure out which file ranges to
# clone, making it attempt to clone from a source range starting at
# the current eof of the file, resulting in the receiver to fail since
# it is an invalid clone operation.
#
xfs_io -c "reflink $MNT/foobar 64K 1M 960K" \
-c "reflink $MNT/foobar 0K 512K 256K" \
-c "reflink $MNT/foobar 512K 128K 256K" \
-c "pwrite -S 0x73 384K 640K" \
$MNT/foobar
btrfs subvolume snapshot -r $MNT $MNT/snap2
btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2
echo -e "\nFile digest in the original filesystem:"
md5sum $MNT/snap2/foobar
# Now unmount the filesystem, create a new one, mount it and try to
# apply both send streams to recreate both snapshots.
umount $DEV
mkfs.btrfs -f $DEV >/dev/null
mount $DEV $MNT
btrfs receive -f /tmp/snap1.send $MNT
btrfs receive -f /tmp/snap2.send $MNT
# Must match what we got in the original filesystem of course.
echo -e "\nFile digest in the new filesystem:"
md5sum $MNT/snap2/foobar
umount $MNT
When running the reproducer, the incremental send operation fails due to
an invalid clone operation:
$ ./test-send-clone.sh
wrote 131072/131072 bytes at offset 0
128 KiB, 32 ops; 0.0015 sec (80.906 MiB/sec and 20711.9741 ops/sec)
wrote 131072/131072 bytes at offset 131072
128 KiB, 32 ops; 0.0013 sec (90.514 MiB/sec and 23171.6148 ops/sec)
wrote 262144/262144 bytes at offset 262144
256 KiB, 64 ops; 0.0025 sec (98.270 MiB/sec and 25157.2327 ops/sec)
wrote 524288/524288 bytes at offset 524288
512 KiB, 128 ops; 0.0052 sec (95.730 MiB/sec and 24506.9883 ops/sec)
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
At subvol /mnt/sdi/snap1
linked 983040/983040 bytes at offset 1048576
960 KiB, 1 ops; 0.0006 sec (1.419 GiB/sec and 1550.3876 ops/sec)
linked 262144/262144 bytes at offset 524288
256 KiB, 1 ops; 0.0020 sec (120.192 MiB/sec and 480.7692 ops/sec)
linked 262144/262144 bytes at offset 131072
256 KiB, 1 ops; 0.0018 sec (133.833 MiB/sec and 535.3319 ops/sec)
wrote 655360/655360 bytes at offset 393216
640 KiB, 160 ops; 0.0093 sec (66.781 MiB/sec and 17095.8436 ops/sec)
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
At subvol /mnt/sdi/snap2
File digest in the original filesystem:
9c13c61cb0b9f5abf45344375cb04dfa /mnt/sdi/snap2/foobar
At subvol snap1
At snapshot snap2
ERROR: failed to clone extents to foobar: Invalid argument
File digest in the new filesystem:
132f0396da8f48d2e667196bff882cfc /mnt/sdi/snap2/foobar
The clone operation is invalid because its source range starts at the
current eof of the file in the receiver, causing the receiver to get
an EINVAL error from the clone operation when attempting it.
For the example above, what happens is the following:
1) When processing the extent at file offset 1M, the algorithm checks that
the extent is shared and can be (fully or partially) found at file
offset 0.
At this point the file has a size (and eof) of 1M at the receiver;
2) It finds that our extent item at file offset 1M has a data offset of
64K and, since the file extent item at file offset 0 has a data offset
of 0, it issues a clone operation, from the same file and root, that
has a source range offset of 64K, destination offset of 1M and a length
of 64K, since the extent item at file offset 0 refers only to the first
128K of the shared extent.
After this clone operation, the file size (and eof) at the receiver is
increased from 1M to 1088K (1M + 64K);
3) Now there's still 896K (960K - 64K) of data left to clone or write, so
it checks for the next file extent item, which starts at file offset
128K. This file extent item has a data offset of 0 and a length of
256K, so a clone operation with a source range offset of 256K, a
destination offset of 1088K (1M + 64K) and length of 128K is issued.
After this operation the file size (and eof) at the receiver increases
from 1088K to 1216K (1088K + 128K);
4) Now there's still 768K (896K - 128K) of data left to clone or write, so
it checks for the next file extent item, located at file offset 384K.
This file extent item points to a different extent, not the one we want
to clone, with a length of 640K. So we issue a write operation into the
file range 1216K (1088K + 128K, end of the last clone operation), with
a length of 640K and with a data matching the one we can find for that
range in send root.
After this operation, the file size (and eof) at the receiver increases
from 1216K to 1856K (1216K + 640K);
5) Now there's still 128K (768K - 640K) of data left to clone or write, so
we look into the file extent item, which is for file offset 1M and it
points to the extent we want to clone, with a data offset of 64K and a
length of 960K.
However this matches the file offset we started with, the start of the
range to clone into. So we can't for sure find any file extent item
from here onwards with the rest of the data we want to clone, yet we
proceed and since the file extent item points to the shared extent,
with a data offset of 64K, we issue a clone operation with a source
range starting at file offset 1856K, which matches the file extent
item's offset, 1M, plus the amount of data cloned and written so far,
which is 64K (step 2) + 128K (step 3) + 640K (step 4). This clone
operation is invalid since the source range offset matches the current
eof of the file in the receiver. We should have stopped looking for
extents to clone at this point and instead fallback to write, which
would simply the contain the data in the file range from 1856K to
1856K + 128K.
So fix this by stopping the loop that looks for file ranges to clone at
clone_range() when we reach the current eof of the file being processed,
if we are cloning from the same file and using the send root as the clone
root. This ensures any data not yet cloned will be sent to the receiver
through a write operation.
A test case for fstests will follow soon.
Reported-by: Massimo B. <massimo.b@gmx.net>
Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/
Fixes: 11f2069c113e ("Btrfs: send, allow clone operations within the same file")
CC: stable@vger.kernel.org # 5.5+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The inode number cache has been removed in this dev cycle, there's one
more leftover. We don't need to run the delayed refs again after
commit_fs_roots as stated in the comment, because btrfs_save_ino_cache
is no more since 5297199a8bca ("btrfs: remove inode number cache
feature").
Nothing else between commit_fs_roots and btrfs_qgroup_account_extents
could create new delayed refs so the qgroup consistency should be safe.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If you export a subdirectory of a filesystem, a READDIRPLUS on the root
of that export will return the filehandle of the parent with the ".."
entry.
The filehandle is optional, so let's just not return the filehandle for
".." if we're at the root of an export.
Note that once the client learns one filehandle outside of the export,
they can trivially access the rest of the export using further lookups.
However, it is also not very difficult to guess filehandles outside of
the export. So exporting a subdirectory of a filesystem should
considered equivalent to providing access to the entire filesystem. To
avoid confusion, we recommend only exporting entire filesystems.
Reported-by: Youjipeng <wangzhibei1999@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"More material for stable trees.
- tree-checker: check item end overflow
- fix false warning during relocation regarding extent type
- fix inode flushing logic, caused notable performance regression
(since 5.10)
- debugging fixups:
- print correct offset for reloc tree key
- pass reliable fs_info pointer to error reporting helper"
* tag 'for-5.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: shrink delalloc pages instead of full inodes
btrfs: reloc: fix wrong file extent type check to avoid false ENOENT
btrfs: tree-checker: check if chunk item end overflows
btrfs: prevent NULL pointer dereference in extent_io_tree_panic
btrfs: print the actual offset in btrfs_root_name
|