Age | Commit message (Collapse) | Author | Files | Lines |
|
Commit 7f9fe614407692 ("btrfs: improve global reserve stealing logic"),
added in the 5.8 merge window, introduced another leak for the space_info's
reclaim_size counter. This is very often triggered by the test cases
generic/269 and generic/416 from fstests, producing a stack trace like the
following during unmount:
[37079.155499] ------------[ cut here ]------------
[37079.156844] WARNING: CPU: 2 PID: 2000423 at fs/btrfs/block-group.c:3422 btrfs_free_block_groups+0x2eb/0x300 [btrfs]
[37079.158090] Modules linked in: dm_snapshot btrfs dm_thin_pool (...)
[37079.164440] CPU: 2 PID: 2000423 Comm: umount Tainted: G W 5.7.0-rc7-btrfs-next-62 #1
[37079.165422] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), (...)
[37079.167384] RIP: 0010:btrfs_free_block_groups+0x2eb/0x300 [btrfs]
[37079.168375] Code: bd 58 ff ff ff 00 4c 8d (...)
[37079.170199] RSP: 0018:ffffaa53875c7de0 EFLAGS: 00010206
[37079.171120] RAX: ffff98099e701cf8 RBX: ffff98099e2d4000 RCX: 0000000000000000
[37079.172057] RDX: 0000000000000001 RSI: ffffffffc0acc5b1 RDI: 00000000ffffffff
[37079.173002] RBP: ffff98099e701cf8 R08: 0000000000000000 R09: 0000000000000000
[37079.173886] R10: 0000000000000000 R11: 0000000000000000 R12: ffff98099e701c00
[37079.174730] R13: ffff98099e2d5100 R14: dead000000000122 R15: dead000000000100
[37079.175578] FS: 00007f4d7d0a5840(0000) GS:ffff9809ec600000(0000) knlGS:0000000000000000
[37079.176434] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[37079.177289] CR2: 0000559224dcc000 CR3: 000000012207a004 CR4: 00000000003606e0
[37079.178152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[37079.178935] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[37079.179675] Call Trace:
[37079.180419] close_ctree+0x291/0x2d1 [btrfs]
[37079.181162] generic_shutdown_super+0x6c/0x100
[37079.181898] kill_anon_super+0x14/0x30
[37079.182641] btrfs_kill_super+0x12/0x20 [btrfs]
[37079.183371] deactivate_locked_super+0x31/0x70
[37079.184012] cleanup_mnt+0x100/0x160
[37079.184650] task_work_run+0x68/0xb0
[37079.185284] exit_to_usermode_loop+0xf9/0x100
[37079.185920] do_syscall_64+0x20d/0x260
[37079.186556] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[37079.187197] RIP: 0033:0x7f4d7d2d9357
[37079.187836] Code: eb 0b 00 f7 d8 64 89 01 48 (...)
[37079.189180] RSP: 002b:00007ffee4e0d368 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[37079.189845] RAX: 0000000000000000 RBX: 00007f4d7d3fb224 RCX: 00007f4d7d2d9357
[37079.190515] RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000559224dc5c90
[37079.191173] RBP: 0000559224dc1970 R08: 0000000000000000 R09: 00007ffee4e0c0e0
[37079.191815] R10: 0000559224dc7b00 R11: 0000000000000246 R12: 0000000000000000
[37079.192451] R13: 0000559224dc5c90 R14: 0000559224dc1a80 R15: 0000559224dc1ba0
[37079.193096] irq event stamp: 0
[37079.193729] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[37079.194379] hardirqs last disabled at (0): [<ffffffff97ab8935>] copy_process+0x755/0x1ea0
[37079.195033] softirqs last enabled at (0): [<ffffffff97ab8935>] copy_process+0x755/0x1ea0
[37079.195700] softirqs last disabled at (0): [<0000000000000000>] 0x0
[37079.196318] ---[ end trace b32710d864dea887 ]---
In the past commit d611add48b717a ("btrfs: fix reclaim counter leak of
space_info objects") fixed similar cases. That commit however has a date
more recent (April 7 2020) then the commit mentioned before (March 13
2020), however it was merged in kernel 5.7 while the older commit, which
introduces a new leak, was merged only in the 5.8 merge window. So the
leak sneaked in unnoticed.
Fix this by making steal_from_global_rsv() remove the ticket using the
helper remove_ticket(), which decrements the reclaim_size counter of the
space_info object.
Fixes: 7f9fe614407692 ("btrfs: improve global reserve stealing logic")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.
This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.
Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.
The following represents an example execution demonstrating the race:
CPU0 CPU1 CPU2
reada_for_search reada_for_search
readahead_tree_block readahead_tree_block
find_create_tree_block find_create_tree_block
alloc_extent_buffer alloc_extent_buffer
find_extent_buffer // not found
allocates eb
lock pages
associate pages to eb
insert eb into radix tree
set TREE_REF, refs == 2
unlock pages
read_extent_buffer_pages // WAIT_NONE
not uptodate (brand new eb)
lock_page
if !trylock_page
goto unlock_exit // not an error
free_extent_buffer
release_extent_buffer
atomic_dec_and_test refs to 1
find_extent_buffer // found
try_release_extent_buffer
take refs_lock
reads refs == 1; no io
atomic_inc_not_zero refs to 2
mark_buffer_accessed
check_buffer_tree_ref
// not STALE, won't take refs_lock
refs == 2; TREE_REF set // no action
read_extent_buffer_pages // WAIT_NONE
clear TREE_REF
release_extent_buffer
atomic_dec_and_test refs to 1
unlock_page
still not uptodate (CPU1 read failed on trylock_page)
locks pages
set io_pages > 0
submit io
return
free_extent_buffer
release_extent_buffer
dec refs to 0
delete from radix tree
btrfs_release_extent_buffer_pages
BUG_ON(io_pages > 0)!!!
We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.
To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. All the codepaths that clear TREE_REF check for io, so they
would not be able to clear it after this point until the io is done.
Stack trace, for reference:
[1417839.424739] ------------[ cut here ]------------
[1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
[1417839.447024] invalid opcode: 0000 [#1] SMP
[1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
[1417839.517008] Code: ed e9 ...
[1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
[1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
[1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
[1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
[1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
[1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
[1417839.651549] FS: 00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
[1417839.669810] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
[1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1417839.731320] Call Trace:
[1417839.737103] release_extent_buffer+0x39/0x90
[1417839.746913] read_block_for_search.isra.38+0x2a3/0x370
[1417839.758645] btrfs_search_slot+0x260/0x9b0
[1417839.768054] btrfs_lookup_file_extent+0x4a/0x70
[1417839.778427] btrfs_get_extent+0x15f/0x830
[1417839.787665] ? submit_extent_page+0xc4/0x1c0
[1417839.797474] ? __do_readpage+0x299/0x7a0
[1417839.806515] __do_readpage+0x33b/0x7a0
[1417839.815171] ? btrfs_releasepage+0x70/0x70
[1417839.824597] extent_readpages+0x28f/0x400
[1417839.833836] read_pages+0x6a/0x1c0
[1417839.841729] ? startup_64+0x2/0x30
[1417839.849624] __do_page_cache_readahead+0x13c/0x1a0
[1417839.860590] filemap_fault+0x6c7/0x990
[1417839.869252] ? xas_load+0x8/0x80
[1417839.876756] ? xas_find+0x150/0x190
[1417839.884839] ? filemap_map_pages+0x295/0x3b0
[1417839.894652] __do_fault+0x32/0x110
[1417839.902540] __handle_mm_fault+0xacd/0x1000
[1417839.912156] handle_mm_fault+0xaa/0x1c0
[1417839.921004] __do_page_fault+0x242/0x4b0
[1417839.930044] ? page_fault+0x8/0x30
[1417839.937933] page_fault+0x1e/0x30
[1417839.945631] RIP: 0033:0x33c4bae
[1417839.952927] Code: Bad RIP value.
[1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
[1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
[1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
[1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
[1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
[1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Convert fall through comments to the pseudo-keyword which is now the
preferred way.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The wait_event_... defines evaluate to long so we should not assign it an int as this may truncate
the value.
Reported-by: Marshall Midden <marshallmidden@gmail.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
When xfstest generic/035, we found the target file was deleted
if the rename return -EACESS.
In cifs_rename2, we unlink the positive target dentry if rename
failed with EACESS or EEXIST, even if the target dentry is positived
before rename. Then the existing file was deleted.
We should just delete the target file which created during the
rename.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
|
|
The flag from the primary tcon needs to be copied into the volume info
so that cifs_get_tcon will try to enable extensions on the per-user
tcon. At that point, since posix extensions must have already been
enabled on the superblock, don't try to needlessly adjust the mount
flags.
Fixes: ce558b0e17f8 ("smb3: Add posix create context for smb3.11 posix mounts")
Fixes: b326614ea215 ("smb3: allow "posix" mount option to enable new SMB311 protocol extensions")
Signed-off-by: Paul Aurich <paul@darkrain42.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
|
|
Fixes: ca567eb2b3f0 ("SMB3: Allow persistent handle timeout to be configurable on mount")
Signed-off-by: Paul Aurich <paul@darkrain42.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
|
|
Fixes: 3e7a02d47872 ("smb3: allow disabling requesting leases")
Signed-off-by: Paul Aurich <paul@darkrain42.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
|
|
Without this:
- persistent handles will only be enabled for per-user tcons if the
server advertises the 'Continuous Availabity' capability
- resilient handles would never be enabled for per-user tcons
Signed-off-by: Paul Aurich <paul@darkrain42.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
|
|
Ensure multiuser SMB3 mounts use encryption for all users' tcons if the
mount options are configured to require encryption. Without this, only
the primary tcon and IPC tcons are guaranteed to be encrypted. Per-user
tcons would only be encrypted if the server was configured to require
encryption.
Signed-off-by: Paul Aurich <paul@darkrain42.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
|
|
This is useful for distinguishing SMB sessions on a multiuser mount.
Signed-off-by: Paul Aurich <paul@darkrain42.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
Pull exfat fixes from Namjae Jeon:
- Zero out unused characters of FileName field to avoid a complaint
from some fsck tool.
- Fix memory leak on error paths.
- Fix unnecessary VOL_DIRTY set when calling rmdir on non-empty
directory.
- Call sync_filesystem() for read-only remount (Fix generic/452 test in
xfstests)
- Add own fsync() to flush dirty metadata.
* tag 'exfat-for-5.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
exfat: flush dirty metadata in fsync
exfat: move setting VOL_DIRTY over exfat_remove_entries()
exfat: call sync_filesystem for read-only remount
exfat: add missing brelse() calls on error paths
exfat: Set the unused characters of FileName field to the value 0000h
|
|
Since 5.7, we've been using task_work to trigger async running of
requests in the context of the original task. This generally works
great, but there's a case where if the task is currently blocked
in the kernel waiting on a condition to become true, it won't process
task_work. Even though the task is woken, it just checks whatever
condition it's waiting on, and goes back to sleep if it's still false.
This is a problem if that very condition only becomes true when that
task_work is run. An example of that is the task registering an eventfd
with io_uring, and it's now blocked waiting on an eventfd read. That
read could depend on a completion event, and that completion event
won't get trigged until task_work has been run.
Use the TWA_SIGNAL notification for task_work, so that we ensure that
the task always runs the work when queued.
Cc: stable@vger.kernel.org # v5.7
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In flush_delete_work, instead of flushing each individual pending
delayed work item, cancel and re-queue them for immediate execution.
The waiting isn't needed here because we're already waiting for all
queued work items to complete in gfs2_flush_delete_work. This makes the
code more efficient, but more importantly, it avoids sleeping during a
rhashtable walk, inside rcu_read_lock().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Log flush operations (gfs2_log_flush()) can target a specific transaction.
But if the function encounters errors (e.g. io errors) and withdraws,
the transaction was only freed it if was queued to one of the ail lists.
If the withdraw occurred before the transaction was queued to the ail1
list, function ail_drain never freed it. The result was:
BUG gfs2_trans: Objects remaining in gfs2_trans on __kmem_cache_shutdown()
This patch makes log_flush() add the targeted transaction to the ail1
list so that function ail_drain() will find and free it properly.
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Callers expect gfs2_inode_lookup to return an inode pointer or ERR_PTR(error).
Commit b66648ad6dcf caused it to return NULL instead of ERR_PTR(-ESTALE) in
some cases. Fix that.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: b66648ad6dcf ("gfs2: Move inode generation number check into gfs2_inode_lookup")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
I don't understand this code well, but I'm seeing a warning about a
still-referenced inode on unmount, and every other similar filesystem
does a dput() here.
Fixes: e8a79fb14f6b ("nfsd: add nfsd/clients directory")
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
We don't drop the reference on the nfsdfs filesystem with
mntput(nn->nfsd_mnt) until nfsd_exit_net(), but that won't be called
until the nfsd module's unloaded, and we can't unload the module as long
as there's a reference on nfsdfs. So this prevents module unloading.
Fixes: 2c830dd7209b ("nfsd: persist nfsd filesystem across mounts")
Reported-and-Tested-by: Luo Xiaogang <lxgrxd@163.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
This reverts commit e9c15badbb7b ("fs: Do not check if there is a
fsnotify watcher on pseudo inodes"). The commit intended to eliminate
fsnotify-related overhead for pseudo inodes but it is broken in
concept. inotify can receive events of pipe files under /proc/X/fd and
chromium relies on close and open events for sandboxing. Maxim Levitsky
reported the following
Chromium starts as a white rectangle, shows few white rectangles that
resemble its notifications and then crashes.
The stdout output from chromium:
[mlevitsk@starship ~]$chromium-freeworld
mesa: for the --simplifycfg-sink-common option: may only occur zero or one times!
mesa: for the --global-isel-abort option: may only occur zero or one times!
[3379:3379:0628/135151.440930:ERROR:browser_switcher_service.cc(238)] XXX Init()
../../sandbox/linux/seccomp-bpf-helpers/sigsys_handlers.cc:**CRASHING**:seccomp-bpf failure in syscall 0072
Received signal 11 SEGV_MAPERR 0000004a9048
Crashes are not universal but even if chromium does not crash, it certainly
does not work properly. While filtering just modify and access might be
safe, the benefit is not worth the risk hence the revert.
Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Fixes: e9c15badbb7b ("fs: Do not check if there is a fsnotify watcher on pseudo inodes")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
generic_file_fsync() exfat used could not guarantee the consistency of
a file because it has flushed not dirty metadata but only dirty data pages
for a file.
Instead of that, use exfat_file_fsync() for files and directories so that
it guarantees to commit both the metadata and data pages for a file.
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
|
|
Move setting VOL_DIRTY over exfat_remove_entries() to avoid unneeded
leaving VOL_DIRTY on -ENOTEMPTY.
Fixes: 5f2aa075070c ("exfat: add inode operations")
Cc: stable@vger.kernel.org # v5.7
Reported-by: Tetsuhiro Kohada <kohada.t2@gmail.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
|
|
We need to commit dirty metadata and pages to disk
before remounting exfat as read-only.
This fixes a failure in xfstests generic/452
generic/452 does the following:
cp something <exfat>/
mount -o remount,ro <exfat>
the <exfat>/something is corrupted. because while
exfat is remounted as read-only, exfat doesn't
have a chance to commit metadata and
vfs invalidates page caches in a block device.
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Acked-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
|
|
If the second exfat_get_dentry() call fails then we need to release
"old_bh" before returning. There is a similar bug in exfat_move_file().
Fixes: 5f2aa075070c ("exfat: add inode operations")
Reported-by: Markus Elfring <Markus.Elfring@web.de>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
|
|
Some fsck tool complain that padding part of the FileName field
is not set to the value 0000h. So let's maintain filesystem cleaner,
as exfat's spec. recommendation.
Signed-off-by: Hyeongseok.Kim <Hyeongseok@gmail.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI fixes from Ingo Molnar:
- Fix build regression on v4.8 and older
- Robustness fix for TPM log parsing code
- kobject refcount fix for the ESRT parsing code
- Two efivarfs fixes to make it behave more like an ordinary file
system
- Style fixup for zero length arrays
- Fix a regression in path separator handling in the initrd loader
- Fix a missing prototype warning
- Add some kerneldoc headers for newly introduced stub routines
- Allow support for SSDT overrides via EFI variables to be disabled
- Report CPU mode and MMU state upon entry for 32-bit ARM
- Use the correct stack pointer alignment when entering from mixed mode
* tag 'efi-urgent-2020-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi/libstub: arm: Print CPU boot mode and MMU state at boot
efi/libstub: arm: Omit arch specific config table matching array on arm64
efi/x86: Setup stack correctly for efi_pe_entry
efi: Make it possible to disable efivar_ssdt entirely
efi/libstub: Descriptions for stub helper functions
efi/libstub: Fix path separator regression
efi/libstub: Fix missing-prototype warning for skip_spaces()
efi: Replace zero-length array and use struct_size() helper
efivarfs: Don't return -EINTR when rate-limiting reads
efivarfs: Update inode modification time for successful writes
efi/esrt: Fix reference count leak in esre_create_sysfs_entry.
efi/tpm: Verify event log header before parsing
efi/x86: Fix build with gcc 4
|
|
The cell name stored in the afs_cell struct is a 64-char + NUL buffer -
when it needs to be able to handle up to AFS_MAXCELLNAME (256 chars) + NUL.
Fix this by changing the array to a pointer and allocating the string.
Found using Coverity.
Fixes: 989782dcdc91 ("afs: Overhaul cell database management")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull cifs fixes from Steve French:
"Six cifs/smb3 fixes, three of them for stable.
Fixes xfstests 451, 313 and 316"
* tag '5.8-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: misc: Use array_size() in if-statement controlling expression
cifs: update ctime and mtime during truncate
cifs/smb3: Fix data inconsistent when punch hole
cifs/smb3: Fix data inconsistent when zero file range
cifs: Fix double add page to memcg when cifs_readpages
cifs: Fix cached_fid refcnt leak in open_shroot
|
|
Pull NFS client bugfixes from Anna Schumaker:
"Stable Fixes:
- xprtrdma: Fix handling of RDMA_ERROR replies
- sunrpc: Fix rollback in rpc_gssd_dummy_populate()
- pNFS/flexfiles: Fix list corruption if the mirror count changes
- NFSv4: Fix CLOSE not waiting for direct IO completion
- SUNRPC: Properly set the @subbuf parameter of xdr_buf_subsegment()
Other Fixes:
- xprtrdma: Fix a use-after-free with r_xprt->rx_ep
- Fix other xprtrdma races during disconnect
- NFS: Fix memory leak of export_path"
* tag 'nfs-for-5.8-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
SUNRPC: Properly set the @subbuf parameter of xdr_buf_subsegment()
NFSv4 fix CLOSE not waiting for direct IO compeletion
pNFS/flexfiles: Fix list corruption if the mirror count changes
nfs: Fix memory leak of export_path
sunrpc: fixed rollback in rpc_gssd_dummy_populate()
xprtrdma: Fix handling of RDMA_ERROR replies
xprtrdma: Clean up disconnect
xprtrdma: Clean up synopsis of rpcrdma_flush_disconnect()
xprtrdma: Use re_connect_status safely in rpcrdma_xprt_connect()
xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed
|
|
Pull io_uring fixes from Jens Axboe:
"Three small fixes:
- Close a corner case for polled IO resubmission (Pavel)
- Toss commands when exiting (Pavel)
- Fix SQPOLL conditional reschedule on perpetually busy submit
(Xuan)"
* tag 'io_uring-5.8-2020-06-26' of git://git.kernel.dk/linux-block:
io_uring: fix current->mm NULL dereference on exit
io_uring: fix hanging iopoll in case of -EAGAIN
io_uring: fix io_sq_thread no schedule when busy
|
|
Merge misx fixes from Andrew Morton:
"31 patches.
Subsystems affected by this patch series: hotfixes, mm/pagealloc,
kexec, ocfs2, lib, mm/slab, mm/slab, mm/slub, mm/swap, mm/pagemap,
mm/vmalloc, mm/memcg, mm/gup, mm/thp, mm/vmscan, x86,
mm/memory-hotplug, MAINTAINERS"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (31 commits)
MAINTAINERS: update info for sparse
mm/memory_hotplug.c: fix false softlockup during pfn range removal
mm: remove vmalloc_exec
arm64: use PAGE_KERNEL_ROX directly in alloc_insn_page
x86/hyperv: allocate the hypercall page with only read and execute bits
mm/memory: fix IO cost for anonymous page
mm/swap: fix for "mm: workingset: age nonresident information alongside anonymous pages"
mm: workingset: age nonresident information alongside anonymous pages
doc: THP CoW fault no longer allocate THP
docs: mm/gup: minor documentation update
mm/memcontrol.c: prevent missed memory.low load tears
mm/memcontrol.c: add missed css_put()
mm: memcontrol: handle div0 crash race condition in memory.low
mm/vmalloc.c: fix a warning while make xmldocs
media: omap3isp: remove cacheflush.h
make asm-generic/cacheflush.h more standalone
mm/debug_vm_pgtable: fix build failure with powerpc 8xx
mm/memory.c: properly pte_offset_map_lock/unlock in vm_insert_pages()
mm: fix swap cache node allocation mask
slub: cure list_slab_objects() from double fix
...
|
|
Figuring out the root case for the REMOVE/CLOSE race and
suggesting the solution was done by Neil Brown.
Currently what happens is that direct IO calls hold a reference
on the open context which is decremented as an asynchronous task
in the nfs_direct_complete(). Before reference is decremented,
control is returned to the application which is free to close the
file. When close is being processed, it decrements its reference
on the open_context but since directIO still holds one, it doesn't
sent a close on the wire. It returns control to the application
which is free to do other operations. For instance, it can delete a
file. Direct IO is finally releasing its reference and triggering
an asynchronous close. Which races with the REMOVE. On the server,
REMOVE can be processed before the CLOSE, failing the REMOVE with
EACCES as the file is still opened.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Suggested-by: Neil Brown <neilb@suse.com>
CC: stable@vger.kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
If the mirror count changes in the new layout we pick up inside
ff_layout_pg_init_write(), then we can end up adding the
request to the wrong mirror and corrupting the mirror->pg_list.
Fixes: d600ad1f2bdb ("NFS41: pop some layoutget errors to application")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
The try_location function is called within a loop by nfs_follow_referral.
try_location calls nfs4_pathname_string to created the export_path.
nfs4_pathname_string allocates the memory. export_path is stored in the
nfs_fs_context/fs_context structure similarly as hostname and source.
But whereas the ctx hostname and source are freed before assignment,
export_path is not. So if there are multiple loops, the new export_path
will overwrite the old without the old being freed.
So call kfree for export_path.
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
In the ocfs2 disk layout, slot number is 16 bits, but in ocfs2
implementation, slot number is 32 bits. Usually this will not cause any
issue, because slot number is converted from u16 to u32, but
OCFS2_INVALID_SLOT was defined as -1, when an invalid slot number from
disk was obtained, its value was (u16)-1, and it was converted to u32.
Then the following checking in get_local_system_inode will be always
skipped:
static struct inode **get_local_system_inode(struct ocfs2_super *osb,
int type,
u32 slot)
{
BUG_ON(slot == OCFS2_INVALID_SLOT);
...
}
Link: http://lkml.kernel.org/r/20200616183829.87211-5-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The following kernel panic was captured when running nfs server over
ocfs2, at that time ocfs2_test_inode_bit() was checking whether one
inode locating at "blkno" 5 was valid, that is ocfs2 root inode, its
"suballoc_slot" was OCFS2_INVALID_SLOT(65535) and it was allocted from
//global_inode_alloc, but here it wrongly assumed that it was got from per
slot inode alloctor which would cause array overflow and trigger kernel
panic.
BUG: unable to handle kernel paging request at 0000000000001088
IP: [<ffffffff816f6898>] _raw_spin_lock+0x18/0xf0
PGD 1e06ba067 PUD 1e9e7d067 PMD 0
Oops: 0002 [#1] SMP
CPU: 6 PID: 24873 Comm: nfsd Not tainted 4.1.12-124.36.1.el6uek.x86_64 #2
Hardware name: Huawei CH121 V3/IT11SGCA1, BIOS 3.87 02/02/2018
RIP: _raw_spin_lock+0x18/0xf0
RSP: e02b:ffff88005ae97908 EFLAGS: 00010206
RAX: ffff88005ae98000 RBX: 0000000000001088 RCX: 0000000000000000
RDX: 0000000000020000 RSI: 0000000000000009 RDI: 0000000000001088
RBP: ffff88005ae97928 R08: 0000000000000000 R09: ffff880212878e00
R10: 0000000000007ff0 R11: 0000000000000000 R12: 0000000000001088
R13: ffff8800063c0aa8 R14: ffff8800650c27d0 R15: 000000000000ffff
FS: 0000000000000000(0000) GS:ffff880218180000(0000) knlGS:ffff880218180000
CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000001088 CR3: 00000002033d0000 CR4: 0000000000042660
Call Trace:
igrab+0x1e/0x60
ocfs2_get_system_file_inode+0x63/0x3a0 [ocfs2]
ocfs2_test_inode_bit+0x328/0xa00 [ocfs2]
ocfs2_get_parent+0xba/0x3e0 [ocfs2]
reconnect_path+0xb5/0x300
exportfs_decode_fh+0xf6/0x2b0
fh_verify+0x350/0x660 [nfsd]
nfsd4_putfh+0x4d/0x60 [nfsd]
nfsd4_proc_compound+0x3d3/0x6f0 [nfsd]
nfsd_dispatch+0xe0/0x290 [nfsd]
svc_process_common+0x412/0x6a0 [sunrpc]
svc_process+0x123/0x210 [sunrpc]
nfsd+0xff/0x170 [nfsd]
kthread+0xcb/0xf0
ret_from_fork+0x61/0x90
Code: 83 c2 02 0f b7 f2 e8 18 dc 91 ff 66 90 eb bf 0f 1f 40 00 55 48 89 e5 41 56 41 55 41 54 53 0f 1f 44 00 00 48 89 fb ba 00 00 02 00 <f0> 0f c1 17 89 d0 45 31 e4 45 31 ed c1 e8 10 66 39 d0 41 89 c6
RIP _raw_spin_lock+0x18/0xf0
CR2: 0000000000001088
---[ end trace 7264463cd1aac8f9 ]---
Kernel panic - not syncing: Fatal exception
Link: http://lkml.kernel.org/r/20200616183829.87211-4-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Set global_inode_alloc as OCFS2_FIRST_ONLINE_SYSTEM_INODE, that will
make it load during mount. It can be used to test whether some
global/system inodes are valid. One use case is that nfsd will test
whether root inode is valid.
Link: http://lkml.kernel.org/r/20200616183829.87211-3-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "ocfs2: fix nfsd over ocfs2 issues", v2.
This is a series of patches to fix issues on nfsd over ocfs2. patch 1
is to avoid inode removed while nfsd access it patch 2 & 3 is to fix a
panic issue.
This patch (of 4):
When nfsd is getting file dentry using handle or parent dentry of some
dentry, one cluster lock is used to avoid inode removed from other node,
but it still could be removed from local node, so use a rw lock to avoid
this.
Link: http://lkml.kernel.org/r/20200616183829.87211-1-junxiao.bi@oracle.com
Link: http://lkml.kernel.org/r/20200616183829.87211-2-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify fixlet from Jan Kara:
"A performance improvement to reduce impact of fsnotify for inodes
where it isn't used"
* tag 'fsnotify_for_v5.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fs: Do not check if there is a fsnotify watcher on pseudo inodes
|
|
Don't reissue requests from io_iopoll_reap_events(), the task may not
have mm, which ends up with NULL. It's better to kill everything off on
exit anyway.
[ 677.734670] RIP: 0010:io_iopoll_complete+0x27e/0x630
...
[ 677.734679] Call Trace:
[ 677.734695] ? __send_signal+0x1f2/0x420
[ 677.734698] ? _raw_spin_unlock_irqrestore+0x24/0x40
[ 677.734699] ? send_signal+0xf5/0x140
[ 677.734700] io_iopoll_getevents+0x12f/0x1a0
[ 677.734702] io_iopoll_reap_events.part.0+0x5e/0xa0
[ 677.734703] io_ring_ctx_wait_and_kill+0x132/0x1c0
[ 677.734704] io_uring_release+0x20/0x30
[ 677.734706] __fput+0xcd/0x230
[ 677.734707] ____fput+0xe/0x10
[ 677.734709] task_work_run+0x67/0xa0
[ 677.734710] do_exit+0x35d/0xb70
[ 677.734712] do_group_exit+0x43/0xa0
[ 677.734713] get_signal+0x140/0x900
[ 677.734715] do_signal+0x37/0x780
[ 677.734717] ? enqueue_hrtimer+0x41/0xb0
[ 677.734718] ? recalibrate_cpu_khz+0x10/0x10
[ 677.734720] ? ktime_get+0x3e/0xa0
[ 677.734721] ? lapic_next_deadline+0x26/0x30
[ 677.734723] ? tick_program_event+0x4d/0x90
[ 677.734724] ? __hrtimer_get_next_event+0x4d/0x80
[ 677.734726] __prepare_exit_to_usermode+0x126/0x1c0
[ 677.734741] prepare_exit_to_usermode+0x9/0x40
[ 677.734742] idtentry_exit_cond_rcu+0x4c/0x60
[ 677.734743] sysvec_reschedule_ipi+0x92/0x160
[ 677.734744] ? asm_sysvec_reschedule_ipi+0xa/0x20
[ 677.734745] asm_sysvec_reschedule_ipi+0x12/0x20
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_do_iopoll() won't do anything with a request unless
req->iopoll_completed is set. So io_complete_rw_iopoll() has to set
it, otherwise io_do_iopoll() will poll a file again and again even
though the request of interest was completed long time ago.
Also, remove -EAGAIN check from io_issue_sqe() as it races with
the changed lines. The request will take the long way and be
resubmitted from io_iopoll*().
io_kiocb's result and iopoll_completed")
Fixes: bbde017a32b3 ("io_uring: add memory barrier to synchronize
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fix from Gao Xiang:
"Fix a regression which uses potential uninitialized high 32-bit value
unexpectedly recently observed with specific compiler options"
* tag 'erofs-for-5.8-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix partially uninitialized misuse in z_erofs_onlinepage_fixup
|
|
Hongyu reported "id != index" in z_erofs_onlinepage_fixup() with
specific aarch64 environment easily, which wasn't shown before.
After digging into that, I found that high 32 bits of page->private
was set to 0xaaaaaaaa rather than 0 (due to z_erofs_onlinepage_init
behavior with specific compiler options). Actually we only use low
32 bits to keep the page information since page->private is only 4
bytes on most 32-bit platforms. However z_erofs_onlinepage_fixup()
uses the upper 32 bits by mistake.
Let's fix it now.
Reported-and-tested-by: Hongyu Jin <hongyu.jin@unisoc.com>
Fixes: 3883a79abd02 ("staging: erofs: introduce VLE decompression support")
Cc: <stable@vger.kernel.org> # 4.19+
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Link: https://lore.kernel.org/r/20200618234349.22553-1-hsiangkao@aol.com
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
|
|
Use array_size() instead of the open-coded version in the controlling
expression of the if statement.
Also, while there, use the preferred form for passing a size of a struct.
The alternative form where struct name is spelled out hurts readability
and introduces an opportunity for a bug when the pointer variable type is
changed but the corresponding sizeof that is passed as argument is not.
This issue was found with the help of Coccinelle and, audited and fixed
manually.
Addresses-KSPP-ID: https://github.com/KSPP/linux/issues/83
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
As the man description of the truncate, if the size changed,
then the st_ctime and st_mtime fields should be updated. But
in cifs, we doesn't do it.
It lead the xfstests generic/313 failed.
So, add the ATTR_MTIME|ATTR_CTIME flags on attrs when change
the file size
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
When punch hole success, we also can read old data from file:
# strace -e trace=pread64,fallocate xfs_io -f -c "pread 20 40" \
-c "fpunch 20 40" -c"pread 20 40" file
pread64(3, " version 5.8.0-rc1+"..., 40, 20) = 40
fallocate(3, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 20, 40) = 0
pread64(3, " version 5.8.0-rc1+"..., 40, 20) = 40
CIFS implements the fallocate(FALLOCATE_FL_PUNCH_HOLE) with send SMB
ioctl(FSCTL_SET_ZERO_DATA) to server. It just set the range of the
remote file to zero, but local page caches not updated, then the
local page caches inconsistent with server.
Also can be found by xfstests generic/316.
So, we need to remove the page caches before send the SMB
ioctl(FSCTL_SET_ZERO_DATA) to server.
Fixes: 31742c5a33176 ("enable fallocate punch hole ("fallocate -p") for SMB3")
Suggested-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Cc: stable@vger.kernel.org # v3.17
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
CIFS implements the fallocate(FALLOC_FL_ZERO_RANGE) with send SMB
ioctl(FSCTL_SET_ZERO_DATA) to server. It just set the range of the
remote file to zero, but local page cache not update, then the data
inconsistent with server, which leads the xfstest generic/008 failed.
So we need to remove the local page caches before send SMB
ioctl(FSCTL_SET_ZERO_DATA) to server. After next read, it will
re-cache it.
Fixes: 30175628bf7f5 ("[SMB3] Enable fallocate -z support for SMB3 mounts")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Cc: stable@vger.kernel.org # v3.17
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
When the user consumes and generates sqe at a fast rate,
io_sqring_entries can always get sqe, and ret will not be equal to -EBUSY,
so that io_sq_thread will never call cond_resched or schedule, and then
we will get the following system error prompt:
rcu: INFO: rcu_sched self-detected stall on CPU
or
watchdog: BUG: soft lockup-CPU#23 stuck for 112s! [io_uring-sq:1863]
This patch checks whether need to call cond_resched() by checking
the need_resched() function every cycle.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When xfstests generic/451, there is an BUG at mm/memcontrol.c:
page:ffffea000560f2c0 refcount:2 mapcount:0 mapping:000000008544e0ea
index:0xf
mapping->aops:cifs_addr_ops dentry name:"tst-aio-dio-cycle-write.451"
flags: 0x2fffff80000001(locked)
raw: 002fffff80000001 ffffc90002023c50 ffffea0005280088 ffff88815cda0210
raw: 000000000000000f 0000000000000000 00000002ffffffff ffff88817287d000
page dumped because: VM_BUG_ON_PAGE(page->mem_cgroup)
page->mem_cgroup:ffff88817287d000
------------[ cut here ]------------
kernel BUG at mm/memcontrol.c:2659!
invalid opcode: 0000 [#1] SMP
CPU: 2 PID: 2038 Comm: xfs_io Not tainted 5.8.0-rc1 #44
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_
073836-buildvm-ppc64le-16.ppc.4
RIP: 0010:commit_charge+0x35/0x50
Code: 0d 48 83 05 54 b2 02 05 01 48 89 77 38 c3 48 c7
c6 78 4a ea ba 48 83 05 38 b2 02 05 01 e8 63 0d9
RSP: 0018:ffffc90002023a50 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff88817287d000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff88817ac97ea0 RDI: ffff88817ac97ea0
RBP: ffffea000560f2c0 R08: 0000000000000203 R09: 0000000000000005
R10: 0000000000000030 R11: ffffc900020237a8 R12: 0000000000000000
R13: 0000000000000001 R14: 0000000000000001 R15: ffff88815a1272c0
FS: 00007f5071ab0800(0000) GS:ffff88817ac80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055efcd5ca000 CR3: 000000015d312000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
mem_cgroup_charge+0x166/0x4f0
__add_to_page_cache_locked+0x4a9/0x710
add_to_page_cache_locked+0x15/0x20
cifs_readpages+0x217/0x1270
read_pages+0x29a/0x670
page_cache_readahead_unbounded+0x24f/0x390
__do_page_cache_readahead+0x3f/0x60
ondemand_readahead+0x1f1/0x470
page_cache_async_readahead+0x14c/0x170
generic_file_buffered_read+0x5df/0x1100
generic_file_read_iter+0x10c/0x1d0
cifs_strict_readv+0x139/0x170
new_sync_read+0x164/0x250
__vfs_read+0x39/0x60
vfs_read+0xb5/0x1e0
ksys_pread64+0x85/0xf0
__x64_sys_pread64+0x22/0x30
do_syscall_64+0x69/0x150
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5071fcb1af
Code: Bad RIP value.
RSP: 002b:00007ffde2cdb8e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
RAX: ffffffffffffffda RBX: 00007ffde2cdb990 RCX: 00007f5071fcb1af
RDX: 0000000000001000 RSI: 000055efcd5ca000 RDI: 0000000000000003
RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000001000 R11: 0000000000000293 R12: 0000000000000001
R13: 000000000009f000 R14: 0000000000000000 R15: 0000000000001000
Modules linked in:
---[ end trace 725fa14a3e1af65c ]---
Since commit 3fea5a499d57 ("mm: memcontrol: convert page cache to a new
mem_cgroup_charge() API") not cancel the page charge, the pages maybe
double add to pagecache:
thread1 | thread2
cifs_readpages
readpages_get_pages
add_to_page_cache_locked(head,index=n)=0
| readpages_get_pages
| add_to_page_cache_locked(head,index=n+1)=0
add_to_page_cache_locked(head, index=n+1)=-EEXIST
then, will next loop with list head page's
index=n+1 and the page->mapping not NULL
readpages_get_pages
add_to_page_cache_locked(head, index=n+1)
commit_charge
VM_BUG_ON_PAGE
So, we should not do the next loop when any page add to page cache
failed.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A number of fixes, located in two areas, one performance fix and one
fixup for better integration with another patchset.
- bug fixes in nowait aio:
- fix snapshot creation hang after nowait-aio was used
- fix failure to write to prealloc extent past EOF
- don't block when extent range is locked
- block group fixes:
- relocation failure when scrub runs in parallel
- refcount fix when removing fails
- fix race between removal and creation
- space accounting fixes
- reinstante fast path check for log tree at unlink time, fixes
performance drop up to 30% in REAIM
- kzfree/kfree fixup to ease treewide patchset renaming kzfree"
* tag 'for-5.8-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: use kfree() in btrfs_ioctl_get_subvol_info()
btrfs: fix RWF_NOWAIT writes blocking on extent locks and waiting for IO
btrfs: fix RWF_NOWAIT write not failling when we need to cow
btrfs: fix failure of RWF_NOWAIT write into prealloc extent beyond eof
btrfs: fix hang on snapshot creation after RWF_NOWAIT write
btrfs: check if a log root exists before locking the log_mutex on unlink
btrfs: fix bytes_may_use underflow when running balance and scrub in parallel
btrfs: fix data block group relocation failure due to concurrent scrub
btrfs: fix race between block group removal and block group creation
btrfs: fix a block group ref counter leak after failure to remove block group
|
|
xlog_wait() on the CIL context can reference a freed context if the
waiter doesn't get scheduled before the CIL context is freed. This
can happen when a task is on the hard throttle and the CIL push
aborts due to a shutdown. This was detected by generic/019:
thread 1 thread 2
__xfs_trans_commit
xfs_log_commit_cil
<CIL size over hard throttle limit>
xlog_wait
schedule
xlog_cil_push_work
wake_up_all
<shutdown aborts commit>
xlog_cil_committed
kmem_free
remove_wait_queue
spin_lock_irqsave --> UAF
Fix it by moving the wait queue to the CIL rather than keeping it in
in the CIL context that gets freed on push completion. Because the
wait queue is now independent of the CIL context and we might have
multiple contexts in flight at once, only wake the waiters on the
push throttle when the context we are pushing is over the hard
throttle size threshold.
Fixes: 0e7ab7efe7745 ("xfs: Throttle commits on delayed background CIL push")
Reported-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|