summaryrefslogtreecommitdiffstats
path: root/fs/ext4
AgeCommit message (Collapse)AuthorFilesLines
2022-11-07ext4: fix use-after-free in ext4_ext_shift_extentsBaokun Li1-5/+13
If the starting position of our insert range happens to be in the hole between the two ext4_extent_idx, because the lblk of the ext4_extent in the previous ext4_extent_idx is always less than the start, which leads to the "extent" variable access across the boundary, the following UAF is triggered: ================================================================== BUG: KASAN: use-after-free in ext4_ext_shift_extents+0x257/0x790 Read of size 4 at addr ffff88819807a008 by task fallocate/8010 CPU: 3 PID: 8010 Comm: fallocate Tainted: G E 5.10.0+ #492 Call Trace: dump_stack+0x7d/0xa3 print_address_description.constprop.0+0x1e/0x220 kasan_report.cold+0x67/0x7f ext4_ext_shift_extents+0x257/0x790 ext4_insert_range+0x5b6/0x700 ext4_fallocate+0x39e/0x3d0 vfs_fallocate+0x26f/0x470 ksys_fallocate+0x3a/0x70 __x64_sys_fallocate+0x4f/0x60 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ================================================================== For right shifts, we can divide them into the following situations: 1. When the first ee_block of ext4_extent_idx is greater than or equal to start, make right shifts directly from the first ee_block. 1) If it is greater than start, we need to continue searching in the previous ext4_extent_idx. 2) If it is equal to start, we can exit the loop (iterator=NULL). 2. When the first ee_block of ext4_extent_idx is less than start, then traverse from the last extent to find the first extent whose ee_block is less than start. 1) If extent is still the last extent after traversal, it means that the last ee_block of ext4_extent_idx is less than start, that is, start is located in the hole between idx and (idx+1), so we can exit the loop directly (break) without right shifts. 2) Otherwise, make right shifts at the corresponding position of the found extent, and then exit the loop (iterator=NULL). Fixes: 331573febb6a ("ext4: Add support FALLOC_FL_INSERT_RANGE for fallocate") Cc: stable@vger.kernel.org # v4.2+ Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20220922120434.1294789-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-11-06Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds6-7/+21
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Fix a number of bugs, including some regressions, the most serious of which was one which would cause online resizes to fail with file systems with metadata checksums enabled. Also fix a warning caused by the newly added fortify string checker, plus some bugs that were found using fuzzed file systems" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix fortify warning in fs/ext4/fast_commit.c:1551 ext4: fix wrong return err in ext4_load_and_init_journal() ext4: fix warning in 'ext4_da_release_space' ext4: fix BUG_ON() when directory entry has invalid rec_len ext4: update the backup superblock's at the end of the online resize
2022-11-06ext4: fix fortify warning in fs/ext4/fast_commit.c:1551Theodore Ts'o1-2/+3
With the new fortify string system, rework the memcpy to avoid this warning: memcpy: detected field-spanning write (size 60) of single field "&raw_inode->i_generation" at fs/ext4/fast_commit.c:1551 (size 4) Cc: stable@kernel.org Fixes: 54d9469bc515 ("fortify: Add run-time WARN for cross-field memcpy()") Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-11-06ext4: fix wrong return err in ext4_load_and_init_journal()Jason Yan1-1/+1
The return value is wrong in ext4_load_and_init_journal(). The local variable 'err' need to be initialized before goto out. The original code in __ext4_fill_super() is fine because it has two return values 'ret' and 'err' and 'ret' is initialized as -EINVAL. After we factor out ext4_load_and_init_journal(), this code is broken. So fix it by directly returning -EINVAL in the error handler path. Cc: stable@kernel.org Fixes: 9c1dd22d7422 ("ext4: factor out ext4_load_and_init_journal()") Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221025040206.3134773-1-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-11-06ext4: fix warning in 'ext4_da_release_space'Ye Bin1-1/+2
Syzkaller report issue as follows: EXT4-fs (loop0): Free/Dirty block details EXT4-fs (loop0): free_blocks=0 EXT4-fs (loop0): dirty_blocks=0 EXT4-fs (loop0): Block reservation details EXT4-fs (loop0): i_reserved_data_blocks=0 EXT4-fs warning (device loop0): ext4_da_release_space:1527: ext4_da_release_space: ino 18, to_free 1 with only 0 reserved data blocks ------------[ cut here ]------------ WARNING: CPU: 0 PID: 92 at fs/ext4/inode.c:1528 ext4_da_release_space+0x25e/0x370 fs/ext4/inode.c:1524 Modules linked in: CPU: 0 PID: 92 Comm: kworker/u4:4 Not tainted 6.0.0-syzkaller-09423-g493ffd6605b2 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022 Workqueue: writeback wb_workfn (flush-7:0) RIP: 0010:ext4_da_release_space+0x25e/0x370 fs/ext4/inode.c:1528 RSP: 0018:ffffc900015f6c90 EFLAGS: 00010296 RAX: 42215896cd52ea00 RBX: 0000000000000000 RCX: 42215896cd52ea00 RDX: 0000000000000000 RSI: 0000000080000001 RDI: 0000000000000000 RBP: 1ffff1100e907d96 R08: ffffffff816aa79d R09: fffff520002bece5 R10: fffff520002bece5 R11: 1ffff920002bece4 R12: ffff888021fd2000 R13: ffff88807483ecb0 R14: 0000000000000001 R15: ffff88807483e740 FS: 0000000000000000(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005555569ba628 CR3: 000000000c88e000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ext4_es_remove_extent+0x1ab/0x260 fs/ext4/extents_status.c:1461 mpage_release_unused_pages+0x24d/0xef0 fs/ext4/inode.c:1589 ext4_writepages+0x12eb/0x3be0 fs/ext4/inode.c:2852 do_writepages+0x3c3/0x680 mm/page-writeback.c:2469 __writeback_single_inode+0xd1/0x670 fs/fs-writeback.c:1587 writeback_sb_inodes+0xb3b/0x18f0 fs/fs-writeback.c:1870 wb_writeback+0x41f/0x7b0 fs/fs-writeback.c:2044 wb_do_writeback fs/fs-writeback.c:2187 [inline] wb_workfn+0x3cb/0xef0 fs/fs-writeback.c:2227 process_one_work+0x877/0xdb0 kernel/workqueue.c:2289 worker_thread+0xb14/0x1330 kernel/workqueue.c:2436 kthread+0x266/0x300 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306 </TASK> Above issue may happens as follows: ext4_da_write_begin ext4_create_inline_data ext4_clear_inode_flag(inode, EXT4_INODE_EXTENTS); ext4_set_inode_flag(inode, EXT4_INODE_INLINE_DATA); __ext4_ioctl ext4_ext_migrate -> will lead to eh->eh_entries not zero, and set extent flag ext4_da_write_begin ext4_da_convert_inline_data_to_extent ext4_da_write_inline_data_begin ext4_da_map_blocks ext4_insert_delayed_block if (!ext4_es_scan_clu(inode, &ext4_es_is_delonly, lblk)) if (!ext4_es_scan_clu(inode, &ext4_es_is_mapped, lblk)) ext4_clu_mapped(inode, EXT4_B2C(sbi, lblk)); -> will return 1 allocated = true; ext4_es_insert_delayed_block(inode, lblk, allocated); ext4_writepages mpage_map_and_submit_extent(handle, &mpd, &give_up_on_write); -> return -ENOSPC mpage_release_unused_pages(&mpd, give_up_on_write); -> give_up_on_write == 1 ext4_es_remove_extent ext4_da_release_space(inode, reserved); if (unlikely(to_free > ei->i_reserved_data_blocks)) -> to_free == 1 but ei->i_reserved_data_blocks == 0 -> then trigger warning as above To solve above issue, forbid inode do migrate which has inline data. Cc: stable@kernel.org Reported-by: syzbot+c740bb18df70ad00952e@syzkaller.appspotmail.com Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221018022701.683489-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-11-06ext4: fix BUG_ON() when directory entry has invalid rec_lenLuís Henriques1-1/+9
The rec_len field in the directory entry has to be a multiple of 4. A corrupted filesystem image can be used to hit a BUG() in ext4_rec_len_to_disk(), called from make_indexed_dir(). ------------[ cut here ]------------ kernel BUG at fs/ext4/ext4.h:2413! ... RIP: 0010:make_indexed_dir+0x53f/0x5f0 ... Call Trace: <TASK> ? add_dirent_to_buf+0x1b2/0x200 ext4_add_entry+0x36e/0x480 ext4_add_nondir+0x2b/0xc0 ext4_create+0x163/0x200 path_openat+0x635/0xe90 do_filp_open+0xb4/0x160 ? __create_object.isra.0+0x1de/0x3b0 ? _raw_spin_unlock+0x12/0x30 do_sys_openat2+0x91/0x150 __x64_sys_open+0x6c/0xa0 do_syscall_64+0x3c/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 The fix simply adds a call to ext4_check_dir_entry() to validate the directory entry, returning -EFSCORRUPTED if the entry is invalid. CC: stable@kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=216540 Signed-off-by: Luís Henriques <lhenriques@suse.de> Link: https://lore.kernel.org/r/20221012131330.32456-1-lhenriques@suse.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-28fs/ext4/super.c: remove unused `deprecated_msg'Andrew Morton1-4/+0
fs/ext4/super.c:1744:19: warning: 'deprecated_msg' defined but not used [-Wunused-const-variable=] Reported-by: kernel test robot <lkp@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-27ext4: update the backup superblock's at the end of the online resizeTheodore Ts'o2-2/+6
When expanding a file system using online resize, various fields in the superblock (e.g., s_blocks_count, s_inodes_count, etc.) change. To update the backup superblocks, the online resize uses the function update_backups() in fs/ext4/resize.c. This function was not updating the checksum field in the backup superblocks. This wasn't a big deal previously, because e2fsck didn't care about the checksum field in the backup superblock. (And indeed, update_backups() goes all the way back to the ext3 days, well before we had support for metadata checksums.) However, there is an alternate, more general way of updating superblock fields, ext4_update_primary_sb() in fs/ext4/ioctl.c. This function does check the checksum of the backup superblock, and if it doesn't match will mark the file system as corrupted. That was clearly not the intent, so avoid to aborting the resize when a bad superblock is found. In addition, teach update_backups() to properly update the checksum in the backup superblocks. We will eventually want to unify updapte_backups() with the infrasture in ext4_update_primary_sb(), but that's for another day. Note: The problem has been around for a while; it just didn't really matter until ext4_update_primary_sb() was added by commit bbc605cdb1e1 ("ext4: implement support for get/set fs label"). And it became trivially easy to reproduce after commit 827891a38acc ("ext4: update the s_overhead_clusters in the backup sb's when resizing") in v6.0. Cc: stable@kernel.org # 5.17+ Fixes: bbc605cdb1e1 ("ext4: implement support for get/set fs label") Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-10-16Merge tag 'random-6.1-rc1-for-linus' of ↵Linus Torvalds4-11/+9
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull more random number generator updates from Jason Donenfeld: "This time with some large scale treewide cleanups. The intent of this pull is to clean up the way callers fetch random integers. The current rules for doing this right are: - If you want a secure or an insecure random u64, use get_random_u64() - If you want a secure or an insecure random u32, use get_random_u32() The old function prandom_u32() has been deprecated for a while now and is just a wrapper around get_random_u32(). Same for get_random_int(). - If you want a secure or an insecure random u16, use get_random_u16() - If you want a secure or an insecure random u8, use get_random_u8() - If you want secure or insecure random bytes, use get_random_bytes(). The old function prandom_bytes() has been deprecated for a while now and has long been a wrapper around get_random_bytes() - If you want a non-uniform random u32, u16, or u8 bounded by a certain open interval maximum, use prandom_u32_max() I say "non-uniform", because it doesn't do any rejection sampling or divisions. Hence, it stays within the prandom_*() namespace, not the get_random_*() namespace. I'm currently investigating a "uniform" function for 6.2. We'll see what comes of that. By applying these rules uniformly, we get several benefits: - By using prandom_u32_max() with an upper-bound that the compiler can prove at compile-time is ≤65536 or ≤256, internally get_random_u16() or get_random_u8() is used, which wastes fewer batched random bytes, and hence has higher throughput. - By using prandom_u32_max() instead of %, when the upper-bound is not a constant, division is still avoided, because prandom_u32_max() uses a faster multiplication-based trick instead. - By using get_random_u16() or get_random_u8() in cases where the return value is intended to indeed be a u16 or a u8, we waste fewer batched random bytes, and hence have higher throughput. This series was originally done by hand while I was on an airplane without Internet. Later, Kees and I worked on retroactively figuring out what could be done with Coccinelle and what had to be done manually, and then we split things up based on that. So while this touches a lot of files, the actual amount of code that's hand fiddled is comfortably small" * tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: prandom: remove unused functions treewide: use get_random_bytes() when possible treewide: use get_random_u32() when possible treewide: use get_random_{u8,u16}() when possible, part 2 treewide: use get_random_{u8,u16}() when possible, part 1 treewide: use prandom_u32_max() when possible, part 2 treewide: use prandom_u32_max() when possible, part 1
2022-10-14Merge tag 'mm-stable-2022-10-13' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - fix a race which causes page refcounting errors in ZONE_DEVICE pages (Alistair Popple) - fix userfaultfd test harness instability (Peter Xu) - various other patches in MM, mainly fixes * tag 'mm-stable-2022-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (29 commits) highmem: fix kmap_to_page() for kmap_local_page() addresses mm/page_alloc: fix incorrect PGFREE and PGALLOC for high-order page mm/selftest: uffd: explain the write missing fault check mm/hugetlb: use hugetlb_pte_stable in migration race check mm/hugetlb: fix race condition of uffd missing/minor handling zram: always expose rw_page LoongArch: update local TLB if PTE entry exists mm: use update_mmu_tlb() on the second thread kasan: fix array-bounds warnings in tests hmm-tests: add test for migrate_device_range() nouveau/dmem: evict device private memory during release nouveau/dmem: refactor nouveau_dmem_fault_copy_one() mm/migrate_device.c: add migrate_device_range() mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page() mm/memremap.c: take a pgmap reference on page allocation mm: free device private pages have zero refcount mm/memory.c: fix race when faulting a device private page mm/damon: use damon_sz_region() in appropriate place mm/damon: move sz_damon_region to damon_sz_region lib/test_meminit: add checks for the allocation functions ...
2022-10-12ext4,f2fs: fix readahead of verity dataMatthew Wilcox (Oracle)1-1/+2
The recent change of page_cache_ra_unbounded() arguments was buggy in the two callers, causing us to readahead the wrong pages. Move the definition of ractl down to after the index is set correctly. This affected performance on configurations that use fs-verity. Link: https://lkml.kernel.org/r/20221012193419.1453558-1-willy@infradead.org Fixes: 73bb49da50cd ("mm/readahead: make page_cache_ra_unbounded take a readahead_control") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Jintao Yin <nicememory@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-11treewide: use get_random_u32() when possibleJason A. Donenfeld3-4/+4
The prandom_u32() function has been a deprecated inline wrapper around get_random_u32() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. The same also applies to get_random_int(), which is just a wrapper around get_random_u32(). This was done as a basic find and replace. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Acked-by: Helge Deller <deller@gmx.de> # for parisc Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-11treewide: use prandom_u32_max() when possible, part 2Jason A. Donenfeld1-3/+2
Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes the minimum required bytes from the RNG and avoids divisions. This was done by hand, covering things that coccinelle could not do on its own. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext2, ext4, and sbitmap Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-11treewide: use prandom_u32_max() when possible, part 1Jason A. Donenfeld1-4/+3
Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes the minimum required bytes from the RNG and avoids divisions. This was done mechanically with this coccinelle script: @basic@ expression E; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u64; @@ ( - ((T)get_random_u32() % (E)) + prandom_u32_max(E) | - ((T)get_random_u32() & ((E) - 1)) + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2) | - ((u64)(E) * get_random_u32() >> 32) + prandom_u32_max(E) | - ((T)get_random_u32() & ~PAGE_MASK) + prandom_u32_max(PAGE_SIZE) ) @multi_line@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; identifier RAND; expression E; @@ - RAND = get_random_u32(); ... when != RAND - RAND %= (E); + RAND = prandom_u32_max(E); // Find a potential literal @literal_mask@ expression LITERAL; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; position p; @@ ((T)get_random_u32()@p & (LITERAL)) // Add one to the literal. @script:python add_one@ literal << literal_mask.LITERAL; RESULT; @@ value = None if literal.startswith('0x'): value = int(literal, 16) elif literal[0] in '123456789': value = int(literal, 10) if value is None: print("I don't know how to handle %s" % (literal)) cocci.include_match(False) elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1: print("Skipping 0x%x for cleanup elsewhere" % (value)) cocci.include_match(False) elif value & (value + 1) != 0: print("Skipping 0x%x because it's not a power of two minus one" % (value)) cocci.include_match(False) elif literal.startswith('0x'): coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1)) else: coccinelle.RESULT = cocci.make_expr("%d" % (value + 1)) // Replace the literal mask with the calculated result. @plus_one@ expression literal_mask.LITERAL; position literal_mask.p; expression add_one.RESULT; identifier FUNC; @@ - (FUNC()@p & (LITERAL)) + prandom_u32_max(RESULT) @collapse_ret@ type T; identifier VAR; expression E; @@ { - T VAR; - VAR = (E); - return VAR; + return E; } @drop_var@ type T; identifier VAR; @@ { - T VAR; ... when != VAR } Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: KP Singh <kpsingh@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-10Merge tag 'pull-tmpfile' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs tmpfile updates from Al Viro: "Miklos' ->tmpfile() signature change; pass an unopened struct file to it, let it open the damn thing. Allows to add tmpfile support to FUSE" * tag 'pull-tmpfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fuse: implement ->tmpfile() vfs: open inside ->tmpfile() vfs: move open right after ->tmpfile() vfs: make vfs_tmpfile() static ovl: use vfs_tmpfile_open() helper cachefiles: use vfs_tmpfile_open() helper cachefiles: only pass inode to *mark_inode_inuse() helpers cachefiles: tmpfile error handling cleanup hugetlbfs: cleanup mknod and tmpfile vfs: add vfs_tmpfile_open() helper
2022-10-06Merge tag 'ext4_for_linus' of ↵Linus Torvalds15-743/+903
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "The first two changes involve files outside of fs/ext4: - submit_bh() can never return an error, so change it to return void, and remove the unused checks from its callers - fix I_DIRTY_TIME handling so it will be set even if the inode already has I_DIRTY_INODE Performance: - Always enable i_version counter (as btrfs and xfs already do). Remove some uneeded i_version bumps to avoid unnecessary nfs cache invalidations - Wake up journal waiters in FIFO order, to avoid some journal users from not getting a journal handle for an unfairly long time - In ext4_write_begin() allocate any necessary buffer heads before starting the journal handle - Don't try to prefetch the block allocation bitmaps for a read-only file system Bug Fixes: - Fix a number of fast commit bugs, including resources leaks and out of bound references in various error handling paths and/or if the fast commit log is corrupted - Avoid stopping the online resize early when expanding a file system which is less than 16TiB to a size greater than 16TiB - Fix apparent metadata corruption caused by a race with a metadata buffer head getting migrated while it was trying to be read - Mark the lazy initialization thread freezable to prevent suspend failures - Other miscellaneous bug fixes Cleanups: - Break up the incredibly long ext4_full_super() function by refactoring to move code into more understandable, smaller functions - Remove the deprecated (and ignored) noacl and nouser_attr mount option - Factor out some common code in fast commit handling - Other miscellaneous cleanups" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (53 commits) ext4: fix potential out of bound read in ext4_fc_replay_scan() ext4: factor out ext4_fc_get_tl() ext4: introduce EXT4_FC_TAG_BASE_LEN helper ext4: factor out ext4_free_ext_path() ext4: remove unnecessary drop path references in mext_check_coverage() ext4: update 'state->fc_regions_size' after successful memory allocation ext4: fix potential memory leak in ext4_fc_record_regions() ext4: fix potential memory leak in ext4_fc_record_modified_inode() ext4: remove redundant checking in ext4_ioctl_checkpoint jbd2: add miss release buffer head in fc_do_one_pass() ext4: move DIOREAD_NOLOCK setting to ext4_set_def_opts() ext4: remove useless local variable 'blocksize' ext4: unify the ext4 super block loading operation ext4: factor out ext4_journal_data_mode_check() ext4: factor out ext4_load_and_init_journal() ext4: factor out ext4_group_desc_init() and ext4_group_desc_free() ext4: factor out ext4_geometry_check() ext4: factor out ext4_check_feature_compatibility() ext4: factor out ext4_init_metadata_csum() ext4: factor out ext4_encoding_init() ...
2022-10-03Merge tag 'statx-dioalign-for-linus' of ↵Linus Torvalds3-11/+64
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull STATX_DIOALIGN support from Eric Biggers: "Make statx() support reporting direct I/O (DIO) alignment information. This provides a generic interface for userspace programs to determine whether a file supports DIO, and if so with what alignment restrictions. Specifically, STATX_DIOALIGN works on block devices, and on regular files when their containing filesystem has implemented support. An interface like this has been requested for years, since the conditions for when DIO is supported in Linux have gotten increasingly complex over time. Today, DIO support and alignment requirements can be affected by various filesystem features such as multi-device support, data journalling, inline data, encryption, verity, compression, checkpoint disabling, log-structured mode, etc. Further complicating things, Linux v6.0 relaxed the traditional rule of DIO needing to be aligned to the block device's logical block size; now user buffers (but not file offsets) only need to be aligned to the DMA alignment. The approach of uplifting the XFS specific ioctl XFS_IOC_DIOINFO was discarded in favor of creating a clean new interface with statx(). For more information, see the individual commits and the man page update[1]" Link: https://lore.kernel.org/r/20220722074229.148925-1-ebiggers@kernel.org [1] * tag 'statx-dioalign-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: xfs: support STATX_DIOALIGN f2fs: support STATX_DIOALIGN f2fs: simplify f2fs_force_buffered_io() f2fs: move f2fs_force_buffered_io() into file.c ext4: support STATX_DIOALIGN fscrypt: change fscrypt_dio_supported() to prepare for STATX_DIOALIGN vfs: support STATX_DIOALIGN on block devices statx: add direct I/O alignment information
2022-10-03Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscryptLinus Torvalds1-4/+6
Pull fscrypt updates from Eric Biggers: "This release contains some implementation changes, but no new features: - Rework the implementation of the fscrypt filesystem-level keyring to not be as tightly coupled to the keyrings subsystem. This resolves several issues. - Eliminate most direct uses of struct request_queue from fs/crypto/, since struct request_queue is considered to be a block layer implementation detail. - Stop using the PG_error flag to track decryption failures. This is a prerequisite for freeing up PG_error for other uses" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fscrypt: work on block_devices instead of request_queues fscrypt: stop holding extra request_queue references fscrypt: stop using keyrings subsystem for fscrypt_master_key fscrypt: stop using PG_error to track error status fscrypt: remove fscrypt_set_test_dummy_encryption()
2022-09-30ext4: fix potential out of bound read in ext4_fc_replay_scan()Ye Bin1-2/+36
For scan loop must ensure that at least EXT4_FC_TAG_BASE_LEN space. If remain space less than EXT4_FC_TAG_BASE_LEN which will lead to out of bound read when mounting corrupt file system image. ADD_RANGE/HEAD/TAIL is needed to add extra check when do journal scan, as this three tags will read data during scan, tag length couldn't less than data length which will read. Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20220924075233.2315259-4-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: factor out ext4_fc_get_tl()Ye Bin1-21/+25
Factor out ext4_fc_get_tl() to fill 'tl' with host byte order. Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20220924075233.2315259-3-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: introduce EXT4_FC_TAG_BASE_LEN helperYe Bin2-26/+31
Introduce EXT4_FC_TAG_BASE_LEN helper for calculate length of struct ext4_fc_tl. Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20220924075233.2315259-2-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: factor out ext4_free_ext_path()Ye Bin7-82/+54
Factor out ext4_free_ext_path() to free extent path. As after previous patch 'ext4_ext_drop_refs()' is only used in 'extents.c', so make it static. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220924021211.3831551-3-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: remove unnecessary drop path references in mext_check_coverage()Ye Bin1-1/+0
According to Jan Kara's suggestion: "The use in mext_check_coverage() can be actually removed - get_ext_path() -> ext4_find_extent() takes care of dropping the references." So remove unnecessary call ext4_ext_drop_refs() in mext_check_coverage(). Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220924021211.3831551-2-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: update 'state->fc_regions_size' after successful memory allocationYe Bin1-4/+5
To avoid to 'state->fc_regions_size' mismatch with 'state->fc_regions' when fail to reallocate 'fc_reqions',only update 'state->fc_regions_size' after 'state->fc_regions' is allocated successfully. Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220921064040.3693255-4-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: fix potential memory leak in ext4_fc_record_regions()Ye Bin1-6/+8
As krealloc may return NULL, in this case 'state->fc_regions' may not be freed by krealloc, but 'state->fc_regions' already set NULL. Then will lead to 'state->fc_regions' memory leak. Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220921064040.3693255-3-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: fix potential memory leak in ext4_fc_record_modified_inode()Ye Bin1-3/+5
As krealloc may return NULL, in this case 'state->fc_modified_inodes' may not be freed by krealloc, but 'state->fc_modified_inodes' already set NULL. Then will lead to 'state->fc_modified_inodes' memory leak. Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220921064040.3693255-2-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: remove redundant checking in ext4_ioctl_checkpointGuoqing Jiang1-3/+0
It is already checked after comment "check for invalid bits set", so let's remove this one. Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Link: https://lore.kernel.org/r/20220918115219.12407-1-guoqing.jiang@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: move DIOREAD_NOLOCK setting to ext4_set_def_opts()Jason Yan1-3/+3
Now since all preparations is done, we can move the DIOREAD_NOLOCK setting to ext4_set_def_opts(). Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220916141527.1012715-17-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: remove useless local variable 'blocksize'Jason Yan1-24/+21
Since sb->s_blocksize is now initialized at the very beginning, the local variable 'blocksize' in __ext4_fill_super() is not needed now. Remove it and use sb->s_blocksize instead. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220916141527.1012715-16-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: unify the ext4 super block loading operationJason Yan1-80/+106
Now we load the super block from the disk in two steps. First we load the super block with the default block size(EXT4_MIN_BLOCK_SIZE). Second we load the super block with the real block size. The second step is a little far from the first step. This patch move these two steps together in a new function. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220916141527.1012715-15-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: factor out ext4_journal_data_mode_check()Jason Yan1-25/+35
Factor out ext4_journal_data_mode_check(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara<jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-14-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: factor out ext4_load_and_init_journal()Jason Yan1-69/+88
This patch group the journal load and initialize code together and factor out ext4_load_and_init_journal(). This patch also removes the lable 'no_journal' which is not needed after refactor. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-13-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: factor out ext4_group_desc_init() and ext4_group_desc_free()Jason Yan1-59/+84
Factor out ext4_group_desc_init() and ext4_group_desc_free(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-12-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: factor out ext4_geometry_check()Jason Yan1-50/+61
Factor out ext4_geometry_check(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-11-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: factor out ext4_check_feature_compatibility()Jason Yan1-67/+77
Factor out ext4_check_feature_compatibility(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-10-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: factor out ext4_init_metadata_csum()Jason Yan1-37/+46
Factor out ext4_init_metadata_csum(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-9-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: factor out ext4_encoding_init()Jason Yan1-36/+50
Factor out ext4_encoding_init(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220916141527.1012715-8-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: factor out ext4_inode_info_init()Jason Yan1-62/+75
Factor out ext4_inode_info_init(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-7-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: factor out ext4_fast_commit_init()Jason Yan1-18/+25
Factor out ext4_fast_commit_init(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-6-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: factor out ext4_handle_clustersize()Jason Yan1-49/+61
Factor out ext4_handle_clustersize(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-5-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: factor out ext4_set_def_opts()Jason Yan1-49/+56
Factor out ext4_set_def_opts(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-4-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: remove cantfind_ext4 error handlerJason Yan1-16/+13
The 'cantfind_ext4' error handler is just a error msg print and then goto failed_mount. This two level goto makes the code complex and not easy to read. The only benefit is that is saves a little bit code. However some branches can merge and some branches dot not even need it. So do some refactor and remove it. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-3-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: goto right label 'failed_mount3a'Jason Yan1-5/+5
Before these two branches neither loaded the journal nor created the xattr cache. So the right label to goto is 'failed_mount3a'. Although this did not cause any issues because the error handler validated if the pointer is null. However this still made me confused when reading the code. So it's still worth to modify to goto the right label. Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220916141527.1012715-2-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: adjust fast commit disable judgement order in ext4_fc_track_inodeYe Bin1-3/+3
If fastcommit is already disabled, there isn't need to mark inode ineligible. So move 'ext4_fc_disabled()' judgement bofore 'ext4_should_journal_data(inode)' judgement which can avoid to do meaningless judgement. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220916083836.388347-3-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: factor out ext4_fc_disabled()Ye Bin1-23/+15
Factor out ext4_fc_disabled(). No functional change. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220916083836.388347-2-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: fix miss release buffer head in ext4_fc_write_inodeYe Bin1-6/+9
In 'ext4_fc_write_inode' function first call 'ext4_get_inode_loc' get 'iloc', after use it miss release 'iloc.bh'. So just release 'iloc.bh' before 'ext4_fc_write_inode' return. Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220914100859.1415196-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: fix dir corruption when ext4_dx_add_entry() failsZhihao Cheng1-5/+10
Following process may lead to fs corruption: 1. ext4_create(dir/foo) ext4_add_nondir ext4_add_entry ext4_dx_add_entry a. add_dirent_to_buf ext4_mark_inode_dirty ext4_handle_dirty_metadata // dir inode bh is recorded into journal b. ext4_append // dx_get_count(entries) == dx_get_limit(entries) ext4_bread(EXT4_GET_BLOCKS_CREATE) ext4_getblk ext4_map_blocks ext4_ext_map_blocks ext4_mb_new_blocks dquot_alloc_block dquot_alloc_space_nodirty inode_add_bytes // update dir's i_blocks ext4_ext_insert_extent ext4_ext_dirty // record extent bh into journal ext4_handle_dirty_metadata(bh) // record new block into journal inode->i_size += inode->i_sb->s_blocksize // new size(in mem) c. ext4_handle_dirty_dx_node(bh2) // record dir's new block(dx_node) into journal d. ext4_handle_dirty_dx_node((frame - 1)->bh) e. ext4_handle_dirty_dx_node(frame->bh) f. do_split // ret err! g. add_dirent_to_buf ext4_mark_inode_dirty(dir) // update raw_inode on disk(skipped) 2. fsck -a /dev/sdb drop last block(dx_node) which beyonds dir's i_size. /dev/sdb: recovering journal /dev/sdb contains a file system with errors, check forced. /dev/sdb: Inode 12, end of extent exceeds allowed value (logical block 128, physical block 3938, len 1) 3. fsck -fn /dev/sdb dx_node->entry[i].blk > dir->i_size Pass 2: Checking directory structure Problem in HTREE directory inode 12 (/dir): bad block number 128. Clear HTree index? no Problem in HTREE directory inode 12: block #3 has invalid depth (2) Problem in HTREE directory inode 12: block #3 has bad max hash Problem in HTREE directory inode 12: block #3 not referenced Fix it by marking inode dirty directly inside ext4_append(). Fetch a reproducer in [Link]. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216466 Cc: stable@vger.kernel.org Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220911045204.516460-1-chengzhihao1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: remove ext4_inline_data_fiemap() declarationGaosheng Cui1-3/+0
ext4_inline_data_fiemap() has been removed since commit d3b6f23f7167 ("ext4: move ext4_fiemap to use iomap framework"), so remove it. Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220909065307.1155201-1-cuigaosheng1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: fix i_version handling in ext4Jeff Layton3-9/+10
ext4 currently updates the i_version counter when the atime is updated during a read. This is less than ideal as it can cause unnecessary cache invalidations with NFSv4 and unnecessary remeasurements for IMA. The increment in ext4_mark_iloc_dirty is also problematic since it can corrupt the i_version counter for ea_inodes. We aren't bumping the file times in ext4_mark_iloc_dirty, so changing the i_version there seems wrong, and is the cause of both problems. Remove that callsite and add increments to the setattr, setxattr and ioctl codepaths, at the same times that we update the ctime. The i_version bump that already happens during timestamp updates should take care of the rest. In ext4_move_extents, increment the i_version on both inodes, and also add in missing ctime updates. [ Some minor updates since we've already enabled the i_version counter unconditionally already via another patch series. -- TYT ] Cc: stable@kernel.org Cc: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20220908172448.208585-3-jlayton@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-30ext4: place buffer head allocation before handle startJinke Han1-0/+7
In our product environment, we encounter some jbd hung waiting handles to stop while several writters were doing memory reclaim for buffer head allocation in delay alloc write path. Ext4 do buffer head allocation with holding transaction handle which may be blocked too long if the reclaim works not so smooth. According to our bcc trace, the reclaim time in buffer head allocation can reach 258s and the jbd transaction commit also take almost the same time meanwhile. Except for these extreme cases, we often see several seconds delays for cgroup memory reclaim on our servers. This is more likely to happen considering docker environment. One thing to note, the allocation of buffer heads is as often as page allocation or more often when blocksize less than page size. Just like page cache allocation, we should also place the buffer head allocation before startting the handle. Cc: stable@kernel.org Signed-off-by: Jinke Han <hanjinke.666@bytedance.com> Link: https://lore.kernel.org/r/20220903012429.22555-1-hanjinke.666@bytedance.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>