summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2022-05-18fs-verity: remove unused parameter desc_size in fsverity_create_info()Zhang Jianhua4-16/+9
The parameter desc_size in fsverity_create_info() is useless and it is not referenced anywhere. The greatest meaning of desc_size here is to indecate the size of struct fsverity_descriptor and futher calculate the size of signature. However, the desc->sig_size can do it also and it is indeed, so remove it. Therefore, it is no need to acquire desc_size by fsverity_get_descriptor() in ensure_verity_info(), so remove the parameter desc_ret in fsverity_get_descriptor() too. Signed-off-by: Zhang Jianhua <chris.zjh@huawei.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20220518132256.2297655-1-chris.zjh@huawei.com
2022-05-18ext4: fix memory leak in parse_apply_sb_mount_options()Eric Biggers1-2/+4
If processing the on-disk mount options fails after any memory was allocated in the ext4_fs_context, e.g. s_qf_names, then this memory is leaked. Fix this by calling ext4_fc_free() instead of kfree() directly. Reproducer: mkfs.ext4 -F /dev/vdc tune2fs /dev/vdc -E mount_opts=usrjquota=file echo clear > /sys/kernel/debug/kmemleak mount /dev/vdc /vdc echo scan > /sys/kernel/debug/kmemleak sleep 5 echo scan > /sys/kernel/debug/kmemleak cat /sys/kernel/debug/kmemleak Fixes: 7edfd85b1ffd ("ext4: Completely separate options parsing and sb setup") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Ritesh Harjani <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220513231605.175121-2-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-18ext4: reject the 'commit' option on ext2 filesystemsEric Biggers1-0/+1
The 'commit' option is only applicable for ext3 and ext4 filesystems, and has never been accepted by the ext2 filesystem driver, so the ext4 driver shouldn't allow it on ext2 filesystems. This fixes a failure in xfstest ext4/053. Fixes: 8dc0aa8cf0f7 ("ext4: check incompatible mount options while mounting ext2/3") Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Link: https://lore.kernel.org/r/20220510183232.172615-1-ebiggers@kernel.org
2022-05-18ext4: remove duplicated #include of dax.h in inode.cYang Li1-1/+0
Fix following includecheck warning: ./fs/ext4/inode.c: linux/dax.h is included more than once. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Link: https://lore.kernel.org/r/20220504225025.44753-1-yang.lee@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-18io_uring: use rcu_dereference in io_closeChristoph Hellwig1-1/+2
Accessing the file table needs a rcu_dereference_protected(). Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18io_uring: consistently use the EPOLL* definesChristoph Hellwig1-4/+4
POLL* are unannotated values for the userspace ABI, while everything in-kernel should use EPOLL* and the __poll_t type. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18io_uring: make apoll_events a __poll_tChristoph Hellwig1-3/+4
apoll_events is fed to vfs_poll and the poll tables, so it should be a __poll_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18io_uring: drop a spurious inline on a forward declarationChristoph Hellwig1-1/+1
io_file_get_normal isn't marked inline, so don't claim it as such in the forward declaration. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18io_uring: don't use ERR_PTR for user pointersChristoph Hellwig1-46/+37
ERR_PTR abuses the high bits of a pointer to transport error information. This is only safe for kernel pointers and not user pointers. Fix io_buffer_select and its helpers to just return NULL for failure and get rid of this abuse. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18io_uring: use a rwf_t for io_rw.flagsChristoph Hellwig1-1/+1
Use the proper type. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18io_uring: add support for ring mapped supplied buffersJens Axboe1-12/+222
Provided buffers allow an application to supply io_uring with buffers that can then be grabbed for a read/receive request, when the data source is ready to deliver data. The existing scheme relies on using IORING_OP_PROVIDE_BUFFERS to do that, but it can be difficult to use in real world applications. It's pretty efficient if the application is able to supply back batches of provided buffers when they have been consumed and the application is ready to recycle them, but if fragmentation occurs in the buffer space, it can become difficult to supply enough buffers at the time. This hurts efficiency. Add a register op, IORING_REGISTER_PBUF_RING, which allows an application to setup a shared queue for each buffer group of provided buffers. The application can then supply buffers simply by adding them to this ring, and the kernel can consume then just as easily. The ring shares the head with the application, the tail remains private in the kernel. Provided buffers setup with IORING_REGISTER_PBUF_RING cannot use IORING_OP_{PROVIDE,REMOVE}_BUFFERS for adding or removing entries to the ring, they must use the mapped ring. Mapped provided buffer rings can co-exist with normal provided buffers, just not within the same group ID. To gauge overhead of the existing scheme and evaluate the mapped ring approach, a simple NOP benchmark was written. It uses a ring of 128 entries, and submits/completes 32 at the time. 'Replenish' is how many buffers are provided back at the time after they have been consumed: Test Replenish NOPs/sec ================================================================ No provided buffers NA ~30M Provided buffers 32 ~16M Provided buffers 1 ~10M Ring buffers 32 ~27M Ring buffers 1 ~27M The ring mapped buffers perform almost as well as not using provided buffers at all, and they don't care if you provided 1 or more back at the same time. This means application can just replenish as they go, rather than need to batch and compact, further reducing overhead in the application. The NOP benchmark above doesn't need to do any compaction, so that overhead isn't even reflected in the above test. Co-developed-by: Dylan Yudaken <dylany@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18io_uring: add io_pin_pages() helperJens Axboe1-27/+50
Abstract this out from io_sqe_buffer_register() so we can use it elsewhere too without duplicating this code. No intended functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18io_uring: add buffer selection support to IORING_OP_NOPJens Axboe1-1/+12
Obviously not really useful since it's not transferring data, but it is helpful in benchmarking overhead of provided buffers. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-18io_uring: fix locking state for empty buffer groupJens Axboe1-11/+14
io_provided_buffer_select() must drop the submit lock, if needed, even in the error handling case. Failure to do so will leave us with the ctx->uring_lock held, causing spew like: ==================================== WARNING: iou-wrk-366/368 still has locks held! 5.18.0-rc6-00294-gdf8dc7004331 #994 Not tainted ------------------------------------ 1 lock held by iou-wrk-366/368: #0: ffff0000c72598a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_ring_submit_lock+0x20/0x48 stack backtrace: CPU: 4 PID: 368 Comm: iou-wrk-366 Not tainted 5.18.0-rc6-00294-gdf8dc7004331 #994 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace.part.0+0xa4/0xd4 show_stack+0x14/0x5c dump_stack_lvl+0x88/0xb0 dump_stack+0x14/0x2c debug_check_no_locks_held+0x84/0x90 try_to_freeze.isra.0+0x18/0x44 get_signal+0x94/0x6ec io_wqe_worker+0x1d8/0x2b4 ret_from_fork+0x10/0x20 and triggering later hangs off get_signal() because we attempt to re-grab the lock. Reported-by: syzbot+987d7bb19195ae45208c@syzkaller.appspotmail.com Fixes: 149c69b04a90 ("io_uring: abstract out provided buffer list selection") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-17io_uring: don't attempt to IOPOLL for MSG_RING requestsJens Axboe1-0/+3
We gate whether to IOPOLL for a request on whether the opcode is allowed on a ring setup for IOPOLL and if it's got a file assigned. MSG_RING is the only one that allows a file yet isn't pollable, it's merely supported to allow communication on an IOPOLL ring, not because we can poll for completion of it. Put the assigned file early and clear it, so we don't attempt to poll for it. Reported-by: syzbot+1a0a53300ce782f8b3ad@syzkaller.appspotmail.com Fixes: 3f1d52abf098 ("io_uring: defer msg-ring file validity check until command issue") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-17ext4: fix race condition between ext4_write and ext4_convert_inline_dataBaokun Li2-13/+6
Hulk Robot reported a BUG_ON: ================================================================== EXT4-fs error (device loop3): ext4_mb_generate_buddy:805: group 0, block bitmap and bg descriptor inconsistent: 25 vs 31513 free clusters kernel BUG at fs/ext4/ext4_jbd2.c:53! invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 0 PID: 25371 Comm: syz-executor.3 Not tainted 5.10.0+ #1 RIP: 0010:ext4_put_nojournal fs/ext4/ext4_jbd2.c:53 [inline] RIP: 0010:__ext4_journal_stop+0x10e/0x110 fs/ext4/ext4_jbd2.c:116 [...] Call Trace: ext4_write_inline_data_end+0x59a/0x730 fs/ext4/inline.c:795 generic_perform_write+0x279/0x3c0 mm/filemap.c:3344 ext4_buffered_write_iter+0x2e3/0x3d0 fs/ext4/file.c:270 ext4_file_write_iter+0x30a/0x11c0 fs/ext4/file.c:520 do_iter_readv_writev+0x339/0x3c0 fs/read_write.c:732 do_iter_write+0x107/0x430 fs/read_write.c:861 vfs_writev fs/read_write.c:934 [inline] do_pwritev+0x1e5/0x380 fs/read_write.c:1031 [...] ================================================================== Above issue may happen as follows: cpu1 cpu2 __________________________|__________________________ do_pwritev vfs_writev do_iter_write ext4_file_write_iter ext4_buffered_write_iter generic_perform_write ext4_da_write_begin vfs_fallocate ext4_fallocate ext4_convert_inline_data ext4_convert_inline_data_nolock ext4_destroy_inline_data_nolock clear EXT4_STATE_MAY_INLINE_DATA ext4_map_blocks ext4_ext_map_blocks ext4_mb_new_blocks ext4_mb_regular_allocator ext4_mb_good_group_nolock ext4_mb_init_group ext4_mb_init_cache ext4_mb_generate_buddy --> error ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA) ext4_restore_inline_data set EXT4_STATE_MAY_INLINE_DATA ext4_block_write_begin ext4_da_write_end ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA) ext4_write_inline_data_end handle=NULL ext4_journal_stop(handle) __ext4_journal_stop ext4_put_nojournal(handle) ref_cnt = (unsigned long)handle BUG_ON(ref_cnt == 0) ---> BUG_ON The lock held by ext4_convert_inline_data is xattr_sem, but the lock held by generic_perform_write is i_rwsem. Therefore, the two locks can be concurrent. To solve above issue, we add inode_lock() for ext4_convert_inline_data(). At the same time, move ext4_convert_inline_data() in front of ext4_punch_hole(), remove similar handling from ext4_punch_hole(). Fixes: 0c8d414f163f ("ext4: let fallocate handle inline data correctly") Cc: stable@vger.kernel.org Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220428134031.4153381-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-17ext4: convert symlink external data block mapping to bdevZhang Yi3-83/+100
Symlink's external data block is one kind of metadata block, and now that almost all ext4 metadata block's page cache (e.g. directory blocks, quota blocks...) belongs to bdev backing inode except the symlink. It is essentially worked in data=journal mode like other regular file's data block because probably in order to make it simple for generic VFS code handling symlinks or some other historical reasons, but the logic of creating external data block in ext4_symlink() is complicated. and it also make things confused if user do not want to let the filesystem worked in data=journal mode. This patch convert the final exceptional case and make things clean, move the mapping of the symlink's external data block to bdev like any other metadata block does. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20220424140936.1898920-3-yi.zhang@huawei.com
2022-05-17ext4: add nowait mode for ext4_getblk()Zhang Yi2-0/+16
Current ext4_getblk() might sleep if some resources are not valid or could be race with a concurrent extents modifing procedure. So we cannot call ext4_getblk() and ext4_map_blocks() to get map blocks in the atomic context in some fast path (e.g. the upcoming procedure of getting symlink external block in the RCU context), even if the map extents have already been check and cached. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20220424140936.1898920-2-yi.zhang@huawei.com
2022-05-17ext4: fix journal_ioprio mount option handlingOjaswin Mujoo1-5/+10
In __ext4_super() we always overwrote the user specified journal_ioprio value with a default value, expecting parse_apply_sb_mount_options() to later correctly set ctx->journal_ioprio to the user specified value. However, if parse_apply_sb_mount_options() returned early because of empty sbi->es_s->s_mount_opts, the correct journal_ioprio value was never set. This patch fixes __ext4_super() to only use the default value if the user has not specified any value for journal_ioprio. Similarly, the remount behavior was to either use journal_ioprio value specified during initial mount, or use the default value irrespective of the journal_ioprio value specified during remount. This patch modifies this to first check if a new value for ioprio has been passed during remount and apply it. If no new value is passed, use the value specified during initial mount. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20220418083545.45778-1-ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2022-05-17ext4: mark group as trimmed only if it was fully scannedDmitry Monakhov1-6/+12
Otherwise nonaligned fstrim calls will works inconveniently for iterative scanners, for example: // trim [0,16MB] for group-1, but mark full group as trimmed fstrim -o $((1024*1024*128)) -l $((1024*1024*16)) ./m // handle [16MB,16MB] for group-1, do nothing because group already has the flag. fstrim -o $((1024*1024*144)) -l $((1024*1024*16)) ./m [ Update function documentation for ext4_trim_all_free -- TYT ] Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> Link: https://lore.kernel.org/r/1650214995-860245-1-git-send-email-dmtrmonakhov@yandex-team.ru Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2022-05-17ext4: fix use-after-free in ext4_rename_dir_prepareYe Bin1-3/+27
We got issue as follows: EXT4-fs (loop0): mounted filesystem without journal. Opts: ,errors=continue ext4_get_first_dir_block: bh->b_data=0xffff88810bee6000 len=34478 ext4_get_first_dir_block: *parent_de=0xffff88810beee6ae bh->b_data=0xffff88810bee6000 ext4_rename_dir_prepare: [1] parent_de=0xffff88810beee6ae ================================================================== BUG: KASAN: use-after-free in ext4_rename_dir_prepare+0x152/0x220 Read of size 4 at addr ffff88810beee6ae by task rep/1895 CPU: 13 PID: 1895 Comm: rep Not tainted 5.10.0+ #241 Call Trace: dump_stack+0xbe/0xf9 print_address_description.constprop.0+0x1e/0x220 kasan_report.cold+0x37/0x7f ext4_rename_dir_prepare+0x152/0x220 ext4_rename+0xf44/0x1ad0 ext4_rename2+0x11c/0x170 vfs_rename+0xa84/0x1440 do_renameat2+0x683/0x8f0 __x64_sys_renameat+0x53/0x60 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f45a6fc41c9 RSP: 002b:00007ffc5a470218 EFLAGS: 00000246 ORIG_RAX: 0000000000000108 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f45a6fc41c9 RDX: 0000000000000005 RSI: 0000000020000180 RDI: 0000000000000005 RBP: 00007ffc5a470240 R08: 00007ffc5a470160 R09: 0000000020000080 R10: 00000000200001c0 R11: 0000000000000246 R12: 0000000000400bb0 R13: 00007ffc5a470320 R14: 0000000000000000 R15: 0000000000000000 The buggy address belongs to the page: page:00000000440015ce refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x10beee flags: 0x200000000000000() raw: 0200000000000000 ffffea00043ff4c8 ffffea0004325608 0000000000000000 raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88810beee580: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88810beee600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >ffff88810beee680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff88810beee700: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88810beee780: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ================================================================== Disabling lock debugging due to kernel taint ext4_rename_dir_prepare: [2] parent_de->inode=3537895424 ext4_rename_dir_prepare: [3] dir=0xffff888124170140 ext4_rename_dir_prepare: [4] ino=2 ext4_rename_dir_prepare: ent->dir->i_ino=2 parent=-757071872 Reason is first directory entry which 'rec_len' is 34478, then will get illegal parent entry. Now, we do not check directory entry after read directory block in 'ext4_get_first_dir_block'. To solve this issue, check directory entry in 'ext4_get_first_dir_block'. [ Trigger an ext4_error() instead of just warning if the directory is missing a '.' or '..' entry. Also make sure we return an error code if the file system is corrupted. -TYT ] Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220414025223.4113128-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2022-05-17btrfs: zoned: introduce a minimal zone size 4M and reject mountJohannes Thumshirn1-3/+12
Zoned devices are expected to have zone sizes in the range of 1-2GB for ZNS SSDs and SMR HDDs have zone sizes of 256MB, so there is no need to allow arbitrarily small zone sizes on btrfs. But for testing purposes with emulated devices it is sometimes desirable to create devices with as small as 4MB zone size to uncover errors. So use 4MB as the smallest possible zone size and reject mounts of devices with a smaller zone size. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-17btrfs: allow defrag to convert inline extents to regular extentsQu Wenruo1-2/+22
Btrfs defaults to max_inline=2K to make small writes inlined into metadata. The default value is always a win, as even DUP/RAID1/RAID10 doubles the metadata usage, it should still cause less physical space used compared to a 4K regular extents. But since the introduction of RAID1C3 and RAID1C4 it's no longer the case, users may find inlined extents causing too much space wasted, and want to convert those inlined extents back to regular extents. Unfortunately defrag will unconditionally skip all inline extents, no matter if the user is trying to converting them back to regular extents. So this patch will add a small exception for defrag_collect_targets() to allow defragging inline extents, if and only if the inlined extents are larger than max_inline, allowing users to convert them to regular ones. This also allows us to defrag extents like the following: item 6 key (257 EXTENT_DATA 0) itemoff 15794 itemsize 69 generation 7 type 0 (inline) inline extent data size 48 ram_bytes 4096 compression 1 (zlib) item 7 key (257 EXTENT_DATA 4096) itemoff 15741 itemsize 53 generation 7 type 1 (regular) extent data disk byte 13631488 nr 4096 extent data offset 0 nr 16384 ram 16384 extent compression 1 (zlib) Previously we're unable to do any defrag, since the first extent is inlined, and the second one has no extent to merge. Now we can defrag it to just one single extent, saving 48 bytes metadata space. item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53 generation 8 type 1 (regular) extent data disk byte 13635584 nr 4096 extent data offset 0 nr 20480 ram 20480 extent compression 1 (zlib) Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-17btrfs: add "0x" prefix for unsupported optional featuresQu Wenruo1-2/+2
The following error message lack the "0x" obviously: cannot mount because of unsupported optional features (4000) Add the prefix to make it less confusing. This can happen on older kernels that try to mount a filesystem with newer features so it makes sense to backport to older trees. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-17btrfs: do not account twice for inode ref when reserving metadata unitsFilipe Manana1-2/+5
When reserving metadata units for creating an inode, we don't need to reserve one extra unit for the inode ref item because when creating the inode, at btrfs_create_new_inode(), we always insert the inode item and the inode ref item in a single batch (a single btree insert operation, and both ending up in the same leaf). As we have accounted already one unit for the inode item, the extra unit for the inode ref item is superfluous, it only makes us reserve more metadata than necessary and often adding more reclaim pressure if we are low on available metadata space. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-17btrfs: zoned: fix comparison of alloc_offset vs meta_write_pointerNaohiro Aota1-1/+1
The block_group->alloc_offset is an offset from the start of the block group. OTOH, the ->meta_write_pointer is an address in the logical space. So, we should compare the alloc_offset shifted with the block_group->start. Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking") CC: stable@vger.kernel.org # 5.16+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-17btrfs: send: avoid trashing the page cacheFilipe Manana1-3/+82
A send operation reads extent data using the buffered IO path for getting extent data to send in write commands and this is both because it's simple and to make use of the generic readahead infrastructure, which results in a massive speedup. However this fills the page cache with data that, most of the time, is really only used by the send operation - once the write commands are sent, it's not useful to have the data in the page cache anymore. For large snapshots, bringing all data into the page cache eventually leads to the need to evict other data from the page cache that may be more useful for applications (and kernel subsystems). Even if extents are shared with the subvolume on which a snapshot is based on and the data is currently on the page cache due to being read through the subvolume, attempting to read the data through the snapshot will always result in bringing a new copy of the data into another location in the page cache (there's currently no shared memory for shared extents). So make send evict the data it has read before if when it first opened the inode, its mapping had no pages currently loaded: when inode->i_mapping->nr_pages has a value of 0. Do this instead of deciding based on the return value of filemap_range_has_page() before reading an extent because the generic readahead mechanism may read pages beyond the range we request (and it very often does it), which means a call to filemap_range_has_page() will return true due to the readahead that was triggered when processing a previous extent - we don't have a simple way to distinguish this case from the case where the data was brought into the page cache through someone else. So checking for the mapping number of pages being 0 when we first open the inode is simple, cheap and it generally accomplishes the goal of not trashing the page cache - the only exception is if part of data was previously loaded into the page cache through the snapshot by some other process, in that case we end up not evicting any data send brings into the page cache, just like before this change - but that however is not the common case. Example scenario, on a box with 32G of RAM: $ btrfs subvolume create /mnt/sv1 $ xfs_io -f -c "pwrite 0 4G" /mnt/sv1/file1 $ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1 $ free -m total used free shared buff/cache available Mem: 31937 186 26866 0 4883 31297 Swap: 8188 0 8188 # After this we get less 4G of free memory. $ btrfs send /mnt/snap1 >/dev/null $ free -m total used free shared buff/cache available Mem: 31937 186 22814 0 8935 31297 Swap: 8188 0 8188 The same, obviously, applies to an incremental send. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-18erofs: scan devices from device tableJeffle Xu2-39/+72
When "-o device" mount option is not specified, scan the device table and instantiate the devices if there's any in the device table. In this case, the tag field of each device slot uniquely specifies a device. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220512055601.106109-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18erofs: change to use asynchronous io for fscache readpage/readaheadXin Yin1-43/+201
Use asynchronous io to read data from fscache may greatly improve IO bandwidth for sequential buffered read scenario. Change erofs_fscache_read_folios to erofs_fscache_read_folios_async, and read data from fscache asynchronously. Make .readpage()/.readahead() to use this new helper. Signed-off-by: Xin Yin <yinxin.x@bytedance.com> Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Link: https://lore.kernel.org/r/20220509074028.74954-23-jefflexu@linux.alibaba.com Acked-by: Chao Yu <chao@kernel.org> [ Gao Xiang: minor styling changes. ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18erofs: add 'fsid' mount optionJeffle Xu2-3/+32
Introduce 'fsid' mount option to enable on-demand read sementics, in which case, erofs will be mounted from data blobs. Users could specify the name of primary data blob by this mount option. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220425122143.56815-22-jefflexu@linux.alibaba.com Acked-by: Chao Yu <chao@kernel.org> Tested-by: Zichen Tian <tianzichen@kuaishou.com> Tested-by: Jia Zhu <zhujia.zj@bytedance.com> Tested-by: Yan Song <yansong.ys@antgroup.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18erofs: implement fscache-based data readaheadJeffle Xu2-0/+94
Implement fscache-based data readahead. Also registers an individual bdi for each erofs instance to enable readahead. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220425122143.56815-21-jefflexu@linux.alibaba.com Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18erofs: implement fscache-based data read for inline layoutJeffle Xu1-0/+32
Implement the data plane of reading data from data blobs over fscache for inline layout. For the heading non-inline part, the data plane for non-inline layout is reused, while only the tail packing part needs special handling. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220425122143.56815-20-jefflexu@linux.alibaba.com Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18erofs: implement fscache-based data read for non-inline layoutJeffle Xu3-0/+57
Implement the data plane of reading data from data blobs over fscache for non-inline layout. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220425122143.56815-19-jefflexu@linux.alibaba.com Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18erofs: implement fscache-based metadata readJeffle Xu2-4/+40
Implement the data plane of reading metadata from primary data blob over fscache. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220425122143.56815-18-jefflexu@linux.alibaba.com Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18erofs: register fscache context for extra data blobsJeffle Xu3-1/+12
Similar to the multi-device mode, erofs could be mounted from one primary data blob (mandatory) and multiple extra data blobs (optional). Register fscache context for each extra data blob. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220425122143.56815-17-jefflexu@linux.alibaba.com Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18erofs: register fscache context for primary data blobJeffle Xu2-4/+12
Registers fscache context for primary data blob. Also move the initialization of s_op and related fields forward, since anonymous inode will be allocated under the super block when registering the fscache context. Something worth mentioning about the cleanup routine. 1. The fscache context will instantiate anonymous inodes under the super block. Release these anonymous inodes when .put_super() is called, or we'll get "VFS: Busy inodes after unmount." warning. 2. The fscache context is initialized prior to the root inode. If .kill_sb() is called when mount failed, .put_super() won't be called when root inode has not been initialized yet. Thus .kill_sb() shall also contain the cleanup routine. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220425122143.56815-16-jefflexu@linux.alibaba.com Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18erofs: add erofs_fscache_read_folios() helperJeffle Xu1-0/+54
Add erofs_fscache_read_folios() helper reading from fscache. It supports on-demand read semantics. That is, it will make the backend prepare for the data when cache miss. Once data ready, it will read from the cache. This helper can then be used to implement .readpage()/.readahead() of on-demand read semantics. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220425122143.56815-15-jefflexu@linux.alibaba.com Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18erofs: add anonymous inode caching metadata for data blobsJeffle Xu2-5/+40
Introduce one anonymous inode for data blobs so that erofs can cache metadata directly within such anonymous inode. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220425122143.56815-14-jefflexu@linux.alibaba.com Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18erofs: add fscache context helper functionsJeffle Xu2-0/+60
Introduce a context structure for managing data blobs, and helper functions for initializing and cleaning up this context structure. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220425122143.56815-13-jefflexu@linux.alibaba.com Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18erofs: register fscache volumeJeffle Xu5-0/+69
A new fscache based mode is going to be introduced for erofs, in which case on-demand read semantics is implemented through fscache. As the first step, register fscache volume for each erofs filesystem. That means, data blobs can not be shared among erofs filesystems. In the following iteration, we are going to introduce the domain semantics, in which case several erofs filesystems can belong to one domain, and data blobs can be shared among these erofs filesystems of one domain. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220425122143.56815-12-jefflexu@linux.alibaba.com Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18erofs: add fscache mode check helperJeffle Xu2-15/+34
Until then erofs is exactly blockdev based filesystem. A new fscache-based mode is going to be introduced for erofs to support scenarios where on-demand read semantics is needed, e.g. container image distribution. In this case, erofs could be mounted from data blobs through fscache. Add a helper checking which mode erofs works in, and twist the code in preparation for the upcoming fscache mode. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220425122143.56815-11-jefflexu@linux.alibaba.com Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18erofs: make erofs_map_blocks() generally availableJeffle Xu2-2/+4
... so that it can be used in the following introduced fscache mode. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220425122143.56815-10-jefflexu@linux.alibaba.com Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18cachefiles: add tracepoints for on-demand read modeJeffle Xu1-0/+7
Add tracepoints for on-demand read mode. Currently following tracepoints are added: OPEN request / COPEN reply CLOSE request READ request / CREAD reply write through anonymous fd release of anonymous fd Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Acked-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20220425122143.56815-8-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18cachefiles: enable on-demand read modeJeffle Xu1-5/+12
Enable on-demand read mode by adding an optional parameter to the "bind" command. On-demand mode will be turned on when this parameter is "ondemand", i.e. "bind ondemand". Otherwise cachefiles will work in the original mode. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Link: https://lore.kernel.org/r/20220509074028.74954-7-jefflexu@linux.alibaba.com Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18cachefiles: implement on-demand readJeffle Xu3-2/+99
Implement the data plane of on-demand read mode. The early implementation [1] place the entry to cachefiles_ondemand_read() in fscache_read(). However, fscache_read() can only detect if the requested file range is fully cache miss, whilst we need to notify the user daemon as long as there's a hole inside the requested file range. Thus the entry is now placed in cachefiles_prepare_read(). When working in on-demand read mode, once a hole detected, the read routine will send a READ request to the user daemon. The user daemon needs to fetch the data and write it to the cache file. After sending the READ request, the read routine will hang there, until the READ request is handled by the user daemon. Then it will retry to read from the same file range. If no progress encountered, the read routine will fail then. A new NETFS_SREQ_ONDEMAND flag is introduced to indicate that on-demand read should be done when a cache miss encountered. [1] https://lore.kernel.org/all/20220406075612.60298-6-jefflexu@linux.alibaba.com/ #v8 Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Acked-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20220425122143.56815-6-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18cachefiles: notify the user daemon when withdrawing cookieJeffle Xu3-0/+45
Notify the user daemon that cookie is going to be withdrawn, providing a hint that the associated anonymous fd can be closed. Be noted that this is only a hint. The user daemon may close the associated anonymous fd when receiving the CLOSE request, then it will receive another anonymous fd when the cookie gets looked up. Or it may ignore the CLOSE request, and keep writing data through the anonymous fd. However the next time the cookie gets looked up, the user daemon will still receive another new anonymous fd. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Acked-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20220425122143.56815-5-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18cachefiles: unbind cachefiles gracefully in on-demand modeJeffle Xu3-3/+22
Add a refcount to avoid the deadlock in on-demand read mode. The on-demand read mode will pin the corresponding cachefiles object for each anonymous fd. The cachefiles object is unpinned when the anonymous fd gets closed. When the user daemon exits and the fd of "/dev/cachefiles" device node gets closed, it will wait for all cahcefiles objects getting withdrawn. Then if there's any anonymous fd getting closed after the fd of the device node, the user daemon will hang forever, waiting for all objects getting withdrawn. To fix this, add a refcount indicating if there's any object pinned by anonymous fds. The cachefiles cache gets unbound and withdrawn when the refcount is decreased to 0. It won't change the behaviour of the original mode, in which case the cachefiles cache gets unbound and withdrawn as long as the fd of the device node gets closed. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Link: https://lore.kernel.org/r/20220509074028.74954-4-jefflexu@linux.alibaba.com Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18cachefiles: notify the user daemon when looking up cookieJeffle Xu6-15/+524
Fscache/CacheFiles used to serve as a local cache for a remote networking fs. A new on-demand read mode will be introduced for CacheFiles, which can boost the scenario where on-demand read semantics are needed, e.g. container image distribution. The essential difference between these two modes is seen when a cache miss occurs: In the original mode, the netfs will fetch the data from the remote server and then write it to the cache file; in on-demand read mode, fetching the data and writing it into the cache is delegated to a user daemon. As the first step, notify the user daemon when looking up cookie. In this case, an anonymous fd is sent to the user daemon, through which the user daemon can write the fetched data to the cache file. Since the user daemon may move the anonymous fd around, e.g. through dup(), an object ID uniquely identifying the cache file is also attached. Also add one advisory flag (FSCACHE_ADV_WANT_CACHE_SIZE) suggesting that the cache file size shall be retrieved at runtime. This helps the scenario where one cache file contains multiple netfs files, e.g. for the purpose of deduplication. In this case, netfs itself has no idea the size of the cache file, whilst the user daemon should give the hint on it. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Link: https://lore.kernel.org/r/20220509074028.74954-3-jefflexu@linux.alibaba.com Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18cachefiles: extract write routineJeffle Xu2-26/+45
Extract the generic routine of writing data to cache files, and make it generally available. This will be used by the following patch implementing on-demand read mode. Since it's called inside CacheFiles module, make the interface generic and unrelated to netfs_cache_resources. It is worth noting that, ki->inval_counter is not initialized after this cleanup. It shall not make any visible difference, since inval_counter is no longer used in the write completion routine, i.e. cachefiles_write_complete(). Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Acked-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20220425122143.56815-2-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-17erofs: support idmapped mountsChao Yu2-2/+2
This patch enables idmapped mounts for erofs, since all dedicated helpers for this functionality existsm, so, in this patch we just pass down the user_namespace argument from the VFS methods to the relevant helpers. Simple idmap example on erofs image: 1. mkdir dir 2. touch dir/file 3. mkfs.erofs erofs.img dir 4. mount -t erofs -o loop erofs.img /mnt/erofs/ 5. ls -ln /mnt/erofs/ total 0 -rw-rw-r-- 1 1000 1000 0 May 17 15:26 file 6. mount-idmapped --map-mount b:1000:1001:1 /mnt/erofs/ /mnt/scratch_erofs/ 7. ls -ln /mnt/scratch_erofs/ total 0 -rw-rw-r-- 1 1001 1001 0 May 17 15:26 file Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Chao Yu <chao.yu@oppo.com> Link: https://lore.kernel.org/r/20220517104103.3570721-1-chao@kernel.org Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>