summaryrefslogtreecommitdiffstats
path: root/fs/erofs
AgeCommit message (Collapse)AuthorFilesLines
2021-05-13erofs: fix 1 lcluster-sized pcluster for big pclusterGao Xiang1-2/+19
If the 1st NONHEAD lcluster of a pcluster isn't CBLKCNT lcluster type rather than a HEAD or PLAIN type instead, which means its pclustersize _must_ be 1 lcluster (since its uncompressed size < 2 lclusters), as illustrated below: HEAD HEAD / PLAIN lcluster type ____________ ____________ |_:__________|_________:__| file data (uncompressed) . . .____________. |____________| pcluster data (compressed) Such on-disk case was explained before [1] but missed to be handled properly in the runtime implementation. It can be observed if manually generating 1 lcluster-sized pcluster with 2 lclusters (thus CBLKCNT doesn't exist.) Let's fix it now. [1] https://lore.kernel.org/r/20210407043927.10623-1-xiang@kernel.org Link: https://lore.kernel.org/r/20210510064715.29123-1-xiang@kernel.org Fixes: cec6e93beadf ("erofs: support parsing big pcluster compress indexes") Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <xiang@kernel.org>
2021-04-10erofs: enable big pcluster featureGao Xiang1-1/+4
Enable COMPR_CFGS and BIG_PCLUSTER since the implementations are all settled properly. Link: https://lore.kernel.org/r/20210407043927.10623-11-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-04-10erofs: support decompress big pcluster for lz4 backendGao Xiang2-95/+138
Prior to big pcluster, there was only one compressed page so it'd easy to map this. However, when big pcluster is enabled, more work needs to be done to handle multiple compressed pages. In detail, - (maptype 0) if there is only one compressed page + no need to copy inplace I/O, just map it directly what we did before; - (maptype 1) if there are more compressed pages + no need to copy inplace I/O, vmap such compressed pages instead; - (maptype 2) if inplace I/O needs to be copied, use per-CPU buffers for decompression then. Another thing is how to detect inplace decompression is feasable or not (it's still quite easy for non big pclusters), apart from the inplace margin calculation, inplace I/O page reusing order is also needed to be considered for each compressed page. Currently, if the compressed page is the xth page, it shouldn't be reused as [0 ... nrpages_out - nrpages_in + x], otherwise a full copy will be triggered. Although there are some extra optimization ideas for this, I'd like to make big pcluster work correctly first and obviously it can be further optimized later since it has nothing with the on-disk format at all. Link: https://lore.kernel.org/r/20210407043927.10623-10-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-04-10erofs: support parsing big pcluster compact indexesGao Xiang1-10/+62
Different from non-compact indexes, several lclusters are packed as the compact form at once and an unique base blkaddr is stored for each pack, so each lcluster index would take less space on avarage (e.g. 2 bytes for COMPACT_2B.) btw, that is also why BIG_PCLUSTER switch should be consistent for compact head0/1. Prior to big pcluster, the size of all pclusters was 1 lcluster. Therefore, when a new HEAD lcluster was scanned, blkaddr would be bumped by 1 lcluster. However, that way doesn't work anymore for big pcluster since we actually don't know the compressed size of pclusters in advance (before reading CBLKCNT lcluster). So, instead, let blkaddr of each pack be the first pcluster blkaddr with a valid CBLKCNT, in detail, 1) if CBLKCNT starts at the pack, this first valid pcluster is itself, e.g. _____________________________________________________________ |_CBLKCNT0_|_NONHEAD_| .. |_HEAD_|_CBLKCNT1_| ... |_HEAD_| ... ^ = blkaddr base ^ += CBLKCNT0 ^ += CBLKCNT1 2) if CBLKCNT doesn't start at the pack, the first valid pcluster is the next pcluster, e.g. _________________________________________________________ | NONHEAD_| .. |_HEAD_|_CBLKCNT0_| ... |_HEAD_|_HEAD_| ... ^ = blkaddr base ^ += CBLKCNT0 ^ += 1 When a CBLKCNT is found, blkaddr will be increased by CBLKCNT lclusters, or a new HEAD is found immediately, bump blkaddr by 1 instead (see the picture above.) Also noted if CBLKCNT is the end of the pack, instead of storing delta1 (distance of the next HEAD lcluster) as normal NONHEADs, it still uses the compressed block count (delta0) since delta1 can be calculated indirectly but the block count can't. Adjust decoding logic to fit big pcluster compact indexes as well. Link: https://lore.kernel.org/r/20210407043927.10623-9-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-04-10erofs: support parsing big pcluster compress indexesGao Xiang1-6/+73
When INCOMPAT_BIG_PCLUSTER sb feature is enabled, legacy compress indexes will also have the same on-disk header compact indexes to keep per-file configurations instead of leaving it zeroed. If ADVISE_BIG_PCLUSTER is set for a file, CBLKCNT will be loaded for each pcluster in this file by parsing 1st non-head lcluster. Link: https://lore.kernel.org/r/20210407043927.10623-8-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-04-10erofs: adjust per-CPU buffers according to max_pclusterblksGao Xiang2-4/+18
Adjust per-CPU buffers on demand since big pcluster definition is available. Also, bail out unsupported pcluster size according to Z_EROFS_PCLUSTER_MAX_SIZE. Link: https://lore.kernel.org/r/20210407043927.10623-7-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-04-10erofs: add big physical cluster definitionGao Xiang2-4/+16
Big pcluster indicates the size of compressed data for each physical pcluster is no longer fixed as block size, but could be more than 1 block (more accurately, 1 logical pcluster) When big pcluster feature is enabled for head0/1, delta0 of the 1st non-head lcluster index will keep block count of this pcluster in lcluster size instead of 1. Or, the compressed size of pcluster should be 1 lcluster if pcluster has no non-head lcluster index. Also note that BIG_PCLUSTER feature reuses COMPR_CFGS feature since it depends on COMPR_CFGS and will be released together. Link: https://lore.kernel.org/r/20210407043927.10623-6-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-04-10erofs: fix up inplace I/O pointer for big pclusterGao Xiang1-14/+14
When picking up inplace I/O pages, it should be traversed in reverse order in aligned with the traversal order of file-backed online pages. Also, index should be updated together when preloading compressed pages. Previously, only page-sized pclustersize was supported so no problem at all. Also rename `compressedpages' to `icpage_ptr' to reflect its functionality. Link: https://lore.kernel.org/r/20210407043927.10623-5-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-04-10erofs: introduce physical cluster slab poolsGao Xiang5-80/+126
Since multiple pcluster sizes could be used at once, the number of compressed pages will become a variable factor. It's necessary to introduce slab pools rather than a single slab cache now. This limits the pclustersize to 1M (Z_EROFS_PCLUSTER_MAX_SIZE), and get rid of the obsolete EROFS_FS_CLUSTER_PAGE_LIMIT, which has no use now. Link: https://lore.kernel.org/r/20210407043927.10623-4-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-04-10erofs: introduce multipage per-CPU buffersGao Xiang6-34/+163
To deal the with the cases which inplace decompression is infeasible for some inplace I/O. Per-CPU buffers was introduced to get rid of page allocation latency and thrash for low-latency decompression algorithms such as lz4. For the big pcluster feature, introduce multipage per-CPU buffers to keep such inplace I/O pclusters temporarily as well but note that per-CPU pages are just consecutive virtually. When a new big pcluster fs is mounted, its max pclustersize will be read and per-CPU buffers can be growed if needed. Shrinking adjustable per-CPU buffers is more complex (because we don't know if such size is still be used), so currently just release them all when unloading. Link: https://lore.kernel.org/r/20210409190630.19569-1-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-04-07erofs: reserve physical_clusterbits[]Gao Xiang4-21/+2
Formal big pcluster design is actually more powerful / flexable than the previous thought whose pclustersize was fixed as power-of-2 blocks, which was obviously inefficient and space-wasting. Instead, pclustersize can now be set independently for each pcluster, so various pcluster sizes can also be used together in one file if mkfs wants (for example, according to data type and/or compression ratio). Let's get rid of previous physical_clusterbits[] setting (also notice that corresponding on-disk fields are still 0 for now). Therefore, head1/2 can be used for at most 2 different algorithms in one file and again pclustersize is now independent of these. Link: https://lore.kernel.org/r/20210407043927.10623-2-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-04-03erofs: Clean up spelling mistakes found in fs/erofsRuiqi Gong2-2/+2
zmap.c: s/correspoinding/corresponding zdata.c: s/endding/ending Link: https://lore.kernel.org/r/20210331093920.31923-1-gongruiqi1@huawei.com Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Ruiqi Gong <gongruiqi1@huawei.com> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-03-29erofs: add on-disk compression configurationsGao Xiang4-7/+157
Add a bitmap for available compression algorithms and a variable-sized on-disk table for compression options in preparation for upcoming big pcluster and LZMA algorithm, which follows the end of super block. To parse the compression options, the bitmap is scanned one by one. For each available algorithm, there is data followed by 2-byte `length' correspondingly (it's enough for most cases, or entire fs blocks should be used.) With such available algorithm bitmap, kernel itself can also refuse to mount such filesystem if any unsupported compression algorithm exists. Note that COMPR_CFGS feature will be enabled with BIG_PCLUSTER. Link: https://lore.kernel.org/r/20210329100012.12980-1-hsiangkao@aol.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-03-29erofs: introduce on-disk lz4 fs configurationsGao Xiang4-6/+25
Introduce z_erofs_lz4_cfgs to store all lz4 configurations. Currently it's only max_distance, but will be used for new features later. Link: https://lore.kernel.org/r/20210329012308.28743-4-hsiangkao@aol.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-03-29erofs: support adjust lz4 history window sizeHuang Jianan4-6/+42
lz4 uses LZ4_DISTANCE_MAX to record history preservation. When using rolling decompression, a block with a higher compression ratio will cause a larger memory allocation (up to 64k). It may cause a large resource burden in extreme cases on devices with small memory and a large number of concurrent IOs. So appropriately reducing this value can improve performance. Decreasing this value will reduce the compression ratio (except when input_size <LZ4_DISTANCE_MAX). But considering that erofs currently only supports 4k output, reducing this value will not significantly reduce the compression benefits. The maximum value of LZ4_DISTANCE_MAX defined by lz4 is 64k, and we can only reduce this value. For the old kernel, it just can't reduce the memory allocation during rolling decompression without affecting the decompression result. Link: https://lore.kernel.org/r/20210329012308.28743-3-hsiangkao@aol.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Huang Jianan <huangjianan@oppo.com> Signed-off-by: Guo Weichao <guoweichao@oppo.com> [ Gao Xiang: introduce struct erofs_sb_lz4_info for configurations. ] Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-03-29erofs: introduce erofs_sb_has_xxx() helpersGao Xiang3-3/+11
Introduce erofs_sb_has_xxx() to make long checks short, especially for later big pcluster & LZMA features. Link: https://lore.kernel.org/r/20210329012308.28743-2-hsiangkao@aol.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-03-29erofs: add unsupported inode i_format checkGao Xiang2-0/+10
If any unknown i_format fields are set (may be of some new incompat inode features), mark such inode as unsupported. Just in case of any new incompat i_format fields added in the future. Link: https://lore.kernel.org/r/20210329003614.6583-1-hsiangkao@aol.com Fixes: 431339ba9042 ("staging: erofs: add inode operations") Cc: <stable@vger.kernel.org> # 4.19+ Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-03-29erofs: don't use erofs_map_blocks() any moreYue Hu2-21/+4
Currently, erofs_map_blocks() will be called only from erofs_{bmap, read_raw_page} which are all for uncompressed files. So, the compression branch in erofs_map_blocks() is pointless. Let's remove it and use erofs_map_blocks_flatmode() directly. Also update related comments. Link: https://lore.kernel.org/r/20210325071008.573-1-zbestahu@gmail.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Yue Hu <huyue2@yulong.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-03-29erofs: complete a missing case for inplace I/OGao Xiang1-15/+29
Add a missing case which could cause unnecessary page allocation but not directly use inplace I/O instead, which increases runtime extra memory footprint. The detail is, considering an online file-backed page, the right half of the page is chosen to be cached (e.g. the end page of a readahead request) and some of its data doesn't exist in managed cache, so the pcluster will be definitely kept in the submission chain. (IOWs, it cannot be decompressed without I/O, e.g., due to the bypass queue). Currently, DELAYEDALLOC/TRYALLOC cases can be downgraded as NOINPLACE, and stop online pages from inplace I/O. After this patch, unneeded page allocations won't be observed in pickup_page_for_submission() then. Link: https://lore.kernel.org/r/20210321183227.5182-1-hsiangkao@aol.com Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-03-29erofs: use sync decompression for atomic contexts onlyHuang Jianan3-2/+9
Sync decompression was introduced to get rid of additional kworker scheduling overhead. But there is no such overhead in non-atomic contexts. Therefore, it should be better to turn off sync decompression to avoid the current thread waiting in z_erofs_runqueue. Link: https://lore.kernel.org/r/20210317035448.13921-3-huangjianan@oppo.com Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Huang Jianan <huangjianan@oppo.com> Signed-off-by: Guo Weichao <guoweichao@oppo.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-03-29erofs: use workqueue decompression for atomic contexts onlyHuang Jianan1-1/+8
z_erofs_decompressqueue_endio may not be executed in the atomic context, for example, when dm-verity is turned on. In this scenario, data can be decompressed directly to get rid of additional kworker scheduling overhead. Link: https://lore.kernel.org/r/20210317035448.13921-2-huangjianan@oppo.com Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Huang Jianan <huangjianan@oppo.com> Signed-off-by: Guo Weichao <guoweichao@oppo.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-03-29erofs: avoid memory allocation failure during rolling decompressionHuang Jianan1-3/+2
Currently, err would be treated as io error. Therefore, it'd be better to ensure memory allocation during rolling decompression to avoid such io error. In the long term, we might consider adding another !Uptodate case for such case. Link: https://lore.kernel.org/r/20210316031515.90954-1-huangjianan@oppo.com Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Huang Jianan <huangjianan@oppo.com> Signed-off-by: Guo Weichao <guoweichao@oppo.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-03-13Merge tag 'erofs-for-5.12-rc3' of ↵Linus Torvalds1-17/+11
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fix from Gao Xiang: "Fix an urgent regression introduced by commit baa2c7c97153 ("block: set .bi_max_vecs as actual allocated vector number"), which could cause unexpected hung since linux 5.12-rc1. Resolve it by avoiding using bio->bi_max_vecs completely" * tag 'erofs-for-5.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: fix bio->bi_max_vecs behavior change
2021-03-11block: rename BIO_MAX_PAGES to BIO_MAX_VECSChristoph Hellwig1-1/+1
Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been horribly confusingly misnamed. Rename it to BIO_MAX_VECS to stop confusing users of the bio API. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-08erofs: fix bio->bi_max_vecs behavior changeGao Xiang1-17/+11
Martin reported an issue that directory read could be hung on the latest -rc kernel with some certain image. The root cause is that commit baa2c7c97153 ("block: set .bi_max_vecs as actual allocated vector number") changes .bi_max_vecs behavior. bio->bi_max_vecs is set as actual allocated vector number rather than the requested number now. Let's avoid using .bi_max_vecs completely instead. Link: https://lore.kernel.org/r/20210306040438.8084-1-hsiangkao@aol.com Reported-by: Martin DEVERA <devik@eaxlabs.cz> Reviewed-by: Chao Yu <yuchao0@huawei.com> [ Gao Xiang: note that <= 5.11 kernels are not impacted. ] Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-02-28Merge tag 'block-5.12-2021-02-27' of git://git.kernel.dk/linux-blockLinus Torvalds1-3/+1
Pull more block updates from Jens Axboe: "A few stragglers (and one due to me missing it originally), and fixes for changes in this merge window mostly. In particular: - blktrace cleanups (Chaitanya, Greg) - Kill dead blk_pm_* functions (Bart) - Fixes for the bio alloc changes (Christoph) - Fix for the partition changes (Christoph, Ming) - Fix for turning off iopoll with polled IO inflight (Jeffle) - nbd disconnect fix (Josef) - loop fsync error fix (Mauricio) - kyber update depth fix (Yang) - max_sectors alignment fix (Mikulas) - Add bio_max_segs helper (Matthew)" * tag 'block-5.12-2021-02-27' of git://git.kernel.dk/linux-block: (21 commits) block: Add bio_max_segs blktrace: fix documentation for blk_fill_rw() block: memory allocations in bounce_clone_bio must not fail block: remove the gfp_mask argument to bounce_clone_bio block: fix bounce_clone_bio for passthrough bios block-crypto-fallback: use a bio_set for splitting bios block: fix logging on capacity change blk-settings: align max_sectors on "logical_block_size" boundary block: reopen the device in blkdev_reread_part block: don't skip empty device in in disk_uevent blktrace: remove debugfs file dentries from struct blk_trace nbd: handle device refs for DESTROY_ON_DISCONNECT properly kyber: introduce kyber_depth_updated() loop: fix I/O error on fsync() in detached loop devices block: fix potential IO hang when turning off io_poll block: get rid of the trace rq insert wrapper blktrace: fix blk_rq_merge documentation blktrace: fix blk_rq_issue documentation blktrace: add blk_fill_rwbs documentation comment block: remove superfluous param in blk_fill_rwbs() ...
2021-02-26block: Add bio_max_segsMatthew Wilcox (Oracle)1-3/+1
It's often inconvenient to use BIO_MAX_PAGES due to min() requiring the sign to be the same. Introduce bio_max_segs() and change BIO_MAX_PAGES to be unsigned to make it easier for the users. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-23Merge tag 'idmapped-mounts-v5.12' of ↵Linus Torvalds2-5/+7
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull idmapped mounts from Christian Brauner: "This introduces idmapped mounts which has been in the making for some time. Simply put, different mounts can expose the same file or directory with different ownership. This initial implementation comes with ports for fat, ext4 and with Christoph's port for xfs with more filesystems being actively worked on by independent people and maintainers. Idmapping mounts handle a wide range of long standing use-cases. Here are just a few: - Idmapped mounts make it possible to easily share files between multiple users or multiple machines especially in complex scenarios. For example, idmapped mounts will be used in the implementation of portable home directories in systemd-homed.service(8) where they allow users to move their home directory to an external storage device and use it on multiple computers where they are assigned different uids and gids. This effectively makes it possible to assign random uids and gids at login time. - It is possible to share files from the host with unprivileged containers without having to change ownership permanently through chown(2). - It is possible to idmap a container's rootfs and without having to mangle every file. For example, Chromebooks use it to share the user's Download folder with their unprivileged containers in their Linux subsystem. - It is possible to share files between containers with non-overlapping idmappings. - Filesystem that lack a proper concept of ownership such as fat can use idmapped mounts to implement discretionary access (DAC) permission checking. - They allow users to efficiently changing ownership on a per-mount basis without having to (recursively) chown(2) all files. In contrast to chown (2) changing ownership of large sets of files is instantenous with idmapped mounts. This is especially useful when ownership of a whole root filesystem of a virtual machine or container is changed. With idmapped mounts a single syscall mount_setattr syscall will be sufficient to change the ownership of all files. - Idmapped mounts always take the current ownership into account as idmappings specify what a given uid or gid is supposed to be mapped to. This contrasts with the chown(2) syscall which cannot by itself take the current ownership of the files it changes into account. It simply changes the ownership to the specified uid and gid. This is especially problematic when recursively chown(2)ing a large set of files which is commong with the aforementioned portable home directory and container and vm scenario. - Idmapped mounts allow to change ownership locally, restricting it to specific mounts, and temporarily as the ownership changes only apply as long as the mount exists. Several userspace projects have either already put up patches and pull-requests for this feature or will do so should you decide to pull this: - systemd: In a wide variety of scenarios but especially right away in their implementation of portable home directories. https://systemd.io/HOME_DIRECTORY/ - container runtimes: containerd, runC, LXD:To share data between host and unprivileged containers, unprivileged and privileged containers, etc. The pull request for idmapped mounts support in containerd, the default Kubernetes runtime is already up for quite a while now: https://github.com/containerd/containerd/pull/4734 - The virtio-fs developers and several users have expressed interest in using this feature with virtual machines once virtio-fs is ported. - ChromeOS: Sharing host-directories with unprivileged containers. I've tightly synced with all those projects and all of those listed here have also expressed their need/desire for this feature on the mailing list. For more info on how people use this there's a bunch of talks about this too. Here's just two recent ones: https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf https://fosdem.org/2021/schedule/event/containers_idmap/ This comes with an extensive xfstests suite covering both ext4 and xfs: https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts It covers truncation, creation, opening, xattrs, vfscaps, setid execution, setgid inheritance and more both with idmapped and non-idmapped mounts. It already helped to discover an unrelated xfs setgid inheritance bug which has since been fixed in mainline. It will be sent for inclusion with the xfstests project should you decide to merge this. In order to support per-mount idmappings vfsmounts are marked with user namespaces. The idmapping of the user namespace will be used to map the ids of vfs objects when they are accessed through that mount. By default all vfsmounts are marked with the initial user namespace. The initial user namespace is used to indicate that a mount is not idmapped. All operations behave as before and this is verified in the testsuite. Based on prior discussions we want to attach the whole user namespace and not just a dedicated idmapping struct. This allows us to reuse all the helpers that already exist for dealing with idmappings instead of introducing a whole new range of helpers. In addition, if we decide in the future that we are confident enough to enable unprivileged users to setup idmapped mounts the permission checking can take into account whether the caller is privileged in the user namespace the mount is currently marked with. The user namespace the mount will be marked with can be specified by passing a file descriptor refering to the user namespace as an argument to the new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern of extensibility. The following conditions must be met in order to create an idmapped mount: - The caller must currently have the CAP_SYS_ADMIN capability in the user namespace the underlying filesystem has been mounted in. - The underlying filesystem must support idmapped mounts. - The mount must not already be idmapped. This also implies that the idmapping of a mount cannot be altered once it has been idmapped. - The mount must be a detached/anonymous mount, i.e. it must have been created by calling open_tree() with the OPEN_TREE_CLONE flag and it must not already have been visible in the filesystem. The last two points guarantee easier semantics for userspace and the kernel and make the implementation significantly simpler. By default vfsmounts are marked with the initial user namespace and no behavioral or performance changes are observed. The manpage with a detailed description can be found here: https://git.kernel.org/brauner/man-pages/c/1d7b902e2875a1ff342e036a9f866a995640aea8 In order to support idmapped mounts, filesystems need to be changed and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The patches to convert individual filesystem are not very large or complicated overall as can be seen from the included fat, ext4, and xfs ports. Patches for other filesystems are actively worked on and will be sent out separately. The xfstestsuite can be used to verify that port has been done correctly. The mount_setattr() syscall is motivated independent of the idmapped mounts patches and it's been around since July 2019. One of the most valuable features of the new mount api is the ability to perform mounts based on file descriptors only. Together with the lookup restrictions available in the openat2() RESOLVE_* flag namespace which we added in v5.6 this is the first time we are close to hardened and race-free (e.g. symlinks) mounting and path resolution. While userspace has started porting to the new mount api to mount proper filesystems and create new bind-mounts it is currently not possible to change mount options of an already existing bind mount in the new mount api since the mount_setattr() syscall is missing. With the addition of the mount_setattr() syscall we remove this last restriction and userspace can now fully port to the new mount api, covering every use-case the old mount api could. We also add the crucial ability to recursively change mount options for a whole mount tree, both removing and adding mount options at the same time. This syscall has been requested multiple times by various people and projects. There is a simple tool available at https://github.com/brauner/mount-idmapped that allows to create idmapped mounts so people can play with this patch series. I'll add support for the regular mount binary should you decide to pull this in the following weeks: Here's an example to a simple idmapped mount of another user's home directory: u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt u1001@f2-vm:/$ ls -al /home/ubuntu/ total 28 drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 . drwxr-xr-x 4 root root 4096 Oct 28 04:00 .. -rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history -rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout -rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc -rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile -rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful -rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo u1001@f2-vm:/$ ls -al /mnt/ total 28 drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 . drwxr-xr-x 29 root root 4096 Oct 28 22:01 .. -rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history -rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout -rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc -rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile -rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful -rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo u1001@f2-vm:/$ touch /mnt/my-file u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file u1001@f2-vm:/$ ls -al /mnt/my-file -rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file u1001@f2-vm:/$ ls -al /home/ubuntu/my-file -rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file u1001@f2-vm:/$ getfacl /mnt/my-file getfacl: Removing leading '/' from absolute path names # file: mnt/my-file # owner: u1001 # group: u1001 user::rw- user:u1001:rwx group::rw- mask::rwx other::r-- u1001@f2-vm:/$ getfacl /home/ubuntu/my-file getfacl: Removing leading '/' from absolute path names # file: home/ubuntu/my-file # owner: ubuntu # group: ubuntu user::rw- user:ubuntu:rwx group::rw- mask::rwx other::r--" * tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits) xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl xfs: support idmapped mounts ext4: support idmapped mounts fat: handle idmapped mounts tests: add mount_setattr() selftests fs: introduce MOUNT_ATTR_IDMAP fs: add mount_setattr() fs: add attr_flags_to_mnt_flags helper fs: split out functions to hold writers namespace: only take read lock in do_reconfigure_mnt() mount: make {lock,unlock}_mount_hash() static namespace: take lock_mount_hash() directly when changing flags nfs: do not export idmapped mounts overlayfs: do not mount on top of idmapped mounts ecryptfs: do not mount on top of idmapped mounts ima: handle idmapped mounts apparmor: handle idmapped mounts fs: make helpers idmap mount aware exec: handle idmapped mounts would_dump: handle idmapped mounts ...
2021-02-22Merge branch 'work.d_name' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull d_name whack-a-mole from Al Viro: "A bunch of places that play with ->d_name in printks instead of using proper formats..." * 'work.d_name' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: orangefs_file_mmap(): use %pD cifs_debug: use %pd instead of messing with ->d_name erofs: use %pd instead of messing with ->d_name cramfs: use %pD instead of messing with file_dentry()->d_name
2021-02-11erofs: initialized fields can only be observed after bit is setGao Xiang2-2/+18
Currently, although set_bit() & test_bit() pairs are used as a fast- path for initialized configurations. However, these atomic ops are actually relaxed forms. Instead, load-acquire & store-release form is needed to make sure uninitialized fields won't be observed in advance here (yet no such corresponding bitops so use full barriers instead.) Link: https://lore.kernel.org/r/20210209130618.15838-1-hsiangkao@aol.com Fixes: 62dc45979f3f ("staging: erofs: fix race of initializing xattrs of a inode at the same time") Fixes: 152a333a5895 ("staging: erofs: add compacted compression indexes support") Cc: <stable@vger.kernel.org> # 5.3+ Reported-by: Huang Jianan <huangjianan@oppo.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-02-11erofs: fix shift-out-of-bounds of blkszbitsGao Xiang1-2/+2
syzbot generated a crafted bitszbits which can be shifted out-of-bounds[1]. So directly print unsupported blkszbits instead of blksize. [1] https://lore.kernel.org/r/000000000000c72ddd05b9444d2f@google.com Link: https://lore.kernel.org/r/20210120013016.14071-1-hsiangkao@aol.com Reported-by: syzbot+c68f467cd7c45860e8d4@syzkaller.appspotmail.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-01-24fs: make helpers idmap mount awareChristian Brauner2-4/+6
Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24stat: handle idmapped mountsChristian Brauner1-1/+1
The generic_fillattr() helper fills in the basic attributes associated with an inode. Enable it to handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace before we store the uid and gid. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-06erofs: use %pd instead of messing with ->d_nameAl Viro1-2/+2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-12-10erofs: avoid using generic_block_bmapHuang Jianan1-19/+7
Surprisingly, `block' in sector_t indicates the number of i_blkbits-sized blocks rather than sectors for bmap. In addition, considering buffer_head limits mapped size to 32-bits, should avoid using generic_block_bmap. Link: https://lore.kernel.org/r/20201209115740.18802-1-huangjianan@oppo.com Fixes: 9da681e017a3 ("staging: erofs: support bmap") Reviewed-by: Chao Yu <yuchao0@huawei.com> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Huang Jianan <huangjianan@oppo.com> Signed-off-by: Guo Weichao <guoweichao@oppo.com> [ Gao Xiang: slightly update the commit message description. ] Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-12-09erofs: force inplace I/O under low memory scenarioGao Xiang2-8/+43
Try to forcely switch to inplace I/O under low memory scenario in order to avoid direct memory reclaim due to cached page allocation. Link: https://lore.kernel.org/r/20201209123717.12430-1-hsiangkao@aol.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-12-08erofs: simplify try_to_claim_pcluster()Gao Xiang1-27/+24
simplify try_to_claim_pcluster() by directly using cmpxchg() here (the retry loop caused more overhead.) Also, move the chain loop detection in and rename it to z_erofs_try_to_claim_pcluster(). Link: https://lore.kernel.org/r/20201208095834.3133565-3-hsiangkao@redhat.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-12-08erofs: insert to managed cache after adding to pclGao Xiang1-17/+9
Previously, it could be some concern to call add_to_page_cache_lru() with page->mapping == Z_EROFS_MAPPING_STAGING (!= NULL). In contrast, page->private is used instead now, so partially revert commit 5ddcee1f3a1c ("erofs: get rid of __stagingpage_alloc helper") with some adaption for simplicity. Link: https://lore.kernel.org/r/20201208095834.3133565-2-hsiangkao@redhat.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-12-08erofs: get rid of magical Z_EROFS_MAPPING_STAGINGGao Xiang4-40/+71
Previously, we played around with magical page->mapping for short-lived temporary pages since we need to identify different types of pages in the same pcluster but both invalidated and short-lived temporary pages can have page->mapping == NULL. It was considered as safe because that temporary pages are all non-LRU / non-movable pages. This patch tends to use specific page->private to identify short-lived pages instead so it won't rely on page->mapping anymore. Details are described in "compress.h" as well. Link: https://lore.kernel.org/r/20201208095834.3133565-1-hsiangkao@redhat.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-12-08erofs: remove a void EROFS_VERSION macro set in MakefileVladimir Zapolskiy1-5/+0
Since commit 4f761fa253b4 ("erofs: rename errln/infoln/debugln to erofs_{err, info, dbg}") the defined macro EROFS_VERSION has no affect, therefore removing it from the Makefile is a non-functional change. Link: https://lore.kernel.org/r/20201030122839.25431-1-vladimir@tuxera.com Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Vladimir Zapolskiy <vladimir@tuxera.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-11-04erofs: fix setting up pcluster for temporary pagesGao Xiang1-2/+5
pcluster should be only set up for all managed pages instead of temporary pages. Since it currently uses page->mapping to identify, the impact is minor for now. [ Update: Vladimir reported the kernel log becomes polluted because PAGE_FLAGS_CHECK_AT_FREE flag(s) set if the page allocation debug option is enabled. ] Link: https://lore.kernel.org/r/20201022145724.27284-1-hsiangkao@aol.com Fixes: 5ddcee1f3a1c ("erofs: get rid of __stagingpage_alloc helper") Cc: <stable@vger.kernel.org> # 5.5+ Tested-by: Vladimir Zapolskiy <vladimir@tuxera.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-11-04erofs: derive atime instead of leaving it emptyGao Xiang1-10/+11
EROFS has _only one_ ondisk timestamp (ctime is currently documented and recorded, we might also record mtime instead with a new compat feature if needed) for each extended inode since EROFS isn't mainly for archival purposes so no need to keep all timestamps on disk especially for Android scenarios due to security concerns. Also, romfs/cramfs don't have their own on-disk timestamp, and squashfs only records mtime instead. Let's also derive access time from ondisk timestamp rather than leaving it empty, and if mtime/atime for each file are really needed for specific scenarios as well, we can also use xattrs to record them then. Link: https://lore.kernel.org/r/20201031195102.21221-1-hsiangkao@aol.com [ Gao Xiang: It'd be better to backport for user-friendly concern. ] Fixes: 431339ba9042 ("staging: erofs: add inode operations") Cc: stable <stable@vger.kernel.org> # 4.19+ Reported-by: nl6720 <nl6720@gmail.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-10-24Merge branch 'work.misc' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted stuff all over the place (the largest group here is Christoph's stat cleanups)" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: remove KSTAT_QUERY_FLAGS fs: remove vfs_stat_set_lookup_flags fs: move vfs_fstatat out of line fs: implement vfs_stat and vfs_lstat in terms of vfs_fstatat fs: remove vfs_statx_fd fs: omfs: use kmemdup() rather than kmalloc+memcpy [PATCH] reduce boilerplate in fsid handling fs: Remove duplicated flag O_NDELAY occurring twice in VALID_OPEN_FLAGS selftests: mount: add nosymfollow tests Add a "nosymfollow" mount option.
2020-10-09erofs: remove unnecessary enum entriesChengguang Xu1-2/+0
Opt_nouser_xattr and Opt_noacl are useless, so just remove them. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Link: https://lore.kernel.org/r/20201005071550.66193-1-cgxu519@mykernel.net Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-09-19erofs: add REQ_RAHEAD flag to readahead requestsGao Xiang2-7/+12
Let's add REQ_RAHEAD flag so it'd be easier to identify readahead I/O requests in blktrace. Reviewed-by: Chao Yu <yuchao0@huawei.com> Link: https://lore.kernel.org/r/20200919072730.24989-3-hsiangkao@redhat.com Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-09-19erofs: fold in should_decompress_synchronously()Gao Xiang1-9/+3
should_decompress_synchronously() has one single condition for now, so fold it instead. Reviewed-by: Chao Yu <yuchao0@huawei.com> Link: https://lore.kernel.org/r/20200919072730.24989-2-hsiangkao@redhat.com Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-09-19erofs: avoid unnecessary variable `err'Gao Xiang1-3/+1
variable `err' in z_erofs_submit_queue() isn't useful here, remove it instead. Reviewed-by: Chao Yu <yuchao0@huawei.com> Link: https://lore.kernel.org/r/20200919072730.24989-1-hsiangkao@redhat.com Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-09-18[PATCH] reduce boilerplate in fsid handlingAl Viro1-2/+1
Get rid of boilerplate in most of ->statfs() instances... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-18erofs: remove unneeded parameterChao Yu1-9/+6
After commit 0615090c5044 ("erofs: convert compressed files from readpages to readahead"), add_to_page_cache_lru() was moved to mm code, so that in below call path, no page will be cached into @pagepool list or grabbed from @pagepool list: - z_erofs_readpage - z_erofs_do_read_page - preload_compressed_pages - erofs_allocpage Let's get rid of this unneeded @pagepool parameter. Signed-off-by: Chao Yu <yuchao0@huawei.com> Link: https://lore.kernel.org/r/20200917011821.22767-1-yuchao0@huawei.com Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-09-18erofs: avoid duplicated permission check for "trusted." xattrsGao Xiang1-2/+0
Don't recheck it since xattr_permission() already checks CAP_SYS_ADMIN capability. Just follow 5d3ce4f70172 ("f2fs: avoid duplicated permission check for "trusted." xattrs") Reported-by: Hongyu Jin <hongyu.jin@unisoc.com> [ Gao Xiang: since it could cause some complex Android overlay permission issue as well on android-5.4+, it'd be better to backport to 5.4+ rather than pure cleanup on mainline. ] Cc: <stable@vger.kernel.org> # 5.4+ Link: https://lore.kernel.org/r/20200811070020.6339-1-hsiangkao@redhat.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>