From acda97acb2e98c97895d81d20494bf6a4bc67c6c Mon Sep 17 00:00:00 2001 From: Igor Matheus Andrade Torrente Date: Mon, 31 May 2021 10:05:15 -0300 Subject: docs: convert dax.txt to rst Change the file extension and add the rst constructs to integrate this doc to the documentation infrastructure and take advantage of rst features. Signed-off-by: Igor Matheus Andrade Torrente Link: https://lore.kernel.org/r/20210531130515.10309-1-igormtorrente@gmail.com Signed-off-by: Jonathan Corbet --- Documentation/filesystems/dax.rst | 291 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/dax.txt | 257 ------------------------------- Documentation/filesystems/index.rst | 1 + 3 files changed, 292 insertions(+), 257 deletions(-) create mode 100644 Documentation/filesystems/dax.rst delete mode 100644 Documentation/filesystems/dax.txt (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/dax.rst b/Documentation/filesystems/dax.rst new file mode 100644 index 000000000000..9a1b8fd9e82b --- /dev/null +++ b/Documentation/filesystems/dax.rst @@ -0,0 +1,291 @@ +======================= +Direct Access for files +======================= + +Motivation +---------- + +The page cache is usually used to buffer reads and writes to files. +It is also used to provide the pages which are mapped into userspace +by a call to mmap. + +For block devices that are memory-like, the page cache pages would be +unnecessary copies of the original storage. The `DAX` code removes the +extra copy by performing reads and writes directly to the storage device. +For file mappings, the storage device is mapped directly into userspace. + + +Usage +----- + +If you have a block device which supports `DAX`, you can make a filesystem +on it as usual. The `DAX` code currently only supports files with a block +size equal to your kernel's `PAGE_SIZE`, so you may need to specify a block +size when creating the filesystem. + +Currently 3 filesystems support `DAX`: ext2, ext4 and xfs. Enabling `DAX` on them +is different. + +Enabling DAX on ext2 +-------------------- + +When mounting the filesystem, use the ``-o dax`` option on the command line or +add 'dax' to the options in ``/etc/fstab``. This works to enable `DAX` on all files +within the filesystem. It is equivalent to the ``-o dax=always`` behavior below. + + +Enabling DAX on xfs and ext4 +---------------------------- + +Summary +------- + + 1. There exists an in-kernel file access mode flag `S_DAX` that corresponds to + the statx flag `STATX_ATTR_DAX`. See the manpage for statx(2) for details + about this access mode. + + 2. There exists a persistent flag `FS_XFLAG_DAX` that can be applied to regular + files and directories. This advisory flag can be set or cleared at any + time, but doing so does not immediately affect the `S_DAX` state. + + 3. If the persistent `FS_XFLAG_DAX` flag is set on a directory, this flag will + be inherited by all regular files and subdirectories that are subsequently + created in this directory. Files and subdirectories that exist at the time + this flag is set or cleared on the parent directory are not modified by + this modification of the parent directory. + + 4. There exist dax mount options which can override `FS_XFLAG_DAX` in the + setting of the `S_DAX` flag. Given underlying storage which supports `DAX` the + following hold: + + ``-o dax=inode`` means "follow `FS_XFLAG_DAX`" and is the default. + + ``-o dax=never`` means "never set `S_DAX`, ignore `FS_XFLAG_DAX`." + + ``-o dax=always`` means "always set `S_DAX` ignore `FS_XFLAG_DAX`." + + ``-o dax`` is a legacy option which is an alias for ``dax=always``. + + .. warning:: + + The option ``-o dax`` may be removed in the future so ``-o dax=always`` is + the preferred method for specifying this behavior. + + .. note:: + + Modifications to and the inheritance behavior of `FS_XFLAG_DAX` remain + the same even when the filesystem is mounted with a dax option. However, + in-core inode state (`S_DAX`) will be overridden until the filesystem is + remounted with dax=inode and the inode is evicted from kernel memory. + + 5. The `S_DAX` policy can be changed via: + + a) Setting the parent directory `FS_XFLAG_DAX` as needed before files are + created + + b) Setting the appropriate dax="foo" mount option + + c) Changing the `FS_XFLAG_DAX` flag on existing regular files and + directories. This has runtime constraints and limitations that are + described in 6) below. + + 6. When changing the `S_DAX` policy via toggling the persistent `FS_XFLAG_DAX` + flag, the change to existing regular files won't take effect until the + files are closed by all processes. + + +Details +------- + +There are 2 per-file dax flags. One is a persistent inode setting (`FS_XFLAG_DAX`) +and the other is a volatile flag indicating the active state of the feature +(`S_DAX`). + +`FS_XFLAG_DAX` is preserved within the filesystem. This persistent config +setting can be set, cleared and/or queried using the `FS_IOC_FS`[`GS`]`ETXATTR` ioctl +(see ioctl_xfs_fsgetxattr(2)) or an utility such as 'xfs_io'. + +New files and directories automatically inherit `FS_XFLAG_DAX` from +their parent directory **when created**. Therefore, setting `FS_XFLAG_DAX` at +directory creation time can be used to set a default behavior for an entire +sub-tree. + +To clarify inheritance, here are 3 examples: + +Example A: + +.. code-block:: shell + + mkdir -p a/b/c + xfs_io -c 'chattr +x' a + mkdir a/b/c/d + mkdir a/e + + ------[outcome]------ + + dax: a,e + no dax: b,c,d + +Example B: + +.. code-block:: shell + + mkdir a + xfs_io -c 'chattr +x' a + mkdir -p a/b/c/d + + ------[outcome]------ + + dax: a,b,c,d + no dax: + +Example C: + +.. code-block:: shell + + mkdir -p a/b/c + xfs_io -c 'chattr +x' c + mkdir a/b/c/d + + ------[outcome]------ + + dax: c,d + no dax: a,b + +The current enabled state (`S_DAX`) is set when a file inode is instantiated in +memory by the kernel. It is set based on the underlying media support, the +value of `FS_XFLAG_DAX` and the filesystem's dax mount option. + +statx can be used to query `S_DAX`. + +.. note:: + + That only regular files will ever have `S_DAX` set and therefore statx + will never indicate that `S_DAX` is set on directories. + +Setting the `FS_XFLAG_DAX` flag (specifically or through inheritance) occurs even +if the underlying media does not support dax and/or the filesystem is +overridden with a mount option. + + +Implementation Tips for Block Driver Writers +-------------------------------------------- + +To support `DAX` in your block driver, implement the 'direct_access' +block device operation. It is used to translate the sector number +(expressed in units of 512-byte sectors) to a page frame number (pfn) +that identifies the physical page for the memory. It also returns a +kernel virtual address that can be used to access the memory. + +The direct_access method takes a 'size' parameter that indicates the +number of bytes being requested. The function should return the number +of bytes that can be contiguously accessed at that offset. It may also +return a negative errno if an error occurs. + +In order to support this method, the storage must be byte-accessible by +the CPU at all times. If your device uses paging techniques to expose +a large amount of memory through a smaller window, then you cannot +implement direct_access. Equally, if your device can occasionally +stall the CPU for an extended period, you should also not attempt to +implement direct_access. + +These block devices may be used for inspiration: +- brd: RAM backed block device driver +- dcssblk: s390 dcss block device driver +- pmem: NVDIMM persistent memory driver + + +Implementation Tips for Filesystem Writers +------------------------------------------ + +Filesystem support consists of: + +* Adding support to mark inodes as being `DAX` by setting the `S_DAX` flag in + i_flags +* Implementing ->read_iter and ->write_iter operations which use + :c:func:`dax_iomap_rw()` when inode has `S_DAX` flag set +* Implementing an mmap file operation for `DAX` files which sets the + `VM_MIXEDMAP` and `VM_HUGEPAGE` flags on the `VMA`, and setting the vm_ops to + include handlers for fault, pmd_fault, page_mkwrite, pfn_mkwrite. These + handlers should probably call :c:func:`dax_iomap_fault()` passing the + appropriate fault size and iomap operations. +* Calling :c:func:`iomap_zero_range()` passing appropriate iomap operations + instead of :c:func:`block_truncate_page()` for `DAX` files +* Ensuring that there is sufficient locking between reads, writes, + truncates and page faults + +The iomap handlers for allocating blocks must make sure that allocated blocks +are zeroed out and converted to written extents before being returned to avoid +exposure of uninitialized data through mmap. + +These filesystems may be used for inspiration: + +.. seealso:: + + ext2: see Documentation/filesystems/ext2.rst + +.. seealso:: + + xfs: see Documentation/admin-guide/xfs.rst + +.. seealso:: + + ext4: see Documentation/filesystems/ext4/ + + +Handling Media Errors +--------------------- + +The libnvdimm subsystem stores a record of known media error locations for +each pmem block device (in gendisk->badblocks). If we fault at such location, +or one with a latent error not yet discovered, the application can expect +to receive a `SIGBUS`. Libnvdimm also allows clearing of these errors by simply +writing the affected sectors (through the pmem driver, and if the underlying +NVDIMM supports the clear_poison DSM defined by ACPI). + +Since `DAX` IO normally doesn't go through the ``driver/bio`` path, applications or +sysadmins have an option to restore the lost data from a prior ``backup/inbuilt`` +redundancy in the following ways: + +1. Delete the affected file, and restore from a backup (sysadmin route): + This will free the filesystem blocks that were being used by the file, + and the next time they're allocated, they will be zeroed first, which + happens through the driver, and will clear bad sectors. + +2. Truncate or hole-punch the part of the file that has a bad-block (at least + an entire aligned sector has to be hole-punched, but not necessarily an + entire filesystem block). + +These are the two basic paths that allow `DAX` filesystems to continue operating +in the presence of media errors. More robust error recovery mechanisms can be +built on top of this in the future, for example, involving redundancy/mirroring +provided at the block layer through DM, or additionally, at the filesystem +level. These would have to rely on the above two tenets, that error clearing +can happen either by sending an IO through the driver, or zeroing (also through +the driver). + + +Shortcomings +------------ + +Even if the kernel or its modules are stored on a filesystem that supports +`DAX` on a block device that supports `DAX`, they will still be copied into RAM. + +The DAX code does not work correctly on architectures which have virtually +mapped caches such as ARM, MIPS and SPARC. + +Calling :c:func:`get_user_pages()` on a range of user memory that has been +mmaped from a `DAX` file will fail when there are no 'struct page' to describe +those pages. This problem has been addressed in some device drivers +by adding optional struct page support for pages under the control of +the driver (see `CONFIG_NVDIMM_PFN` in ``drivers/nvdimm`` for an example of +how to do this). In the non struct page cases `O_DIRECT` reads/writes to +those memory ranges from a non-`DAX` file will fail + + +.. note:: + + `O_DIRECT` reads/writes _of a `DAX` file do work, it is the memory that + is being accessed that is key here). Other things that will not work in + the non struct page case include RDMA, :c:func:`sendfile()` and + :c:func:`splice()`. diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt deleted file mode 100644 index e03c20564f3a..000000000000 --- a/Documentation/filesystems/dax.txt +++ /dev/null @@ -1,257 +0,0 @@ -Direct Access for files ------------------------ - -Motivation ----------- - -The page cache is usually used to buffer reads and writes to files. -It is also used to provide the pages which are mapped into userspace -by a call to mmap. - -For block devices that are memory-like, the page cache pages would be -unnecessary copies of the original storage. The DAX code removes the -extra copy by performing reads and writes directly to the storage device. -For file mappings, the storage device is mapped directly into userspace. - - -Usage ------ - -If you have a block device which supports DAX, you can make a filesystem -on it as usual. The DAX code currently only supports files with a block -size equal to your kernel's PAGE_SIZE, so you may need to specify a block -size when creating the filesystem. - -Currently 3 filesystems support DAX: ext2, ext4 and xfs. Enabling DAX on them -is different. - -Enabling DAX on ext2 ------------------------------ - -When mounting the filesystem, use the "-o dax" option on the command line or -add 'dax' to the options in /etc/fstab. This works to enable DAX on all files -within the filesystem. It is equivalent to the '-o dax=always' behavior below. - - -Enabling DAX on xfs and ext4 ----------------------------- - -Summary -------- - - 1. There exists an in-kernel file access mode flag S_DAX that corresponds to - the statx flag STATX_ATTR_DAX. See the manpage for statx(2) for details - about this access mode. - - 2. There exists a persistent flag FS_XFLAG_DAX that can be applied to regular - files and directories. This advisory flag can be set or cleared at any - time, but doing so does not immediately affect the S_DAX state. - - 3. If the persistent FS_XFLAG_DAX flag is set on a directory, this flag will - be inherited by all regular files and subdirectories that are subsequently - created in this directory. Files and subdirectories that exist at the time - this flag is set or cleared on the parent directory are not modified by - this modification of the parent directory. - - 4. There exist dax mount options which can override FS_XFLAG_DAX in the - setting of the S_DAX flag. Given underlying storage which supports DAX the - following hold: - - "-o dax=inode" means "follow FS_XFLAG_DAX" and is the default. - - "-o dax=never" means "never set S_DAX, ignore FS_XFLAG_DAX." - - "-o dax=always" means "always set S_DAX ignore FS_XFLAG_DAX." - - "-o dax" is a legacy option which is an alias for "dax=always". - This may be removed in the future so "-o dax=always" is - the preferred method for specifying this behavior. - - NOTE: Modifications to and the inheritance behavior of FS_XFLAG_DAX remain - the same even when the filesystem is mounted with a dax option. However, - in-core inode state (S_DAX) will be overridden until the filesystem is - remounted with dax=inode and the inode is evicted from kernel memory. - - 5. The S_DAX policy can be changed via: - - a) Setting the parent directory FS_XFLAG_DAX as needed before files are - created - - b) Setting the appropriate dax="foo" mount option - - c) Changing the FS_XFLAG_DAX flag on existing regular files and - directories. This has runtime constraints and limitations that are - described in 6) below. - - 6. When changing the S_DAX policy via toggling the persistent FS_XFLAG_DAX - flag, the change to existing regular files won't take effect until the - files are closed by all processes. - - -Details -------- - -There are 2 per-file dax flags. One is a persistent inode setting (FS_XFLAG_DAX) -and the other is a volatile flag indicating the active state of the feature -(S_DAX). - -FS_XFLAG_DAX is preserved within the filesystem. This persistent config -setting can be set, cleared and/or queried using the FS_IOC_FS[GS]ETXATTR ioctl -(see ioctl_xfs_fsgetxattr(2)) or an utility such as 'xfs_io'. - -New files and directories automatically inherit FS_XFLAG_DAX from -their parent directory _when_ _created_. Therefore, setting FS_XFLAG_DAX at -directory creation time can be used to set a default behavior for an entire -sub-tree. - -To clarify inheritance, here are 3 examples: - -Example A: - -mkdir -p a/b/c -xfs_io -c 'chattr +x' a -mkdir a/b/c/d -mkdir a/e - - dax: a,e - no dax: b,c,d - -Example B: - -mkdir a -xfs_io -c 'chattr +x' a -mkdir -p a/b/c/d - - dax: a,b,c,d - no dax: - -Example C: - -mkdir -p a/b/c -xfs_io -c 'chattr +x' c -mkdir a/b/c/d - - dax: c,d - no dax: a,b - - -The current enabled state (S_DAX) is set when a file inode is instantiated in -memory by the kernel. It is set based on the underlying media support, the -value of FS_XFLAG_DAX and the filesystem's dax mount option. - -statx can be used to query S_DAX. NOTE that only regular files will ever have -S_DAX set and therefore statx will never indicate that S_DAX is set on -directories. - -Setting the FS_XFLAG_DAX flag (specifically or through inheritance) occurs even -if the underlying media does not support dax and/or the filesystem is -overridden with a mount option. - - - -Implementation Tips for Block Driver Writers --------------------------------------------- - -To support DAX in your block driver, implement the 'direct_access' -block device operation. It is used to translate the sector number -(expressed in units of 512-byte sectors) to a page frame number (pfn) -that identifies the physical page for the memory. It also returns a -kernel virtual address that can be used to access the memory. - -The direct_access method takes a 'size' parameter that indicates the -number of bytes being requested. The function should return the number -of bytes that can be contiguously accessed at that offset. It may also -return a negative errno if an error occurs. - -In order to support this method, the storage must be byte-accessible by -the CPU at all times. If your device uses paging techniques to expose -a large amount of memory through a smaller window, then you cannot -implement direct_access. Equally, if your device can occasionally -stall the CPU for an extended period, you should also not attempt to -implement direct_access. - -These block devices may be used for inspiration: -- brd: RAM backed block device driver -- dcssblk: s390 dcss block device driver -- pmem: NVDIMM persistent memory driver - - -Implementation Tips for Filesystem Writers ------------------------------------------- - -Filesystem support consists of -- adding support to mark inodes as being DAX by setting the S_DAX flag in - i_flags -- implementing ->read_iter and ->write_iter operations which use dax_iomap_rw() - when inode has S_DAX flag set -- implementing an mmap file operation for DAX files which sets the - VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to - include handlers for fault, pmd_fault, page_mkwrite, pfn_mkwrite. These - handlers should probably call dax_iomap_fault() passing the appropriate - fault size and iomap operations. -- calling iomap_zero_range() passing appropriate iomap operations instead of - block_truncate_page() for DAX files -- ensuring that there is sufficient locking between reads, writes, - truncates and page faults - -The iomap handlers for allocating blocks must make sure that allocated blocks -are zeroed out and converted to written extents before being returned to avoid -exposure of uninitialized data through mmap. - -These filesystems may be used for inspiration: -- ext2: see Documentation/filesystems/ext2.rst -- ext4: see Documentation/filesystems/ext4/ -- xfs: see Documentation/admin-guide/xfs.rst - - -Handling Media Errors ---------------------- - -The libnvdimm subsystem stores a record of known media error locations for -each pmem block device (in gendisk->badblocks). If we fault at such location, -or one with a latent error not yet discovered, the application can expect -to receive a SIGBUS. Libnvdimm also allows clearing of these errors by simply -writing the affected sectors (through the pmem driver, and if the underlying -NVDIMM supports the clear_poison DSM defined by ACPI). - -Since DAX IO normally doesn't go through the driver/bio path, applications or -sysadmins have an option to restore the lost data from a prior backup/inbuilt -redundancy in the following ways: - -1. Delete the affected file, and restore from a backup (sysadmin route): - This will free the filesystem blocks that were being used by the file, - and the next time they're allocated, they will be zeroed first, which - happens through the driver, and will clear bad sectors. - -2. Truncate or hole-punch the part of the file that has a bad-block (at least - an entire aligned sector has to be hole-punched, but not necessarily an - entire filesystem block). - -These are the two basic paths that allow DAX filesystems to continue operating -in the presence of media errors. More robust error recovery mechanisms can be -built on top of this in the future, for example, involving redundancy/mirroring -provided at the block layer through DM, or additionally, at the filesystem -level. These would have to rely on the above two tenets, that error clearing -can happen either by sending an IO through the driver, or zeroing (also through -the driver). - - -Shortcomings ------------- - -Even if the kernel or its modules are stored on a filesystem that supports -DAX on a block device that supports DAX, they will still be copied into RAM. - -The DAX code does not work correctly on architectures which have virtually -mapped caches such as ARM, MIPS and SPARC. - -Calling get_user_pages() on a range of user memory that has been mmaped -from a DAX file will fail when there are no 'struct page' to describe -those pages. This problem has been addressed in some device drivers -by adding optional struct page support for pages under the control of -the driver (see CONFIG_NVDIMM_PFN in drivers/nvdimm for an example of -how to do this). In the non struct page cases O_DIRECT reads/writes to -those memory ranges from a non-DAX file will fail (note that O_DIRECT -reads/writes _of a DAX file_ do work, it is the memory that is being -accessed that is key here). Other things that will not work in the -non struct page case include RDMA, sendfile() and splice(). diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index d4853cb919d2..246af51b277a 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -77,6 +77,7 @@ Documentation for filesystem implementations. coda configfs cramfs + dax debugfs dlmfs ecryptfs -- cgit v1.2.3 From a9edc03f13dbd51095b38ef0371d24e7ec7ae693 Mon Sep 17 00:00:00 2001 From: Kir Kolyshkin Date: Thu, 10 Jun 2021 20:00:44 -0700 Subject: docs: fix a cross-ref Commit acda97acb2e98c9 changes dax.txt to dax.rst. Fix the references accordingly. Cc: Igor Matheus Andrade Torrente Signed-off-by: Kir Kolyshkin Link: https://lore.kernel.org/r/20210611030044.1982911-4-kolyshkin@gmail.com Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/ext4.rst | 2 +- Documentation/filesystems/ext2.rst | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst index d2795ca6821e..4c559e08d11e 100644 --- a/Documentation/admin-guide/ext4.rst +++ b/Documentation/admin-guide/ext4.rst @@ -392,7 +392,7 @@ When mounting an ext4 filesystem, the following option are accepted: dax Use direct access (no page cache). See - Documentation/filesystems/dax.txt. Note that this option is + Documentation/filesystems/dax.rst. Note that this option is incompatible with data=journal. inlinecrypt diff --git a/Documentation/filesystems/ext2.rst b/Documentation/filesystems/ext2.rst index c2fce22cfd03..154101cf0e4f 100644 --- a/Documentation/filesystems/ext2.rst +++ b/Documentation/filesystems/ext2.rst @@ -25,7 +25,7 @@ check=none, nocheck (*) Don't do extra checking of bitmaps on mount (check=normal and check=strict options removed) dax Use direct access (no page cache). See - Documentation/filesystems/dax.txt. + Documentation/filesystems/dax.rst. debug Extra debugging information is sent to the kernel syslog. Useful for developers. -- cgit v1.2.3 From d9d2c82738b7cacefde30b701d2ddc4879f6c39a Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Wed, 16 Jun 2021 08:55:12 +0200 Subject: docs: filesystems: ext4: blockgroup.rst: replace some characters MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The conversion tools used during DocBook/LaTeX/html/Markdown->ReST conversion and some cut-and-pasted text contain some characters that aren't easily reachable on standard keyboards and/or could cause troubles when parsed by the documentation build system. Replace the occurences of the following characters: - U+2217 ('∗'): ASTERISK OPERATOR use ASCII asterisk instead of the ASTERISK OPERATOR Signed-off-by: Mauro Carvalho Chehab Acked-by: Theodore Ts'o Link: https://lore.kernel.org/r/c5c3c384c48779ca7c9dcd90183cefe20ac82928.1623826294.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet --- Documentation/filesystems/ext4/blockgroup.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/ext4/blockgroup.rst b/Documentation/filesystems/ext4/blockgroup.rst index 3da156633339..d5d652addce5 100644 --- a/Documentation/filesystems/ext4/blockgroup.rst +++ b/Documentation/filesystems/ext4/blockgroup.rst @@ -84,7 +84,7 @@ Without the option META\_BG, for safety concerns, all block group descriptors copies are kept in the first block group. Given the default 128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4 can have at most 2^27/64 = 2^21 block groups. This limits the entire -filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB. +filesystem size to 2^21 * 2^27 = 2^48bytes or 256TiB. The solution to this problem is to use the metablock group feature (META\_BG), which is already in ext3 for all 2.6 releases. With the -- cgit v1.2.3 From 993b892610d159dc16f6556dd0bf111ddc3ce0b9 Mon Sep 17 00:00:00 2001 From: Fox Chen Date: Thu, 27 May 2021 17:16:06 +0800 Subject: docs: path-lookup: update follow_managed() part No follow_managed() anymore, handle_mounts(), traverse_mounts(), will do the job. see commit 9deed3ebca24 ("new helper: traverse_mounts()") Signed-off-by: Fox Chen Reviewed-by: NeilBrown Link: https://lore.kernel.org/r/20210527091618.287093-2-foxhlchen@gmail.com Signed-off-by: Jonathan Corbet --- Documentation/filesystems/path-lookup.rst | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index c482e1619e77..751082d469e8 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -448,10 +448,11 @@ described. If it finds a ``LAST_NORM`` component it first calls filesystem to revalidate the result if it is that sort of filesystem. If that doesn't get a good result, it calls "``lookup_slow()``" which takes ``i_rwsem``, rechecks the cache, and then asks the filesystem -to find a definitive answer. Each of these will call -``follow_managed()`` (as described below) to handle any mount points. +to find a definitive answer. -In the absence of symbolic links, ``walk_component()`` creates a new +As the last step of ``walk_component()``, ``step_into()`` will be called either +directly from walk_component() or from handle_dots(). It calls +``handle_mounts()``, to check and handle mount points, in which a new ``struct path`` containing a counted reference to the new dentry and a reference to the new ``vfsmount`` which is only counted if it is different from the previous ``vfsmount``. It then calls @@ -535,8 +536,7 @@ covered in greater detail in autofs.txt in the Linux documentation tree, but a few notes specifically related to path lookup are in order here. -The Linux VFS has a concept of "managed" dentries which is reflected -in function names such as "``follow_managed()``". There are three +The Linux VFS has a concept of "managed" dentries. There are three potentially interesting things about these dentries corresponding to three different flags that might be set in ``dentry->d_flags``: -- cgit v1.2.3 From 084c86837a3583c7cf56d74f91fb8e6191f99a8a Mon Sep 17 00:00:00 2001 From: Fox Chen Date: Thu, 27 May 2021 17:16:07 +0800 Subject: docs: path-lookup: update path_to_nameidata() part No path_to_namei() anymore, step_into() will be called. Related commit: commit c99687a03a78 ("fold path_to_nameidata() into its only remaining caller") Signed-off-by: Fox Chen Reviewed-by: NeilBrown Link: https://lore.kernel.org/r/20210527091618.287093-3-foxhlchen@gmail.com Signed-off-by: Jonathan Corbet --- Documentation/filesystems/path-lookup.rst | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 751082d469e8..6ea0880fb982 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -453,11 +453,12 @@ to find a definitive answer. As the last step of ``walk_component()``, ``step_into()`` will be called either directly from walk_component() or from handle_dots(). It calls ``handle_mounts()``, to check and handle mount points, in which a new -``struct path`` containing a counted reference to the new dentry and a -reference to the new ``vfsmount`` which is only counted if it is -different from the previous ``vfsmount``. It then calls -``path_to_nameidata()`` to install the new ``struct path`` in the -``struct nameidata`` and drop the unneeded references. +``struct path`` is created containing a counted reference to the new dentry and +a reference to the new ``vfsmount`` which is only counted if it is +different from the previous ``vfsmount``. Then if there is +a symbolic link, ``step_into()`` calls ``pick_link()`` to deal with it, +otherwise it installs the new ``struct path`` in the ``struct nameidata``, and +drops the unneeded references. This "hand-over-hand" sequencing of getting a reference to the new dentry before dropping the reference to the previous dentry may -- cgit v1.2.3 From 8593d2cc8c2f09164d674b2318661ede00dd4d0e Mon Sep 17 00:00:00 2001 From: Fox Chen Date: Thu, 27 May 2021 17:16:08 +0800 Subject: docs: path-lookup: update path_mountpoint() part path_mountpoint() doesn't exist anymore. Have been folded into path_lookup_at when flag is set with LOOKUP_MOUNTPOINT. Check commit: commit 161aff1d93abf0e ("LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat()") Signed-off-by: Fox Chen Reviewed-by: NeilBrown Link: https://lore.kernel.org/r/20210527091618.287093-4-foxhlchen@gmail.com Signed-off-by: Jonathan Corbet --- Documentation/filesystems/path-lookup.rst | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 6ea0880fb982..652d3284f178 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -472,7 +472,7 @@ Handling the final component ``nd->last_type`` to refer to the final component of the path. It does not call ``walk_component()`` that last time. Handling that final component remains for the caller to sort out. Those callers are -``path_lookupat()``, ``path_parentat()``, ``path_mountpoint()`` and +``path_lookupat()``, ``path_parentat()`` and ``path_openat()`` each of which handles the differing requirements of different system calls. @@ -488,12 +488,10 @@ perform their operation. object is wanted such as by ``stat()`` or ``chmod()``. It essentially just calls ``walk_component()`` on the final component through a call to ``lookup_last()``. ``path_lookupat()`` returns just the final dentry. - -``path_mountpoint()`` handles the special case of unmounting which must -not try to revalidate the mounted filesystem. It effectively -contains, through a call to ``mountpoint_last()``, an alternate -implementation of ``lookup_slow()`` which skips that step. This is -important when unmounting a filesystem that is inaccessible, such as +It is worth noting that when flag ``LOOKUP_MOUNTPOINT`` is set, +``path_lookupat()`` will unset LOOKUP_JUMPED in nameidata so that in the +subsequent path traversal ``d_weak_revalidate()`` won't be called. +This is important when unmounting a filesystem that is inaccessible, such as one provided by a dead NFS server. Finally ``path_openat()`` is used for the ``open()`` system call; it -- cgit v1.2.3 From 71e0a67dc6c26018e27fe0c670e2db023aa72d22 Mon Sep 17 00:00:00 2001 From: Fox Chen Date: Thu, 27 May 2021 17:16:09 +0800 Subject: docs: path-lookup: update do_last() part traling_symlink() was merged into lookup_last, do_last(). do_last() has later been split into open_last_lookups() and do_open(). see related commit: commit c5971b8c6354 ("take post-lookup part of do_last() out of loop") Signed-off-by: Fox Chen Reviewed-by: NeilBrown Link: https://lore.kernel.org/r/20210527091618.287093-5-foxhlchen@gmail.com Signed-off-by: Jonathan Corbet --- Documentation/filesystems/path-lookup.rst | 47 +++++++++++++++---------------- 1 file changed, 22 insertions(+), 25 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 652d3284f178..2b0b33168067 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -495,11 +495,11 @@ This is important when unmounting a filesystem that is inaccessible, such as one provided by a dead NFS server. Finally ``path_openat()`` is used for the ``open()`` system call; it -contains, in support functions starting with "``do_last()``", all the +contains, in support functions starting with "``open_last_lookups()``", all the complexity needed to handle the different subtleties of O_CREAT (with or without O_EXCL), final "``/``" characters, and trailing symbolic links. We will revisit this in the final part of this series, which -focuses on those symbolic links. "``do_last()``" will sometimes, but +focuses on those symbolic links. "``open_last_lookups()``" will sometimes, but not always, take ``i_rwsem``, depending on what it finds. Each of these, or the functions which call them, need to be alert to @@ -1196,29 +1196,26 @@ potentially need to call ``link_path_walk()`` again and again on successive symlinks until one is found that doesn't point to another symlink. -This case is handled by the relevant caller of ``link_path_walk()``, such as -``path_lookupat()`` using a loop that calls ``link_path_walk()``, and then -handles the final component. If the final component is a symlink -that needs to be followed, then ``trailing_symlink()`` is called to set -things up properly and the loop repeats, calling ``link_path_walk()`` -again. This could loop as many as 40 times if the last component of -each symlink is another symlink. - -The various functions that examine the final component and possibly -report that it is a symlink are ``lookup_last()``, ``mountpoint_last()`` -and ``do_last()``, each of which use the same convention as -``walk_component()`` of returning ``1`` if a symlink was found that needs -to be followed. - -Of these, ``do_last()`` is the most interesting as it is used for -opening a file. Part of ``do_last()`` runs with ``i_rwsem`` held and this -part is in a separate function: ``lookup_open()``. - -Explaining ``do_last()`` completely is beyond the scope of this article, -but a few highlights should help those interested in exploring the -code. - -1. Rather than just finding the target file, ``do_last()`` needs to open +This case is handled by relevant callers of ``link_path_walk()``, such as +``path_lookupat()``, ``path_openat()`` using a loop that calls ``link_path_walk()``, +and then handles the final component by calling ``open_last_lookups()`` or +``lookup_last()``. If it is a symlink that needs to be followed, +``open_last_lookups()`` or ``lookup_last()`` will set things up properly and +return the path so that the loop repeats, calling +``link_path_walk()`` again. This could loop as many as 40 times if the last +component of each symlink is another symlink. + +Of the various functions that examine the final component, +``open_last_lookups()`` is the most interesting as it works in tandem +with ``do_open()`` for opening a file. Part of ``open_last_lookups()`` runs +with ``i_rwsem`` held and this part is in a separate function: ``lookup_open()``. + +Explaining ``open_last_lookups()`` and ``do_open()`` completely is beyond the scope +of this article, but a few highlights should help those interested in exploring +the code. + +1. Rather than just finding the target file, ``do_open()`` is used after + ``open_last_lookup()`` to open it. If the file was found in the dcache, then ``vfs_open()`` is used for this. If not, then ``lookup_open()`` will either call ``atomic_open()`` (if the filesystem provides it) to combine the final lookup with the open, or -- cgit v1.2.3 From 34ef75ef25c6fdea899acdb0a466f8ed0c365644 Mon Sep 17 00:00:00 2001 From: Fox Chen Date: Thu, 27 May 2021 17:16:10 +0800 Subject: docs: path-lookup: remove filename_mountpoint No filename_mountpoint any more see commit: commit 161aff1d93ab ("LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat()") Without filename_mountpoint and path_mountpoint(), the numbers should be four & three: "These four correspond roughly to the three path_*() functions" Signed-off-by: Fox Chen Reviewed-by: NeilBrown Link: https://lore.kernel.org/r/20210527091618.287093-6-foxhlchen@gmail.com Signed-off-by: Jonathan Corbet --- Documentation/filesystems/path-lookup.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 2b0b33168067..3cbaf30b0f83 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -652,9 +652,9 @@ restarts from the top with REF-walk. This pattern of "try RCU-walk, if that fails try REF-walk" can be clearly seen in functions like ``filename_lookup()``, -``filename_parentat()``, ``filename_mountpoint()``, -``do_filp_open()``, and ``do_file_open_root()``. These five -correspond roughly to the four ``path_*()`` functions we met earlier, +``filename_parentat()``, +``do_filp_open()``, and ``do_file_open_root()``. These four +correspond roughly to the three ``path_*()`` functions we met earlier, each of which calls ``link_path_walk()``. The ``path_*()`` functions are called using different mode flags until a mode is found which works. They are first called with ``LOOKUP_RCU`` set to request "RCU-walk". If -- cgit v1.2.3 From d2d3dd5ecce11ba560ff024e63ddb1640b7b27b0 Mon Sep 17 00:00:00 2001 From: Fox Chen Date: Thu, 27 May 2021 17:16:11 +0800 Subject: docs: path-lookup: Add macro name to symlink limit description Add macro name MAXSYMLINKS to the symlink limit description, so that it is consistent with path name length description above. Signed-off-by: Fox Chen Reviewed-by: NeilBrown Link: https://lore.kernel.org/r/20210527091618.287093-7-foxhlchen@gmail.com Signed-off-by: Jonathan Corbet --- Documentation/filesystems/path-lookup.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 3cbaf30b0f83..40b9afec4d60 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -992,8 +992,8 @@ is 4096. There are a number of reasons for this limit; not letting the kernel spend too much time on just one path is one of them. With symbolic links you can effectively generate much longer paths so some sort of limit is needed for the same reason. Linux imposes a limit of -at most 40 symlinks in any one path lookup. It previously imposed a -further limit of eight on the maximum depth of recursion, but that was +at most 40 (MAXSYMLINKS) symlinks in any one path lookup. It previously imposed +a further limit of eight on the maximum depth of recursion, but that was raised to 40 when a separate stack was implemented, so there is now just the one limit. -- cgit v1.2.3 From 4a00e4bd59bbd5eac26f1792eb8d7d60f6cafe9a Mon Sep 17 00:00:00 2001 From: Fox Chen Date: Thu, 27 May 2021 17:16:12 +0800 Subject: docs: path-lookup: i_op->follow_link replaced with i_op->get_link follow_link has been replaced by get_link() which can be called in RCU mode. see commit: commit 6b2553918d8b ("replace ->follow_link() with new method that could stay in RCU mode") Signed-off-by: Fox Chen Reviewed-by: NeilBrown Link: https://lore.kernel.org/r/20210527091618.287093-8-foxhlchen@gmail.com Signed-off-by: Jonathan Corbet --- Documentation/filesystems/path-lookup.rst | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 40b9afec4d60..4650c6427963 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -1060,13 +1060,11 @@ filesystem cannot successfully get a reference in RCU-walk mode, it must return ``-ECHILD`` and ``unlazy_walk()`` will be called to return to REF-walk mode in which the filesystem is allowed to sleep. -The place for all this to happen is the ``i_op->follow_link()`` inode -method. In the present mainline code this is never actually called in -RCU-walk mode as the rewrite is not quite complete. It is likely that -in a future release this method will be passed an ``inode`` pointer when -called in RCU-walk mode so it both (1) knows to be careful, and (2) has the -validated pointer. Much like the ``i_op->permission()`` method we -looked at previously, ``->follow_link()`` would need to be careful that +The place for all this to happen is the ``i_op->get_link()`` inode +method. This is called both in RCU-walk and REF-walk. In RCU-walk the +``dentry*`` argument is NULL, ``->get_link()`` can return -ECHILD to drop out of +RCU-walk. Much like the ``i_op->permission()`` method we +looked at previously, ``->get_link()`` would need to be careful that all the data structures it references are safe to be accessed while holding no counted reference, only the RCU lock. Though getting a reference with ``->follow_link()`` is not yet done in RCU-walk mode, the -- cgit v1.2.3 From 671f73356f6a2aa2fb1bb71f8fdeeba858b6fec6 Mon Sep 17 00:00:00 2001 From: Fox Chen Date: Thu, 27 May 2021 17:16:13 +0800 Subject: docs: path-lookup: update i_op->put_link and cookie description No inode->put_link operation anymore. We use delayed_call to deal with link destruction. Cookie has been replaced with struct delayed_call. Related commit: commit fceef393a538 ("switch ->get_link() to delayed_call, kill ->put_link()") Signed-off-by: Fox Chen Reviewed-by: NeilBrown Link: https://lore.kernel.org/r/20210527091618.287093-9-foxhlchen@gmail.com Signed-off-by: Jonathan Corbet --- Documentation/filesystems/path-lookup.rst | 30 ++++++++---------------------- 1 file changed, 8 insertions(+), 22 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 4650c6427963..3855809784cf 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -1066,34 +1066,20 @@ method. This is called both in RCU-walk and REF-walk. In RCU-walk the RCU-walk. Much like the ``i_op->permission()`` method we looked at previously, ``->get_link()`` would need to be careful that all the data structures it references are safe to be accessed while -holding no counted reference, only the RCU lock. Though getting a -reference with ``->follow_link()`` is not yet done in RCU-walk mode, the -code is ready to release the reference when that does happen. - -This need to drop the reference to a symlink adds significant -complexity. It requires a reference to the inode so that the -``i_op->put_link()`` inode operation can be called. In REF-walk, that -reference is kept implicitly through a reference to the dentry, so -keeping the ``struct path`` of the symlink is easiest. For RCU-walk, -the pointer to the inode is kept separately. To allow switching from -RCU-walk back to REF-walk in the middle of processing nested symlinks -we also need the seq number for the dentry so we can confirm that -switching back was safe. - -Finally, when providing a reference to a symlink, the filesystem also -provides an opaque "cookie" that must be passed to ``->put_link()`` so that it -knows what to free. This might be the allocated memory area, or a -pointer to the ``struct page`` in the page cache, or something else -completely. Only the filesystem knows what it is. +holding no counted reference, only the RCU lock. A callback +``struct delayed_called`` will be passed to ``->get_link()``: +file systems can set their own put_link function and argument through +``set_delayed_call()``. Later on, when VFS wants to put link, it will call +``do_delayed_call()`` to invoke that callback function with the argument. In order for the reference to each symlink to be dropped when the walk completes, whether in RCU-walk or REF-walk, the symlink stack needs to contain, along with the path remnants: -- the ``struct path`` to provide a reference to the inode in REF-walk -- the ``struct inode *`` to provide a reference to the inode in RCU-walk +- the ``struct path`` to provide a reference to the previous path +- the ``const char *`` to provide a reference to the to previous name - the ``seq`` to allow the path to be safely switched from RCU-walk to REF-walk -- the ``cookie`` that tells ``->put_path()`` what to put. +- the ``struct delayed_call`` for later invocation. This means that each entry in the symlink stack needs to hold five pointers and an integer instead of just one pointer (the path -- cgit v1.2.3 From 18edb95a88a947b10536be4dc86b4a190715f816 Mon Sep 17 00:00:00 2001 From: Fox Chen Date: Thu, 27 May 2021 17:16:14 +0800 Subject: docs: path-lookup: no get_link() no get_link() anymore. we have step_into() and pick_link(). walk_component() will call step_into(), in turn call pick_link, and return symlink name. Signed-off-by: Fox Chen Reviewed-by: NeilBrown Link: https://lore.kernel.org/r/20210527091618.287093-10-foxhlchen@gmail.com Signed-off-by: Jonathan Corbet --- Documentation/filesystems/path-lookup.rst | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 3855809784cf..0a125673a8fe 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -1103,12 +1103,10 @@ doesn't need to notice. Getting this ``name`` variable on and off the stack is very straightforward; pushing and popping the references is a little more complex. -When a symlink is found, ``walk_component()`` returns the value ``1`` -(``0`` is returned for any other sort of success, and a negative number -is, as usual, an error indicator). This causes ``get_link()`` to be -called; it then gets the link from the filesystem. Providing that -operation is successful, the old path ``name`` is placed on the stack, -and the new value is used as the ``name`` for a while. When the end of +When a symlink is found, ``walk_component()`` calls ``pick_link()`` via ``step_into()`` +which returns the link from the filesystem. +Providing that operation is successful, the old path ``name`` is placed on the +stack, and the new value is used as the ``name`` for a while. When the end of the path is found (i.e. ``*name`` is ``'\0'``) the old ``name`` is restored off the stack and path walking continues. -- cgit v1.2.3 From de9414adafe4da174212909e054222948aa620fc Mon Sep 17 00:00:00 2001 From: Fox Chen Date: Thu, 27 May 2021 17:16:15 +0800 Subject: docs: path-lookup: update WALK_GET, WALK_PUT desc WALK_GET is changed to WALK_TRAILING with a different meaning. Here it should be WALK_NOFOLLOW. WALK_PUT dosn't exist, we have WALK_MORE. WALK_PUT == !WALK_MORE And there is not should_follow_link(). Related commits: commit 8c4efe22e7c4 ("namei: invert the meaning of WALK_FOLLOW") commit 1c4ff1a87e46 ("namei: invert WALK_PUT logics") Signed-off-by: Fox Chen Reviewed-by: NeilBrown [jc: applied language tweaks suggested by Neil] Link: https://lore.kernel.org/r/20210527091618.287093-11-foxhlchen@gmail.com Signed-off-by: Jonathan Corbet --- Documentation/filesystems/path-lookup.rst | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 0a125673a8fe..1102252cbc7a 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -1123,13 +1123,13 @@ stack in ``walk_component()`` immediately when the symlink is found; old symlink as it walks that last component. So it is quite convenient for ``walk_component()`` to release the old symlink and pop the references just before pushing the reference information for the -new symlink. It is guided in this by two flags; ``WALK_GET``, which -gives it permission to follow a symlink if it finds one, and -``WALK_PUT``, which tells it to release the current symlink after it has been -followed. ``WALK_PUT`` is tested first, leading to a call to -``put_link()``. ``WALK_GET`` is tested subsequently (by -``should_follow_link()``) leading to a call to ``pick_link()`` which sets -up the stack frame. +new symlink. It is guided in this by three flags: ``WALK_NOFOLLOW`` which +forbids it from following a symlink if it finds one, ``WALK_MORE`` +which indicates that it is yet too early to release the +current symlink, and ``WALK_TRAILING`` which indicates that it is on the final +component of the lookup, so we will check userspace flag ``LOOKUP_FOLLOW`` to +decide whether follow it when it is a symlink and call ``may_follow_link()`` to +check if we have privilege to follow it. Symlinks with no final component ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -- cgit v1.2.3 From 3c1be84b8d82959a6b7fedb598b8781fa1d09421 Mon Sep 17 00:00:00 2001 From: Fox Chen Date: Thu, 27 May 2021 17:16:16 +0800 Subject: docs: path-lookup: update get_link() ->follow_link description get_link() is merged into pick_link(). i_op->follow_link is replaced with i_op->get_link(). get_link() can return ERR_PTR(0) which equals NULL. Signed-off-by: Fox Chen Reviewed-by: NeilBrown Link: https://lore.kernel.org/r/20210527091618.287093-12-foxhlchen@gmail.com Signed-off-by: Jonathan Corbet --- Documentation/filesystems/path-lookup.rst | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 1102252cbc7a..c150f076abbf 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -1136,10 +1136,10 @@ Symlinks with no final component A pair of special-case symlinks deserve a little further explanation. Both result in a new ``struct path`` (with mount and dentry) being set -up in the ``nameidata``, and result in ``get_link()`` returning ``NULL``. +up in the ``nameidata``, and result in ``pick_link()`` returning ``NULL``. The more obvious case is a symlink to "``/``". All symlinks starting -with "``/``" are detected in ``get_link()`` which resets the ``nameidata`` +with "``/``" are detected in ``pick_link()`` which resets the ``nameidata`` to point to the effective filesystem root. If the symlink only contains "``/``" then there is nothing more to do, no components at all, so ``NULL`` is returned to indicate that the symlink can be released and @@ -1156,12 +1156,11 @@ something that looks like a symlink. It is really a reference to the target file, not just the name of it. When you ``readlink`` these objects you get a name that might refer to the same file - unless it has been unlinked or mounted over. When ``walk_component()`` follows -one of these, the ``->follow_link()`` method in "procfs" doesn't return +one of these, the ``->get_link()`` method in "procfs" doesn't return a string name, but instead calls ``nd_jump_link()`` which updates the -``nameidata`` in place to point to that target. ``->follow_link()`` then -returns ``NULL``. Again there is no final component and ``get_link()`` -reports this by leaving the ``last_type`` field of ``nameidata`` as -``LAST_BIND``. +``nameidata`` in place to point to that target. ``->get_link()`` then +returns ``NULL``. Again there is no final component and ``pick_link()`` +returns ``NULL``. Following the symlink in the final component -------------------------------------------- -- cgit v1.2.3 From ef4aa53f36a932e656a3b91cdc8a9a9dcb9cef81 Mon Sep 17 00:00:00 2001 From: Fox Chen Date: Thu, 27 May 2021 17:16:17 +0800 Subject: docs: path-lookup: update symlink description instead of lookup_real()/vfs_create(), i_op->lookup() and i_op->create() will be called directly. update vfs_open() logic should_follow_link is merged into lookup_last() or open_last_lookup() which returns symlink name instead of an integer. Signed-off-by: Fox Chen Reviewed-by: NeilBrown Link: https://lore.kernel.org/r/20210527091618.287093-13-foxhlchen@gmail.com Signed-off-by: Jonathan Corbet --- Documentation/filesystems/path-lookup.rst | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index c150f076abbf..b746e974393a 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -1200,16 +1200,15 @@ the code. it. If the file was found in the dcache, then ``vfs_open()`` is used for this. If not, then ``lookup_open()`` will either call ``atomic_open()`` (if the filesystem provides it) to combine the final lookup with the open, or - will perform the separate ``lookup_real()`` and ``vfs_create()`` steps + will perform the separate ``i_op->lookup()`` and ``i_op->create()`` steps directly. In the later case the actual "open" of this newly found or created file will be performed by ``vfs_open()``, just as if the name were found in the dcache. 2. ``vfs_open()`` can fail with ``-EOPENSTALE`` if the cached information - wasn't quite current enough. Rather than restarting the lookup from - the top with ``LOOKUP_REVAL`` set, ``lookup_open()`` is called instead, - giving the filesystem a chance to resolve small inconsistencies. - If that doesn't work, only then is the lookup restarted from the top. + wasn't quite current enough. If it's in RCU-walk ``-ECHILD`` will be returned + otherwise ``-ESTALE`` is returned. When ``-ESTALE`` is returned, the caller may + retry with ``LOOKUP_REVAL`` flag set. 3. An open with O_CREAT **does** follow a symlink in the final component, unlike other creation system calls (like ``mkdir``). So the sequence:: @@ -1219,8 +1218,8 @@ the code. will create a file called ``/tmp/bar``. This is not permitted if ``O_EXCL`` is set but otherwise is handled for an O_CREAT open much - like for a non-creating open: ``should_follow_link()`` returns ``1``, and - so does ``do_last()`` so that ``trailing_symlink()`` gets called and the + like for a non-creating open: ``lookup_last()`` or ``open_last_lookup()`` + returns a non ``NULL`` value, and ``link_path_walk()`` gets called and the open process continues on the symlink that was found. Updating the access time -- cgit v1.2.3 From 8943474a416c0d2eac2366c22be1458ad0ceb812 Mon Sep 17 00:00:00 2001 From: Fox Chen Date: Thu, 27 May 2021 17:16:18 +0800 Subject: docs: path-lookup: use bare function() rather than literals As suggested by Matthew Wilcox and Jonathan Corbet, drop ``...`` literals around function names of this patchset. Signed-off-by: Fox Chen Reviewed-by: NeilBrown Link: https://lore.kernel.org/r/20210527091618.287093-14-foxhlchen@gmail.com Signed-off-by: Jonathan Corbet --- Documentation/filesystems/path-lookup.rst | 70 +++++++++++++++---------------- 1 file changed, 35 insertions(+), 35 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index b746e974393a..a6fa7619b69e 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -450,13 +450,13 @@ If that doesn't get a good result, it calls "``lookup_slow()``" which takes ``i_rwsem``, rechecks the cache, and then asks the filesystem to find a definitive answer. -As the last step of ``walk_component()``, ``step_into()`` will be called either +As the last step of walk_component(), step_into() will be called either directly from walk_component() or from handle_dots(). It calls -``handle_mounts()``, to check and handle mount points, in which a new +handle_mounts(), to check and handle mount points, in which a new ``struct path`` is created containing a counted reference to the new dentry and a reference to the new ``vfsmount`` which is only counted if it is different from the previous ``vfsmount``. Then if there is -a symbolic link, ``step_into()`` calls ``pick_link()`` to deal with it, +a symbolic link, step_into() calls pick_link() to deal with it, otherwise it installs the new ``struct path`` in the ``struct nameidata``, and drops the unneeded references. @@ -472,8 +472,8 @@ Handling the final component ``nd->last_type`` to refer to the final component of the path. It does not call ``walk_component()`` that last time. Handling that final component remains for the caller to sort out. Those callers are -``path_lookupat()``, ``path_parentat()`` and -``path_openat()`` each of which handles the differing requirements of +path_lookupat(), path_parentat() and +path_openat() each of which handles the differing requirements of different system calls. ``path_parentat()`` is clearly the simplest - it just wraps a little bit @@ -489,17 +489,17 @@ object is wanted such as by ``stat()`` or ``chmod()``. It essentially just calls ``walk_component()`` on the final component through a call to ``lookup_last()``. ``path_lookupat()`` returns just the final dentry. It is worth noting that when flag ``LOOKUP_MOUNTPOINT`` is set, -``path_lookupat()`` will unset LOOKUP_JUMPED in nameidata so that in the -subsequent path traversal ``d_weak_revalidate()`` won't be called. +path_lookupat() will unset LOOKUP_JUMPED in nameidata so that in the +subsequent path traversal d_weak_revalidate() won't be called. This is important when unmounting a filesystem that is inaccessible, such as one provided by a dead NFS server. Finally ``path_openat()`` is used for the ``open()`` system call; it -contains, in support functions starting with "``open_last_lookups()``", all the +contains, in support functions starting with "open_last_lookups()", all the complexity needed to handle the different subtleties of O_CREAT (with or without O_EXCL), final "``/``" characters, and trailing symbolic links. We will revisit this in the final part of this series, which -focuses on those symbolic links. "``open_last_lookups()``" will sometimes, but +focuses on those symbolic links. "open_last_lookups()" will sometimes, but not always, take ``i_rwsem``, depending on what it finds. Each of these, or the functions which call them, need to be alert to @@ -651,9 +651,9 @@ RCU-walk finds it cannot stop gracefully, it simply gives up and restarts from the top with REF-walk. This pattern of "try RCU-walk, if that fails try REF-walk" can be -clearly seen in functions like ``filename_lookup()``, -``filename_parentat()``, -``do_filp_open()``, and ``do_file_open_root()``. These four +clearly seen in functions like filename_lookup(), +filename_parentat(), +do_filp_open(), and do_file_open_root(). These four correspond roughly to the three ``path_*()`` functions we met earlier, each of which calls ``link_path_walk()``. The ``path_*()`` functions are called using different mode flags until a mode is found which works. @@ -1069,8 +1069,8 @@ all the data structures it references are safe to be accessed while holding no counted reference, only the RCU lock. A callback ``struct delayed_called`` will be passed to ``->get_link()``: file systems can set their own put_link function and argument through -``set_delayed_call()``. Later on, when VFS wants to put link, it will call -``do_delayed_call()`` to invoke that callback function with the argument. +set_delayed_call(). Later on, when VFS wants to put link, it will call +do_delayed_call() to invoke that callback function with the argument. In order for the reference to each symlink to be dropped when the walk completes, whether in RCU-walk or REF-walk, the symlink stack needs to contain, @@ -1103,7 +1103,7 @@ doesn't need to notice. Getting this ``name`` variable on and off the stack is very straightforward; pushing and popping the references is a little more complex. -When a symlink is found, ``walk_component()`` calls ``pick_link()`` via ``step_into()`` +When a symlink is found, walk_component() calls pick_link() via step_into() which returns the link from the filesystem. Providing that operation is successful, the old path ``name`` is placed on the stack, and the new value is used as the ``name`` for a while. When the end of @@ -1136,10 +1136,10 @@ Symlinks with no final component A pair of special-case symlinks deserve a little further explanation. Both result in a new ``struct path`` (with mount and dentry) being set -up in the ``nameidata``, and result in ``pick_link()`` returning ``NULL``. +up in the ``nameidata``, and result in pick_link() returning ``NULL``. The more obvious case is a symlink to "``/``". All symlinks starting -with "``/``" are detected in ``pick_link()`` which resets the ``nameidata`` +with "``/``" are detected in pick_link() which resets the ``nameidata`` to point to the effective filesystem root. If the symlink only contains "``/``" then there is nothing more to do, no components at all, so ``NULL`` is returned to indicate that the symlink can be released and @@ -1157,9 +1157,9 @@ target file, not just the name of it. When you ``readlink`` these objects you get a name that might refer to the same file - unless it has been unlinked or mounted over. When ``walk_component()`` follows one of these, the ``->get_link()`` method in "procfs" doesn't return -a string name, but instead calls ``nd_jump_link()`` which updates the +a string name, but instead calls nd_jump_link() which updates the ``nameidata`` in place to point to that target. ``->get_link()`` then -returns ``NULL``. Again there is no final component and ``pick_link()`` +returns ``NULL``. Again there is no final component and pick_link() returns ``NULL``. Following the symlink in the final component @@ -1177,35 +1177,35 @@ potentially need to call ``link_path_walk()`` again and again on successive symlinks until one is found that doesn't point to another symlink. -This case is handled by relevant callers of ``link_path_walk()``, such as -``path_lookupat()``, ``path_openat()`` using a loop that calls ``link_path_walk()``, -and then handles the final component by calling ``open_last_lookups()`` or -``lookup_last()``. If it is a symlink that needs to be followed, -``open_last_lookups()`` or ``lookup_last()`` will set things up properly and +This case is handled by relevant callers of link_path_walk(), such as +path_lookupat(), path_openat() using a loop that calls link_path_walk(), +and then handles the final component by calling open_last_lookups() or +lookup_last(). If it is a symlink that needs to be followed, +open_last_lookups() or lookup_last() will set things up properly and return the path so that the loop repeats, calling -``link_path_walk()`` again. This could loop as many as 40 times if the last +link_path_walk() again. This could loop as many as 40 times if the last component of each symlink is another symlink. Of the various functions that examine the final component, -``open_last_lookups()`` is the most interesting as it works in tandem -with ``do_open()`` for opening a file. Part of ``open_last_lookups()`` runs -with ``i_rwsem`` held and this part is in a separate function: ``lookup_open()``. +open_last_lookups() is the most interesting as it works in tandem +with do_open() for opening a file. Part of open_last_lookups() runs +with ``i_rwsem`` held and this part is in a separate function: lookup_open(). -Explaining ``open_last_lookups()`` and ``do_open()`` completely is beyond the scope +Explaining open_last_lookups() and do_open() completely is beyond the scope of this article, but a few highlights should help those interested in exploring the code. -1. Rather than just finding the target file, ``do_open()`` is used after - ``open_last_lookup()`` to open +1. Rather than just finding the target file, do_open() is used after + open_last_lookup() to open it. If the file was found in the dcache, then ``vfs_open()`` is used for this. If not, then ``lookup_open()`` will either call ``atomic_open()`` (if the filesystem provides it) to combine the final lookup with the open, or will perform the separate ``i_op->lookup()`` and ``i_op->create()`` steps directly. In the later case the actual "open" of this newly found or - created file will be performed by ``vfs_open()``, just as if the name + created file will be performed by vfs_open(), just as if the name were found in the dcache. -2. ``vfs_open()`` can fail with ``-EOPENSTALE`` if the cached information +2. vfs_open() can fail with ``-EOPENSTALE`` if the cached information wasn't quite current enough. If it's in RCU-walk ``-ECHILD`` will be returned otherwise ``-ESTALE`` is returned. When ``-ESTALE`` is returned, the caller may retry with ``LOOKUP_REVAL`` flag set. @@ -1218,8 +1218,8 @@ the code. will create a file called ``/tmp/bar``. This is not permitted if ``O_EXCL`` is set but otherwise is handled for an O_CREAT open much - like for a non-creating open: ``lookup_last()`` or ``open_last_lookup()`` - returns a non ``NULL`` value, and ``link_path_walk()`` gets called and the + like for a non-creating open: lookup_last() or open_last_lookup() + returns a non ``NULL`` value, and link_path_walk() gets called and the open process continues on the symlink that was found. Updating the access time -- cgit v1.2.3