diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2018-08-13 22:34:47 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2018-08-13 22:34:47 -0700 |
commit | 10f3e23f07cb0c20f9bcb77a5b5a7eb2a1b2a2fe (patch) | |
tree | 1fcb34309b3542512c6f3345f092f7adb8c3312c /Documentation/filesystems/ext4/ondisk/allocators.rst | |
parent | 3bb37da509e576c80180fa0e4d1cfcaddf0cb82e (diff) | |
parent | 863c37fcb14f8b66ea831b45fb35a53ac4a8d69e (diff) | |
download | linux-10f3e23f07cb0c20f9bcb77a5b5a7eb2a1b2a2fe.tar.bz2 |
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
- Convert content from the ext4 wiki to Documentation rst files so it
is more likely to be updated as we add new features to ext4.
- Add 64-bit timestamp support to ext4's superblock fields.
- ... and the usual bug fixes and cleanups, including a Spectre gadget
fixup and some hardening against maliciously corrupted file systems.
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (34 commits)
ext4: remove unneeded variable "err" in ext4_mb_release_inode_pa()
ext4: improve code readability in ext4_iget()
ext4: fix spectre gadget in ext4_mb_regular_allocator()
ext4: check for NUL characters in extended attribute's name
ext4: use ext4_warning() for sb_getblk failure
ext4: fix race when setting the bitmap corrupted flag
ext4: reset error code in ext4_find_entry in fallback
ext4: handle layout changes to pinned DAX mappings
dax: dax_layout_busy_page() warn on !exceptional
docs: fix up the obviously obsolete bits in the new ext4 documentation
docs: add new ext4 superblock time extension fields
docs: create filesystem internal section
ext4: use swap macro in mext_page_double_lock
ext4: check allocation failure when duplicating "data" in ext4_remount()
ext4: fix warning message in ext4_enable_quotas()
ext4: super: extend timestamps to 40 bits
jbd2: replace current_kernel_time64 with ktime equivalent
ext4: use timespec64 for all inode times
ext4: use ktime_get_real_seconds for i_dtime
ext4: use 64-bit timestamps for mmp_time
...
Diffstat (limited to 'Documentation/filesystems/ext4/ondisk/allocators.rst')
-rw-r--r-- | Documentation/filesystems/ext4/ondisk/allocators.rst | 56 |
1 files changed, 56 insertions, 0 deletions
diff --git a/Documentation/filesystems/ext4/ondisk/allocators.rst b/Documentation/filesystems/ext4/ondisk/allocators.rst new file mode 100644 index 000000000000..7aa85152ace3 --- /dev/null +++ b/Documentation/filesystems/ext4/ondisk/allocators.rst @@ -0,0 +1,56 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Block and Inode Allocation Policy +--------------------------------- + +ext4 recognizes (better than ext3, anyway) that data locality is +generally a desirably quality of a filesystem. On a spinning disk, +keeping related blocks near each other reduces the amount of movement +that the head actuator and disk must perform to access a data block, +thus speeding up disk IO. On an SSD there of course are no moving parts, +but locality can increase the size of each transfer request while +reducing the total number of requests. This locality may also have the +effect of concentrating writes on a single erase block, which can speed +up file rewrites significantly. Therefore, it is useful to reduce +fragmentation whenever possible. + +The first tool that ext4 uses to combat fragmentation is the multi-block +allocator. When a file is first created, the block allocator +speculatively allocates 8KiB of disk space to the file on the assumption +that the space will get written soon. When the file is closed, the +unused speculative allocations are of course freed, but if the +speculation is correct (typically the case for full writes of small +files) then the file data gets written out in a single multi-block +extent. A second related trick that ext4 uses is delayed allocation. +Under this scheme, when a file needs more blocks to absorb file writes, +the filesystem defers deciding the exact placement on the disk until all +the dirty buffers are being written out to disk. By not committing to a +particular placement until it's absolutely necessary (the commit timeout +is hit, or sync() is called, or the kernel runs out of memory), the hope +is that the filesystem can make better location decisions. + +The third trick that ext4 (and ext3) uses is that it tries to keep a +file's data blocks in the same block group as its inode. This cuts down +on the seek penalty when the filesystem first has to read a file's inode +to learn where the file's data blocks live and then seek over to the +file's data blocks to begin I/O operations. + +The fourth trick is that all the inodes in a directory are placed in the +same block group as the directory, when feasible. The working assumption +here is that all the files in a directory might be related, therefore it +is useful to try to keep them all together. + +The fifth trick is that the disk volume is cut up into 128MB block +groups; these mini-containers are used as outlined above to try to +maintain data locality. However, there is a deliberate quirk -- when a +directory is created in the root directory, the inode allocator scans +the block groups and puts that directory into the least heavily loaded +block group that it can find. This encourages directories to spread out +over a disk; as the top-level directory/file blobs fill up one block +group, the allocators simply move on to the next block group. Allegedly +this scheme evens out the loading on the block groups, though the author +suspects that the directories which are so unlucky as to land towards +the end of a spinning drive get a raw deal performance-wise. + +Of course if all of these mechanisms fail, one can always use e4defrag +to defragment files. |