From 1ad1335dc58646764eda7bb054b350934a1b23ec Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 18 Apr 2018 11:07:49 +0300 Subject: docs/admin-guide/mm: start moving here files from Documentation/vm Several documents in Documentation/vm fit quite well into the "admin/user guide" category. The documents that don't overload the reader with lots of implementation details and provide coherent description of certain feature can be moved to Documentation/admin-guide/mm. Signed-off-by: Mike Rapoport Signed-off-by: Jonathan Corbet --- Documentation/ABI/stable/sysfs-devices-node | 2 +- .../ABI/testing/sysfs-kernel-mm-hugepages | 2 +- Documentation/admin-guide/mm/hugetlbpage.rst | 381 +++++++++++++++++++++ .../admin-guide/mm/idle_page_tracking.rst | 115 +++++++ Documentation/admin-guide/mm/index.rst | 9 + Documentation/admin-guide/mm/pagemap.rst | 197 +++++++++++ Documentation/admin-guide/mm/soft-dirty.rst | 47 +++ Documentation/admin-guide/mm/userfaultfd.rst | 241 +++++++++++++ Documentation/filesystems/proc.txt | 6 +- Documentation/sysctl/vm.txt | 4 +- Documentation/vm/00-INDEX | 10 - Documentation/vm/hugetlbpage.rst | 381 --------------------- Documentation/vm/hwpoison.rst | 2 +- Documentation/vm/idle_page_tracking.rst | 115 ------- Documentation/vm/index.rst | 5 - Documentation/vm/pagemap.rst | 197 ----------- Documentation/vm/soft-dirty.rst | 47 --- Documentation/vm/userfaultfd.rst | 241 ------------- 18 files changed, 999 insertions(+), 1003 deletions(-) create mode 100644 Documentation/admin-guide/mm/hugetlbpage.rst create mode 100644 Documentation/admin-guide/mm/idle_page_tracking.rst create mode 100644 Documentation/admin-guide/mm/pagemap.rst create mode 100644 Documentation/admin-guide/mm/soft-dirty.rst create mode 100644 Documentation/admin-guide/mm/userfaultfd.rst delete mode 100644 Documentation/vm/hugetlbpage.rst delete mode 100644 Documentation/vm/idle_page_tracking.rst delete mode 100644 Documentation/vm/pagemap.rst delete mode 100644 Documentation/vm/soft-dirty.rst delete mode 100644 Documentation/vm/userfaultfd.rst (limited to 'Documentation') diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node index b38f4b734567..3e90e1f3bf0a 100644 --- a/Documentation/ABI/stable/sysfs-devices-node +++ b/Documentation/ABI/stable/sysfs-devices-node @@ -90,4 +90,4 @@ Date: December 2009 Contact: Lee Schermerhorn Description: The node's huge page size control/query attributes. - See Documentation/vm/hugetlbpage.rst \ No newline at end of file + See Documentation/admin-guide/mm/hugetlbpage.rst \ No newline at end of file diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-hugepages b/Documentation/ABI/testing/sysfs-kernel-mm-hugepages index 5140b233356c..fdaa2162fae1 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-hugepages +++ b/Documentation/ABI/testing/sysfs-kernel-mm-hugepages @@ -12,4 +12,4 @@ Description: free_hugepages surplus_hugepages resv_hugepages - See Documentation/vm/hugetlbpage.rst for details. + See Documentation/admin-guide/mm/hugetlbpage.rst for details. diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst new file mode 100644 index 000000000000..2b374d10284d --- /dev/null +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -0,0 +1,381 @@ +.. _hugetlbpage: + +============= +HugeTLB Pages +============= + +Overview +======== + +The intent of this file is to give a brief summary of hugetlbpage support in +the Linux kernel. This support is built on top of multiple page size support +that is provided by most modern architectures. For example, x86 CPUs normally +support 4K and 2M (1G if architecturally supported) page sizes, ia64 +architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M, +256M and ppc64 supports 4K and 16M. A TLB is a cache of virtual-to-physical +translations. Typically this is a very scarce resource on processor. +Operating systems try to make best use of limited number of TLB resources. +This optimization is more critical now as bigger and bigger physical memories +(several GBs) are more readily available. + +Users can use the huge page support in Linux kernel by either using the mmap +system call or standard SYSV shared memory system calls (shmget, shmat). + +First the Linux kernel needs to be built with the CONFIG_HUGETLBFS +(present under "File systems") and CONFIG_HUGETLB_PAGE (selected +automatically when CONFIG_HUGETLBFS is selected) configuration +options. + +The ``/proc/meminfo`` file provides information about the total number of +persistent hugetlb pages in the kernel's huge page pool. It also displays +default huge page size and information about the number of free, reserved +and surplus huge pages in the pool of huge pages of default size. +The huge page size is needed for generating the proper alignment and +size of the arguments to system calls that map huge page regions. + +The output of ``cat /proc/meminfo`` will include lines like:: + + HugePages_Total: uuu + HugePages_Free: vvv + HugePages_Rsvd: www + HugePages_Surp: xxx + Hugepagesize: yyy kB + Hugetlb: zzz kB + +where: + +HugePages_Total + is the size of the pool of huge pages. +HugePages_Free + is the number of huge pages in the pool that are not yet + allocated. +HugePages_Rsvd + is short for "reserved," and is the number of huge pages for + which a commitment to allocate from the pool has been made, + but no allocation has yet been made. Reserved huge pages + guarantee that an application will be able to allocate a + huge page from the pool of huge pages at fault time. +HugePages_Surp + is short for "surplus," and is the number of huge pages in + the pool above the value in ``/proc/sys/vm/nr_hugepages``. The + maximum number of surplus huge pages is controlled by + ``/proc/sys/vm/nr_overcommit_hugepages``. +Hugepagesize + is the default hugepage size (in Kb). +Hugetlb + is the total amount of memory (in kB), consumed by huge + pages of all sizes. + If huge pages of different sizes are in use, this number + will exceed HugePages_Total \* Hugepagesize. To get more + detailed information, please, refer to + ``/sys/kernel/mm/hugepages`` (described below). + + +``/proc/filesystems`` should also show a filesystem of type "hugetlbfs" +configured in the kernel. + +``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" huge +pages in the kernel's huge page pool. "Persistent" huge pages will be +returned to the huge page pool when freed by a task. A user with root +privileges can dynamically allocate more or free some persistent huge pages +by increasing or decreasing the value of ``nr_hugepages``. + +Pages that are used as huge pages are reserved inside the kernel and cannot +be used for other purposes. Huge pages cannot be swapped out under +memory pressure. + +Once a number of huge pages have been pre-allocated to the kernel huge page +pool, a user with appropriate privilege can use either the mmap system call +or shared memory system calls to use the huge pages. See the discussion of +:ref:`Using Huge Pages `, below. + +The administrator can allocate persistent huge pages on the kernel boot +command line by specifying the "hugepages=N" parameter, where 'N' = the +number of huge pages requested. This is the most reliable method of +allocating huge pages as memory has not yet become fragmented. + +Some platforms support multiple huge page sizes. To allocate huge pages +of a specific size, one must precede the huge pages boot command parameters +with a huge page size selection parameter "hugepagesz=". must +be specified in bytes with optional scale suffix [kKmMgG]. The default huge +page size may be selected with the "default_hugepagesz=" boot parameter. + +When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages`` +indicates the current number of pre-allocated huge pages of the default size. +Thus, one can use the following command to dynamically allocate/deallocate +default sized persistent huge pages:: + + echo 20 > /proc/sys/vm/nr_hugepages + +This command will try to adjust the number of default sized huge pages in the +huge page pool to 20, allocating or freeing huge pages, as required. + +On a NUMA platform, the kernel will attempt to distribute the huge page pool +over all the set of allowed nodes specified by the NUMA memory policy of the +task that modifies ``nr_hugepages``. The default for the allowed nodes--when the +task has default memory policy--is all on-line nodes with memory. Allowed +nodes with insufficient available, contiguous memory for a huge page will be +silently skipped when allocating persistent huge pages. See the +:ref:`discussion below ` +of the interaction of task memory policy, cpusets and per node attributes +with the allocation and freeing of persistent huge pages. + +The success or failure of huge page allocation depends on the amount of +physically contiguous memory that is present in system at the time of the +allocation attempt. If the kernel is unable to allocate huge pages from +some nodes in a NUMA system, it will attempt to make up the difference by +allocating extra pages on other nodes with sufficient available contiguous +memory, if any. + +System administrators may want to put this command in one of the local rc +init files. This will enable the kernel to allocate huge pages early in +the boot process when the possibility of getting physical contiguous pages +is still very high. Administrators can verify the number of huge pages +actually allocated by checking the sysctl or meminfo. To check the per node +distribution of huge pages in a NUMA system, use:: + + cat /sys/devices/system/node/node*/meminfo | fgrep Huge + +``/proc/sys/vm/nr_overcommit_hugepages`` specifies how large the pool of +huge pages can grow, if more huge pages than ``/proc/sys/vm/nr_hugepages`` are +requested by applications. Writing any non-zero value into this file +indicates that the hugetlb subsystem is allowed to try to obtain that +number of "surplus" huge pages from the kernel's normal page pool, when the +persistent huge page pool is exhausted. As these surplus huge pages become +unused, they are freed back to the kernel's normal page pool. + +When increasing the huge page pool size via ``nr_hugepages``, any existing +surplus pages will first be promoted to persistent huge pages. Then, additional +huge pages will be allocated, if necessary and if possible, to fulfill +the new persistent huge page pool size. + +The administrator may shrink the pool of persistent huge pages for +the default huge page size by setting the ``nr_hugepages`` sysctl to a +smaller value. The kernel will attempt to balance the freeing of huge pages +across all nodes in the memory policy of the task modifying ``nr_hugepages``. +Any free huge pages on the selected nodes will be freed back to the kernel's +normal page pool. + +Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that +it becomes less than the number of huge pages in use will convert the balance +of the in-use huge pages to surplus huge pages. This will occur even if +the number of surplus pages would exceed the overcommit value. As long as +this condition holds--that is, until ``nr_hugepages+nr_overcommit_hugepages`` is +increased sufficiently, or the surplus huge pages go out of use and are freed-- +no more surplus huge pages will be allowed to be allocated. + +With support for multiple huge page pools at run-time available, much of +the huge page userspace interface in ``/proc/sys/vm`` has been duplicated in +sysfs. +The ``/proc`` interfaces discussed above have been retained for backwards +compatibility. The root huge page control directory in sysfs is:: + + /sys/kernel/mm/hugepages + +For each huge page size supported by the running kernel, a subdirectory +will exist, of the form:: + + hugepages-${size}kB + +Inside each of these directories, the same set of files will exist:: + + nr_hugepages + nr_hugepages_mempolicy + nr_overcommit_hugepages + free_hugepages + resv_hugepages + surplus_hugepages + +which function as described above for the default huge page-sized case. + +.. _mem_policy_and_hp_alloc: + +Interaction of Task Memory Policy with Huge Page Allocation/Freeing +=================================================================== + +Whether huge pages are allocated and freed via the ``/proc`` interface or +the ``/sysfs`` interface using the ``nr_hugepages_mempolicy`` attribute, the +NUMA nodes from which huge pages are allocated or freed are controlled by the +NUMA memory policy of the task that modifies the ``nr_hugepages_mempolicy`` +sysctl or attribute. When the ``nr_hugepages`` attribute is used, mempolicy +is ignored. + +The recommended method to allocate or free huge pages to/from the kernel +huge page pool, using the ``nr_hugepages`` example above, is:: + + numactl --interleave echo 20 \ + >/proc/sys/vm/nr_hugepages_mempolicy + +or, more succinctly:: + + numactl -m echo 20 >/proc/sys/vm/nr_hugepages_mempolicy + +This will allocate or free ``abs(20 - nr_hugepages)`` to or from the nodes +specified in , depending on whether number of persistent huge pages +is initially less than or greater than 20, respectively. No huge pages will be +allocated nor freed on any node not included in the specified . + +When adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any +memory policy mode--bind, preferred, local or interleave--may be used. The +resulting effect on persistent huge page allocation is as follows: + +#. Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.rst], + persistent huge pages will be distributed across the node or nodes + specified in the mempolicy as if "interleave" had been specified. + However, if a node in the policy does not contain sufficient contiguous + memory for a huge page, the allocation will not "fallback" to the nearest + neighbor node with sufficient contiguous memory. To do this would cause + undesirable imbalance in the distribution of the huge page pool, or + possibly, allocation of persistent huge pages on nodes not allowed by + the task's memory policy. + +#. One or more nodes may be specified with the bind or interleave policy. + If more than one node is specified with the preferred policy, only the + lowest numeric id will be used. Local policy will select the node where + the task is running at the time the nodes_allowed mask is constructed. + For local policy to be deterministic, the task must be bound to a cpu or + cpus in a single node. Otherwise, the task could be migrated to some + other node at any time after launch and the resulting node will be + indeterminate. Thus, local policy is not very useful for this purpose. + Any of the other mempolicy modes may be used to specify a single node. + +#. The nodes allowed mask will be derived from any non-default task mempolicy, + whether this policy was set explicitly by the task itself or one of its + ancestors, such as numactl. This means that if the task is invoked from a + shell with non-default policy, that policy will be used. One can specify a + node list of "all" with numactl --interleave or --membind [-m] to achieve + interleaving over all nodes in the system or cpuset. + +#. Any task mempolicy specified--e.g., using numactl--will be constrained by + the resource limits of any cpuset in which the task runs. Thus, there will + be no way for a task with non-default policy running in a cpuset with a + subset of the system nodes to allocate huge pages outside the cpuset + without first moving to a cpuset that contains all of the desired nodes. + +#. Boot-time huge page allocation attempts to distribute the requested number + of huge pages over all on-lines nodes with memory. + +Per Node Hugepages Attributes +============================= + +A subset of the contents of the root huge page control directory in sysfs, +described above, will be replicated under each the system device of each +NUMA node with memory in:: + + /sys/devices/system/node/node[0-9]*/hugepages/ + +Under this directory, the subdirectory for each supported huge page size +contains the following attribute files:: + + nr_hugepages + free_hugepages + surplus_hugepages + +The free\_' and surplus\_' attribute files are read-only. They return the number +of free and surplus [overcommitted] huge pages, respectively, on the parent +node. + +The ``nr_hugepages`` attribute returns the total number of huge pages on the +specified node. When this attribute is written, the number of persistent huge +pages on the parent node will be adjusted to the specified value, if sufficient +resources exist, regardless of the task's mempolicy or cpuset constraints. + +Note that the number of overcommit and reserve pages remain global quantities, +as we don't know until fault time, when the faulting task's mempolicy is +applied, from which node the huge page allocation will be attempted. + +.. _using_huge_pages: + +Using Huge Pages +================ + +If the user applications are going to request huge pages using mmap system +call, then it is required that system administrator mount a file system of +type hugetlbfs:: + + mount -t hugetlbfs \ + -o uid=,gid=,mode=,pagesize=,size=,\ + min_size=,nr_inodes= none /mnt/huge + +This command mounts a (pseudo) filesystem of type hugetlbfs on the directory +``/mnt/huge``. Any file created on ``/mnt/huge`` uses huge pages. + +The ``uid`` and ``gid`` options sets the owner and group of the root of the +file system. By default the ``uid`` and ``gid`` of the current process +are taken. + +The ``mode`` option sets the mode of root of file system to value & 01777. +This value is given in octal. By default the value 0755 is picked. + +If the platform supports multiple huge page sizes, the ``pagesize`` option can +be used to specify the huge page size and associated pool. ``pagesize`` +is specified in bytes. If ``pagesize`` is not specified the platform's +default huge page size and associated pool will be used. + +The ``size`` option sets the maximum value of memory (huge pages) allowed +for that filesystem (``/mnt/huge``). The ``size`` option can be specified +in bytes, or as a percentage of the specified huge page pool (``nr_hugepages``). +The size is rounded down to HPAGE_SIZE boundary. + +The ``min_size`` option sets the minimum value of memory (huge pages) allowed +for the filesystem. ``min_size`` can be specified in the same way as ``size``, +either bytes or a percentage of the huge page pool. +At mount time, the number of huge pages specified by ``min_size`` are reserved +for use by the filesystem. +If there are not enough free huge pages available, the mount will fail. +As huge pages are allocated to the filesystem and freed, the reserve count +is adjusted so that the sum of allocated and reserved huge pages is always +at least ``min_size``. + +The option ``nr_inodes`` sets the maximum number of inodes that ``/mnt/huge`` +can use. + +If the ``size``, ``min_size`` or ``nr_inodes`` option is not provided on +command line then no limits are set. + +For ``pagesize``, ``size``, ``min_size`` and ``nr_inodes`` options, you can +use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. +For example, size=2K has the same meaning as size=2048. + +While read system calls are supported on files that reside on hugetlb +file systems, write system calls are not. + +Regular chown, chgrp, and chmod commands (with right permissions) could be +used to change the file attributes on hugetlbfs. + +Also, it is important to note that no such mount command is required if +applications are going to use only shmat/shmget system calls or mmap with +MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see +:ref:`map_hugetlb ` below. + +Users who wish to use hugetlb memory via shared memory segment should be +members of a supplementary group and system admin needs to configure that gid +into ``/proc/sys/vm/hugetlb_shm_group``. It is possible for same or different +applications to use any combination of mmaps and shm* calls, though the mount of +filesystem will be required for using mmap calls without MAP_HUGETLB. + +Syscalls that operate on memory backed by hugetlb pages only have their lengths +aligned to the native page size of the processor; they will normally fail with +errno set to EINVAL or exclude hugetlb pages that extend beyond the length if +not hugepage aligned. For example, munmap(2) will fail if memory is backed by +a hugetlb page and the length is smaller than the hugepage size. + + +Examples +======== + +.. _map_hugetlb: + +``map_hugetlb`` + see tools/testing/selftests/vm/map_hugetlb.c + +``hugepage-shm`` + see tools/testing/selftests/vm/hugepage-shm.c + +``hugepage-mmap`` + see tools/testing/selftests/vm/hugepage-mmap.c + +The `libhugetlbfs`_ library provides a wide range of userspace tools +to help with huge page usability, environment setup, and control. + +.. _libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs diff --git a/Documentation/admin-guide/mm/idle_page_tracking.rst b/Documentation/admin-guide/mm/idle_page_tracking.rst new file mode 100644 index 000000000000..92e3a25d2deb --- /dev/null +++ b/Documentation/admin-guide/mm/idle_page_tracking.rst @@ -0,0 +1,115 @@ +.. _idle_page_tracking: + +================== +Idle Page Tracking +================== + +Motivation +========== + +The idle page tracking feature allows to track which memory pages are being +accessed by a workload and which are idle. This information can be useful for +estimating the workload's working set size, which, in turn, can be taken into +account when configuring the workload parameters, setting memory cgroup limits, +or deciding where to place the workload within a compute cluster. + +It is enabled by CONFIG_IDLE_PAGE_TRACKING=y. + +.. _user_api: + +User API +======== + +The idle page tracking API is located at ``/sys/kernel/mm/page_idle``. +Currently, it consists of the only read-write file, +``/sys/kernel/mm/page_idle/bitmap``. + +The file implements a bitmap where each bit corresponds to a memory page. The +bitmap is represented by an array of 8-byte integers, and the page at PFN #i is +mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is +set, the corresponding page is idle. + +A page is considered idle if it has not been accessed since it was marked idle +(for more details on what "accessed" actually means see the :ref:`Implementation +Details ` section). +To mark a page idle one has to set the bit corresponding to +the page by writing to the file. A value written to the file is OR-ed with the +current bitmap value. + +Only accesses to user memory pages are tracked. These are pages mapped to a +process address space, page cache and buffer pages, swap cache pages. For other +page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored, +and hence such pages are never reported idle. + +For huge pages the idle flag is set only on the head page, so one has to read +``/proc/kpageflags`` in order to correctly count idle huge pages. + +Reading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return +-EINVAL if you are not starting the read/write on an 8-byte boundary, or +if the size of the read/write is not a multiple of 8 bytes. Writing to +this file beyond max PFN will return -ENXIO. + +That said, in order to estimate the amount of pages that are not used by a +workload one should: + + 1. Mark all the workload's pages as idle by setting corresponding bits in + ``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading + ``/proc/pid/pagemap`` if the workload is represented by a process, or by + filtering out alien pages using ``/proc/kpagecgroup`` in case the workload + is placed in a memory cgroup. + + 2. Wait until the workload accesses its working set. + + 3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set. + If one wants to ignore certain types of pages, e.g. mlocked pages since they + are not reclaimable, he or she can filter them out using + ``/proc/kpageflags``. + +See Documentation/admin-guide/mm/pagemap.rst for more information about +``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``. + +.. _impl_details: + +Implementation Details +====================== + +The kernel internally keeps track of accesses to user memory pages in order to +reclaim unreferenced pages first on memory shortage conditions. A page is +considered referenced if it has been recently accessed via a process address +space, in which case one or more PTEs it is mapped to will have the Accessed bit +set, or marked accessed explicitly by the kernel (see mark_page_accessed()). The +latter happens when: + + - a userspace process reads or writes a page using a system call (e.g. read(2) + or write(2)) + + - a page that is used for storing filesystem buffers is read or written, + because a process needs filesystem metadata stored in it (e.g. lists a + directory tree) + + - a page is accessed by a device driver using get_user_pages() + +When a dirty page is written to swap or disk as a result of memory reclaim or +exceeding the dirty memory limit, it is not marked referenced. + +The idle memory tracking feature adds a new page flag, the Idle flag. This flag +is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the +:ref:`User API ` +section), and cleared automatically whenever a page is referenced as defined +above. + +When a page is marked idle, the Accessed bit must be cleared in all PTEs it is +mapped to, otherwise we will not be able to detect accesses to the page coming +from a process address space. To avoid interference with the reclaimer, which, +as noted above, uses the Accessed bit to promote actively referenced pages, one +more page flag is introduced, the Young flag. When the PTE Accessed bit is +cleared as a result of setting or updating a page's Idle flag, the Young flag +is set on the page. The reclaimer treats the Young flag as an extra PTE +Accessed bit and therefore will consider such a page as referenced. + +Since the idle memory tracking feature is based on the memory reclaimer logic, +it only works with pages that are on an LRU list, other pages are silently +ignored. That means it will ignore a user memory page if it is isolated, but +since there are usually not many of them, it should not affect the overall +result noticeably. In order not to stall scanning of the idle page bitmap, +locked pages may be skipped too. diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index c47c16e13a18..6c8b554464bb 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -17,3 +17,12 @@ are described in Documentation/sysctl/vm.txt and in `man 5 proc`_. Here we document in detail how to interact with various mechanisms in the Linux memory management. + +.. toctree:: + :maxdepth: 1 + + hugetlbpage + idle_page_tracking + pagemap + soft-dirty + userfaultfd diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst new file mode 100644 index 000000000000..053ca64fd47a --- /dev/null +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -0,0 +1,197 @@ +.. _pagemap: + +============================= +Examining Process Page Tables +============================= + +pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow +userspace programs to examine the page tables and related information by +reading files in ``/proc``. + +There are four components to pagemap: + + * ``/proc/pid/pagemap``. This file lets a userspace process find out which + physical frame each virtual page is mapped to. It contains one 64-bit + value for each virtual page, containing the following data (from + ``fs/proc/task_mmu.c``, above pagemap_read): + + * Bits 0-54 page frame number (PFN) if present + * Bits 0-4 swap type if swapped + * Bits 5-54 swap offset if swapped + * Bit 55 pte is soft-dirty (see Documentation/admin-guide/mm/soft-dirty.rst) + * Bit 56 page exclusively mapped (since 4.2) + * Bits 57-60 zero + * Bit 61 page is file-page or shared-anon (since 3.5) + * Bit 62 page swapped + * Bit 63 page present + + Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs. + In 4.0 and 4.1 opens by unprivileged fail with -EPERM. Starting from + 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN. + Reason: information about PFNs helps in exploiting Rowhammer vulnerability. + + If the page is not present but in swap, then the PFN contains an + encoding of the swap file number and the page's offset into the + swap. Unmapped pages return a null PFN. This allows determining + precisely which pages are mapped (or in swap) and comparing mapped + pages between processes. + + Efficient users of this interface will use ``/proc/pid/maps`` to + determine which areas of memory are actually mapped and llseek to + skip over unmapped regions. + + * ``/proc/kpagecount``. This file contains a 64-bit count of the number of + times each page is mapped, indexed by PFN. + + * ``/proc/kpageflags``. This file contains a 64-bit set of flags for each + page, indexed by PFN. + + The flags are (from ``fs/proc/page.c``, above kpageflags_read): + + 0. LOCKED + 1. ERROR + 2. REFERENCED + 3. UPTODATE + 4. DIRTY + 5. LRU + 6. ACTIVE + 7. SLAB + 8. WRITEBACK + 9. RECLAIM + 10. BUDDY + 11. MMAP + 12. ANON + 13. SWAPCACHE + 14. SWAPBACKED + 15. COMPOUND_HEAD + 16. COMPOUND_TAIL + 17. HUGE + 18. UNEVICTABLE + 19. HWPOISON + 20. NOPAGE + 21. KSM + 22. THP + 23. BALLOON + 24. ZERO_PAGE + 25. IDLE + + * ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the + memory cgroup each page is charged to, indexed by PFN. Only available when + CONFIG_MEMCG is set. + +Short descriptions to the page flags +==================================== + +0 - LOCKED + page is being locked for exclusive access, e.g. by undergoing read/write IO +7 - SLAB + page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator + When compound page is used, SLUB/SLQB will only set this flag on the head + page; SLOB will not flag it at all. +10 - BUDDY + a free memory block managed by the buddy system allocator + The buddy system organizes free memory in blocks of various orders. + An order N block has 2^N physically contiguous pages, with the BUDDY flag + set for and _only_ for the first page. +15 - COMPOUND_HEAD + A compound page with order N consists of 2^N physically contiguous pages. + A compound page with order 2 takes the form of "HTTT", where H donates its + head page and T donates its tail page(s). The major consumers of compound + pages are hugeTLB pages (Documentation/admin-guide/mm/hugetlbpage.rst), the SLUB etc. + memory allocators and various device drivers. However in this interface, + only huge/giga pages are made visible to end users. +16 - COMPOUND_TAIL + A compound page tail (see description above). +17 - HUGE + this is an integral part of a HugeTLB page +19 - HWPOISON + hardware detected memory corruption on this page: don't touch the data! +20 - NOPAGE + no page frame exists at the requested address +21 - KSM + identical memory pages dynamically shared between one or more processes +22 - THP + contiguous pages which construct transparent hugepages +23 - BALLOON + balloon compaction page +24 - ZERO_PAGE + zero page for pfn_zero or huge_zero page +25 - IDLE + page has not been accessed since it was marked idle (see + Documentation/admin-guide/mm/idle_page_tracking.rst). Note that this flag may be + stale in case the page was accessed via a PTE. To make sure the flag + is up-to-date one has to read ``/sys/kernel/mm/page_idle/bitmap`` first. + +IO related page flags +--------------------- + +1 - ERROR + IO error occurred +3 - UPTODATE + page has up-to-date data + ie. for file backed page: (in-memory data revision >= on-disk one) +4 - DIRTY + page has been written to, hence contains new data + i.e. for file backed page: (in-memory data revision > on-disk one) +8 - WRITEBACK + page is being synced to disk + +LRU related page flags +---------------------- + +5 - LRU + page is in one of the LRU lists +6 - ACTIVE + page is in the active LRU list +18 - UNEVICTABLE + page is in the unevictable (non-)LRU list It is somehow pinned and + not a candidate for LRU page reclaims, e.g. ramfs pages, + shmctl(SHM_LOCK) and mlock() memory segments +2 - REFERENCED + page has been referenced since last LRU list enqueue/requeue +9 - RECLAIM + page will be reclaimed soon after its pageout IO completed +11 - MMAP + a memory mapped page +12 - ANON + a memory mapped page that is not part of a file +13 - SWAPCACHE + page is mapped to swap space, i.e. has an associated swap entry +14 - SWAPBACKED + page is backed by swap/RAM + +The page-types tool in the tools/vm directory can be used to query the +above flags. + +Using pagemap to do something useful +==================================== + +The general procedure for using pagemap to find out about a process' memory +usage goes like this: + + 1. Read ``/proc/pid/maps`` to determine which parts of the memory space are + mapped to what. + 2. Select the maps you are interested in -- all of them, or a particular + library, or the stack or the heap, etc. + 3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine. + 4. Read a u64 for each page from pagemap. + 5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you + just read, seek to that entry in the file, and read the data you want. + +For example, to find the "unique set size" (USS), which is the amount of +memory that a process is using that is not shared with any other process, +you can go through every map in the process, find the PFNs, look those up +in kpagecount, and tally up the number of pages that are only referenced +once. + +Other notes +=========== + +Reading from any of the files will return -EINVAL if you are not starting +the read on an 8-byte boundary (e.g., if you sought an odd number of bytes +into the file), or if the size of the read is not a multiple of 8 bytes. + +Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is +always 12 at most architectures). Since Linux 3.11 their meaning changes +after first clear of soft-dirty bits. Since Linux 4.2 they are used for +flags unconditionally. diff --git a/Documentation/admin-guide/mm/soft-dirty.rst b/Documentation/admin-guide/mm/soft-dirty.rst new file mode 100644 index 000000000000..cb0cfd6672fa --- /dev/null +++ b/Documentation/admin-guide/mm/soft-dirty.rst @@ -0,0 +1,47 @@ +.. _soft_dirty: + +=============== +Soft-Dirty PTEs +=============== + +The soft-dirty is a bit on a PTE which helps to track which pages a task +writes to. In order to do this tracking one should + + 1. Clear soft-dirty bits from the task's PTEs. + + This is done by writing "4" into the ``/proc/PID/clear_refs`` file of the + task in question. + + 2. Wait some time. + + 3. Read soft-dirty bits from the PTEs. + + This is done by reading from the ``/proc/PID/pagemap``. The bit 55 of the + 64-bit qword is the soft-dirty one. If set, the respective PTE was + written to since step 1. + + +Internally, to do this tracking, the writable bit is cleared from PTEs +when the soft-dirty bit is cleared. So, after this, when the task tries to +modify a page at some virtual address the #PF occurs and the kernel sets +the soft-dirty bit on the respective PTE. + +Note, that although all the task's address space is marked as r/o after the +soft-dirty bits clear, the #PF-s that occur after that are processed fast. +This is so, since the pages are still mapped to physical memory, and thus all +the kernel does is finds this fact out and puts both writable and soft-dirty +bits on the PTE. + +While in most cases tracking memory changes by #PF-s is more than enough +there is still a scenario when we can lose soft dirty bits -- a task +unmaps a previously mapped memory region and then maps a new one at exactly +the same place. When unmap is called, the kernel internally clears PTE values +including soft dirty bits. To notify user space application about such +memory region renewal the kernel always marks new memory regions (and +expanded regions) as soft dirty. + +This feature is actively used by the checkpoint-restore project. You +can find more details about it on http://criu.org + + +-- Pavel Emelyanov, Apr 9, 2013 diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst new file mode 100644 index 000000000000..5048cf661a8a --- /dev/null +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -0,0 +1,241 @@ +.. _userfaultfd: + +=========== +Userfaultfd +=========== + +Objective +========= + +Userfaults allow the implementation of on-demand paging from userland +and more generally they allow userland to take control of various +memory page faults, something otherwise only the kernel code could do. + +For example userfaults allows a proper and more optimal implementation +of the PROT_NONE+SIGSEGV trick. + +Design +====== + +Userfaults are delivered and resolved through the userfaultfd syscall. + +The userfaultfd (aside from registering and unregistering virtual +memory ranges) provides two primary functionalities: + +1) read/POLLIN protocol to notify a userland thread of the faults + happening + +2) various UFFDIO_* ioctls that can manage the virtual memory regions + registered in the userfaultfd that allows userland to efficiently + resolve the userfaults it receives via 1) or to manage the virtual + memory in the background + +The real advantage of userfaults if compared to regular virtual memory +management of mremap/mprotect is that the userfaults in all their +operations never involve heavyweight structures like vmas (in fact the +userfaultfd runtime load never takes the mmap_sem for writing). + +Vmas are not suitable for page- (or hugepage) granular fault tracking +when dealing with virtual address spaces that could span +Terabytes. Too many vmas would be needed for that. + +The userfaultfd once opened by invoking the syscall, can also be +passed using unix domain sockets to a manager process, so the same +manager process could handle the userfaults of a multitude of +different processes without them being aware about what is going on +(well of course unless they later try to use the userfaultfd +themselves on the same region the manager is already tracking, which +is a corner case that would currently return -EBUSY). + +API +=== + +When first opened the userfaultfd must be enabled invoking the +UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or +a later API version) which will specify the read/POLLIN protocol +userland intends to speak on the UFFD and the uffdio_api.features +userland requires. The UFFDIO_API ioctl if successful (i.e. if the +requested uffdio_api.api is spoken also by the running kernel and the +requested features are going to be enabled) will return into +uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of +respectively all the available features of the read(2) protocol and +the generic ioctl available. + +The uffdio_api.features bitmask returned by the UFFDIO_API ioctl +defines what memory types are supported by the userfaultfd and what +events, except page fault notifications, may be generated. + +If the kernel supports registering userfaultfd ranges on hugetlbfs +virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in +uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be +set if the kernel supports registering userfaultfd ranges on shared +memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero +MAP_SHARED, memfd_create, etc). + +The userland application that wants to use userfaultfd with hugetlbfs +or shared memory need to set the corresponding flag in +uffdio_api.features to enable those features. + +If the userland desires to receive notifications for events other than +page faults, it has to verify that uffdio_api.features has appropriate +UFFD_FEATURE_EVENT_* bits set. These events are described in more +detail below in "Non-cooperative userfaultfd" section. + +Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should +be invoked (if present in the returned uffdio_api.ioctls bitmask) to +register a memory range in the userfaultfd by setting the +uffdio_register structure accordingly. The uffdio_register.mode +bitmask will specify to the kernel which kind of faults to track for +the range (UFFDIO_REGISTER_MODE_MISSING would track missing +pages). The UFFDIO_REGISTER ioctl will return the +uffdio_register.ioctls bitmask of ioctls that are suitable to resolve +userfaults on the range registered. Not all ioctls will necessarily be +supported for all memory types depending on the underlying virtual +memory backend (anonymous memory vs tmpfs vs real filebacked +mappings). + +Userland can use the uffdio_register.ioctls to manage the virtual +address space in the background (to add or potentially also remove +memory from the userfaultfd registered range). This means a userfault +could be triggering just before userland maps in the background the +user-faulted page. + +The primary ioctl to resolve userfaults is UFFDIO_COPY. That +atomically copies a page into the userfault registered range and wakes +up the blocked userfaults (unless uffdio_copy.mode & +UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to +UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an +half copied page since it'll keep userfaulting until the copy has +finished. + +QEMU/KVM +======== + +QEMU/KVM is using the userfaultfd syscall to implement postcopy live +migration. Postcopy live migration is one form of memory +externalization consisting of a virtual machine running with part or +all of its memory residing on a different node in the cloud. The +userfaultfd abstraction is generic enough that not a single line of +KVM kernel code had to be modified in order to add postcopy live +migration to QEMU. + +Guest async page faults, FOLL_NOWAIT and all other GUP features work +just fine in combination with userfaults. Userfaults trigger async +page faults in the guest scheduler so those guest processes that +aren't waiting for userfaults (i.e. network bound) can keep running in +the guest vcpus. + +It is generally beneficial to run one pass of precopy live migration +just before starting postcopy live migration, in order to avoid +generating userfaults for readonly guest regions. + +The implementation of postcopy live migration currently uses one +single bidirectional socket but in the future two different sockets +will be used (to reduce the latency of the userfaults to the minimum +possible without having to decrease /proc/sys/net/ipv4/tcp_wmem). + +The QEMU in the source node writes all pages that it knows are missing +in the destination node, into the socket, and the migration thread of +the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE +ioctls on the userfaultfd in order to map the received pages into the +guest (UFFDIO_ZEROCOPY is used if the source page was a zero page). + +A different postcopy thread in the destination node listens with +poll() to the userfaultfd in parallel. When a POLLIN event is +generated after a userfault triggers, the postcopy thread read() from +the userfaultfd and receives the fault address (or -EAGAIN in case the +userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run +by the parallel QEMU migration thread). + +After the QEMU postcopy thread (running in the destination node) gets +the userfault address it writes the information about the missing page +into the socket. The QEMU source node receives the information and +roughly "seeks" to that page address and continues sending all +remaining missing pages from that new page offset. Soon after that +(just the time to flush the tcp_wmem queue through the network) the +migration thread in the QEMU running in the destination node will +receive the page that triggered the userfault and it'll map it as +usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it +was spontaneously sent by the source or if it was an urgent page +requested through a userfault). + +By the time the userfaults start, the QEMU in the destination node +doesn't need to keep any per-page state bitmap relative to the live +migration around and a single per-page bitmap has to be maintained in +the QEMU running in the source node to know which pages are still +missing in the destination node. The bitmap in the source node is +checked to find which missing pages to send in round robin and we seek +over it when receiving incoming userfaults. After sending each page of +course the bitmap is updated accordingly. It's also useful to avoid +sending the same page twice (in case the userfault is read by the +postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration +thread). + +Non-cooperative userfaultfd +=========================== + +When the userfaultfd is monitored by an external manager, the manager +must be able to track changes in the process virtual memory +layout. Userfaultfd can notify the manager about such changes using +the same read(2) protocol as for the page fault notifications. The +manager has to explicitly enable these events by setting appropriate +bits in uffdio_api.features passed to UFFDIO_API ioctl: + +UFFD_FEATURE_EVENT_FORK + enable userfaultfd hooks for fork(). When this feature is + enabled, the userfaultfd context of the parent process is + duplicated into the newly created process. The manager + receives UFFD_EVENT_FORK with file descriptor of the new + userfaultfd context in the uffd_msg.fork. + +UFFD_FEATURE_EVENT_REMAP + enable notifications about mremap() calls. When the + non-cooperative process moves a virtual memory area to a + different location, the manager will receive + UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and + new addresses of the area and its original length. + +UFFD_FEATURE_EVENT_REMOVE + enable notifications about madvise(MADV_REMOVE) and + madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will + be generated upon these calls to madvise. The uffd_msg.remove + will contain start and end addresses of the removed area. + +UFFD_FEATURE_EVENT_UNMAP + enable notifications about memory unmapping. The manager will + get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and + end addresses of the unmapped area. + +Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP +are pretty similar, they quite differ in the action expected from the +userfaultfd manager. In the former case, the virtual memory is +removed, but the area is not, the area remains monitored by the +userfaultfd, and if a page fault occurs in that area it will be +delivered to the manager. The proper resolution for such page fault is +to zeromap the faulting address. However, in the latter case, when an +area is unmapped, either explicitly (with munmap() system call), or +implicitly (e.g. during mremap()), the area is removed and in turn the +userfaultfd context for such area disappears too and the manager will +not get further userland page faults from the removed area. Still, the +notification is required in order to prevent manager from using +UFFDIO_COPY on the unmapped area. + +Unlike userland page faults which have to be synchronous and require +explicit or implicit wakeup, all the events are delivered +asynchronously and the non-cooperative process resumes execution as +soon as manager executes read(). The userfaultfd manager should +carefully synchronize calls to UFFDIO_COPY with the events +processing. To aid the synchronization, the UFFDIO_COPY ioctl will +return -ENOSPC when the monitored process exits at the time of +UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed +its virtual memory layout simultaneously with outstanding UFFDIO_COPY +operation. + +The current asynchronous model of the event delivery is optimal for +single threaded non-cooperative userfaultfd manager implementations. A +synchronous event delivery model can be added later as a new +userfaultfd feature to facilitate multithreading enhancements of the +non cooperative manager, for example to allow UFFDIO_COPY ioctls to +run in parallel to the event reception. Single threaded +implementations should continue to use the current async event +delivery model instead. diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 2d3984c70feb..ef53f808288d 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -515,7 +515,8 @@ guarantees: The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG bits on both physical and virtual pages associated with a process, and the -soft-dirty bit on pte (see Documentation/vm/soft-dirty.rst for details). +soft-dirty bit on pte (see Documentation/admin-guide/mm/soft-dirty.rst +for details). To clear the bits for all the pages associated with the process > echo 1 > /proc/PID/clear_refs @@ -536,7 +537,8 @@ Any other value written to /proc/PID/clear_refs will have no effect. The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags using /proc/kpageflags and number of times a page is mapped using -/proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.rst. +/proc/kpagecount. For detailed explanation, see +Documentation/admin-guide/mm/pagemap.rst. The /proc/pid/numa_maps is an extension based on maps, showing the memory locality and binding policy, as well as the memory usage (in pages) of diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index c8e6d5b031e4..697ef8c225df 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -515,7 +515,7 @@ nr_hugepages Change the minimum size of the hugepage pool. -See Documentation/vm/hugetlbpage.rst +See Documentation/admin-guide/mm/hugetlbpage.rst ============================================================== @@ -524,7 +524,7 @@ nr_overcommit_hugepages Change the maximum size of the hugepage pool. The maximum is nr_hugepages + nr_overcommit_hugepages. -See Documentation/vm/hugetlbpage.rst +See Documentation/admin-guide/mm/hugetlbpage.rst ============================================================== diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX index cda564d55b3c..f8a96ca16b7a 100644 --- a/Documentation/vm/00-INDEX +++ b/Documentation/vm/00-INDEX @@ -12,14 +12,10 @@ highmem.rst - Outline of highmem and common issues. hmm.rst - Documentation of heterogeneous memory management -hugetlbpage.rst - - a brief summary of hugetlbpage support in the Linux kernel. hugetlbfs_reserv.rst - A brief overview of hugetlbfs reservation design/implementation. hwpoison.rst - explains what hwpoison is -idle_page_tracking.rst - - description of the idle page tracking feature. ksm.rst - how to use the Kernel Samepage Merging feature. mmu_notifier.rst @@ -34,16 +30,12 @@ page_frags.rst - description of page fragments allocator page_migration.rst - description of page migration in NUMA systems. -pagemap.rst - - pagemap, from the userspace perspective page_owner.rst - tracking about who allocated each page remap_file_pages.rst - a note about remap_file_pages() system call slub.rst - a short users guide for SLUB. -soft-dirty.rst - - short explanation for soft-dirty PTEs split_page_table_lock.rst - Separate per-table lock to improve scalability of the old page_table_lock. swap_numa.rst @@ -52,8 +44,6 @@ transhuge.rst - Transparent Hugepage Support, alternative way of using hugepages. unevictable-lru.rst - Unevictable LRU infrastructure -userfaultfd.rst - - description of userfaultfd system call z3fold.txt - outline of z3fold allocator for storing compressed pages zsmalloc.rst diff --git a/Documentation/vm/hugetlbpage.rst b/Documentation/vm/hugetlbpage.rst deleted file mode 100644 index 2b374d10284d..000000000000 --- a/Documentation/vm/hugetlbpage.rst +++ /dev/null @@ -1,381 +0,0 @@ -.. _hugetlbpage: - -============= -HugeTLB Pages -============= - -Overview -======== - -The intent of this file is to give a brief summary of hugetlbpage support in -the Linux kernel. This support is built on top of multiple page size support -that is provided by most modern architectures. For example, x86 CPUs normally -support 4K and 2M (1G if architecturally supported) page sizes, ia64 -architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M, -256M and ppc64 supports 4K and 16M. A TLB is a cache of virtual-to-physical -translations. Typically this is a very scarce resource on processor. -Operating systems try to make best use of limited number of TLB resources. -This optimization is more critical now as bigger and bigger physical memories -(several GBs) are more readily available. - -Users can use the huge page support in Linux kernel by either using the mmap -system call or standard SYSV shared memory system calls (shmget, shmat). - -First the Linux kernel needs to be built with the CONFIG_HUGETLBFS -(present under "File systems") and CONFIG_HUGETLB_PAGE (selected -automatically when CONFIG_HUGETLBFS is selected) configuration -options. - -The ``/proc/meminfo`` file provides information about the total number of -persistent hugetlb pages in the kernel's huge page pool. It also displays -default huge page size and information about the number of free, reserved -and surplus huge pages in the pool of huge pages of default size. -The huge page size is needed for generating the proper alignment and -size of the arguments to system calls that map huge page regions. - -The output of ``cat /proc/meminfo`` will include lines like:: - - HugePages_Total: uuu - HugePages_Free: vvv - HugePages_Rsvd: www - HugePages_Surp: xxx - Hugepagesize: yyy kB - Hugetlb: zzz kB - -where: - -HugePages_Total - is the size of the pool of huge pages. -HugePages_Free - is the number of huge pages in the pool that are not yet - allocated. -HugePages_Rsvd - is short for "reserved," and is the number of huge pages for - which a commitment to allocate from the pool has been made, - but no allocation has yet been made. Reserved huge pages - guarantee that an application will be able to allocate a - huge page from the pool of huge pages at fault time. -HugePages_Surp - is short for "surplus," and is the number of huge pages in - the pool above the value in ``/proc/sys/vm/nr_hugepages``. The - maximum number of surplus huge pages is controlled by - ``/proc/sys/vm/nr_overcommit_hugepages``. -Hugepagesize - is the default hugepage size (in Kb). -Hugetlb - is the total amount of memory (in kB), consumed by huge - pages of all sizes. - If huge pages of different sizes are in use, this number - will exceed HugePages_Total \* Hugepagesize. To get more - detailed information, please, refer to - ``/sys/kernel/mm/hugepages`` (described below). - - -``/proc/filesystems`` should also show a filesystem of type "hugetlbfs" -configured in the kernel. - -``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" huge -pages in the kernel's huge page pool. "Persistent" huge pages will be -returned to the huge page pool when freed by a task. A user with root -privileges can dynamically allocate more or free some persistent huge pages -by increasing or decreasing the value of ``nr_hugepages``. - -Pages that are used as huge pages are reserved inside the kernel and cannot -be used for other purposes. Huge pages cannot be swapped out under -memory pressure. - -Once a number of huge pages have been pre-allocated to the kernel huge page -pool, a user with appropriate privilege can use either the mmap system call -or shared memory system calls to use the huge pages. See the discussion of -:ref:`Using Huge Pages `, below. - -The administrator can allocate persistent huge pages on the kernel boot -command line by specifying the "hugepages=N" parameter, where 'N' = the -number of huge pages requested. This is the most reliable method of -allocating huge pages as memory has not yet become fragmented. - -Some platforms support multiple huge page sizes. To allocate huge pages -of a specific size, one must precede the huge pages boot command parameters -with a huge page size selection parameter "hugepagesz=". must -be specified in bytes with optional scale suffix [kKmMgG]. The default huge -page size may be selected with the "default_hugepagesz=" boot parameter. - -When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages`` -indicates the current number of pre-allocated huge pages of the default size. -Thus, one can use the following command to dynamically allocate/deallocate -default sized persistent huge pages:: - - echo 20 > /proc/sys/vm/nr_hugepages - -This command will try to adjust the number of default sized huge pages in the -huge page pool to 20, allocating or freeing huge pages, as required. - -On a NUMA platform, the kernel will attempt to distribute the huge page pool -over all the set of allowed nodes specified by the NUMA memory policy of the -task that modifies ``nr_hugepages``. The default for the allowed nodes--when the -task has default memory policy--is all on-line nodes with memory. Allowed -nodes with insufficient available, contiguous memory for a huge page will be -silently skipped when allocating persistent huge pages. See the -:ref:`discussion below ` -of the interaction of task memory policy, cpusets and per node attributes -with the allocation and freeing of persistent huge pages. - -The success or failure of huge page allocation depends on the amount of -physically contiguous memory that is present in system at the time of the -allocation attempt. If the kernel is unable to allocate huge pages from -some nodes in a NUMA system, it will attempt to make up the difference by -allocating extra pages on other nodes with sufficient available contiguous -memory, if any. - -System administrators may want to put this command in one of the local rc -init files. This will enable the kernel to allocate huge pages early in -the boot process when the possibility of getting physical contiguous pages -is still very high. Administrators can verify the number of huge pages -actually allocated by checking the sysctl or meminfo. To check the per node -distribution of huge pages in a NUMA system, use:: - - cat /sys/devices/system/node/node*/meminfo | fgrep Huge - -``/proc/sys/vm/nr_overcommit_hugepages`` specifies how large the pool of -huge pages can grow, if more huge pages than ``/proc/sys/vm/nr_hugepages`` are -requested by applications. Writing any non-zero value into this file -indicates that the hugetlb subsystem is allowed to try to obtain that -number of "surplus" huge pages from the kernel's normal page pool, when the -persistent huge page pool is exhausted. As these surplus huge pages become -unused, they are freed back to the kernel's normal page pool. - -When increasing the huge page pool size via ``nr_hugepages``, any existing -surplus pages will first be promoted to persistent huge pages. Then, additional -huge pages will be allocated, if necessary and if possible, to fulfill -the new persistent huge page pool size. - -The administrator may shrink the pool of persistent huge pages for -the default huge page size by setting the ``nr_hugepages`` sysctl to a -smaller value. The kernel will attempt to balance the freeing of huge pages -across all nodes in the memory policy of the task modifying ``nr_hugepages``. -Any free huge pages on the selected nodes will be freed back to the kernel's -normal page pool. - -Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that -it becomes less than the number of huge pages in use will convert the balance -of the in-use huge pages to surplus huge pages. This will occur even if -the number of surplus pages would exceed the overcommit value. As long as -this condition holds--that is, until ``nr_hugepages+nr_overcommit_hugepages`` is -increased sufficiently, or the surplus huge pages go out of use and are freed-- -no more surplus huge pages will be allowed to be allocated. - -With support for multiple huge page pools at run-time available, much of -the huge page userspace interface in ``/proc/sys/vm`` has been duplicated in -sysfs. -The ``/proc`` interfaces discussed above have been retained for backwards -compatibility. The root huge page control directory in sysfs is:: - - /sys/kernel/mm/hugepages - -For each huge page size supported by the running kernel, a subdirectory -will exist, of the form:: - - hugepages-${size}kB - -Inside each of these directories, the same set of files will exist:: - - nr_hugepages - nr_hugepages_mempolicy - nr_overcommit_hugepages - free_hugepages - resv_hugepages - surplus_hugepages - -which function as described above for the default huge page-sized case. - -.. _mem_policy_and_hp_alloc: - -Interaction of Task Memory Policy with Huge Page Allocation/Freeing -=================================================================== - -Whether huge pages are allocated and freed via the ``/proc`` interface or -the ``/sysfs`` interface using the ``nr_hugepages_mempolicy`` attribute, the -NUMA nodes from which huge pages are allocated or freed are controlled by the -NUMA memory policy of the task that modifies the ``nr_hugepages_mempolicy`` -sysctl or attribute. When the ``nr_hugepages`` attribute is used, mempolicy -is ignored. - -The recommended method to allocate or free huge pages to/from the kernel -huge page pool, using the ``nr_hugepages`` example above, is:: - - numactl --interleave echo 20 \ - >/proc/sys/vm/nr_hugepages_mempolicy - -or, more succinctly:: - - numactl -m echo 20 >/proc/sys/vm/nr_hugepages_mempolicy - -This will allocate or free ``abs(20 - nr_hugepages)`` to or from the nodes -specified in , depending on whether number of persistent huge pages -is initially less than or greater than 20, respectively. No huge pages will be -allocated nor freed on any node not included in the specified . - -When adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any -memory policy mode--bind, preferred, local or interleave--may be used. The -resulting effect on persistent huge page allocation is as follows: - -#. Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.rst], - persistent huge pages will be distributed across the node or nodes - specified in the mempolicy as if "interleave" had been specified. - However, if a node in the policy does not contain sufficient contiguous - memory for a huge page, the allocation will not "fallback" to the nearest - neighbor node with sufficient contiguous memory. To do this would cause - undesirable imbalance in the distribution of the huge page pool, or - possibly, allocation of persistent huge pages on nodes not allowed by - the task's memory policy. - -#. One or more nodes may be specified with the bind or interleave policy. - If more than one node is specified with the preferred policy, only the - lowest numeric id will be used. Local policy will select the node where - the task is running at the time the nodes_allowed mask is constructed. - For local policy to be deterministic, the task must be bound to a cpu or - cpus in a single node. Otherwise, the task could be migrated to some - other node at any time after launch and the resulting node will be - indeterminate. Thus, local policy is not very useful for this purpose. - Any of the other mempolicy modes may be used to specify a single node. - -#. The nodes allowed mask will be derived from any non-default task mempolicy, - whether this policy was set explicitly by the task itself or one of its - ancestors, such as numactl. This means that if the task is invoked from a - shell with non-default policy, that policy will be used. One can specify a - node list of "all" with numactl --interleave or --membind [-m] to achieve - interleaving over all nodes in the system or cpuset. - -#. Any task mempolicy specified--e.g., using numactl--will be constrained by - the resource limits of any cpuset in which the task runs. Thus, there will - be no way for a task with non-default policy running in a cpuset with a - subset of the system nodes to allocate huge pages outside the cpuset - without first moving to a cpuset that contains all of the desired nodes. - -#. Boot-time huge page allocation attempts to distribute the requested number - of huge pages over all on-lines nodes with memory. - -Per Node Hugepages Attributes -============================= - -A subset of the contents of the root huge page control directory in sysfs, -described above, will be replicated under each the system device of each -NUMA node with memory in:: - - /sys/devices/system/node/node[0-9]*/hugepages/ - -Under this directory, the subdirectory for each supported huge page size -contains the following attribute files:: - - nr_hugepages - free_hugepages - surplus_hugepages - -The free\_' and surplus\_' attribute files are read-only. They return the number -of free and surplus [overcommitted] huge pages, respectively, on the parent -node. - -The ``nr_hugepages`` attribute returns the total number of huge pages on the -specified node. When this attribute is written, the number of persistent huge -pages on the parent node will be adjusted to the specified value, if sufficient -resources exist, regardless of the task's mempolicy or cpuset constraints. - -Note that the number of overcommit and reserve pages remain global quantities, -as we don't know until fault time, when the faulting task's mempolicy is -applied, from which node the huge page allocation will be attempted. - -.. _using_huge_pages: - -Using Huge Pages -================ - -If the user applications are going to request huge pages using mmap system -call, then it is required that system administrator mount a file system of -type hugetlbfs:: - - mount -t hugetlbfs \ - -o uid=,gid=,mode=,pagesize=,size=,\ - min_size=,nr_inodes= none /mnt/huge - -This command mounts a (pseudo) filesystem of type hugetlbfs on the directory -``/mnt/huge``. Any file created on ``/mnt/huge`` uses huge pages. - -The ``uid`` and ``gid`` options sets the owner and group of the root of the -file system. By default the ``uid`` and ``gid`` of the current process -are taken. - -The ``mode`` option sets the mode of root of file system to value & 01777. -This value is given in octal. By default the value 0755 is picked. - -If the platform supports multiple huge page sizes, the ``pagesize`` option can -be used to specify the huge page size and associated pool. ``pagesize`` -is specified in bytes. If ``pagesize`` is not specified the platform's -default huge page size and associated pool will be used. - -The ``size`` option sets the maximum value of memory (huge pages) allowed -for that filesystem (``/mnt/huge``). The ``size`` option can be specified -in bytes, or as a percentage of the specified huge page pool (``nr_hugepages``). -The size is rounded down to HPAGE_SIZE boundary. - -The ``min_size`` option sets the minimum value of memory (huge pages) allowed -for the filesystem. ``min_size`` can be specified in the same way as ``size``, -either bytes or a percentage of the huge page pool. -At mount time, the number of huge pages specified by ``min_size`` are reserved -for use by the filesystem. -If there are not enough free huge pages available, the mount will fail. -As huge pages are allocated to the filesystem and freed, the reserve count -is adjusted so that the sum of allocated and reserved huge pages is always -at least ``min_size``. - -The option ``nr_inodes`` sets the maximum number of inodes that ``/mnt/huge`` -can use. - -If the ``size``, ``min_size`` or ``nr_inodes`` option is not provided on -command line then no limits are set. - -For ``pagesize``, ``size``, ``min_size`` and ``nr_inodes`` options, you can -use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. -For example, size=2K has the same meaning as size=2048. - -While read system calls are supported on files that reside on hugetlb -file systems, write system calls are not. - -Regular chown, chgrp, and chmod commands (with right permissions) could be -used to change the file attributes on hugetlbfs. - -Also, it is important to note that no such mount command is required if -applications are going to use only shmat/shmget system calls or mmap with -MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see -:ref:`map_hugetlb ` below. - -Users who wish to use hugetlb memory via shared memory segment should be -members of a supplementary group and system admin needs to configure that gid -into ``/proc/sys/vm/hugetlb_shm_group``. It is possible for same or different -applications to use any combination of mmaps and shm* calls, though the mount of -filesystem will be required for using mmap calls without MAP_HUGETLB. - -Syscalls that operate on memory backed by hugetlb pages only have their lengths -aligned to the native page size of the processor; they will normally fail with -errno set to EINVAL or exclude hugetlb pages that extend beyond the length if -not hugepage aligned. For example, munmap(2) will fail if memory is backed by -a hugetlb page and the length is smaller than the hugepage size. - - -Examples -======== - -.. _map_hugetlb: - -``map_hugetlb`` - see tools/testing/selftests/vm/map_hugetlb.c - -``hugepage-shm`` - see tools/testing/selftests/vm/hugepage-shm.c - -``hugepage-mmap`` - see tools/testing/selftests/vm/hugepage-mmap.c - -The `libhugetlbfs`_ library provides a wide range of userspace tools -to help with huge page usability, environment setup, and control. - -.. _libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs diff --git a/Documentation/vm/hwpoison.rst b/Documentation/vm/hwpoison.rst index 070aa1e716b7..09bd24a92784 100644 --- a/Documentation/vm/hwpoison.rst +++ b/Documentation/vm/hwpoison.rst @@ -155,7 +155,7 @@ Testing value). This allows stress testing of many kinds of pages. The page_flags are the same as in /proc/kpageflags. The flag bits are defined in include/linux/kernel-page-flags.h and - documented in Documentation/vm/pagemap.rst + documented in Documentation/admin-guide/mm/pagemap.rst * Architecture specific MCE injector diff --git a/Documentation/vm/idle_page_tracking.rst b/Documentation/vm/idle_page_tracking.rst deleted file mode 100644 index d1c4609a5220..000000000000 --- a/Documentation/vm/idle_page_tracking.rst +++ /dev/null @@ -1,115 +0,0 @@ -.. _idle_page_tracking: - -================== -Idle Page Tracking -================== - -Motivation -========== - -The idle page tracking feature allows to track which memory pages are being -accessed by a workload and which are idle. This information can be useful for -estimating the workload's working set size, which, in turn, can be taken into -account when configuring the workload parameters, setting memory cgroup limits, -or deciding where to place the workload within a compute cluster. - -It is enabled by CONFIG_IDLE_PAGE_TRACKING=y. - -.. _user_api: - -User API -======== - -The idle page tracking API is located at ``/sys/kernel/mm/page_idle``. -Currently, it consists of the only read-write file, -``/sys/kernel/mm/page_idle/bitmap``. - -The file implements a bitmap where each bit corresponds to a memory page. The -bitmap is represented by an array of 8-byte integers, and the page at PFN #i is -mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is -set, the corresponding page is idle. - -A page is considered idle if it has not been accessed since it was marked idle -(for more details on what "accessed" actually means see the :ref:`Implementation -Details ` section). -To mark a page idle one has to set the bit corresponding to -the page by writing to the file. A value written to the file is OR-ed with the -current bitmap value. - -Only accesses to user memory pages are tracked. These are pages mapped to a -process address space, page cache and buffer pages, swap cache pages. For other -page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored, -and hence such pages are never reported idle. - -For huge pages the idle flag is set only on the head page, so one has to read -``/proc/kpageflags`` in order to correctly count idle huge pages. - -Reading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return --EINVAL if you are not starting the read/write on an 8-byte boundary, or -if the size of the read/write is not a multiple of 8 bytes. Writing to -this file beyond max PFN will return -ENXIO. - -That said, in order to estimate the amount of pages that are not used by a -workload one should: - - 1. Mark all the workload's pages as idle by setting corresponding bits in - ``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading - ``/proc/pid/pagemap`` if the workload is represented by a process, or by - filtering out alien pages using ``/proc/kpagecgroup`` in case the workload - is placed in a memory cgroup. - - 2. Wait until the workload accesses its working set. - - 3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set. - If one wants to ignore certain types of pages, e.g. mlocked pages since they - are not reclaimable, he or she can filter them out using - ``/proc/kpageflags``. - -See Documentation/vm/pagemap.rst for more information about -``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``. - -.. _impl_details: - -Implementation Details -====================== - -The kernel internally keeps track of accesses to user memory pages in order to -reclaim unreferenced pages first on memory shortage conditions. A page is -considered referenced if it has been recently accessed via a process address -space, in which case one or more PTEs it is mapped to will have the Accessed bit -set, or marked accessed explicitly by the kernel (see mark_page_accessed()). The -latter happens when: - - - a userspace process reads or writes a page using a system call (e.g. read(2) - or write(2)) - - - a page that is used for storing filesystem buffers is read or written, - because a process needs filesystem metadata stored in it (e.g. lists a - directory tree) - - - a page is accessed by a device driver using get_user_pages() - -When a dirty page is written to swap or disk as a result of memory reclaim or -exceeding the dirty memory limit, it is not marked referenced. - -The idle memory tracking feature adds a new page flag, the Idle flag. This flag -is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the -:ref:`User API ` -section), and cleared automatically whenever a page is referenced as defined -above. - -When a page is marked idle, the Accessed bit must be cleared in all PTEs it is -mapped to, otherwise we will not be able to detect accesses to the page coming -from a process address space. To avoid interference with the reclaimer, which, -as noted above, uses the Accessed bit to promote actively referenced pages, one -more page flag is introduced, the Young flag. When the PTE Accessed bit is -cleared as a result of setting or updating a page's Idle flag, the Young flag -is set on the page. The reclaimer treats the Young flag as an extra PTE -Accessed bit and therefore will consider such a page as referenced. - -Since the idle memory tracking feature is based on the memory reclaimer logic, -it only works with pages that are on an LRU list, other pages are silently -ignored. That means it will ignore a user memory page if it is isolated, but -since there are usually not many of them, it should not affect the overall -result noticeably. In order not to stall scanning of the idle page bitmap, -locked pages may be skipped too. diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index 6c451421a01e..ed58cb9f9675 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -13,15 +13,10 @@ various features of the Linux memory management .. toctree:: :maxdepth: 1 - hugetlbpage - idle_page_tracking ksm numa_memory_policy - pagemap transhuge - soft-dirty swap_numa - userfaultfd zswap Kernel developers MM documentation diff --git a/Documentation/vm/pagemap.rst b/Documentation/vm/pagemap.rst deleted file mode 100644 index 7ba8cbd57ad3..000000000000 --- a/Documentation/vm/pagemap.rst +++ /dev/null @@ -1,197 +0,0 @@ -.. _pagemap: - -============================= -Examining Process Page Tables -============================= - -pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow -userspace programs to examine the page tables and related information by -reading files in ``/proc``. - -There are four components to pagemap: - - * ``/proc/pid/pagemap``. This file lets a userspace process find out which - physical frame each virtual page is mapped to. It contains one 64-bit - value for each virtual page, containing the following data (from - ``fs/proc/task_mmu.c``, above pagemap_read): - - * Bits 0-54 page frame number (PFN) if present - * Bits 0-4 swap type if swapped - * Bits 5-54 swap offset if swapped - * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.rst) - * Bit 56 page exclusively mapped (since 4.2) - * Bits 57-60 zero - * Bit 61 page is file-page or shared-anon (since 3.5) - * Bit 62 page swapped - * Bit 63 page present - - Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs. - In 4.0 and 4.1 opens by unprivileged fail with -EPERM. Starting from - 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN. - Reason: information about PFNs helps in exploiting Rowhammer vulnerability. - - If the page is not present but in swap, then the PFN contains an - encoding of the swap file number and the page's offset into the - swap. Unmapped pages return a null PFN. This allows determining - precisely which pages are mapped (or in swap) and comparing mapped - pages between processes. - - Efficient users of this interface will use ``/proc/pid/maps`` to - determine which areas of memory are actually mapped and llseek to - skip over unmapped regions. - - * ``/proc/kpagecount``. This file contains a 64-bit count of the number of - times each page is mapped, indexed by PFN. - - * ``/proc/kpageflags``. This file contains a 64-bit set of flags for each - page, indexed by PFN. - - The flags are (from ``fs/proc/page.c``, above kpageflags_read): - - 0. LOCKED - 1. ERROR - 2. REFERENCED - 3. UPTODATE - 4. DIRTY - 5. LRU - 6. ACTIVE - 7. SLAB - 8. WRITEBACK - 9. RECLAIM - 10. BUDDY - 11. MMAP - 12. ANON - 13. SWAPCACHE - 14. SWAPBACKED - 15. COMPOUND_HEAD - 16. COMPOUND_TAIL - 17. HUGE - 18. UNEVICTABLE - 19. HWPOISON - 20. NOPAGE - 21. KSM - 22. THP - 23. BALLOON - 24. ZERO_PAGE - 25. IDLE - - * ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the - memory cgroup each page is charged to, indexed by PFN. Only available when - CONFIG_MEMCG is set. - -Short descriptions to the page flags -==================================== - -0 - LOCKED - page is being locked for exclusive access, e.g. by undergoing read/write IO -7 - SLAB - page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator - When compound page is used, SLUB/SLQB will only set this flag on the head - page; SLOB will not flag it at all. -10 - BUDDY - a free memory block managed by the buddy system allocator - The buddy system organizes free memory in blocks of various orders. - An order N block has 2^N physically contiguous pages, with the BUDDY flag - set for and _only_ for the first page. -15 - COMPOUND_HEAD - A compound page with order N consists of 2^N physically contiguous pages. - A compound page with order 2 takes the form of "HTTT", where H donates its - head page and T donates its tail page(s). The major consumers of compound - pages are hugeTLB pages (Documentation/vm/hugetlbpage.rst), the SLUB etc. - memory allocators and various device drivers. However in this interface, - only huge/giga pages are made visible to end users. -16 - COMPOUND_TAIL - A compound page tail (see description above). -17 - HUGE - this is an integral part of a HugeTLB page -19 - HWPOISON - hardware detected memory corruption on this page: don't touch the data! -20 - NOPAGE - no page frame exists at the requested address -21 - KSM - identical memory pages dynamically shared between one or more processes -22 - THP - contiguous pages which construct transparent hugepages -23 - BALLOON - balloon compaction page -24 - ZERO_PAGE - zero page for pfn_zero or huge_zero page -25 - IDLE - page has not been accessed since it was marked idle (see - Documentation/vm/idle_page_tracking.rst). Note that this flag may be - stale in case the page was accessed via a PTE. To make sure the flag - is up-to-date one has to read ``/sys/kernel/mm/page_idle/bitmap`` first. - -IO related page flags ---------------------- - -1 - ERROR - IO error occurred -3 - UPTODATE - page has up-to-date data - ie. for file backed page: (in-memory data revision >= on-disk one) -4 - DIRTY - page has been written to, hence contains new data - i.e. for file backed page: (in-memory data revision > on-disk one) -8 - WRITEBACK - page is being synced to disk - -LRU related page flags ----------------------- - -5 - LRU - page is in one of the LRU lists -6 - ACTIVE - page is in the active LRU list -18 - UNEVICTABLE - page is in the unevictable (non-)LRU list It is somehow pinned and - not a candidate for LRU page reclaims, e.g. ramfs pages, - shmctl(SHM_LOCK) and mlock() memory segments -2 - REFERENCED - page has been referenced since last LRU list enqueue/requeue -9 - RECLAIM - page will be reclaimed soon after its pageout IO completed -11 - MMAP - a memory mapped page -12 - ANON - a memory mapped page that is not part of a file -13 - SWAPCACHE - page is mapped to swap space, i.e. has an associated swap entry -14 - SWAPBACKED - page is backed by swap/RAM - -The page-types tool in the tools/vm directory can be used to query the -above flags. - -Using pagemap to do something useful -==================================== - -The general procedure for using pagemap to find out about a process' memory -usage goes like this: - - 1. Read ``/proc/pid/maps`` to determine which parts of the memory space are - mapped to what. - 2. Select the maps you are interested in -- all of them, or a particular - library, or the stack or the heap, etc. - 3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine. - 4. Read a u64 for each page from pagemap. - 5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you - just read, seek to that entry in the file, and read the data you want. - -For example, to find the "unique set size" (USS), which is the amount of -memory that a process is using that is not shared with any other process, -you can go through every map in the process, find the PFNs, look those up -in kpagecount, and tally up the number of pages that are only referenced -once. - -Other notes -=========== - -Reading from any of the files will return -EINVAL if you are not starting -the read on an 8-byte boundary (e.g., if you sought an odd number of bytes -into the file), or if the size of the read is not a multiple of 8 bytes. - -Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is -always 12 at most architectures). Since Linux 3.11 their meaning changes -after first clear of soft-dirty bits. Since Linux 4.2 they are used for -flags unconditionally. diff --git a/Documentation/vm/soft-dirty.rst b/Documentation/vm/soft-dirty.rst deleted file mode 100644 index cb0cfd6672fa..000000000000 --- a/Documentation/vm/soft-dirty.rst +++ /dev/null @@ -1,47 +0,0 @@ -.. _soft_dirty: - -=============== -Soft-Dirty PTEs -=============== - -The soft-dirty is a bit on a PTE which helps to track which pages a task -writes to. In order to do this tracking one should - - 1. Clear soft-dirty bits from the task's PTEs. - - This is done by writing "4" into the ``/proc/PID/clear_refs`` file of the - task in question. - - 2. Wait some time. - - 3. Read soft-dirty bits from the PTEs. - - This is done by reading from the ``/proc/PID/pagemap``. The bit 55 of the - 64-bit qword is the soft-dirty one. If set, the respective PTE was - written to since step 1. - - -Internally, to do this tracking, the writable bit is cleared from PTEs -when the soft-dirty bit is cleared. So, after this, when the task tries to -modify a page at some virtual address the #PF occurs and the kernel sets -the soft-dirty bit on the respective PTE. - -Note, that although all the task's address space is marked as r/o after the -soft-dirty bits clear, the #PF-s that occur after that are processed fast. -This is so, since the pages are still mapped to physical memory, and thus all -the kernel does is finds this fact out and puts both writable and soft-dirty -bits on the PTE. - -While in most cases tracking memory changes by #PF-s is more than enough -there is still a scenario when we can lose soft dirty bits -- a task -unmaps a previously mapped memory region and then maps a new one at exactly -the same place. When unmap is called, the kernel internally clears PTE values -including soft dirty bits. To notify user space application about such -memory region renewal the kernel always marks new memory regions (and -expanded regions) as soft dirty. - -This feature is actively used by the checkpoint-restore project. You -can find more details about it on http://criu.org - - --- Pavel Emelyanov, Apr 9, 2013 diff --git a/Documentation/vm/userfaultfd.rst b/Documentation/vm/userfaultfd.rst deleted file mode 100644 index 5048cf661a8a..000000000000 --- a/Documentation/vm/userfaultfd.rst +++ /dev/null @@ -1,241 +0,0 @@ -.. _userfaultfd: - -=========== -Userfaultfd -=========== - -Objective -========= - -Userfaults allow the implementation of on-demand paging from userland -and more generally they allow userland to take control of various -memory page faults, something otherwise only the kernel code could do. - -For example userfaults allows a proper and more optimal implementation -of the PROT_NONE+SIGSEGV trick. - -Design -====== - -Userfaults are delivered and resolved through the userfaultfd syscall. - -The userfaultfd (aside from registering and unregistering virtual -memory ranges) provides two primary functionalities: - -1) read/POLLIN protocol to notify a userland thread of the faults - happening - -2) various UFFDIO_* ioctls that can manage the virtual memory regions - registered in the userfaultfd that allows userland to efficiently - resolve the userfaults it receives via 1) or to manage the virtual - memory in the background - -The real advantage of userfaults if compared to regular virtual memory -management of mremap/mprotect is that the userfaults in all their -operations never involve heavyweight structures like vmas (in fact the -userfaultfd runtime load never takes the mmap_sem for writing). - -Vmas are not suitable for page- (or hugepage) granular fault tracking -when dealing with virtual address spaces that could span -Terabytes. Too many vmas would be needed for that. - -The userfaultfd once opened by invoking the syscall, can also be -passed using unix domain sockets to a manager process, so the same -manager process could handle the userfaults of a multitude of -different processes without them being aware about what is going on -(well of course unless they later try to use the userfaultfd -themselves on the same region the manager is already tracking, which -is a corner case that would currently return -EBUSY). - -API -=== - -When first opened the userfaultfd must be enabled invoking the -UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or -a later API version) which will specify the read/POLLIN protocol -userland intends to speak on the UFFD and the uffdio_api.features -userland requires. The UFFDIO_API ioctl if successful (i.e. if the -requested uffdio_api.api is spoken also by the running kernel and the -requested features are going to be enabled) will return into -uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of -respectively all the available features of the read(2) protocol and -the generic ioctl available. - -The uffdio_api.features bitmask returned by the UFFDIO_API ioctl -defines what memory types are supported by the userfaultfd and what -events, except page fault notifications, may be generated. - -If the kernel supports registering userfaultfd ranges on hugetlbfs -virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in -uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be -set if the kernel supports registering userfaultfd ranges on shared -memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero -MAP_SHARED, memfd_create, etc). - -The userland application that wants to use userfaultfd with hugetlbfs -or shared memory need to set the corresponding flag in -uffdio_api.features to enable those features. - -If the userland desires to receive notifications for events other than -page faults, it has to verify that uffdio_api.features has appropriate -UFFD_FEATURE_EVENT_* bits set. These events are described in more -detail below in "Non-cooperative userfaultfd" section. - -Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should -be invoked (if present in the returned uffdio_api.ioctls bitmask) to -register a memory range in the userfaultfd by setting the -uffdio_register structure accordingly. The uffdio_register.mode -bitmask will specify to the kernel which kind of faults to track for -the range (UFFDIO_REGISTER_MODE_MISSING would track missing -pages). The UFFDIO_REGISTER ioctl will return the -uffdio_register.ioctls bitmask of ioctls that are suitable to resolve -userfaults on the range registered. Not all ioctls will necessarily be -supported for all memory types depending on the underlying virtual -memory backend (anonymous memory vs tmpfs vs real filebacked -mappings). - -Userland can use the uffdio_register.ioctls to manage the virtual -address space in the background (to add or potentially also remove -memory from the userfaultfd registered range). This means a userfault -could be triggering just before userland maps in the background the -user-faulted page. - -The primary ioctl to resolve userfaults is UFFDIO_COPY. That -atomically copies a page into the userfault registered range and wakes -up the blocked userfaults (unless uffdio_copy.mode & -UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to -UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an -half copied page since it'll keep userfaulting until the copy has -finished. - -QEMU/KVM -======== - -QEMU/KVM is using the userfaultfd syscall to implement postcopy live -migration. Postcopy live migration is one form of memory -externalization consisting of a virtual machine running with part or -all of its memory residing on a different node in the cloud. The -userfaultfd abstraction is generic enough that not a single line of -KVM kernel code had to be modified in order to add postcopy live -migration to QEMU. - -Guest async page faults, FOLL_NOWAIT and all other GUP features work -just fine in combination with userfaults. Userfaults trigger async -page faults in the guest scheduler so those guest processes that -aren't waiting for userfaults (i.e. network bound) can keep running in -the guest vcpus. - -It is generally beneficial to run one pass of precopy live migration -just before starting postcopy live migration, in order to avoid -generating userfaults for readonly guest regions. - -The implementation of postcopy live migration currently uses one -single bidirectional socket but in the future two different sockets -will be used (to reduce the latency of the userfaults to the minimum -possible without having to decrease /proc/sys/net/ipv4/tcp_wmem). - -The QEMU in the source node writes all pages that it knows are missing -in the destination node, into the socket, and the migration thread of -the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE -ioctls on the userfaultfd in order to map the received pages into the -guest (UFFDIO_ZEROCOPY is used if the source page was a zero page). - -A different postcopy thread in the destination node listens with -poll() to the userfaultfd in parallel. When a POLLIN event is -generated after a userfault triggers, the postcopy thread read() from -the userfaultfd and receives the fault address (or -EAGAIN in case the -userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run -by the parallel QEMU migration thread). - -After the QEMU postcopy thread (running in the destination node) gets -the userfault address it writes the information about the missing page -into the socket. The QEMU source node receives the information and -roughly "seeks" to that page address and continues sending all -remaining missing pages from that new page offset. Soon after that -(just the time to flush the tcp_wmem queue through the network) the -migration thread in the QEMU running in the destination node will -receive the page that triggered the userfault and it'll map it as -usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it -was spontaneously sent by the source or if it was an urgent page -requested through a userfault). - -By the time the userfaults start, the QEMU in the destination node -doesn't need to keep any per-page state bitmap relative to the live -migration around and a single per-page bitmap has to be maintained in -the QEMU running in the source node to know which pages are still -missing in the destination node. The bitmap in the source node is -checked to find which missing pages to send in round robin and we seek -over it when receiving incoming userfaults. After sending each page of -course the bitmap is updated accordingly. It's also useful to avoid -sending the same page twice (in case the userfault is read by the -postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration -thread). - -Non-cooperative userfaultfd -=========================== - -When the userfaultfd is monitored by an external manager, the manager -must be able to track changes in the process virtual memory -layout. Userfaultfd can notify the manager about such changes using -the same read(2) protocol as for the page fault notifications. The -manager has to explicitly enable these events by setting appropriate -bits in uffdio_api.features passed to UFFDIO_API ioctl: - -UFFD_FEATURE_EVENT_FORK - enable userfaultfd hooks for fork(). When this feature is - enabled, the userfaultfd context of the parent process is - duplicated into the newly created process. The manager - receives UFFD_EVENT_FORK with file descriptor of the new - userfaultfd context in the uffd_msg.fork. - -UFFD_FEATURE_EVENT_REMAP - enable notifications about mremap() calls. When the - non-cooperative process moves a virtual memory area to a - different location, the manager will receive - UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and - new addresses of the area and its original length. - -UFFD_FEATURE_EVENT_REMOVE - enable notifications about madvise(MADV_REMOVE) and - madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will - be generated upon these calls to madvise. The uffd_msg.remove - will contain start and end addresses of the removed area. - -UFFD_FEATURE_EVENT_UNMAP - enable notifications about memory unmapping. The manager will - get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and - end addresses of the unmapped area. - -Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP -are pretty similar, they quite differ in the action expected from the -userfaultfd manager. In the former case, the virtual memory is -removed, but the area is not, the area remains monitored by the -userfaultfd, and if a page fault occurs in that area it will be -delivered to the manager. The proper resolution for such page fault is -to zeromap the faulting address. However, in the latter case, when an -area is unmapped, either explicitly (with munmap() system call), or -implicitly (e.g. during mremap()), the area is removed and in turn the -userfaultfd context for such area disappears too and the manager will -not get further userland page faults from the removed area. Still, the -notification is required in order to prevent manager from using -UFFDIO_COPY on the unmapped area. - -Unlike userland page faults which have to be synchronous and require -explicit or implicit wakeup, all the events are delivered -asynchronously and the non-cooperative process resumes execution as -soon as manager executes read(). The userfaultfd manager should -carefully synchronize calls to UFFDIO_COPY with the events -processing. To aid the synchronization, the UFFDIO_COPY ioctl will -return -ENOSPC when the monitored process exits at the time of -UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed -its virtual memory layout simultaneously with outstanding UFFDIO_COPY -operation. - -The current asynchronous model of the event delivery is optimal for -single threaded non-cooperative userfaultfd manager implementations. A -synchronous event delivery model can be added later as a new -userfaultfd feature to facilitate multithreading enhancements of the -non cooperative manager, for example to allow UFFDIO_COPY ioctls to -run in parallel to the event reception. Single threaded -implementations should continue to use the current async event -delivery model instead. -- cgit v1.2.3