diff options
Diffstat (limited to 'Documentation/vm/userfaultfd.rst')
-rw-r--r-- | Documentation/vm/userfaultfd.rst | 241 |
1 files changed, 0 insertions, 241 deletions
diff --git a/Documentation/vm/userfaultfd.rst b/Documentation/vm/userfaultfd.rst deleted file mode 100644 index 5048cf661a8a..000000000000 --- a/Documentation/vm/userfaultfd.rst +++ /dev/null @@ -1,241 +0,0 @@ -.. _userfaultfd: - -=========== -Userfaultfd -=========== - -Objective -========= - -Userfaults allow the implementation of on-demand paging from userland -and more generally they allow userland to take control of various -memory page faults, something otherwise only the kernel code could do. - -For example userfaults allows a proper and more optimal implementation -of the PROT_NONE+SIGSEGV trick. - -Design -====== - -Userfaults are delivered and resolved through the userfaultfd syscall. - -The userfaultfd (aside from registering and unregistering virtual -memory ranges) provides two primary functionalities: - -1) read/POLLIN protocol to notify a userland thread of the faults - happening - -2) various UFFDIO_* ioctls that can manage the virtual memory regions - registered in the userfaultfd that allows userland to efficiently - resolve the userfaults it receives via 1) or to manage the virtual - memory in the background - -The real advantage of userfaults if compared to regular virtual memory -management of mremap/mprotect is that the userfaults in all their -operations never involve heavyweight structures like vmas (in fact the -userfaultfd runtime load never takes the mmap_sem for writing). - -Vmas are not suitable for page- (or hugepage) granular fault tracking -when dealing with virtual address spaces that could span -Terabytes. Too many vmas would be needed for that. - -The userfaultfd once opened by invoking the syscall, can also be -passed using unix domain sockets to a manager process, so the same -manager process could handle the userfaults of a multitude of -different processes without them being aware about what is going on -(well of course unless they later try to use the userfaultfd -themselves on the same region the manager is already tracking, which -is a corner case that would currently return -EBUSY). - -API -=== - -When first opened the userfaultfd must be enabled invoking the -UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or -a later API version) which will specify the read/POLLIN protocol -userland intends to speak on the UFFD and the uffdio_api.features -userland requires. The UFFDIO_API ioctl if successful (i.e. if the -requested uffdio_api.api is spoken also by the running kernel and the -requested features are going to be enabled) will return into -uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of -respectively all the available features of the read(2) protocol and -the generic ioctl available. - -The uffdio_api.features bitmask returned by the UFFDIO_API ioctl -defines what memory types are supported by the userfaultfd and what -events, except page fault notifications, may be generated. - -If the kernel supports registering userfaultfd ranges on hugetlbfs -virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in -uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be -set if the kernel supports registering userfaultfd ranges on shared -memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero -MAP_SHARED, memfd_create, etc). - -The userland application that wants to use userfaultfd with hugetlbfs -or shared memory need to set the corresponding flag in -uffdio_api.features to enable those features. - -If the userland desires to receive notifications for events other than -page faults, it has to verify that uffdio_api.features has appropriate -UFFD_FEATURE_EVENT_* bits set. These events are described in more -detail below in "Non-cooperative userfaultfd" section. - -Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should -be invoked (if present in the returned uffdio_api.ioctls bitmask) to -register a memory range in the userfaultfd by setting the -uffdio_register structure accordingly. The uffdio_register.mode -bitmask will specify to the kernel which kind of faults to track for -the range (UFFDIO_REGISTER_MODE_MISSING would track missing -pages). The UFFDIO_REGISTER ioctl will return the -uffdio_register.ioctls bitmask of ioctls that are suitable to resolve -userfaults on the range registered. Not all ioctls will necessarily be -supported for all memory types depending on the underlying virtual -memory backend (anonymous memory vs tmpfs vs real filebacked -mappings). - -Userland can use the uffdio_register.ioctls to manage the virtual -address space in the background (to add or potentially also remove -memory from the userfaultfd registered range). This means a userfault -could be triggering just before userland maps in the background the -user-faulted page. - -The primary ioctl to resolve userfaults is UFFDIO_COPY. That -atomically copies a page into the userfault registered range and wakes -up the blocked userfaults (unless uffdio_copy.mode & -UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to -UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an -half copied page since it'll keep userfaulting until the copy has -finished. - -QEMU/KVM -======== - -QEMU/KVM is using the userfaultfd syscall to implement postcopy live -migration. Postcopy live migration is one form of memory -externalization consisting of a virtual machine running with part or -all of its memory residing on a different node in the cloud. The -userfaultfd abstraction is generic enough that not a single line of -KVM kernel code had to be modified in order to add postcopy live -migration to QEMU. - -Guest async page faults, FOLL_NOWAIT and all other GUP features work -just fine in combination with userfaults. Userfaults trigger async -page faults in the guest scheduler so those guest processes that -aren't waiting for userfaults (i.e. network bound) can keep running in -the guest vcpus. - -It is generally beneficial to run one pass of precopy live migration -just before starting postcopy live migration, in order to avoid -generating userfaults for readonly guest regions. - -The implementation of postcopy live migration currently uses one -single bidirectional socket but in the future two different sockets -will be used (to reduce the latency of the userfaults to the minimum -possible without having to decrease /proc/sys/net/ipv4/tcp_wmem). - -The QEMU in the source node writes all pages that it knows are missing -in the destination node, into the socket, and the migration thread of -the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE -ioctls on the userfaultfd in order to map the received pages into the -guest (UFFDIO_ZEROCOPY is used if the source page was a zero page). - -A different postcopy thread in the destination node listens with -poll() to the userfaultfd in parallel. When a POLLIN event is -generated after a userfault triggers, the postcopy thread read() from -the userfaultfd and receives the fault address (or -EAGAIN in case the -userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run -by the parallel QEMU migration thread). - -After the QEMU postcopy thread (running in the destination node) gets -the userfault address it writes the information about the missing page -into the socket. The QEMU source node receives the information and -roughly "seeks" to that page address and continues sending all -remaining missing pages from that new page offset. Soon after that -(just the time to flush the tcp_wmem queue through the network) the -migration thread in the QEMU running in the destination node will -receive the page that triggered the userfault and it'll map it as -usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it -was spontaneously sent by the source or if it was an urgent page -requested through a userfault). - -By the time the userfaults start, the QEMU in the destination node -doesn't need to keep any per-page state bitmap relative to the live -migration around and a single per-page bitmap has to be maintained in -the QEMU running in the source node to know which pages are still -missing in the destination node. The bitmap in the source node is -checked to find which missing pages to send in round robin and we seek -over it when receiving incoming userfaults. After sending each page of -course the bitmap is updated accordingly. It's also useful to avoid -sending the same page twice (in case the userfault is read by the -postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration -thread). - -Non-cooperative userfaultfd -=========================== - -When the userfaultfd is monitored by an external manager, the manager -must be able to track changes in the process virtual memory -layout. Userfaultfd can notify the manager about such changes using -the same read(2) protocol as for the page fault notifications. The -manager has to explicitly enable these events by setting appropriate -bits in uffdio_api.features passed to UFFDIO_API ioctl: - -UFFD_FEATURE_EVENT_FORK - enable userfaultfd hooks for fork(). When this feature is - enabled, the userfaultfd context of the parent process is - duplicated into the newly created process. The manager - receives UFFD_EVENT_FORK with file descriptor of the new - userfaultfd context in the uffd_msg.fork. - -UFFD_FEATURE_EVENT_REMAP - enable notifications about mremap() calls. When the - non-cooperative process moves a virtual memory area to a - different location, the manager will receive - UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and - new addresses of the area and its original length. - -UFFD_FEATURE_EVENT_REMOVE - enable notifications about madvise(MADV_REMOVE) and - madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will - be generated upon these calls to madvise. The uffd_msg.remove - will contain start and end addresses of the removed area. - -UFFD_FEATURE_EVENT_UNMAP - enable notifications about memory unmapping. The manager will - get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and - end addresses of the unmapped area. - -Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP -are pretty similar, they quite differ in the action expected from the -userfaultfd manager. In the former case, the virtual memory is -removed, but the area is not, the area remains monitored by the -userfaultfd, and if a page fault occurs in that area it will be -delivered to the manager. The proper resolution for such page fault is -to zeromap the faulting address. However, in the latter case, when an -area is unmapped, either explicitly (with munmap() system call), or -implicitly (e.g. during mremap()), the area is removed and in turn the -userfaultfd context for such area disappears too and the manager will -not get further userland page faults from the removed area. Still, the -notification is required in order to prevent manager from using -UFFDIO_COPY on the unmapped area. - -Unlike userland page faults which have to be synchronous and require -explicit or implicit wakeup, all the events are delivered -asynchronously and the non-cooperative process resumes execution as -soon as manager executes read(). The userfaultfd manager should -carefully synchronize calls to UFFDIO_COPY with the events -processing. To aid the synchronization, the UFFDIO_COPY ioctl will -return -ENOSPC when the monitored process exits at the time of -UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed -its virtual memory layout simultaneously with outstanding UFFDIO_COPY -operation. - -The current asynchronous model of the event delivery is optimal for -single threaded non-cooperative userfaultfd manager implementations. A -synchronous event delivery model can be added later as a new -userfaultfd feature to facilitate multithreading enhancements of the -non cooperative manager, for example to allow UFFDIO_COPY ioctls to -run in parallel to the event reception. Single threaded -implementations should continue to use the current async event -delivery model instead. |