From bffc33ec539699f045a9254144de3d4eace05f07 Mon Sep 17 00:00:00 2001 From: Jérôme Glisse Date: Fri, 8 Sep 2017 16:11:19 -0700 Subject: hmm: heterogeneous memory management documentation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "HMM (Heterogeneous Memory Management)", v25. Heterogeneous Memory Management (HMM) (description and justification) Today device driver expose dedicated memory allocation API through their device file, often relying on a combination of IOCTL and mmap calls. The device can only access and use memory allocated through this API. This effectively split the program address space into object allocated for the device and useable by the device and other regular memory (malloc, mmap of a file, share memory, â) only accessible by CPU (or in a very limited way by a device by pinning memory). Allowing different isolated component of a program to use a device thus require duplication of the input data structure using device memory allocator. This is reasonable for simple data structure (array, grid, image, â) but this get extremely complex with advance data structure (list, tree, graph, â) that rely on a web of memory pointers. This is becoming a serious limitation on the kind of work load that can be offloaded to device like GPU. New industry standard like C++, OpenCL or CUDA are pushing to remove this barrier. This require a shared address space between GPU device and CPU so that GPU can access any memory of a process (while still obeying memory protection like read only). This kind of feature is also appearing in various other operating systems. HMM is a set of helpers to facilitate several aspects of address space sharing and device memory management. Unlike existing sharing mechanism that rely on pining pages use by a device, HMM relies on mmu_notifier to propagate CPU page table update to device page table. Duplicating CPU page table is only one aspect necessary for efficiently using device like GPU. GPU local memory have bandwidth in the TeraBytes/ second range but they are connected to main memory through a system bus like PCIE that is limited to 32GigaBytes/second (PCIE 4.0 16x). Thus it is necessary to allow migration of process memory from main system memory to device memory. Issue is that on platform that only have PCIE the device memory is not accessible by the CPU with the same properties as main memory (cache coherency, atomic operations, ...). To allow migration from main memory to device memory HMM provides a set of helper to hotplug device memory as a new type of ZONE_DEVICE memory which is un-addressable by CPU but still has struct page representing it. This allow most of the core kernel logic that deals with a process memory to stay oblivious of the peculiarity of device memory. When page backing an address of a process is migrated to device memory the CPU page table entry is set to a new specific swap entry. CPU access to such address triggers a migration back to system memory, just like if the page was swap on disk. HMM also blocks any one from pinning a ZONE_DEVICE page so that it can always be migrated back to system memory if CPU access it. Conversely HMM does not migrate to device memory any page that is pin in system memory. To allow efficient migration between device memory and main memory a new migrate_vma() helpers is added with this patchset. It allows to leverage device DMA engine to perform the copy operation. This feature will be use by upstream driver like nouveau mlx5 and probably other in the future (amdgpu is next suspect in line). We are actively working on nouveau and mlx5 support. To test this patchset we also worked with NVidia close source driver team, they have more resources than us to test this kind of infrastructure and also a bigger and better userspace eco-system with various real industry workload they can be use to test and profile HMM. The expected workload is a program builds a data set on the CPU (from disk, from network, from sensors, â). Program uses GPU API (OpenCL, CUDA, ...) to give hint on memory placement for the input data and also for the output buffer. Program call GPU API to schedule a GPU job, this happens using device driver specific ioctl. All this is hidden from programmer point of view in case of C++ compiler that transparently offload some part of a program to GPU. Program can keep doing other stuff on the CPU while the GPU is crunching numbers. It is expected that CPU will not access the same data set as the GPU while GPU is working on it, but this is not mandatory. In fact we expect some small memory object to be actively access by both GPU and CPU concurrently as synchronization channel and/or for monitoring purposes. Such object will stay in system memory and should not be bottlenecked by system bus bandwidth (rare write and read access from both CPU and GPU). As we are relying on device driver API, HMM does not introduce any new syscall nor does it modify any existing ones. It does not change any POSIX semantics or behaviors. For instance the child after a fork of a process that is using HMM will not be impacted in anyway, nor is there any data hazard between child COW or parent COW of memory that was migrated to device prior to fork. HMM assume a numbers of hardware features. Device must allow device page table to be updated at any time (ie device job must be preemptable). Device page table must provides memory protection such as read only. Device must track write access (dirty bit). Device must have a minimum granularity that match PAGE_SIZE (ie 4k). Reviewer (just hint): Patch 1 HMM documentation Patch 2 introduce core infrastructure and definition of HMM, pretty small patch and easy to review Patch 3 introduce the mirror functionality of HMM, it relies on mmu_notifier and thus someone familiar with that part would be in better position to review Patch 4 is an helper to snapshot CPU page table while synchronizing with concurrent page table update. Understanding mmu_notifier makes review easier. Patch 5 is mostly a wrapper around handle_mm_fault() Patch 6 add new add_pages() helper to avoid modifying each arch memory hot plug function Patch 7 add a new memory type for ZONE_DEVICE and also add all the logic in various core mm to support this new type. Dan Williams and any core mm contributor are best people to review each half of this patchset Patch 8 special case HMM ZONE_DEVICE pages inside put_page() Kirill and Dan Williams are best person to review this Patch 9 allow to uncharge a page from memory group without using the lru list field of struct page (best reviewer: Johannes Weiner or Vladimir Davydov or Michal Hocko) Patch 10 Add support to uncharge ZONE_DEVICE page from a memory cgroup (best reviewer: Johannes Weiner or Vladimir Davydov or Michal Hocko) Patch 11 add helper to hotplug un-addressable device memory as new type of ZONE_DEVICE memory (new type introducted in patch 3 of this serie). This is boiler plate code around memory hotplug and it also pick a free range of physical address for the device memory. Note that the physical address do not point to anything (at least as far as the kernel knows). Patch 12 introduce a new hmm_device class as an helper for device driver that want to expose multiple device memory under a common fake device driver. This is usefull for multi-gpu configuration. Anyone familiar with device driver infrastructure can review this. Boiler plate code really. Patch 13 add a new migrate mode. Any one familiar with page migration is welcome to review. Patch 14 introduce a new migration helper (migrate_vma()) that allow to migrate a range of virtual address of a process using device DMA engine to perform the copy. It is not limited to do copy from and to device but can also do copy between any kind of source and destination memory. Again anyone familiar with migration code should be able to verify the logic. Patch 15 optimize the new migrate_vma() by unmapping pages while we are collecting them. This can be review by any mm folks. Patch 16 add unaddressable memory migration to helper introduced in patch 7, this can be review by anyone familiar with migration code Patch 17 add a feature that allow device to allocate non-present page on the GPU when migrating a range of address to device memory. This is an helper for device driver to avoid having to first allocate system memory before migration to device memory Patch 18 add a new kind of ZONE_DEVICE memory for cache coherent device memory (CDM) Patch 19 add an helper to hotplug CDM memory Previous patchset posting : v1 http://lwn.net/Articles/597289/ v2 https://lkml.org/lkml/2014/6/12/559 v3 https://lkml.org/lkml/2014/6/13/633 v4 https://lkml.org/lkml/2014/8/29/423 v5 https://lkml.org/lkml/2014/11/3/759 v6 http://lwn.net/Articles/619737/ v7 http://lwn.net/Articles/627316/ v8 https://lwn.net/Articles/645515/ v9 https://lwn.net/Articles/651553/ v10 https://lwn.net/Articles/654430/ v11 http://www.gossamer-threads.com/lists/linux/kernel/2286424 v12 http://www.kernelhub.org/?msg=972982&p=2 v13 https://lwn.net/Articles/706856/ v14 https://lkml.org/lkml/2016/12/8/344 v15 http://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1304107.html v16 http://www.spinics.net/lists/linux-mm/msg119814.html v17 https://lkml.org/lkml/2017/1/27/847 v18 https://lkml.org/lkml/2017/3/16/596 v19 https://lkml.org/lkml/2017/4/5/831 v20 https://lwn.net/Articles/720715/ v21 https://lkml.org/lkml/2017/4/24/747 v22 http://lkml.iu.edu/hypermail/linux/kernel/1705.2/05176.html v23 https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1404788.html v24 https://lwn.net/Articles/726691/ This patch (of 19): This adds documentation for HMM (Heterogeneous Memory Management). It presents the motivation behind it, the features necessary for it to be useful and and gives an overview of how this is implemented. Link: http://lkml.kernel.org/r/20170817000548.32038-2-jglisse@redhat.com Signed-off-by: Jérôme Glisse Cc: John Hubbard Cc: Dan Williams Cc: David Nellans Cc: Balbir Singh Cc: Aneesh Kumar Cc: Benjamin Herrenschmidt Cc: Evgeny Baskakov Cc: Johannes Weiner Cc: Kirill A. Shutemov Cc: Mark Hairgrove Cc: Michal Hocko Cc: Paul E. McKenney Cc: Ross Zwisler Cc: Sherry Cheung Cc: Subhash Gutti Cc: Vladimir Davydov Cc: Bob Liu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/hmm.txt | 384 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 384 insertions(+) create mode 100644 Documentation/vm/hmm.txt (limited to 'Documentation') diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt new file mode 100644 index 000000000000..4d3aac9f4a5d --- /dev/null +++ b/Documentation/vm/hmm.txt @@ -0,0 +1,384 @@ +Heterogeneous Memory Management (HMM) + +Transparently allow any component of a program to use any memory region of said +program with a device without using device specific memory allocator. This is +becoming a requirement to simplify the use of advance heterogeneous computing +where GPU, DSP or FPGA are use to perform various computations. + +This document is divided as follow, in the first section i expose the problems +related to the use of a device specific allocator. The second section i expose +the hardware limitations that are inherent to many platforms. The third section +gives an overview of HMM designs. The fourth section explains how CPU page- +table mirroring works and what is HMM purpose in this context. Fifth section +deals with how device memory is represented inside the kernel. Finaly the last +section present the new migration helper that allow to leverage the device DMA +engine. + + +1) Problems of using device specific memory allocator: +2) System bus, device memory characteristics +3) Share address space and migration +4) Address space mirroring implementation and API +5) Represent and manage device memory from core kernel point of view +6) Migrate to and from device memory +7) Memory cgroup (memcg) and rss accounting + + +------------------------------------------------------------------------------- + +1) Problems of using device specific memory allocator: + +Device with large amount of on board memory (several giga bytes) like GPU have +historically manage their memory through dedicated driver specific API. This +creates a disconnect between memory allocated and managed by device driver and +regular application memory (private anonymous, share memory or regular file +back memory). From here on i will refer to this aspect as split address space. +I use share address space to refer to the opposite situation ie one in which +any memory region can be use by device transparently. + +Split address space because device can only access memory allocated through the +device specific API. This imply that all memory object in a program are not +equal from device point of view which complicate large program that rely on a +wide set of libraries. + +Concretly this means that code that wants to leverage device like GPU need to +copy object between genericly allocated memory (malloc, mmap private/share/) +and memory allocated through the device driver API (this still end up with an +mmap but of the device file). + +For flat dataset (array, grid, image, ...) this isn't too hard to achieve but +complex data-set (list, tree, ...) are hard to get right. Duplicating a complex +data-set need to re-map all the pointer relations between each of its elements. +This is error prone and program gets harder to debug because of the duplicate +data-set. + +Split address space also means that library can not transparently use data they +are getting from core program or other library and thus each library might have +to duplicate its input data-set using specific memory allocator. Large project +suffer from this and waste resources because of the various memory copy. + +Duplicating each library API to accept as input or output memory allocted by +each device specific allocator is not a viable option. It would lead to a +combinatorial explosions in the library entry points. + +Finaly with the advance of high level language constructs (in C++ but in other +language too) it is now possible for compiler to leverage GPU or other devices +without even the programmer knowledge. Some of compiler identified patterns are +only do-able with a share address. It is as well more reasonable to use a share +address space for all the other patterns. + + +------------------------------------------------------------------------------- + +2) System bus, device memory characteristics + +System bus cripple share address due to few limitations. Most system bus only +allow basic memory access from device to main memory, even cache coherency is +often optional. Access to device memory from CPU is even more limited, most +often than not it is not cache coherent. + +If we only consider the PCIE bus than device can access main memory (often +through an IOMMU) and be cache coherent with the CPUs. However it only allows +a limited set of atomic operation from device on main memory. This is worse +in the other direction the CPUs can only access a limited range of the device +memory and can not perform atomic operations on it. Thus device memory can not +be consider like regular memory from kernel point of view. + +Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0 +and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s). +The final limitation is latency, access to main memory from the device has an +order of magnitude higher latency than when the device access its own memory. + +Some platform are developing new system bus or additions/modifications to PCIE +to address some of those limitations (OpenCAPI, CCIX). They mainly allow two +way cache coherency between CPU and device and allow all atomic operations the +architecture supports. Saddly not all platform are following this trends and +some major architecture are left without hardware solutions to those problems. + +So for share address space to make sense not only we must allow device to +access any memory memory but we must also permit any memory to be migrated to +device memory while device is using it (blocking CPU access while it happens). + + +------------------------------------------------------------------------------- + +3) Share address space and migration + +HMM intends to provide two main features. First one is to share the address +space by duplication the CPU page table into the device page table so same +address point to same memory and this for any valid main memory address in +the process address space. + +To achieve this, HMM offer a set of helpers to populate the device page table +while keeping track of CPU page table updates. Device page table updates are +not as easy as CPU page table updates. To update the device page table you must +allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics +commands in it to perform the update (unmap, cache invalidations and flush, +...). This can not be done through common code for all device. Hence why HMM +provides helpers to factor out everything that can be while leaving the gory +details to the device driver. + +The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does +allow to allocate a struct page for each page of the device memory. Those page +are special because the CPU can not map them. They however allow to migrate +main memory to device memory using exhisting migration mechanism and everything +looks like if page was swap out to disk from CPU point of view. Using a struct +page gives the easiest and cleanest integration with existing mm mechanisms. +Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory +for the device memory and second to perform migration. Policy decision of what +and when to migrate things is left to the device driver. + +Note that any CPU access to a device page trigger a page fault and a migration +back to main memory ie when a page backing an given address A is migrated from +a main memory page to a device page then any CPU access to address A trigger a +page fault and initiate a migration back to main memory. + + +With this two features, HMM not only allow a device to mirror a process address +space and keeps both CPU and device page table synchronize, but also allow to +leverage device memory by migrating part of data-set that is actively use by a +device. + + +------------------------------------------------------------------------------- + +4) Address space mirroring implementation and API + +Address space mirroring main objective is to allow to duplicate range of CPU +page table into a device page table and HMM helps keeping both synchronize. A +device driver that want to mirror a process address space must start with the +registration of an hmm_mirror struct: + + int hmm_mirror_register(struct hmm_mirror *mirror, + struct mm_struct *mm); + int hmm_mirror_register_locked(struct hmm_mirror *mirror, + struct mm_struct *mm); + +The locked variant is to be use when the driver is already holding the mmap_sem +of the mm in write mode. The mirror struct has a set of callback that are use +to propagate CPU page table: + + struct hmm_mirror_ops { + /* sync_cpu_device_pagetables() - synchronize page tables + * + * @mirror: pointer to struct hmm_mirror + * @update_type: type of update that occurred to the CPU page table + * @start: virtual start address of the range to update + * @end: virtual end address of the range to update + * + * This callback ultimately originates from mmu_notifiers when the CPU + * page table is updated. The device driver must update its page table + * in response to this callback. The update argument tells what action + * to perform. + * + * The device driver must not return from this callback until the device + * page tables are completely updated (TLBs flushed, etc); this is a + * synchronous call. + */ + void (*update)(struct hmm_mirror *mirror, + enum hmm_update action, + unsigned long start, + unsigned long end); + }; + +Device driver must perform update to the range following action (turn range +read only, or fully unmap, ...). Once driver callback returns the device must +be done with the update. + + +When device driver wants to populate a range of virtual address it can use +either: + int hmm_vma_get_pfns(struct vm_area_struct *vma, + struct hmm_range *range, + unsigned long start, + unsigned long end, + hmm_pfn_t *pfns); + int hmm_vma_fault(struct vm_area_struct *vma, + struct hmm_range *range, + unsigned long start, + unsigned long end, + hmm_pfn_t *pfns, + bool write, + bool block); + +First one (hmm_vma_get_pfns()) will only fetch present CPU page table entry and +will not trigger a page fault on missing or non present entry. The second one +do trigger page fault on missing or read only entry if write parameter is true. +Page fault use the generic mm page fault code path just like a CPU page fault. + +Both function copy CPU page table into their pfns array argument. Each entry in +that array correspond to an address in the virtual range. HMM provide a set of +flags to help driver identify special CPU page table entries. + +Locking with the update() callback is the most important aspect the driver must +respect in order to keep things properly synchronize. The usage pattern is : + + int driver_populate_range(...) + { + struct hmm_range range; + ... + again: + ret = hmm_vma_get_pfns(vma, &range, start, end, pfns); + if (ret) + return ret; + take_lock(driver->update); + if (!hmm_vma_range_done(vma, &range)) { + release_lock(driver->update); + goto again; + } + + // Use pfns array content to update device page table + + release_lock(driver->update); + return 0; + } + +The driver->update lock is the same lock that driver takes inside its update() +callback. That lock must be call before hmm_vma_range_done() to avoid any race +with a concurrent CPU page table update. + +HMM implements all this on top of the mmu_notifier API because we wanted to a +simpler API and also to be able to perform optimization latter own like doing +concurrent device update in multi-devices scenario. + +HMM also serve as an impedence missmatch between how CPU page table update are +done (by CPU write to the page table and TLB flushes) from how device update +their own page table. Device update is a multi-step process, first appropriate +commands are write to a buffer, then this buffer is schedule for execution on +the device. It is only once the device has executed commands in the buffer that +the update is done. Creating and scheduling update command buffer can happen +concurrently for multiple devices. Waiting for each device to report commands +as executed is serialize (there is no point in doing this concurrently). + + +------------------------------------------------------------------------------- + +5) Represent and manage device memory from core kernel point of view + +Several differents design were try to support device memory. First one use +device specific data structure to keep information about migrated memory and +HMM hooked itself in various place of mm code to handle any access to address +that were back by device memory. It turns out that this ended up replicating +most of the fields of struct page and also needed many kernel code path to be +updated to understand this new kind of memory. + +Thing is most kernel code path never try to access the memory behind a page +but only care about struct page contents. Because of this HMM switchted to +directly using struct page for device memory which left most kernel code path +un-aware of the difference. We only need to make sure that no one ever try to +map those page from the CPU side. + +HMM provide a set of helpers to register and hotplug device memory as a new +region needing struct page. This is offer through a very simple API: + + struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, + struct device *device, + unsigned long size); + void hmm_devmem_remove(struct hmm_devmem *devmem); + +The hmm_devmem_ops is where most of the important things are: + + struct hmm_devmem_ops { + void (*free)(struct hmm_devmem *devmem, struct page *page); + int (*fault)(struct hmm_devmem *devmem, + struct vm_area_struct *vma, + unsigned long addr, + struct page *page, + unsigned flags, + pmd_t *pmdp); + }; + +The first callback (free()) happens when the last reference on a device page is +drop. This means the device page is now free and no longer use by anyone. The +second callback happens whenever CPU try to access a device page which it can +not do. This second callback must trigger a migration back to system memory. + + +------------------------------------------------------------------------------- + +6) Migrate to and from device memory + +Because CPU can not access device memory, migration must use device DMA engine +to perform copy from and to device memory. For this we need a new migration +helper: + + int migrate_vma(const struct migrate_vma_ops *ops, + struct vm_area_struct *vma, + unsigned long mentries, + unsigned long start, + unsigned long end, + unsigned long *src, + unsigned long *dst, + void *private); + +Unlike other migration function it works on a range of virtual address, there +is two reasons for that. First device DMA copy has a high setup overhead cost +and thus batching multiple pages is needed as otherwise the migration overhead +make the whole excersie pointless. The second reason is because driver trigger +such migration base on range of address the device is actively accessing. + +The migrate_vma_ops struct define two callbacks. First one (alloc_and_copy()) +control destination memory allocation and copy operation. Second one is there +to allow device driver to perform cleanup operation after migration. + + struct migrate_vma_ops { + void (*alloc_and_copy)(struct vm_area_struct *vma, + const unsigned long *src, + unsigned long *dst, + unsigned long start, + unsigned long end, + void *private); + void (*finalize_and_map)(struct vm_area_struct *vma, + const unsigned long *src, + const unsigned long *dst, + unsigned long start, + unsigned long end, + void *private); + }; + +It is important to stress that this migration helpers allow for hole in the +virtual address range. Some pages in the range might not be migrated for all +the usual reasons (page is pin, page is lock, ...). This helper does not fail +but just skip over those pages. + +The alloc_and_copy() might as well decide to not migrate all pages in the +range (for reasons under the callback control). For those the callback just +have to leave the corresponding dst entry empty. + +Finaly the migration of the struct page might fails (for file back page) for +various reasons (failure to freeze reference, or update page cache, ...). If +that happens then the finalize_and_map() can catch any pages that was not +migrated. Note those page were still copied to new page and thus we wasted +bandwidth but this is considered as a rare event and a price that we are +willing to pay to keep all the code simpler. + + +------------------------------------------------------------------------------- + +7) Memory cgroup (memcg) and rss accounting + +For now device memory is accounted as any regular page in rss counters (either +anonymous if device page is use for anonymous, file if device page is use for +file back page or shmem if device page is use for share memory). This is a +deliberate choice to keep existing application that might start using device +memory without knowing about it to keep runing unimpacted. + +Drawbacks is that OOM killer might kill an application using a lot of device +memory and not a lot of regular system memory and thus not freeing much system +memory. We want to gather more real world experience on how application and +system react under memory pressure in the presence of device memory before +deciding to account device memory differently. + + +Same decision was made for memory cgroup. Device memory page are accounted +against same memory cgroup a regular page would be accounted to. This does +simplify migration to and from device memory. This also means that migration +back from device memory to regular memory can not fail because it would +go above memory cgroup limit. We might revisit this choice latter on once we +get more experience in how device memory is use and its impact on memory +resource control. + + +Note that device memory can never be pin nor by device driver nor through GUP +and thus such memory is always free upon process exit. Or when last reference +is drop in case of share memory or file back memory. -- cgit v1.2.3 From cd9e61ed1eebbcd5dfad59475d41ec58d9b64b6a Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Fri, 8 Sep 2017 16:14:36 -0700 Subject: rbtree: cache leftmost node internally Patch series "rbtree: Cache leftmost node internally", v4. A series to extending rbtrees to internally cache the leftmost node such that we can have fast overlap check optimization for all interval tree users[1]. The benefits of this series are that: (i) Unify users that do internal leftmost node caching. (ii) Optimize all interval tree users. (iii) Convert at least two new users (epoll and procfs) to the new interface. This patch (of 16): Red-black tree semantics imply that nodes with smaller or greater (or equal for duplicates) keys always be to the left and right, respectively. For the kernel this is extremely evident when considering our rb_first() semantics. Enabling lookups for the smallest node in the tree in O(1) can save a good chunk of cycles in not having to walk down the tree each time. To this end there are a few core users that explicitly do this, such as the scheduler and rtmutexes. There is also the desire for interval trees to have this optimization allowing faster overlap checking. This patch introduces a new 'struct rb_root_cached' which is just the root with a cached pointer to the leftmost node. The reason why the regular rb_root was not extended instead of adding a new structure was that this allows the user to have the choice between memory footprint and actual tree performance. The new wrappers on top of the regular rb_root calls are: - rb_first_cached(cached_root) -- which is a fast replacement for rb_first. - rb_insert_color_cached(node, cached_root, new) - rb_erase_cached(node, cached_root) In addition, augmented cached interfaces are also added for basic insertion and deletion operations; which becomes important for the interval tree changes. With the exception of the inserts, which adds a bool for updating the new leftmost, the interfaces are kept the same. To this end, porting rb users to the cached version becomes really trivial, and keeping current rbtree semantics for users that don't care about the optimization requires zero overhead. Link: http://lkml.kernel.org/r/20170719014603.19029-2-dave@stgolabs.net Signed-off-by: Davidlohr Bueso Reviewed-by: Jan Kara Acked-by: Peter Zijlstra (Intel) Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/rbtree.txt | 33 +++++++++++++++++++++++++++++++++ include/linux/rbtree.h | 21 +++++++++++++++++++++ include/linux/rbtree_augmented.h | 33 ++++++++++++++++++++++++++++++--- lib/rbtree.c | 34 +++++++++++++++++++++++++++++----- 4 files changed, 113 insertions(+), 8 deletions(-) (limited to 'Documentation') diff --git a/Documentation/rbtree.txt b/Documentation/rbtree.txt index b8a8c70b0188..c42a21b99046 100644 --- a/Documentation/rbtree.txt +++ b/Documentation/rbtree.txt @@ -193,6 +193,39 @@ Example:: for (node = rb_first(&mytree); node; node = rb_next(node)) printk("key=%s\n", rb_entry(node, struct mytype, node)->keystring); +Cached rbtrees +-------------- + +Computing the leftmost (smallest) node is quite a common task for binary +search trees, such as for traversals or users relying on a the particular +order for their own logic. To this end, users can use 'struct rb_root_cached' +to optimize O(logN) rb_first() calls to a simple pointer fetch avoiding +potentially expensive tree iterations. This is done at negligible runtime +overhead for maintanence; albeit larger memory footprint. + +Similar to the rb_root structure, cached rbtrees are initialized to be +empty via: + + struct rb_root_cached mytree = RB_ROOT_CACHED; + +Cached rbtree is simply a regular rb_root with an extra pointer to cache the +leftmost node. This allows rb_root_cached to exist wherever rb_root does, +which permits augmented trees to be supported as well as only a few extra +interfaces: + + struct rb_node *rb_first_cached(struct rb_root_cached *tree); + void rb_insert_color_cached(struct rb_node *, struct rb_root_cached *, bool); + void rb_erase_cached(struct rb_node *node, struct rb_root_cached *); + +Both insert and erase calls have their respective counterpart of augmented +trees: + + void rb_insert_augmented_cached(struct rb_node *node, struct rb_root_cached *, + bool, struct rb_augment_callbacks *); + void rb_erase_augmented_cached(struct rb_node *, struct rb_root_cached *, + struct rb_augment_callbacks *); + + Support for Augmented rbtrees ----------------------------- diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h index e585018498d5..d574361943ea 100644 --- a/include/linux/rbtree.h +++ b/include/linux/rbtree.h @@ -44,10 +44,25 @@ struct rb_root { struct rb_node *rb_node; }; +/* + * Leftmost-cached rbtrees. + * + * We do not cache the rightmost node based on footprint + * size vs number of potential users that could benefit + * from O(1) rb_last(). Just not worth it, users that want + * this feature can always implement the logic explicitly. + * Furthermore, users that want to cache both pointers may + * find it a bit asymmetric, but that's ok. + */ +struct rb_root_cached { + struct rb_root rb_root; + struct rb_node *rb_leftmost; +}; #define rb_parent(r) ((struct rb_node *)((r)->__rb_parent_color & ~3)) #define RB_ROOT (struct rb_root) { NULL, } +#define RB_ROOT_CACHED (struct rb_root_cached) { {NULL, }, NULL } #define rb_entry(ptr, type, member) container_of(ptr, type, member) #define RB_EMPTY_ROOT(root) (READ_ONCE((root)->rb_node) == NULL) @@ -69,6 +84,12 @@ extern struct rb_node *rb_prev(const struct rb_node *); extern struct rb_node *rb_first(const struct rb_root *); extern struct rb_node *rb_last(const struct rb_root *); +extern void rb_insert_color_cached(struct rb_node *, + struct rb_root_cached *, bool); +extern void rb_erase_cached(struct rb_node *node, struct rb_root_cached *); +/* Same as rb_first(), but O(1) */ +#define rb_first_cached(root) (root)->rb_leftmost + /* Postorder iteration - always visit the parent after its children */ extern struct rb_node *rb_first_postorder(const struct rb_root *); extern struct rb_node *rb_next_postorder(const struct rb_node *); diff --git a/include/linux/rbtree_augmented.h b/include/linux/rbtree_augmented.h index 9702b6e183bc..6bfd2b581f75 100644 --- a/include/linux/rbtree_augmented.h +++ b/include/linux/rbtree_augmented.h @@ -41,7 +41,9 @@ struct rb_augment_callbacks { void (*rotate)(struct rb_node *old, struct rb_node *new); }; -extern void __rb_insert_augmented(struct rb_node *node, struct rb_root *root, +extern void __rb_insert_augmented(struct rb_node *node, + struct rb_root *root, + bool newleft, struct rb_node **leftmost, void (*augment_rotate)(struct rb_node *old, struct rb_node *new)); /* * Fixup the rbtree and update the augmented information when rebalancing. @@ -57,7 +59,16 @@ static inline void rb_insert_augmented(struct rb_node *node, struct rb_root *root, const struct rb_augment_callbacks *augment) { - __rb_insert_augmented(node, root, augment->rotate); + __rb_insert_augmented(node, root, false, NULL, augment->rotate); +} + +static inline void +rb_insert_augmented_cached(struct rb_node *node, + struct rb_root_cached *root, bool newleft, + const struct rb_augment_callbacks *augment) +{ + __rb_insert_augmented(node, &root->rb_root, + newleft, &root->rb_leftmost, augment->rotate); } #define RB_DECLARE_CALLBACKS(rbstatic, rbname, rbstruct, rbfield, \ @@ -150,6 +161,7 @@ extern void __rb_erase_color(struct rb_node *parent, struct rb_root *root, static __always_inline struct rb_node * __rb_erase_augmented(struct rb_node *node, struct rb_root *root, + struct rb_node **leftmost, const struct rb_augment_callbacks *augment) { struct rb_node *child = node->rb_right; @@ -157,6 +169,9 @@ __rb_erase_augmented(struct rb_node *node, struct rb_root *root, struct rb_node *parent, *rebalance; unsigned long pc; + if (leftmost && node == *leftmost) + *leftmost = rb_next(node); + if (!tmp) { /* * Case 1: node to erase has no more than 1 child (easy!) @@ -256,9 +271,21 @@ static __always_inline void rb_erase_augmented(struct rb_node *node, struct rb_root *root, const struct rb_augment_callbacks *augment) { - struct rb_node *rebalance = __rb_erase_augmented(node, root, augment); + struct rb_node *rebalance = __rb_erase_augmented(node, root, + NULL, augment); if (rebalance) __rb_erase_color(rebalance, root, augment->rotate); } +static __always_inline void +rb_erase_augmented_cached(struct rb_node *node, struct rb_root_cached *root, + const struct rb_augment_callbacks *augment) +{ + struct rb_node *rebalance = __rb_erase_augmented(node, &root->rb_root, + &root->rb_leftmost, + augment); + if (rebalance) + __rb_erase_color(rebalance, &root->rb_root, augment->rotate); +} + #endif /* _LINUX_RBTREE_AUGMENTED_H */ diff --git a/lib/rbtree.c b/lib/rbtree.c index 4ba2828a67c0..d102d9d2ffaa 100644 --- a/lib/rbtree.c +++ b/lib/rbtree.c @@ -95,10 +95,14 @@ __rb_rotate_set_parents(struct rb_node *old, struct rb_node *new, static __always_inline void __rb_insert(struct rb_node *node, struct rb_root *root, + bool newleft, struct rb_node **leftmost, void (*augment_rotate)(struct rb_node *old, struct rb_node *new)) { struct rb_node *parent = rb_red_parent(node), *gparent, *tmp; + if (newleft) + *leftmost = node; + while (true) { /* * Loop invariant: node is red @@ -434,19 +438,38 @@ static const struct rb_augment_callbacks dummy_callbacks = { void rb_insert_color(struct rb_node *node, struct rb_root *root) { - __rb_insert(node, root, dummy_rotate); + __rb_insert(node, root, false, NULL, dummy_rotate); } EXPORT_SYMBOL(rb_insert_color); void rb_erase(struct rb_node *node, struct rb_root *root) { struct rb_node *rebalance; - rebalance = __rb_erase_augmented(node, root, &dummy_callbacks); + rebalance = __rb_erase_augmented(node, root, + NULL, &dummy_callbacks); if (rebalance) ____rb_erase_color(rebalance, root, dummy_rotate); } EXPORT_SYMBOL(rb_erase); +void rb_insert_color_cached(struct rb_node *node, + struct rb_root_cached *root, bool leftmost) +{ + __rb_insert(node, &root->rb_root, leftmost, + &root->rb_leftmost, dummy_rotate); +} +EXPORT_SYMBOL(rb_insert_color_cached); + +void rb_erase_cached(struct rb_node *node, struct rb_root_cached *root) +{ + struct rb_node *rebalance; + rebalance = __rb_erase_augmented(node, &root->rb_root, + &root->rb_leftmost, &dummy_callbacks); + if (rebalance) + ____rb_erase_color(rebalance, &root->rb_root, dummy_rotate); +} +EXPORT_SYMBOL(rb_erase_cached); + /* * Augmented rbtree manipulation functions. * @@ -455,9 +478,10 @@ EXPORT_SYMBOL(rb_erase); */ void __rb_insert_augmented(struct rb_node *node, struct rb_root *root, + bool newleft, struct rb_node **leftmost, void (*augment_rotate)(struct rb_node *old, struct rb_node *new)) { - __rb_insert(node, root, augment_rotate); + __rb_insert(node, root, newleft, leftmost, augment_rotate); } EXPORT_SYMBOL(__rb_insert_augmented); @@ -502,7 +526,7 @@ struct rb_node *rb_next(const struct rb_node *node) * as we can. */ if (node->rb_right) { - node = node->rb_right; + node = node->rb_right; while (node->rb_left) node=node->rb_left; return (struct rb_node *)node; @@ -534,7 +558,7 @@ struct rb_node *rb_prev(const struct rb_node *node) * as we can. */ if (node->rb_left) { - node = node->rb_left; + node = node->rb_left; while (node->rb_right) node=node->rb_right; return (struct rb_node *)node; -- cgit v1.2.3 From a2d818030135c293f878fbb772cf40e7a14c5acc Mon Sep 17 00:00:00 2001 From: "Robert P. J. Day" Date: Fri, 8 Sep 2017 16:17:19 -0700 Subject: drivers/pps: aesthetic tweaks to PPS-related content Collection of aesthetic adjustments to various PPS-related files, directories and Documentation, some quite minor just for the sake of consistency, including: * Updated example of pps device tree node (courtesy Rodolfo G.) * "PPS-API" -> "PPS API" * "pps_source_info_s" -> "pps_source_info" * "ktimer driver" -> "pps-ktimer driver" * "ppstest /dev/pps0" -> "ppstest /dev/pps1" to match example * Add missing PPS-related entries to MAINTAINERS file * Other trivialities Link: http://lkml.kernel.org/r/alpine.LFD.2.20.1708261048220.8106@localhost.localdomain Signed-off-by: Robert P. J. Day Acked-by: Rodolfo Giometti Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/devicetree/bindings/pps/pps-gpio.txt | 8 +++- Documentation/pps/pps.txt | 44 +++++++++++----------- MAINTAINERS | 3 ++ include/linux/pps-gpio.h | 2 +- include/linux/pps_kernel.h | 16 ++++---- include/uapi/linux/pps.h | 4 +- kernel/time/timekeeping.c | 2 +- 7 files changed, 43 insertions(+), 36 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/pps/pps-gpio.txt b/Documentation/devicetree/bindings/pps/pps-gpio.txt index 40bf9c3564a5..0de23b793657 100644 --- a/Documentation/devicetree/bindings/pps/pps-gpio.txt +++ b/Documentation/devicetree/bindings/pps/pps-gpio.txt @@ -13,8 +13,12 @@ Optional properties: Example: pps { - compatible = "pps-gpio"; - gpios = <&gpio2 6 0>; + pinctrl-names = "default"; + pinctrl-0 = <&pinctrl_pps>; + gpios = <&gpio1 26 GPIO_ACTIVE_HIGH>; assert-falling-edge; + + compatible = "pps-gpio"; + status = "okay"; }; diff --git a/Documentation/pps/pps.txt b/Documentation/pps/pps.txt index 1fdbd5447216..99f5d8c4c652 100644 --- a/Documentation/pps/pps.txt +++ b/Documentation/pps/pps.txt @@ -48,12 +48,12 @@ problem: time_pps_create(). This implies that the source has a /dev/... entry. This assumption is -ok for the serial and parallel port, where you can do something +OK for the serial and parallel port, where you can do something useful besides(!) the gathering of timestamps as it is the central -task for a PPS-API. But this assumption does not work for a single +task for a PPS API. But this assumption does not work for a single purpose GPIO line. In this case even basic file-related functionality (like read() and write()) makes no sense at all and should not be a -precondition for the use of a PPS-API. +precondition for the use of a PPS API. The problem can be simply solved if you consider that a PPS source is not always connected with a GPS data source. @@ -88,13 +88,13 @@ Coding example -------------- To register a PPS source into the kernel you should define a struct -pps_source_info_s as follows: +pps_source_info as follows: static struct pps_source_info pps_ktimer_info = { .name = "ktimer", .path = "", - .mode = PPS_CAPTUREASSERT | PPS_OFFSETASSERT | \ - PPS_ECHOASSERT | \ + .mode = PPS_CAPTUREASSERT | PPS_OFFSETASSERT | + PPS_ECHOASSERT | PPS_CANWAIT | PPS_TSFMT_TSPEC, .echo = pps_ktimer_echo, .owner = THIS_MODULE, @@ -108,13 +108,13 @@ initialization routine as follows: The pps_register_source() prototype is: - int pps_register_source(struct pps_source_info_s *info, int default_params) + int pps_register_source(struct pps_source_info *info, int default_params) where "info" is a pointer to a structure that describes a particular PPS source, "default_params" tells the system what the initial default parameters for the device should be (it is obvious that these parameters must be a subset of ones defined in the struct -pps_source_info_s which describe the capabilities of the driver). +pps_source_info which describe the capabilities of the driver). Once you have registered a new PPS source into the system you can signal an assert event (for example in the interrupt handler routine) @@ -142,8 +142,10 @@ If the SYSFS filesystem is enabled in the kernel it provides a new class: Every directory is the ID of a PPS sources defined in the system and inside you find several files: - $ ls /sys/class/pps/pps0/ - assert clear echo mode name path subsystem@ uevent + $ ls -F /sys/class/pps/pps0/ + assert dev mode path subsystem@ + clear echo name power/ uevent + Inside each "assert" and "clear" file you can find the timestamp and a sequence number: @@ -154,32 +156,32 @@ sequence number: Where before the "#" is the timestamp in seconds; after it is the sequence number. Other files are: -* echo: reports if the PPS source has an echo function or not; + * echo: reports if the PPS source has an echo function or not; -* mode: reports available PPS functioning modes; + * mode: reports available PPS functioning modes; -* name: reports the PPS source's name; + * name: reports the PPS source's name; -* path: reports the PPS source's device path, that is the device the - PPS source is connected to (if it exists). + * path: reports the PPS source's device path, that is the device the + PPS source is connected to (if it exists). Testing the PPS support ----------------------- In order to test the PPS support even without specific hardware you can use -the ktimer driver (see the client subsection in the PPS configuration menu) +the pps-ktimer driver (see the client subsection in the PPS configuration menu) and the userland tools available in your distribution's pps-tools package, -http://linuxpps.org , or https://github.com/ago/pps-tools . +http://linuxpps.org , or https://github.com/redlab-i/pps-tools. -Once you have enabled the compilation of ktimer just modprobe it (if +Once you have enabled the compilation of pps-ktimer just modprobe it (if not statically compiled): - # modprobe ktimer + # modprobe pps-ktimer and the run ppstest as follow: - $ ./ppstest /dev/pps0 + $ ./ppstest /dev/pps1 trying PPS source "/dev/pps1" found PPS source "/dev/pps1" ok, found 1 source(s), now start fetching data... @@ -187,7 +189,7 @@ and the run ppstest as follow: source 0 - assert 1186592700.388931295, sequence: 365 - clear 0.000000000, sequence: 0 source 0 - assert 1186592701.389032765, sequence: 366 - clear 0.000000000, sequence: 0 -Please, note that to compile userland programs you need the file timepps.h . +Please note that to compile userland programs, you need the file timepps.h. This is available in the pps-tools repository mentioned above. diff --git a/MAINTAINERS b/MAINTAINERS index ff3a349f24e4..109c5d9a04c4 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -10725,8 +10725,11 @@ W: http://wiki.enneenne.com/index.php/LinuxPPS_support L: linuxpps@ml.enneenne.com (subscribers-only) S: Maintained F: Documentation/pps/ +F: Documentation/devicetree/bindings/pps/pps-gpio.txt +F: Documentation/ABI/testing/sysfs-pps F: drivers/pps/ F: include/linux/pps*.h +F: include/uapi/linux/pps.h PPTP DRIVER M: Dmitry Kozlov diff --git a/include/linux/pps-gpio.h b/include/linux/pps-gpio.h index 0035abe41b9a..56f35dd3d01d 100644 --- a/include/linux/pps-gpio.h +++ b/include/linux/pps-gpio.h @@ -29,4 +29,4 @@ struct pps_gpio_platform_data { const char *gpio_label; }; -#endif +#endif /* _PPS_GPIO_H */ diff --git a/include/linux/pps_kernel.h b/include/linux/pps_kernel.h index 35ac903956c7..80a980cc8d95 100644 --- a/include/linux/pps_kernel.h +++ b/include/linux/pps_kernel.h @@ -22,7 +22,6 @@ #define LINUX_PPS_KERNEL_H #include - #include #include #include @@ -35,9 +34,9 @@ struct pps_device; /* The specific PPS source info */ struct pps_source_info { - char name[PPS_MAX_NAME_LEN]; /* simbolic name */ + char name[PPS_MAX_NAME_LEN]; /* symbolic name */ char path[PPS_MAX_NAME_LEN]; /* path of connected device */ - int mode; /* PPS's allowed mode */ + int mode; /* PPS allowed mode */ void (*echo)(struct pps_device *pps, int event, void *data); /* PPS echo function */ @@ -57,10 +56,10 @@ struct pps_event_time { struct pps_device { struct pps_source_info info; /* PSS source info */ - struct pps_kparams params; /* PPS's current params */ + struct pps_kparams params; /* PPS current params */ - __u32 assert_sequence; /* PPS' assert event seq # */ - __u32 clear_sequence; /* PPS' clear event seq # */ + __u32 assert_sequence; /* PPS assert event seq # */ + __u32 clear_sequence; /* PPS clear event seq # */ struct pps_ktime assert_tu; struct pps_ktime clear_tu; int current_mode; /* PPS mode at event time */ @@ -69,7 +68,7 @@ struct pps_device { wait_queue_head_t queue; /* PPS event queue */ unsigned int id; /* PPS source unique ID */ - void const *lookup_cookie; /* pps_lookup_dev only */ + void const *lookup_cookie; /* For pps_lookup_dev() only */ struct cdev cdev; struct device *dev; struct fasync_struct *async_queue; /* fasync method */ @@ -101,7 +100,7 @@ extern struct pps_device *pps_register_source( extern void pps_unregister_source(struct pps_device *pps); extern void pps_event(struct pps_device *pps, struct pps_event_time *ts, int event, void *data); -/* Look up a pps device by magic cookie */ +/* Look up a pps_device by magic cookie */ struct pps_device *pps_lookup_dev(void const *cookie); static inline void timespec_to_pps_ktime(struct pps_ktime *kt, @@ -132,4 +131,3 @@ static inline void pps_sub_ts(struct pps_event_time *ts, struct timespec64 delta } #endif /* LINUX_PPS_KERNEL_H */ - diff --git a/include/uapi/linux/pps.h b/include/uapi/linux/pps.h index c1cb3825a8bc..c29d6b791c08 100644 --- a/include/uapi/linux/pps.h +++ b/include/uapi/linux/pps.h @@ -95,8 +95,8 @@ struct pps_kparams { #define PPS_CAPTURECLEAR 0x02 /* capture clear events */ #define PPS_CAPTUREBOTH 0x03 /* capture assert and clear events */ -#define PPS_OFFSETASSERT 0x10 /* apply compensation for assert ev. */ -#define PPS_OFFSETCLEAR 0x20 /* apply compensation for clear ev. */ +#define PPS_OFFSETASSERT 0x10 /* apply compensation for assert event */ +#define PPS_OFFSETCLEAR 0x20 /* apply compensation for clear event */ #define PPS_CANWAIT 0x100 /* can we wait for an event? */ #define PPS_CANPOLL 0x200 /* bit reserved for future use */ diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 8ea4fb315719..2cafb49aa65e 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -2316,7 +2316,7 @@ void hardpps(const struct timespec64 *phase_ts, const struct timespec64 *raw_ts) raw_spin_unlock_irqrestore(&timekeeper_lock, flags); } EXPORT_SYMBOL(hardpps); -#endif +#endif /* CONFIG_NTP_PPS */ /** * xtime_update() - advances the timekeeping infrastructure -- cgit v1.2.3