diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2017-09-09 10:30:07 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2017-09-09 10:30:07 -0700 |
commit | fbf4432ff71b7a25bef993a5312906946d27f446 (patch) | |
tree | cf3e0024af4b8f9376eff75743f1fa1526e40900 /Documentation | |
parent | c054be10ffdbd5507a1fd738067d76acfb4808fd (diff) | |
parent | 0cfb6aee70bddbef6ec796b255f588ce0e126766 (diff) | |
download | linux-fbf4432ff71b7a25bef993a5312906946d27f446.tar.bz2 |
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
- most of the rest of MM
- a small number of misc things
- lib/ updates
- checkpatch
- autofs updates
- ipc/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (126 commits)
ipc: optimize semget/shmget/msgget for lots of keys
ipc/sem: play nicer with large nsops allocations
ipc/sem: drop sem_checkid helper
ipc: convert kern_ipc_perm.refcount from atomic_t to refcount_t
ipc: convert sem_undo_list.refcnt from atomic_t to refcount_t
ipc: convert ipc_namespace.count from atomic_t to refcount_t
kcov: support compat processes
sh: defconfig: cleanup from old Kconfig options
mn10300: defconfig: cleanup from old Kconfig options
m32r: defconfig: cleanup from old Kconfig options
drivers/pps: use surrounding "if PPS" to remove numerous dependency checks
drivers/pps: aesthetic tweaks to PPS-related content
cpumask: make cpumask_next() out-of-line
kmod: move #ifdef CONFIG_MODULES wrapper to Makefile
kmod: split off umh headers into its own file
MAINTAINERS: clarify kmod is just a kernel module loader
kmod: split out umh code into its own file
test_kmod: flip INT checks to be consistent
test_kmod: remove paranoid UINT_MAX check on uint range processing
vfat: deduplicate hex2bin()
...
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/devicetree/bindings/pps/pps-gpio.txt | 8 | ||||
-rw-r--r-- | Documentation/pps/pps.txt | 44 | ||||
-rw-r--r-- | Documentation/rbtree.txt | 33 | ||||
-rw-r--r-- | Documentation/vm/hmm.txt | 384 |
4 files changed, 446 insertions, 23 deletions
diff --git a/Documentation/devicetree/bindings/pps/pps-gpio.txt b/Documentation/devicetree/bindings/pps/pps-gpio.txt index 40bf9c3564a5..0de23b793657 100644 --- a/Documentation/devicetree/bindings/pps/pps-gpio.txt +++ b/Documentation/devicetree/bindings/pps/pps-gpio.txt @@ -13,8 +13,12 @@ Optional properties: Example: pps { - compatible = "pps-gpio"; - gpios = <&gpio2 6 0>; + pinctrl-names = "default"; + pinctrl-0 = <&pinctrl_pps>; + gpios = <&gpio1 26 GPIO_ACTIVE_HIGH>; assert-falling-edge; + + compatible = "pps-gpio"; + status = "okay"; }; diff --git a/Documentation/pps/pps.txt b/Documentation/pps/pps.txt index 1fdbd5447216..99f5d8c4c652 100644 --- a/Documentation/pps/pps.txt +++ b/Documentation/pps/pps.txt @@ -48,12 +48,12 @@ problem: time_pps_create(). This implies that the source has a /dev/... entry. This assumption is -ok for the serial and parallel port, where you can do something +OK for the serial and parallel port, where you can do something useful besides(!) the gathering of timestamps as it is the central -task for a PPS-API. But this assumption does not work for a single +task for a PPS API. But this assumption does not work for a single purpose GPIO line. In this case even basic file-related functionality (like read() and write()) makes no sense at all and should not be a -precondition for the use of a PPS-API. +precondition for the use of a PPS API. The problem can be simply solved if you consider that a PPS source is not always connected with a GPS data source. @@ -88,13 +88,13 @@ Coding example -------------- To register a PPS source into the kernel you should define a struct -pps_source_info_s as follows: +pps_source_info as follows: static struct pps_source_info pps_ktimer_info = { .name = "ktimer", .path = "", - .mode = PPS_CAPTUREASSERT | PPS_OFFSETASSERT | \ - PPS_ECHOASSERT | \ + .mode = PPS_CAPTUREASSERT | PPS_OFFSETASSERT | + PPS_ECHOASSERT | PPS_CANWAIT | PPS_TSFMT_TSPEC, .echo = pps_ktimer_echo, .owner = THIS_MODULE, @@ -108,13 +108,13 @@ initialization routine as follows: The pps_register_source() prototype is: - int pps_register_source(struct pps_source_info_s *info, int default_params) + int pps_register_source(struct pps_source_info *info, int default_params) where "info" is a pointer to a structure that describes a particular PPS source, "default_params" tells the system what the initial default parameters for the device should be (it is obvious that these parameters must be a subset of ones defined in the struct -pps_source_info_s which describe the capabilities of the driver). +pps_source_info which describe the capabilities of the driver). Once you have registered a new PPS source into the system you can signal an assert event (for example in the interrupt handler routine) @@ -142,8 +142,10 @@ If the SYSFS filesystem is enabled in the kernel it provides a new class: Every directory is the ID of a PPS sources defined in the system and inside you find several files: - $ ls /sys/class/pps/pps0/ - assert clear echo mode name path subsystem@ uevent + $ ls -F /sys/class/pps/pps0/ + assert dev mode path subsystem@ + clear echo name power/ uevent + Inside each "assert" and "clear" file you can find the timestamp and a sequence number: @@ -154,32 +156,32 @@ sequence number: Where before the "#" is the timestamp in seconds; after it is the sequence number. Other files are: -* echo: reports if the PPS source has an echo function or not; + * echo: reports if the PPS source has an echo function or not; -* mode: reports available PPS functioning modes; + * mode: reports available PPS functioning modes; -* name: reports the PPS source's name; + * name: reports the PPS source's name; -* path: reports the PPS source's device path, that is the device the - PPS source is connected to (if it exists). + * path: reports the PPS source's device path, that is the device the + PPS source is connected to (if it exists). Testing the PPS support ----------------------- In order to test the PPS support even without specific hardware you can use -the ktimer driver (see the client subsection in the PPS configuration menu) +the pps-ktimer driver (see the client subsection in the PPS configuration menu) and the userland tools available in your distribution's pps-tools package, -http://linuxpps.org , or https://github.com/ago/pps-tools . +http://linuxpps.org , or https://github.com/redlab-i/pps-tools. -Once you have enabled the compilation of ktimer just modprobe it (if +Once you have enabled the compilation of pps-ktimer just modprobe it (if not statically compiled): - # modprobe ktimer + # modprobe pps-ktimer and the run ppstest as follow: - $ ./ppstest /dev/pps0 + $ ./ppstest /dev/pps1 trying PPS source "/dev/pps1" found PPS source "/dev/pps1" ok, found 1 source(s), now start fetching data... @@ -187,7 +189,7 @@ and the run ppstest as follow: source 0 - assert 1186592700.388931295, sequence: 365 - clear 0.000000000, sequence: 0 source 0 - assert 1186592701.389032765, sequence: 366 - clear 0.000000000, sequence: 0 -Please, note that to compile userland programs you need the file timepps.h . +Please note that to compile userland programs, you need the file timepps.h. This is available in the pps-tools repository mentioned above. diff --git a/Documentation/rbtree.txt b/Documentation/rbtree.txt index b8a8c70b0188..c42a21b99046 100644 --- a/Documentation/rbtree.txt +++ b/Documentation/rbtree.txt @@ -193,6 +193,39 @@ Example:: for (node = rb_first(&mytree); node; node = rb_next(node)) printk("key=%s\n", rb_entry(node, struct mytype, node)->keystring); +Cached rbtrees +-------------- + +Computing the leftmost (smallest) node is quite a common task for binary +search trees, such as for traversals or users relying on a the particular +order for their own logic. To this end, users can use 'struct rb_root_cached' +to optimize O(logN) rb_first() calls to a simple pointer fetch avoiding +potentially expensive tree iterations. This is done at negligible runtime +overhead for maintanence; albeit larger memory footprint. + +Similar to the rb_root structure, cached rbtrees are initialized to be +empty via: + + struct rb_root_cached mytree = RB_ROOT_CACHED; + +Cached rbtree is simply a regular rb_root with an extra pointer to cache the +leftmost node. This allows rb_root_cached to exist wherever rb_root does, +which permits augmented trees to be supported as well as only a few extra +interfaces: + + struct rb_node *rb_first_cached(struct rb_root_cached *tree); + void rb_insert_color_cached(struct rb_node *, struct rb_root_cached *, bool); + void rb_erase_cached(struct rb_node *node, struct rb_root_cached *); + +Both insert and erase calls have their respective counterpart of augmented +trees: + + void rb_insert_augmented_cached(struct rb_node *node, struct rb_root_cached *, + bool, struct rb_augment_callbacks *); + void rb_erase_augmented_cached(struct rb_node *, struct rb_root_cached *, + struct rb_augment_callbacks *); + + Support for Augmented rbtrees ----------------------------- diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt new file mode 100644 index 000000000000..4d3aac9f4a5d --- /dev/null +++ b/Documentation/vm/hmm.txt @@ -0,0 +1,384 @@ +Heterogeneous Memory Management (HMM) + +Transparently allow any component of a program to use any memory region of said +program with a device without using device specific memory allocator. This is +becoming a requirement to simplify the use of advance heterogeneous computing +where GPU, DSP or FPGA are use to perform various computations. + +This document is divided as follow, in the first section i expose the problems +related to the use of a device specific allocator. The second section i expose +the hardware limitations that are inherent to many platforms. The third section +gives an overview of HMM designs. The fourth section explains how CPU page- +table mirroring works and what is HMM purpose in this context. Fifth section +deals with how device memory is represented inside the kernel. Finaly the last +section present the new migration helper that allow to leverage the device DMA +engine. + + +1) Problems of using device specific memory allocator: +2) System bus, device memory characteristics +3) Share address space and migration +4) Address space mirroring implementation and API +5) Represent and manage device memory from core kernel point of view +6) Migrate to and from device memory +7) Memory cgroup (memcg) and rss accounting + + +------------------------------------------------------------------------------- + +1) Problems of using device specific memory allocator: + +Device with large amount of on board memory (several giga bytes) like GPU have +historically manage their memory through dedicated driver specific API. This +creates a disconnect between memory allocated and managed by device driver and +regular application memory (private anonymous, share memory or regular file +back memory). From here on i will refer to this aspect as split address space. +I use share address space to refer to the opposite situation ie one in which +any memory region can be use by device transparently. + +Split address space because device can only access memory allocated through the +device specific API. This imply that all memory object in a program are not +equal from device point of view which complicate large program that rely on a +wide set of libraries. + +Concretly this means that code that wants to leverage device like GPU need to +copy object between genericly allocated memory (malloc, mmap private/share/) +and memory allocated through the device driver API (this still end up with an +mmap but of the device file). + +For flat dataset (array, grid, image, ...) this isn't too hard to achieve but +complex data-set (list, tree, ...) are hard to get right. Duplicating a complex +data-set need to re-map all the pointer relations between each of its elements. +This is error prone and program gets harder to debug because of the duplicate +data-set. + +Split address space also means that library can not transparently use data they +are getting from core program or other library and thus each library might have +to duplicate its input data-set using specific memory allocator. Large project +suffer from this and waste resources because of the various memory copy. + +Duplicating each library API to accept as input or output memory allocted by +each device specific allocator is not a viable option. It would lead to a +combinatorial explosions in the library entry points. + +Finaly with the advance of high level language constructs (in C++ but in other +language too) it is now possible for compiler to leverage GPU or other devices +without even the programmer knowledge. Some of compiler identified patterns are +only do-able with a share address. It is as well more reasonable to use a share +address space for all the other patterns. + + +------------------------------------------------------------------------------- + +2) System bus, device memory characteristics + +System bus cripple share address due to few limitations. Most system bus only +allow basic memory access from device to main memory, even cache coherency is +often optional. Access to device memory from CPU is even more limited, most +often than not it is not cache coherent. + +If we only consider the PCIE bus than device can access main memory (often +through an IOMMU) and be cache coherent with the CPUs. However it only allows +a limited set of atomic operation from device on main memory. This is worse +in the other direction the CPUs can only access a limited range of the device +memory and can not perform atomic operations on it. Thus device memory can not +be consider like regular memory from kernel point of view. + +Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0 +and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s). +The final limitation is latency, access to main memory from the device has an +order of magnitude higher latency than when the device access its own memory. + +Some platform are developing new system bus or additions/modifications to PCIE +to address some of those limitations (OpenCAPI, CCIX). They mainly allow two +way cache coherency between CPU and device and allow all atomic operations the +architecture supports. Saddly not all platform are following this trends and +some major architecture are left without hardware solutions to those problems. + +So for share address space to make sense not only we must allow device to +access any memory memory but we must also permit any memory to be migrated to +device memory while device is using it (blocking CPU access while it happens). + + +------------------------------------------------------------------------------- + +3) Share address space and migration + +HMM intends to provide two main features. First one is to share the address +space by duplication the CPU page table into the device page table so same +address point to same memory and this for any valid main memory address in +the process address space. + +To achieve this, HMM offer a set of helpers to populate the device page table +while keeping track of CPU page table updates. Device page table updates are +not as easy as CPU page table updates. To update the device page table you must +allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics +commands in it to perform the update (unmap, cache invalidations and flush, +...). This can not be done through common code for all device. Hence why HMM +provides helpers to factor out everything that can be while leaving the gory +details to the device driver. + +The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does +allow to allocate a struct page for each page of the device memory. Those page +are special because the CPU can not map them. They however allow to migrate +main memory to device memory using exhisting migration mechanism and everything +looks like if page was swap out to disk from CPU point of view. Using a struct +page gives the easiest and cleanest integration with existing mm mechanisms. +Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory +for the device memory and second to perform migration. Policy decision of what +and when to migrate things is left to the device driver. + +Note that any CPU access to a device page trigger a page fault and a migration +back to main memory ie when a page backing an given address A is migrated from +a main memory page to a device page then any CPU access to address A trigger a +page fault and initiate a migration back to main memory. + + +With this two features, HMM not only allow a device to mirror a process address +space and keeps both CPU and device page table synchronize, but also allow to +leverage device memory by migrating part of data-set that is actively use by a +device. + + +------------------------------------------------------------------------------- + +4) Address space mirroring implementation and API + +Address space mirroring main objective is to allow to duplicate range of CPU +page table into a device page table and HMM helps keeping both synchronize. A +device driver that want to mirror a process address space must start with the +registration of an hmm_mirror struct: + + int hmm_mirror_register(struct hmm_mirror *mirror, + struct mm_struct *mm); + int hmm_mirror_register_locked(struct hmm_mirror *mirror, + struct mm_struct *mm); + +The locked variant is to be use when the driver is already holding the mmap_sem +of the mm in write mode. The mirror struct has a set of callback that are use +to propagate CPU page table: + + struct hmm_mirror_ops { + /* sync_cpu_device_pagetables() - synchronize page tables + * + * @mirror: pointer to struct hmm_mirror + * @update_type: type of update that occurred to the CPU page table + * @start: virtual start address of the range to update + * @end: virtual end address of the range to update + * + * This callback ultimately originates from mmu_notifiers when the CPU + * page table is updated. The device driver must update its page table + * in response to this callback. The update argument tells what action + * to perform. + * + * The device driver must not return from this callback until the device + * page tables are completely updated (TLBs flushed, etc); this is a + * synchronous call. + */ + void (*update)(struct hmm_mirror *mirror, + enum hmm_update action, + unsigned long start, + unsigned long end); + }; + +Device driver must perform update to the range following action (turn range +read only, or fully unmap, ...). Once driver callback returns the device must +be done with the update. + + +When device driver wants to populate a range of virtual address it can use +either: + int hmm_vma_get_pfns(struct vm_area_struct *vma, + struct hmm_range *range, + unsigned long start, + unsigned long end, + hmm_pfn_t *pfns); + int hmm_vma_fault(struct vm_area_struct *vma, + struct hmm_range *range, + unsigned long start, + unsigned long end, + hmm_pfn_t *pfns, + bool write, + bool block); + +First one (hmm_vma_get_pfns()) will only fetch present CPU page table entry and +will not trigger a page fault on missing or non present entry. The second one +do trigger page fault on missing or read only entry if write parameter is true. +Page fault use the generic mm page fault code path just like a CPU page fault. + +Both function copy CPU page table into their pfns array argument. Each entry in +that array correspond to an address in the virtual range. HMM provide a set of +flags to help driver identify special CPU page table entries. + +Locking with the update() callback is the most important aspect the driver must +respect in order to keep things properly synchronize. The usage pattern is : + + int driver_populate_range(...) + { + struct hmm_range range; + ... + again: + ret = hmm_vma_get_pfns(vma, &range, start, end, pfns); + if (ret) + return ret; + take_lock(driver->update); + if (!hmm_vma_range_done(vma, &range)) { + release_lock(driver->update); + goto again; + } + + // Use pfns array content to update device page table + + release_lock(driver->update); + return 0; + } + +The driver->update lock is the same lock that driver takes inside its update() +callback. That lock must be call before hmm_vma_range_done() to avoid any race +with a concurrent CPU page table update. + +HMM implements all this on top of the mmu_notifier API because we wanted to a +simpler API and also to be able to perform optimization latter own like doing +concurrent device update in multi-devices scenario. + +HMM also serve as an impedence missmatch between how CPU page table update are +done (by CPU write to the page table and TLB flushes) from how device update +their own page table. Device update is a multi-step process, first appropriate +commands are write to a buffer, then this buffer is schedule for execution on +the device. It is only once the device has executed commands in the buffer that +the update is done. Creating and scheduling update command buffer can happen +concurrently for multiple devices. Waiting for each device to report commands +as executed is serialize (there is no point in doing this concurrently). + + +------------------------------------------------------------------------------- + +5) Represent and manage device memory from core kernel point of view + +Several differents design were try to support device memory. First one use +device specific data structure to keep information about migrated memory and +HMM hooked itself in various place of mm code to handle any access to address +that were back by device memory. It turns out that this ended up replicating +most of the fields of struct page and also needed many kernel code path to be +updated to understand this new kind of memory. + +Thing is most kernel code path never try to access the memory behind a page +but only care about struct page contents. Because of this HMM switchted to +directly using struct page for device memory which left most kernel code path +un-aware of the difference. We only need to make sure that no one ever try to +map those page from the CPU side. + +HMM provide a set of helpers to register and hotplug device memory as a new +region needing struct page. This is offer through a very simple API: + + struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, + struct device *device, + unsigned long size); + void hmm_devmem_remove(struct hmm_devmem *devmem); + +The hmm_devmem_ops is where most of the important things are: + + struct hmm_devmem_ops { + void (*free)(struct hmm_devmem *devmem, struct page *page); + int (*fault)(struct hmm_devmem *devmem, + struct vm_area_struct *vma, + unsigned long addr, + struct page *page, + unsigned flags, + pmd_t *pmdp); + }; + +The first callback (free()) happens when the last reference on a device page is +drop. This means the device page is now free and no longer use by anyone. The +second callback happens whenever CPU try to access a device page which it can +not do. This second callback must trigger a migration back to system memory. + + +------------------------------------------------------------------------------- + +6) Migrate to and from device memory + +Because CPU can not access device memory, migration must use device DMA engine +to perform copy from and to device memory. For this we need a new migration +helper: + + int migrate_vma(const struct migrate_vma_ops *ops, + struct vm_area_struct *vma, + unsigned long mentries, + unsigned long start, + unsigned long end, + unsigned long *src, + unsigned long *dst, + void *private); + +Unlike other migration function it works on a range of virtual address, there +is two reasons for that. First device DMA copy has a high setup overhead cost +and thus batching multiple pages is needed as otherwise the migration overhead +make the whole excersie pointless. The second reason is because driver trigger +such migration base on range of address the device is actively accessing. + +The migrate_vma_ops struct define two callbacks. First one (alloc_and_copy()) +control destination memory allocation and copy operation. Second one is there +to allow device driver to perform cleanup operation after migration. + + struct migrate_vma_ops { + void (*alloc_and_copy)(struct vm_area_struct *vma, + const unsigned long *src, + unsigned long *dst, + unsigned long start, + unsigned long end, + void *private); + void (*finalize_and_map)(struct vm_area_struct *vma, + const unsigned long *src, + const unsigned long *dst, + unsigned long start, + unsigned long end, + void *private); + }; + +It is important to stress that this migration helpers allow for hole in the +virtual address range. Some pages in the range might not be migrated for all +the usual reasons (page is pin, page is lock, ...). This helper does not fail +but just skip over those pages. + +The alloc_and_copy() might as well decide to not migrate all pages in the +range (for reasons under the callback control). For those the callback just +have to leave the corresponding dst entry empty. + +Finaly the migration of the struct page might fails (for file back page) for +various reasons (failure to freeze reference, or update page cache, ...). If +that happens then the finalize_and_map() can catch any pages that was not +migrated. Note those page were still copied to new page and thus we wasted +bandwidth but this is considered as a rare event and a price that we are +willing to pay to keep all the code simpler. + + +------------------------------------------------------------------------------- + +7) Memory cgroup (memcg) and rss accounting + +For now device memory is accounted as any regular page in rss counters (either +anonymous if device page is use for anonymous, file if device page is use for +file back page or shmem if device page is use for share memory). This is a +deliberate choice to keep existing application that might start using device +memory without knowing about it to keep runing unimpacted. + +Drawbacks is that OOM killer might kill an application using a lot of device +memory and not a lot of regular system memory and thus not freeing much system +memory. We want to gather more real world experience on how application and +system react under memory pressure in the presence of device memory before +deciding to account device memory differently. + + +Same decision was made for memory cgroup. Device memory page are accounted +against same memory cgroup a regular page would be accounted to. This does +simplify migration to and from device memory. This also means that migration +back from device memory to regular memory can not fail because it would +go above memory cgroup limit. We might revisit this choice latter on once we +get more experience in how device memory is use and its impact on memory +resource control. + + +Note that device memory can never be pin nor by device driver nor through GUP +and thus such memory is always free upon process exit. Or when last reference +is drop in case of share memory or file back memory. |