summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2020-06-02x86: mm: ptdump: calculate effective permissions correctlySteven Price1-1/+16
Patch series "Fix W+X debug feature on x86" Jan alerted me[1] that the W+X detection debug feature was broken in x86 by my change[2] to switch x86 to use the generic ptdump infrastructure. Fundamentally the approach of trying to move the calculation of effective permissions into note_page() was broken because note_page() is only called for 'leaf' entries and the effective permissions are passed down via the internal nodes of the page tree. The solution I've taken here is to create a new (optional) callback which is called for all nodes of the page tree and therefore can calculate the effective permissions. Secondly on some configurations (32 bit with PAE) "unsigned long" is not large enough to store the table entries. The fix here is simple - let's just use a u64. [1] https://lore.kernel.org/lkml/d573dc7e-e742-84de-473d-f971142fa319@suse.com/ [2] 2ae27137b2db ("x86: mm: convert dump_pagetables to use walk_page_range") This patch (of 2): By switching the x86 page table dump code to use the generic code the effective permissions are no longer calculated correctly because the note_page() function is only called for *leaf* entries. To calculate the actual effective permissions it is necessary to observe the full hierarchy of the page tree. Introduce a new callback for ptdump which is called for every entry and can therefore update the prot_levels array correctly. note_page() can then simply access the appropriate element in the array. [steven.price@arm.com: make the assignment conditional on val != 0] Link: http://lkml.kernel.org/r/430c8ab4-e7cd-6933-dde6-087fac6db872@arm.com Fixes: 2ae27137b2db ("x86: mm: convert dump_pagetables to use walk_page_range") Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Steven Price <steven.price@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Qian Cai <cai@lca.pw> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200521152308.33096-1-steven.price@arm.com Link: http://lkml.kernel.org/r/20200521152308.33096-2-steven.price@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02memcg: fix memcg_kmem_bypass() for remote memcg chargingZefan Li1-1/+6
While trying to use remote memcg charging in an out-of-tree kernel module I found it's not working, because the current thread is a workqueue thread. As we will probably encounter this issue in the future as the users of memalloc_use_memcg() grow, and it's nothing wrong for this usage, it's better we fix it now. Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/1d202a12-26fe-0012-ea14-f025ddcd044a@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/memcg: automatically penalize tasks with high swap useJakub Kicinski1-7/+81
Add a memory.swap.high knob, which can be used to protect the system from SWAP exhaustion. The mechanism used for penalizing is similar to memory.high penalty (sleep on return to user space). That is not to say that the knob itself is equivalent to memory.high. The objective is more to protect the system from potentially buggy tasks consuming a lot of swap and impacting other tasks, or even bringing the whole system to stand still with complete SWAP exhaustion. Hopefully without the need to find per-task hard limits. Slowing misbehaving tasks down gradually allows user space oom killers or other protection mechanisms to react. oomd and earlyoom already do killing based on swap exhaustion, and memory.swap.high protection will help implement such userspace oom policies more reliably. We can use one counter for number of pages allocated under pressure to save struct task space and avoid two separate hierarchy walks on the hot path. The exact overage is calculated on return to user space, anyway. Take the new high limit into account when determining if swap is "full". Borrowing the explanation from Johannes: The idea behind "swap full" is that as long as the workload has plenty of swap space available and it's not changing its memory contents, it makes sense to generously hold on to copies of data in the swap device, even after the swapin. A later reclaim cycle can drop the page without any IO. Trading disk space for IO. But the only two ways to reclaim a swap slot is when they're faulted in and the references go away, or by scanning the virtual address space like swapoff does - which is very expensive (one could argue it's too expensive even for swapoff, it's often more practical to just reboot). So at some point in the fill level, we have to start freeing up swap slots on fault/swapin. Otherwise we could eventually run out of swap slots while they're filled with copies of data that is also in RAM. We don't want to OOM a workload because its available swap space is filled with redundant cache. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Chris Down <chris@chrisdown.name> Cc: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200527195846.102707-5-kuba@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/memcg: move cgroup high memory limit setting into struct page_counterJakub Kicinski1-8/+11
High memory limit is currently recorded directly in struct mem_cgroup. We are about to add a high limit for swap, move the field to struct page_counter and add some helpers. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Chris Down <chris@chrisdown.name> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200527195846.102707-4-kuba@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/memcg: move penalty delay clamping out of calculate_high_delay()Jakub Kicinski1-8/+8
We will want to call calculate_high_delay() twice - once for memory and once for swap, and we should apply the clamp value to sum of the penalties. Clamping has to be applied outside of calculate_high_delay(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Chris Down <chris@chrisdown.name> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200527195846.102707-3-kuba@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/memcg: prepare for swap over-high accounting and penalty calculationJakub Kicinski1-27/+35
Patch series "memcg: Slow down swap allocation as the available space gets depleted", v6. Tejun describes the problem as follows: When swap runs out, there's an abrupt change in system behavior - the anonymous memory suddenly becomes unmanageable which readily breaks any sort of memory isolation and can bring down the whole system. To avoid that, oomd [1] monitors free swap space and triggers kills when it drops below the specific threshold (e.g. 15%). While this works, it's far from ideal: - Depending on IO performance and total swap size, a given headroom might not be enough or too much. - oomd has to monitor swap depletion in addition to the usual pressure metrics and it currently doesn't consider memory.swap.max. Solve this by adapting parts of the approach that memory.high uses - slow down allocation as the resource gets depleted turning the depletion behavior from abrupt cliff one to gradual degradation observable through memory pressure metric. [1] https://github.com/facebookincubator/oomd This patch (of 4): Slice the memory overage calculation logic a little bit so we can reuse it to apply a similar penalty to the swap. The logic which accesses the memory-specific fields (use and high values) has to be taken out of calculate_high_delay(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Down <chris@chrisdown.name> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200527195846.102707-1-kuba@kernel.org Link: http://lkml.kernel.org/r/20200527195846.102707-2-kuba@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02memcg: expose root cgroup's memory.statShakeel Butt1-1/+0
One way to measure the efficiency of memory reclaim is to look at the ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are not updated consistently at the system level and the ratio of these are not very meaningful. The pgsteal and pgscan are updated for only global reclaim while pgrefill gets updated for global as well as cgroup reclaim. Please note that this difference is only for system level vmstats. The cgroup stats returned by memory.stat are actually consistent. The cgroup's pgsteal contains number of reclaimed pages for global as well as cgroup reclaim. So, one way to get the system level stats is to get these stats from root's memory.stat, so, expose memory.stat for the root cgroup. From Johannes Weiner: There are subtle differences between /proc/vmstat and memory.stat, and cgroup-aware code that wants to watch the full hierarchy currently has to know about these intricacies and translate semantics back and forth. Generally having the fully recursive memory.stat at the root level could help a broader range of usecases. Why not fix the stats by including both the global and cgroup reclaim activity instead of exposing root cgroup's memory.stat? The reason is the benefit of having metrics exposing the activity that happens purely due to machine capacity rather than localized activity that happens due to the limits throughout the cgroup tree. Additionally there are userspace tools like sysstat(sar) which reads these stats to inform about the system level reclaim activity. So, we should not break such use-cases. Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Chris Down <chris@chrisdown.name> Cc: Mel Gorman <mgorman@suse.de> Cc: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200508170630.94406-1-shakeelb@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm: memcontrol: simplify value comparison between count and limitKaixu Xia1-1/+1
When the variables count and limit have the same value(count == limit), the result of min(margin, limit - count) statement should be 0 and the variable margin is set to 0. So in this case, the min() statement is not necessary and we can directly set the variable margin to 0. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/1587479661-27237-1-git-send-email-kaixuxia@tencent.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm, memcg: add workingset_restore in memory.statYafang Shao1-0/+2
There's a new workingset counter introduced in commit 1899ad18c607 ("mm: workingset: tell cache transitions from workingset thrashing"). With the help of this counter we can know the workingset is transitioning or thrashing. To leverage the benifit of this counter to memcg, we should introduce it into memory.stat. Then we could know the workingset of the workload inside a memcg better. Bellow is the verification of this new counter in memory.stat. Read a file into the memory and then read it again to make these pages be active. The size of this file is 1G. (memory.max is greater than file size) The counters in memory.stat will be inactive_file 0 active_file 1073639424 workingset_refault 0 workingset_activate 0 workingset_restore 0 workingset_nodereclaim 0 Trigger the memcg reclaim by setting a lower value to memory.high, and then some pages will be demoted into inactive list, and then some pages in the inactive list will be evicted into the storage. inactive_file 498094080 active_file 310063104 workingset_refault 0 workingset_activate 0 workingset_restore 0 workingset_nodereclaim 0 Then recover the memory.high and read the file into memory again. As a result of it, the transitioning will occur. Bellow is the result of this transitioning, inactive_file 498094080 active_file 575397888 workingset_refault 64746 workingset_activate 64746 workingset_restore 64746 workingset_nodereclaim 0 Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Chris Down <chris@chrisdown.name> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Shakeel Butt <shakeelb@google.com> Link: http://lkml.kernel.org/r/20200504153522.11553-1-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm: swapfile: fix /proc/swaps heading and Size/Used/Priority alignmentRandy Dunlap1-4/+8
Fix the heading and Size/Used/Priority field alignments in /proc/swaps. If the Size and/or Used value is >= 10000000 (8 bytes), then the alignment by using tab characters is broken. This patch maintains the use of tabs for alignment. If spaces are preferred, we can just use a Field Width specifier for the bytes and inuse fields. That way those fields don't have to be a multiple of 8 bytes in width. E.g., with a field width of 12, both Size and Used would always fit on the first line of an 80-column wide terminal (only Priority would be on the second line). There are actually 2 problems: heading alignment and field width. On an xterm, if Used is 7 bytes in length, the tab does nothing, and the display is like this, with no space/tab between the Used and Priority fields. (ugh) Filename Type Size Used Priority /dev/sda8 partition 16779260 2023012-1 To be clear, if one does 'cat /proc/swaps >/tmp/proc.swaps', it does look different, like so: Filename Type Size Used Priority /dev/sda8 partition 16779260 2086988 -1 Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Link: http://lkml.kernel.org/r/c0ffb41a-81ac-ddfa-d452-a9229ecc0387@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02swap: reduce lock contention on swap cache from swap slots allocationHuang Ying1-5/+56
In some swap scalability test, it is found that there are heavy lock contention on swap cache even if we have split one swap cache radix tree per swap device to one swap cache radix tree every 64 MB trunk in commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"). The reason is as follow. After the swap device becomes fragmented so that there's no free swap cluster, the swap device will be scanned linearly to find the free swap slots. swap_info_struct->cluster_next is the next scanning base that is shared by all CPUs. So nearby free swap slots will be allocated for different CPUs. The probability for multiple CPUs to operate on the same 64 MB trunk is high. This causes the lock contention on the swap cache. To solve the issue, in this patch, for SSD swap device, a percpu version next scanning base (cluster_next_cpu) is added. Every CPU will use its own per-cpu next scanning base. And after finishing scanning a 64MB trunk, the per-cpu scanning base will be changed to the beginning of another randomly selected 64MB trunk. In this way, the probability for multiple CPUs to operate on the same 64 MB trunk is reduced greatly. Thus the lock contention is reduced too. For HDD, because sequential access is more important for IO performance, the original shared next scanning base is used. To test the patch, we have run 16-process pmbench memory benchmark on a 2-socket server machine with 48 cores. One ram disk is configured as the swap device per socket. The pmbench working-set size is much larger than the available memory so that swapping is triggered. The memory read/write ratio is 80/20 and the accessing pattern is random. In the original implementation, the lock contention on the swap cache is heavy. The perf profiling data of the lock contention code path is as following, _raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list: 7.91 _raw_spin_lock_irqsave.__remove_mapping.shrink_page_list: 7.11 _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.51 _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 1.66 _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 1.29 _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.03 _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 0.93 After applying this patch, it becomes, _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.58 _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 2.3 _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 2.26 _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 1.8 _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.19 The lock contention on the swap cache is almost eliminated. And the pmbench score increases 18.5%. The swapin throughput increases 18.7% from 2.96 GB/s to 3.51 GB/s. While the swapout throughput increases 18.5% from 2.99 GB/s to 3.54 GB/s. We need really fast disk to show the benefit. I have tried this on 2 Intel P3600 NVMe disks. The performance improvement is only about 1%. The improvement should be better on the faster disks, such as Intel Optane disk. [ying.huang@intel.com: fix cluster_next_cpu allocation and freeing, per Daniel] Link: http://lkml.kernel.org/r/20200525002648.336325-1-ying.huang@intel.com [ying.huang@intel.com: v4] Link: http://lkml.kernel.org/r/20200529010840.928819-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200520031502.175659-1-ying.huang@intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/swapfile.c: use prandom_u32_max()Huang Ying1-1/+1
To improve the code readability and take advantage of the common implementation. Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200512081013.520201-1-ying.huang@intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/swapfile.c: __swap_entry_free() always free 1 entryWei Yang1-4/+5
__swap_entry_free() always frees 1 entry. Let's remove the usage. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200501015259.32237-2-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02swap: try to scan more free slots even when fragmentedHuang Ying1-0/+22
Now, the scalability of swap code will drop much when the swap device becomes fragmented, because the swap slots allocation batching stops working. To solve the problem, in this patch, we will try to scan a little more swap slots with restricted effort to batch the swap slots allocation even if the swap device is fragmented. Test shows that the benchmark score can increase up to 37.1% with the patch. Details are as follows. The swap code has a per-cpu cache of swap slots. These batch swap space allocations to improve swap subsystem scaling. In the following code path, add_to_swap() get_swap_page() refill_swap_slots_cache() get_swap_pages() scan_swap_map_slots() scan_swap_map_slots() and get_swap_pages() can return multiple swap slots for each call. These slots will be cached in the per-CPU swap slots cache, so that several following swap slot requests will be fulfilled there to avoid the lock contention in the lower level swap space allocation/freeing code path. But this only works when there are free swap clusters. If a swap device becomes so fragmented that there's no free swap clusters, scan_swap_map_slots() and get_swap_pages() will return only one swap slot for each call in the above code path. Effectively, this falls back to the situation before the swap slots cache was introduced, the heavy lock contention on the swap related locks kills the scalability. Why does it work in this way? Because the swap device could be large, and the free swap slot scanning could be quite time consuming, to avoid taking too much time to scanning free swap slots, the conservative method was used. In fact, this can be improved via scanning a little more free slots with strictly restricted effort. Which is implemented in this patch. In scan_swap_map_slots(), after the first free swap slot is gotten, we will try to scan a little more, but only if we haven't scanned too many slots (< LATENCY_LIMIT). That is, the added scanning latency is strictly restricted. To test the patch, we have run 16-process pmbench memory benchmark on a 2-socket server machine with 48 cores. Multiple ram disks are configured as the swap devices. The pmbench working-set size is much larger than the available memory so that swapping is triggered. The memory read/write ratio is 80/20 and the accessing pattern is random, so the swap space becomes highly fragmented during the test. In the original implementation, the lock contention on swap related locks is very heavy. The perf profiling data of the lock contention code path is as following, _raw_spin_lock.get_swap_pages.get_swap_page.add_to_swap: 21.03 _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 1.92 _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 1.72 _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 0.69 While after applying this patch, it becomes, _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 4.89 _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 3.85 _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.1 _raw_spin_lock_irqsave.pagevec_lru_move_fn.__lru_cache_add.do_swap_page: 0.88 That is, the lock contention on the swap locks is eliminated. And the pmbench score increases 37.1%. The swapin throughput increases 45.7% from 2.02 GB/s to 2.94 GB/s. While the swapout throughput increases 45.3% from 2.04 GB/s to 2.97 GB/s. Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200427030023.264780-1-ying.huang@intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/swapfile.c: omit a duplicate code by compare tmp and max firstWei Yang1-10/+8
There are two duplicate code to handle the case when there is no available swap entry. To avoid this, we can compare tmp and max first and let the second guard do its job. No functional change is expected. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200421213824.8099-3-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/swapfile.c: tmp is always smaller than maxWei Yang1-1/+1
If tmp is bigger or equal to max, we would jump to new_cluster. Return true directly. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200421213824.8099-2-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/swapfile.c: found_free could be represented by (tmp < max)Wei Yang1-8/+3
This is not necessary to use the variable found_free to record the status. Just check tmp and max is enough. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200421213824.8099-1-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/swapfile.c: remove the extra check in scan_swap_map_slots()Wei Yang1-3/+0
scan_swap_map_slots() is only called by scan_swap_map() and get_swap_pages(). Both ensure nr would not exceed SWAP_BATCH. Just remove it. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200325220309.9803-2-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/swapfile.c: simplify the calculation of n_goalWei Yang1-5/+1
Use min3() to simplify the comparison and make it more self-explaining. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200325220309.9803-1-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/swapfile.c: remove the unnecessary goto for SSD caseWei Yang1-5/+1
Now we can see there is redundant goto for SSD case. In these two places, we can just let the code walk through to the correct tag instead of explicitly jump to it. Let's remove them for better readability. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Link: http://lkml.kernel.org/r/20200328060520.31449-4-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/swapfile.c: explicitly show ssd/non-ssd is handled mutually exclusiveWei Yang1-7/+3
The code shows if this is ssd, it will jump to specific tag and skip the following code for non-ssd. Let's use "else if" to explicitly show the mutually exclusion for ssd/non-ssd to reduce ambiguity. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Link: http://lkml.kernel.org/r/20200328060520.31449-3-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/swapfile.c: offset is only used when there is more slotsWei Yang1-3/+1
scan_swap_map_slots() is used to iterate swap_map[] array for an available swap entry. While after several optimizations, e.g. for ssd case, the logic of this function is a little not easy to catch. This patchset tries to clean up the logic a little: * shows the ssd/non-ssd case is handled mutually exclusively * remove some unnecessary goto for ssd case This patch (of 3): When si->cluster_nr is zero, function would reach done and return. The increased offset would not be used any more. This means we can move the offset increment into the if clause. This brings a further code cleanup possibility. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Link: http://lkml.kernel.org/r/20200328060520.31449-1-richard.weiyang@gmail.com Link: http://lkml.kernel.org/r/20200328060520.31449-2-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm: swap: properly update readahead statistics in unuse_pte_range()Andrea Righi1-4/+8
In unuse_pte_range() we blindly swap-in pages without checking if the swap entry is already present in the swap cache. By doing this, the hit/miss ratio used by the swap readahead heuristic is not properly updated and this leads to non-optimal performance during swapoff. Tracing the distribution of the readahead size returned by the swap readahead heuristic during swapoff shows that a small readahead size is used most of the time as if we had only misses (this happens both with cluster and vma readahead), for example: r::swapin_nr_pages(unsigned long offset):unsigned long:$retval COUNT EVENT 36948 $retval = 8 44151 $retval = 4 49290 $retval = 1 527771 $retval = 2 Checking if the swap entry is present in the swap cache, instead, allows to properly update the readahead statistics and the heuristic behaves in a better way during swapoff, selecting a bigger readahead size: r::swapin_nr_pages(unsigned long offset):unsigned long:$retval COUNT EVENT 1618 $retval = 1 4960 $retval = 2 41315 $retval = 4 103521 $retval = 8 In terms of swapoff performance the result is the following: Testing environment =================== - Host: CPU: 1.8GHz Intel Core i7-8565U (quad-core, 8MB cache) HDD: PC401 NVMe SK hynix 512GB MEM: 16GB - Guest (kvm): 8GB of RAM virtio block driver 16GB swap file on ext4 (/swapfile) Test case ========= - allocate 85% of memory - `systemctl hibernate` to force all the pages to be swapped-out to the swap file - resume the system - measure the time that swapoff takes to complete: # /usr/bin/time swapoff /swapfile Result (swapoff time) ====== 5.6 vanilla 5.6 w/ this patch ----------- ----------------- cluster-readahead 22.09s 12.19s vma-readahead 18.20s 15.33s Conclusion ========== The specific use case this patch is addressing is to improve swapoff performance in cloud environments when a VM has been hibernated, resumed and all the memory needs to be forced back to RAM by disabling swap. This change allows to better exploits the advantages of the readahead heuristic during swapoff and this improvement allows to to speed up the resume process of such VMs. [andrea.righi@canonical.com: update changelog] Link: http://lkml.kernel.org/r/20200418084705.GA147642@xps-13 Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Anchal Agarwal <anchalag@amazon.com> Cc: Hugh Dickins <hughd@google.com> Cc: Vineeth Remanan Pillai <vpillai@digitalocean.com> Cc: Kelley Nielsen <kelleynnn@gmail.com> Link: http://lkml.kernel.org/r/20200416180132.GB3352@xps-13 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/swap_state: fix a data race in swapin_nr_pagesQian Cai1-2/+3
"prev_offset" is a static variable in swapin_nr_pages() that can be accessed concurrently with only mmap_sem held in read mode as noticed by KCSAN, BUG: KCSAN: data-race in swap_cluster_readahead / swap_cluster_readahead write to 0xffffffff92763830 of 8 bytes by task 14795 on cpu 17: swap_cluster_readahead+0x2a6/0x5e0 swapin_readahead+0x92/0x8dc do_swap_page+0x49b/0xf20 __handle_mm_fault+0xcfb/0xd70 handle_mm_fault+0xfc/0x2f0 do_page_fault+0x263/0x715 page_fault+0x34/0x40 1 lock held by (dnf)/14795: #0: ffff897bd2e98858 (&mm->mmap_sem#2){++++}-{3:3}, at: do_page_fault+0x143/0x715 do_user_addr_fault at arch/x86/mm/fault.c:1405 (inlined by) do_page_fault at arch/x86/mm/fault.c:1535 irq event stamp: 83493 count_memcg_event_mm+0x1a6/0x270 count_memcg_event_mm+0x119/0x270 __do_softirq+0x365/0x589 irq_exit+0xa2/0xc0 read to 0xffffffff92763830 of 8 bytes by task 1 on cpu 22: swap_cluster_readahead+0xfd/0x5e0 swapin_readahead+0x92/0x8dc do_swap_page+0x49b/0xf20 __handle_mm_fault+0xcfb/0xd70 handle_mm_fault+0xfc/0x2f0 do_page_fault+0x263/0x715 page_fault+0x34/0x40 1 lock held by systemd/1: #0: ffff897c38f14858 (&mm->mmap_sem#2){++++}-{3:3}, at: do_page_fault+0x143/0x715 irq event stamp: 43530289 count_memcg_event_mm+0x1a6/0x270 count_memcg_event_mm+0x119/0x270 __do_softirq+0x365/0x589 irq_exit+0xa2/0xc0 Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Marco Elver <elver@google.com> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200402213748.2237-1-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/swapfile: use list_{prev,next}_entry() instead of open-codingchenqiwu1-10/+6
Use list_{prev,next}_entry() instead of list_entry() for better code readability. Signed-off-by: chenqiwu <chenqiwu@xiaomi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Qian Cai <cai@lca.pw> Cc: Baoquan He <bhe@redhat.com> Link: http://lkml.kernel.org/r/1586599916-15456-2-git-send-email-qiwuchen55@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/gup.c: further document vma_permits_fault()Miles Chen1-1/+2
Describe the caller's responsibilities when passing FAULT_FLAG_ALLOW_RETRY. Link: http://lkml.kernel.org/r/1586915606.5647.5.camel@mtkswgap22 Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/gup: introduce pin_user_pages_unlockedJohn Hubbard1-0/+17
Introduce pin_user_pages_unlocked(), which is nearly identical to the get_user_pages_unlocked() that it wraps, except that it sets FOLL_PIN and rejects FOLL_GET. Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Walls <awalls@md.metrocast.net> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Link: http://lkml.kernel.org/r/20200518012157.1178336-2-jhubbard@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/gup.c: update the documentationSouptick Joarder1-18/+39
This patch is an attempt to update the documentation. - Add/ remove extra * based on type of function static/global. - Add description for functions and their input arguments. [akpm@linux-foundation.org: s@/*@/**@] Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1588013630-4497-1-git-send-email-jrdr.linux@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/writeback: discard NR_UNSTABLE_NFS, use NR_WRITEBACK insteadNeilBrown4-20/+14
After an NFS page has been written it is considered "unstable" until a COMMIT request succeeds. If the COMMIT fails, the page will be re-written. These "unstable" pages are currently accounted as "reclaimable", either in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a 'reclaimable' count. This might have made sense when sending the COMMIT required a separate action by the VFS/MM (e.g. releasepage() used to send a COMMIT). However now that all writes generated by ->writepages() will automatically be followed by a COMMIT (since commit 919e3bd9a875 ("NFS: Ensure we commit after writeback is complete")) it makes more sense to treat them as writeback pages. So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in NR_WRITEBACK and WB_WRITEBACK. A particular effect of this change is that when wb_check_background_flush() calls wb_over_bg_threshold(), the latter will report 'true' a lot less often as the 'unstable' pages are no longer considered 'dirty' (as there is nothing that writeback can do about them anyway). Currently wb_check_background_flush() will trigger writeback to NFS even when there are relatively few dirty pages (if there are lots of unstable pages), this can result in small writes going to the server (10s of Kilobytes rather than a Megabyte) which hurts throughput. With this patch, there are fewer writes which are each larger on average. Where the NR_UNSTABLE_NFS count was included in statistics virtual-files, the entry is retained, but the value is hard-coded as zero. static trace points and warning printks which mentioned this counter no longer report it. [akpm@linux-foundation.org: re-layout comment] [akpm@linux-foundation.org: fix printk warning] Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Trond Myklebust <trond.myklebust@hammerspace.com> Acked-by: Michal Hocko <mhocko@suse.com> [mm] Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Link: http://lkml.kernel.org/r/87d06j7gqa.fsf@notabene.neil.brown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLENeilBrown2-10/+35
PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a daemon needs to write to one bdi (the final bdi) in order to free up writes queued to another bdi (the client bdi). The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty pages, so that it can still dirty pages after other processses have been throttled. The purpose of this is to avoid deadlock that happen when the PF_LESS_THROTTLE process must write for any dirty pages to be freed, but it is being thottled and cannot write. This approach was designed when all threads were blocked equally, independently on which device they were writing to, or how fast it was. Since that time the writeback algorithm has changed substantially with different threads getting different allowances based on non-trivial heuristics. This means the simple "add 25%" heuristic is no longer reliable. The important issue is not that the daemon needs a *larger* dirty page allowance, but that it needs a *private* dirty page allowance, so that dirty pages for the "client" bdi that it is helping to clear (the bdi for an NFS filesystem or loop block device etc) do not affect the throttling of the daemon writing to the "final" bdi. This patch changes the heuristic so that the task is not throttled when the bdi it is writing to has a dirty page count below below (or equal to) the free-run threshold for that bdi. This ensures it will always be able to have some pages in flight, and so will not deadlock. In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might still be throttled by global threshold, but that is acceptable as it is only the deadlock state that is interesting for this flag. This approach of "only throttle when target bdi is busy" is consistent with the other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes attention to be focussed only on the target bdi. So this patch - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, - removes the 25% bonus that that flag gives, and - If PF_LOCAL_THROTTLE is set, don't delay at all unless the global and the local free-run thresholds are exceeded. Note that previously realtime threads were treated the same as PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for real-time threads, so it is now different from the behaviour of nfsd and loop tasks. I don't know what is wanted for realtime. [akpm@linux-foundation.org: coding style fixes] Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Chuck Lever <chuck.lever@oracle.com> [nfsd] Cc: Christoph Hellwig <hch@lst.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/page-writeback.c: remove unused variableChao Yu1-3/+1
Commit 64081362e8ff ("mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock") left unused variable, remove it. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Link: http://lkml.kernel.org/r/20200528033740.17269-1-yuchao0@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/filemap.c: remove misleading commentMatthew Wilcox (Oracle)1-1/+0
We no longer return 0 here and the comment doesn't tell us anything that we don't already know (SIGBUS is a pretty good indicator that things didn't work out). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Link: http://lkml.kernel.org/r/20200529123243.20640-1-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/migrate.c: call detach_page_private to cleanup codeGuoqing Jiang1-6/+1
We can cleanup code a little by call detach_page_private here. [akpm@linux-foundation.org: use attach_page_private(), per Dave] http://lkml.kernel.org/r/20200521225220.GV2005@dread.disaster.area [akpm@linux-foundation.org: clear PagePrivate] Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Chao Yu <yuchao0@huawei.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200519214049.15179-1-guoqing.jiang@cloud.ionos.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02fs: convert mpage_readpages to mpage_readaheadMatthew Wilcox (Oracle)1-1/+1
Implement the new readahead aop and convert all callers (block_dev, exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6, reiserfs & udf). The callers are all trivial except for GFS2 & OCFS2. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> # ocfs2 Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> # ocfs2 Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-17-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm: use memalloc_nofs_save in readahead pathMatthew Wilcox (Oracle)1-0/+14
Ensure that memory allocations in the readahead path do not attempt to reclaim file-backed pages, which could lead to a deadlock. It is possible, though unlikely this is the root cause of a problem observed by Cong Wang. Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Suggested-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-16-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm: document why we don't set PageReadaheadMatthew Wilcox (Oracle)1-3/+6
If the page is already in cache, we don't set PageReadahead on it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-15-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm: add page_cache_readahead_unboundedMatthew Wilcox (Oracle)1-21/+47
ext4 and f2fs have duplicated the guts of the readahead code so they can read past i_size. Instead, separate out the guts of the readahead code so they can call it directly. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-14-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm: move end_index check out of readahead loopMatthew Wilcox (Oracle)1-6/+8
By reducing nr_to_read, we can eliminate this check from inside the loop. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-13-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm: add readahead address space operationMatthew Wilcox (Oracle)1-2/+10
This replaces ->readpages with a saner interface: - Return void instead of an ignored error code. - Page cache is already populated with locked pages when ->readahead is called. - New arguments can be passed to the implementation without changing all the filesystems that use a common helper function like mpage_readahead(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-12-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm: put readahead pages in cache earlierMatthew Wilcox (Oracle)1-18/+28
When populating the page cache for readahead, mappings that use ->readpages must populate the page cache themselves as the pages are passed on a linked list which would normally be used for the page cache's LRU. For mappings that use ->readpage or the upcoming ->readahead method, we can put the pages into the page cache as soon as they're allocated, which solves a race between readahead and direct IO. It also lets us remove the gfp argument from read_pages(). Use the new readahead_page() API to implement the repeated calls to ->readpage(), just like most filesystems will. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-11-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm: remove 'page_offset' from readahead loopMatthew Wilcox (Oracle)1-5/+3
Replace the page_offset variable with 'index + i'. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-10-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm: rename readahead loop variable to 'i'Matthew Wilcox (Oracle)1-4/+4
Change the type of page_idx to unsigned long, and rename it -- it's just a loop counter, not a page index. Suggested-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-9-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm: rename various 'offset' parameters to 'index'Matthew Wilcox (Oracle)1-44/+42
The word 'offset' is used ambiguously to mean 'byte offset within a page', 'byte offset from the start of the file' and 'page offset from the start of the file'. Use 'index' to mean 'page offset from the start of the file' throughout the readahead code. [ We should probably rename the 'pgoff_t' type to 'pgidx_t' too - Linus ] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-8-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm: use readahead_control to pass argumentsMatthew Wilcox (Oracle)1-14/+19
In this patch, only between __do_page_cache_readahead() and read_pages(), but it will be extended in upcoming patches. The read_pages() function becomes aops centric, as this makes the most sense by the end of the patchset. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-7-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm: move readahead nr_pages check into read_pagesMatthew Wilcox (Oracle)1-5/+7
Simplify the callers by moving the check for nr_pages and the BUG_ON into read_pages(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-5-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm: ignore return value of ->readpagesMatthew Wilcox (Oracle)1-6/+2
We used to assign the return value to a variable, which we then ignored. Remove the pretence of caring. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-4-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm: return void from various readahead functionsMatthew Wilcox (Oracle)3-28/+19
ondemand_readahead has two callers, neither of which use the return value. That means that both ra_submit and __do_page_cache_readahead() can return void, and we don't need to worry that a present page in the readahead window causes us to return a smaller nr_pages than we ought to have. Similarly, no caller uses the return value from force_page_cache_readahead(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-3-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm: move readahead prototypes from mm.hMatthew Wilcox (Oracle)2-0/+4
Patch series "Change readahead API", v11. This series adds a readahead address_space operation to replace the readpages operation. The key difference is that pages are added to the page cache as they are allocated (and then looked up by the filesystem) instead of passing them on a list to the readpages operation and having the filesystem add them to the page cache. It's a net reduction in code for each implementation, more efficient than walking a list, and solves the direct-write vs buffered-read problem reported by yu kuai at http://lkml.kernel.org/r/20200116063601.39201-1-yukuai3@huawei.com The only unconverted filesystems are those which use fscache. Their conversion is pending Dave Howells' rewrite which will make the conversion substantially easier. This should be completed by the end of the year. I want to thank the reviewers/testers; Dave Chinner, John Hubbard, Eric Biggers, Johannes Thumshirn, Dave Sterba, Zi Yan, Christoph Hellwig and Miklos Szeredi have done a marvellous job of providing constructive criticism. These patches pass an xfstests run on ext4, xfs & btrfs with no regressions that I can tell (some of the tests seem a little flaky before and remain flaky afterwards). This patch (of 25): The readahead code is part of the page cache so should be found in the pagemap.h file. force_page_cache_readahead is only used within mm, so move it to mm/internal.h instead. Remove the parameter names where they add no value, and rename the ones which were actively misleading. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-1-willy@infradead.org Link: http://lkml.kernel.org/r/20200414150233.24495-2-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm, dump_page(): do not crash with invalid mapping pointerVlastimil Babka1-6/+50
We have seen a following problem on a RPi4 with 1G RAM: BUG: Bad page state in process systemd-hwdb pfn:35601 page:ffff7e0000d58040 refcount:15 mapcount:131221 mapping:efd8fe765bc80080 index:0x1 compound_mapcount: -32767 Unable to handle kernel paging request at virtual address efd8fe765bc80080 Mem abort info: ESR = 0x96000004 Exception class = DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 [efd8fe765bc80080] address between user and kernel address ranges Internal error: Oops: 96000004 [#1] SMP Modules linked in: btrfs libcrc32c xor xor_neon zlib_deflate raid6_pq mmc_block xhci_pci xhci_hcd usbcore sdhci_iproc sdhci_pltfm sdhci mmc_core clk_raspberrypi gpio_raspberrypi_exp pcie_brcmstb bcm2835_dma gpio_regulator phy_generic fixed sg scsi_mod efivarfs Supported: No, Unreleased kernel CPU: 3 PID: 408 Comm: systemd-hwdb Not tainted 5.3.18-8-default #1 SLE15-SP2 (unreleased) Hardware name: raspberrypi rpi/rpi, BIOS 2020.01 02/21/2020 pstate: 40000085 (nZcv daIf -PAN -UAO) pc : __dump_page+0x268/0x368 lr : __dump_page+0xc4/0x368 sp : ffff000012563860 x29: ffff000012563860 x28: ffff80003ddc4300 x27: 0000000000000010 x26: 000000000000003f x25: ffff7e0000d58040 x24: 000000000000000f x23: efd8fe765bc80080 x22: 0000000000020095 x21: efd8fe765bc80080 x20: ffff000010ede8b0 x19: ffff7e0000d58040 x18: ffffffffffffffff x17: 0000000000000001 x16: 0000000000000007 x15: ffff000011689708 x14: 3030386362353637 x13: 6566386466653a67 x12: 6e697070616d2031 x11: 32323133313a746e x10: 756f6370616d2035 x9 : ffff00001168a840 x8 : ffff00001077a670 x7 : 000000000000013d x6 : ffff0000118a43b5 x5 : 0000000000000001 x4 : ffff80003dd9e2c8 x3 : ffff80003dd9e2c8 x2 : 911c8d7c2f483500 x1 : dead000000000100 x0 : efd8fe765bc80080 Call trace: __dump_page+0x268/0x368 bad_page+0xd4/0x168 check_new_page_bad+0x80/0xb8 rmqueue_bulk.constprop.26+0x4d8/0x788 get_page_from_freelist+0x4d4/0x1228 __alloc_pages_nodemask+0x134/0xe48 alloc_pages_vma+0x198/0x1c0 do_anonymous_page+0x1a4/0x4d8 __handle_mm_fault+0x4e8/0x560 handle_mm_fault+0x104/0x1e0 do_page_fault+0x1e8/0x4c0 do_translation_fault+0xb0/0xc0 do_mem_abort+0x50/0xb0 el0_da+0x24/0x28 Code: f9401025 8b8018a0 9a851005 17ffffca (f94002a0) Besides the underlying issue with page->mapping containing a bogus value for some reason, we can see that __dump_page() crashed by trying to read the pointer at mapping->host, turning a recoverable warning into full Oops. It can be expected that when page is reported as bad state for some reason, the pointers there should not be trusted blindly. So this patch treats all data in __dump_page() that depends on page->mapping as lava, using probe_kernel_read_strict(). Ideally this would include the dentry->d_parent recursively, but that would mean changing printk handler for %pd. Chances of reaching the dentry printing part with an initially bogus mapping pointer should be rather low, though. Also prefix printing mapping->a_ops with a description of what is being printed. In case the value is bogus, %ps will print raw value instead of the symbol name and then it's not obvious at all that it's printing a_ops. Reported-by: Petr Tesarik <ptesarik@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: John Hubbard <jhubbard@nvidia.com> Link: http://lkml.kernel.org/r/20200331165454.12263-1-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/slub: fix stack overruns with SLUB_STATSQian Cai1-1/+2
There is no need to copy SLUB_STATS items from root memcg cache to new memcg cache copies. Doing so could result in stack overruns because the store function only accepts 0 to clear the stat and returns an error for everything else while the show method would print out the whole stat. Then, the mismatch of the lengths returns from show and store methods happens in memcg_propagate_slab_attrs(): else if (root_cache->max_attr_size < ARRAY_SIZE(mbuf)) buf = mbuf; max_attr_size is only 2 from slab_attr_store(), then, it uses mbuf[64] in show_stat() later where a bounch of sprintf() would overrun the stack variable. Fix it by always allocating a page of buffer to be used in show_stat() if SLUB_STATS=y which should only be used for debug purpose. # echo 1 > /sys/kernel/slab/fs_cache/shrink BUG: KASAN: stack-out-of-bounds in number+0x421/0x6e0 Write of size 1 at addr ffffc900256cfde0 by task kworker/76:0/53251 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func Call Trace: number+0x421/0x6e0 vsnprintf+0x451/0x8e0 sprintf+0x9e/0xd0 show_stat+0x124/0x1d0 alloc_slowpath_show+0x13/0x20 __kmem_cache_create+0x47a/0x6b0 addr ffffc900256cfde0 is located in stack of task kworker/76:0/53251 at offset 0 in frame: process_one_work+0x0/0xb90 this frame has 1 object: [32, 72) 'lockdep_map' Memory state around the buggy address: ffffc900256cfc80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffffc900256cfd00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffffc900256cfd80: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 ^ ffffc900256cfe00: 00 00 00 00 00 f2 f2 f2 00 00 00 00 00 00 00 00 ffffc900256cfe80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: __kmem_cache_create+0x6ac/0x6b0 Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func Call Trace: __kmem_cache_create+0x6ac/0x6b0 Fixes: 107dab5c92d5 ("slub: slub-specific propagation changes") Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Glauber Costa <glauber@scylladb.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/20200429222356.4322-1-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>