summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2021-09-04mm, slub: discard slabs in unfreeze_partials() without irqs disabledVlastimil Babka1-1/+2
No need for disabled irqs when discarding slabs, so restore them before discarding. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: move irq control into unfreeze_partials()Vlastimil Babka1-6/+7
unfreeze_partials() can be optimized so that it doesn't need irqs disabled for the whole time. As the first step, move irq control into the function and remove it from the put_cpu_partial() caller. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: call deactivate_slab() without disabling irqsVlastimil Babka1-5/+19
The function is now safe to be called with irqs enabled, so move the calls outside of irq disabled sections. When called from ___slab_alloc() -> flush_slab() we have irqs disabled, so to reenable them before deactivate_slab() we need to open-code flush_slab() in ___slab_alloc() and reenable irqs after modifying the kmem_cache_cpu fields. But that means a IRQ handler meanwhile might have assigned a new page to kmem_cache_cpu.page so we have to retry the whole check. The remaining callers of flush_slab() are the IPI handler which has disabled irqs anyway, and slub_cpu_dead() which will be dealt with in the following patch. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: make locking in deactivate_slab() irq-safeVlastimil Babka1-4/+5
dectivate_slab() now no longer touches the kmem_cache_cpu structure, so it will be possible to call it with irqs enabled. Just convert the spin_lock calls to their irq saving/restoring variants to make it irq-safe. Note we now have to use cmpxchg_double_slab() for irq-safe slab_lock(), because in some situations we don't take the list_lock, which would disable irqs. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: move reset of c->page and freelist out of deactivate_slab()Vlastimil Babka1-13/+18
deactivate_slab() removes the cpu slab by merging the cpu freelist with slab's freelist and putting the slab on the proper node's list. It also sets the respective kmem_cache_cpu pointers to NULL. By extracting the kmem_cache_cpu operations from the function, we can make it not dependent on disabled irqs. Also if we return a single free pointer from ___slab_alloc, we no longer have to assign kmem_cache_cpu.page before deactivation or care if somebody preempted us and assigned a different page to our kmem_cache_cpu in the process. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: stop disabling irqs around get_partial()Vlastimil Babka1-14/+8
The function get_partial() does not need to have irqs disabled as a whole. It's sufficient to convert spin_lock operations to their irq saving/restoring versions. As a result, it's now possible to reach the page allocator from the slab allocator without disabling and re-enabling interrupts on the way. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: check new pages with restored irqsVlastimil Babka1-5/+3
Building on top of the previous patch, re-enable irqs before checking new pages. alloc_debug_processing() is now called with enabled irqs so we need to remove VM_BUG_ON(!irqs_disabled()); in check_slab() - there doesn't seem to be a need for it anyway. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: validate slab from partial list or page allocator before making it ↵Vlastimil Babka1-8/+9
cpu slab When we obtain a new slab page from node partial list or page allocator, we assign it to kmem_cache_cpu, perform some checks, and if they fail, we undo the assignment. In order to allow doing the checks without irq disabled, restructure the code so that the checks are done first, and kmem_cache_cpu.page assignment only after they pass. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: restore irqs around calling new_slab()Vlastimil Babka1-6/+2
allocate_slab() currently re-enables irqs before calling to the page allocator. It depends on gfpflags_allow_blocking() to determine if it's safe to do so. Now we can instead simply restore irq before calling it through new_slab(). The other caller early_kmem_cache_node_alloc() is unaffected by this. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: move disabling irqs closer to get_partial() in ___slab_alloc()Vlastimil Babka1-9/+25
Continue reducing the irq disabled scope. Check for per-cpu partial slabs with first with irqs enabled and then recheck with irqs disabled before grabbing the slab page. Mostly preparatory for the following patches. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: do initial checks in ___slab_alloc() with irqs enabledVlastimil Babka1-9/+45
As another step of shortening irq disabled sections in ___slab_alloc(), delay disabling irqs until we pass the initial checks if there is a cached percpu slab and it's suitable for our allocation. Now we have to recheck c->page after actually disabling irqs as an allocation in irq handler might have replaced it. Because we call pfmemalloc_match() as one of the checks, we might hit VM_BUG_ON_PAGE(!PageSlab(page)) in PageSlabPfmemalloc in case we get interrupted and the page is freed. Thus introduce a pfmemalloc_match_unsafe() variant that lacks the PageSlab check. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net>
2021-09-04mm, slub: move disabling/enabling irqs to ___slab_alloc()Vlastimil Babka1-12/+24
Currently __slab_alloc() disables irqs around the whole ___slab_alloc(). This includes cases where this is not needed, such as when the allocation ends up in the page allocator and has to awkwardly enable irqs back based on gfp flags. Also the whole kmem_cache_alloc_bulk() is executed with irqs disabled even when it hits the __slab_alloc() slow path, and long periods with disabled interrupts are undesirable. As a first step towards reducing irq disabled periods, move irq handling into ___slab_alloc(). Callers will instead prevent the s->cpu_slab percpu pointer from becoming invalid via get_cpu_ptr(), thus preempt_disable(). This does not protect against modification by an irq handler, which is still done by disabled irq for most of ___slab_alloc(). As a small immediate benefit, slab_out_of_memory() from ___slab_alloc() is now called with irqs enabled. kmem_cache_alloc_bulk() disables irqs for its fastpath and then re-enables them before calling ___slab_alloc(), which then disables them at its discretion. The whole kmem_cache_alloc_bulk() operation also disables preemption. When ___slab_alloc() calls new_slab() to allocate a new page, re-enable preemption, because new_slab() will re-enable interrupts in contexts that allow blocking (this will be improved by later patches). The patch itself will thus increase overhead a bit due to disabled preemption (on configs where it matters) and increased disabling/enabling irqs in kmem_cache_alloc_bulk(), but that will be gradually improved in the following patches. Note in __slab_alloc() we need to change the #ifdef CONFIG_PREEMPT guard to CONFIG_PREEMPT_COUNT to make sure preempt disable/enable is properly paired in all configurations. On configs without involuntary preemption and debugging the re-read of kmem_cache_cpu pointer is still compiled out as it was before. [ Mike Galbraith <efault@gmx.de>: Fix kmem_cache_alloc_bulk() error path ] Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: simplify kmem_cache_cpu and tid setupVlastimil Babka1-13/+9
In slab_alloc_node() and do_slab_free() fastpaths we need to guarantee that our kmem_cache_cpu pointer is from the same cpu as the tid value. Currently that's done by reading the tid first using this_cpu_read(), then the kmem_cache_cpu pointer and verifying we read the same tid using the pointer and plain READ_ONCE(). This can be simplified to just fetching kmem_cache_cpu pointer and then reading tid using the pointer. That guarantees they are from the same cpu. We don't need to read the tid using this_cpu_read() because the value will be validated by this_cpu_cmpxchg_double(), making sure we are on the correct cpu and the freelist didn't change by anyone preempting us since reading the tid. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net>
2021-09-04mm, slub: restructure new page checks in ___slab_alloc()Vlastimil Babka1-6/+22
When we allocate slab object from a newly acquired page (from node's partial list or page allocator), we usually also retain the page as a new percpu slab. There are two exceptions - when pfmemalloc status of the page doesn't match our gfp flags, or when the cache has debugging enabled. The current code for these decisions is not easy to follow, so restructure it and add comments. The new structure will also help with the following changes. No functional change. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net>
2021-09-04mm, slub: return slab page from get_partial() and set c->page afterwardsVlastimil Babka1-10/+11
The function get_partial() finds a suitable page on a partial list, acquires and returns its freelist and assigns the page pointer to kmem_cache_cpu. In later patch we will need more control over the kmem_cache_cpu.page assignment, so instead of passing a kmem_cache_cpu pointer, pass a pointer to a pointer to a page that get_partial() can fill and the caller can assign the kmem_cache_cpu.page pointer. No functional change as all of this still happens with disabled IRQs. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04mm, slub: dissolve new_slab_objects() into ___slab_alloc()Vlastimil Babka1-32/+18
The later patches will need more fine grained control over individual actions in ___slab_alloc(), the only caller of new_slab_objects(), so dissolve it there. This is a preparatory step with no functional change. The only minor change is moving WARN_ON_ONCE() for using a constructor together with __GFP_ZERO to new_slab(), which makes it somewhat less frequent, but still able to catch a development change introducing a systematic misuse. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Mel Gorman <mgorman@techsingularity.net>
2021-09-04mm, slub: extract get_partial() from new_slab_objects()Vlastimil Babka1-6/+6
The later patches will need more fine grained control over individual actions in ___slab_alloc(), the only caller of new_slab_objects(), so this is a first preparatory step with no functional change. This adds a goto label that appears unnecessary at this point, but will be useful for later changes. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com>
2021-09-03Merge tag 'kbuild-v5.15' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Add -s option (strict mode) to merge_config.sh to make it fail when any symbol is redefined. - Show a warning if a different compiler is used for building external modules. - Infer --target from ARCH for CC=clang to let you cross-compile the kernel without CROSS_COMPILE. - Make the integrated assembler default (LLVM_IAS=1) for CC=clang. - Add <linux/stdarg.h> to the kernel source instead of borrowing <stdarg.h> from the compiler. - Add Nick Desaulniers as a Kbuild reviewer. - Drop stale cc-option tests. - Fix the combination of CONFIG_TRIM_UNUSED_KSYMS and CONFIG_LTO_CLANG to handle symbols in inline assembly. - Show a warning if 'FORCE' is missing for if_changed rules. - Various cleanups * tag 'kbuild-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (39 commits) kbuild: redo fake deps at include/ksym/*.h kbuild: clean up objtool_args slightly modpost: get the *.mod file path more simply checkkconfigsymbols.py: Fix the '--ignore' option kbuild: merge vmlinux_link() between ARCH=um and other architectures kbuild: do not remove 'linux' link in scripts/link-vmlinux.sh kbuild: merge vmlinux_link() between the ordinary link and Clang LTO kbuild: remove stale *.symversions kbuild: remove unused quiet_cmd_update_lto_symversions gen_compile_commands: extract compiler command from a series of commands x86: remove cc-option-yn test for -mtune= arc: replace cc-option-yn uses with cc-option s390: replace cc-option-yn uses with cc-option ia64: move core-y in arch/ia64/Makefile to arch/ia64/Kbuild sparc: move the install rule to arch/sparc/Makefile security: remove unneeded subdir-$(CONFIG_...) kbuild: sh: remove unused install script kbuild: Fix 'no symbols' warning when CONFIG_TRIM_UNUSD_KSYMS=y kbuild: Switch to 'f' variants of integrated assembler flag kbuild: Shuffle blank line to improve comment meaning ...
2021-09-03mm, slub: remove redundant unfreeze_partials() from put_cpu_partial()Vlastimil Babka1-7/+0
Commit d6e0b7fa1186 ("slub: make dead caches discard free slabs immediately") introduced cpu partial flushing for kmemcg caches, based on setting the target cpu_partial to 0 and adding a flushing check in put_cpu_partial(). This code that sets cpu_partial to 0 was later moved by c9fc586403e7 ("slab: introduce __kmemcg_cache_deactivate()") and ultimately removed by 9855609bde03 ("mm: memcg/slab: use a single set of kmem_caches for all accounted allocations"). However the check and flush in put_cpu_partial() was never removed, although it's effectively a dead code. So this patch removes it. Note that d6e0b7fa1186 also added preempt_disable()/enable() to unfreeze_partials() which could be thus also considered unnecessary. But further patches will rely on it, so keep it. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-03mm, slub: don't disable irq for debug_check_no_locks_freed()Vlastimil Babka1-13/+1
In slab_free_hook() we disable irqs around the debug_check_no_locks_freed() call, which is unnecessary, as irqs are already being disabled inside the call. This seems to be leftover from the past where there were more calls inside the irq disabled sections. Remove the irq disable/enable operations. Mel noted: > Looks like it was needed for kmemcheck which went away back in 4.15 Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net>
2021-09-03mm, slub: allocate private object map for validate_slab_cache()Vlastimil Babka1-9/+15
validate_slab_cache() is called either to handle a sysfs write, or from a self-test context. In both situations it's straightforward to preallocate a private object bitmap instead of grabbing the shared static one meant for critical sections, so let's do that. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Mel Gorman <mgorman@techsingularity.net>
2021-09-03mm, slub: allocate private object map for debugfs listingsVlastimil Babka1-15/+29
Slub has a static spinlock protected bitmap for marking which objects are on freelist when it wants to list them, for situations where dynamically allocating such map can lead to recursion or locking issues, and on-stack bitmap would be too large. The handlers of debugfs files alloc_traces and free_traces also currently use this shared bitmap, but their syscall context makes it straightforward to allocate a private map before entering locked sections, so switch these processing paths to use a private bitmap. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Mel Gorman <mgorman@techsingularity.net>
2021-09-03mm, slub: don't call flush_all() from slab_debug_trace_open()Vlastimil Babka1-3/+0
slab_debug_trace_open() can only be called on caches with SLAB_STORE_USER flag and as with all slub debugging flags, such caches avoid cpu or percpu partial slabs altogether, so there's nothing to flush. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com>
2021-09-03Merge branch 'akpm' (patches from Andrew)Linus Torvalds39-1137/+2072
Merge misc updates from Andrew Morton: "173 patches. Subsystems affected by this series: ia64, ocfs2, block, and mm (debug, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock, oom-kill, migration, ksm, percpu, vmstat, and madvise)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits) mm/madvise: add MADV_WILLNEED to process_madvise() mm/vmstat: remove unneeded return value mm/vmstat: simplify the array size calculation mm/vmstat: correct some wrong comments mm/percpu,c: remove obsolete comments of pcpu_chunk_populated() selftests: vm: add COW time test for KSM pages selftests: vm: add KSM merging time test mm: KSM: fix data type selftests: vm: add KSM merging across nodes test selftests: vm: add KSM zero page merging test selftests: vm: add KSM unmerge test selftests: vm: add KSM merge test mm/migrate: correct kernel-doc notation mm: wire up syscall process_mrelease mm: introduce process_mrelease system call memblock: make memblock_find_in_range method private mm/mempolicy.c: use in_task() in mempolicy_slab_node() mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies mm/mempolicy: advertise new MPOL_PREFERRED_MANY mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY ...
2021-09-03mm/madvise: add MADV_WILLNEED to process_madvise()zhangkui1-0/+1
There is a usecase in Android that an app process's memory is swapped out by process_madvise() with MADV_PAGEOUT, such as the memory is swapped to zram or a backing device. When the process is scheduled to running, like switch to foreground, multiple page faults may cause the app dropped frames. To reduce the problem, System Management Software can read-ahead memory of the process immediately when the app switches to forground. Calling process_madvise() with MADV_WILLNEED can meet this need. Link: https://lkml.kernel.org/r/20210804082010.12482-1-zhangkui@oppo.com Signed-off-by: zhangkui <zhangkui@oppo.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/vmstat: remove unneeded return valueMiaohe Lin1-6/+2
The return value of pagetypeinfo_showfree and pagetypeinfo_showblockcount are unused now. Remove them. Link: https://lkml.kernel.org/r/20210715122911.15700-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/vmstat: simplify the array size calculationMiaohe Lin1-5/+3
We can replace the array_num * sizeof(array[0]) with sizeof(array) to simplify the code. Link: https://lkml.kernel.org/r/20210715122911.15700-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/vmstat: correct some wrong commentsMiaohe Lin1-6/+1
Patch series "Cleanup for vmstat". This series contains cleanups to remove unneeded return value, correct wrong comment and simplify the array size calculation. More details can be found in the respective changelogs. This patch (of 3): Correct wrong fls(mem+1) to fls(mem)+1 and remove the duplicated comment with quiet_vmstat(). Link: https://lkml.kernel.org/r/20210715122911.15700-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210715122911.15700-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()Jing Xiangfeng1-3/+0
Commit b239f7daf553 ("percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE") removed the parameter 'for_alloc', so remove this comment. Link: https://lkml.kernel.org/r/1630576043-21367-1-git-send-email-jingxiangfeng@huawei.com Signed-off-by: Jing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: KSM: fix data typeZhansaya Bagdauletkyzy1-4/+4
ksm_stable_node_chains_prune_millisecs is declared as int, but in stable__node_chains_prune_millisecs_store(), it can store values up to UINT_MAX. Change its type to unsigned int. Link: https://lkml.kernel.org/r/20210806111351.GA71845@asus Signed-off-by: Zhansaya Bagdauletkyzy <zhansayabagdaulet@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/migrate: correct kernel-doc notationRandy Dunlap1-1/+1
Use the expected "Return:" format to prevent a kernel-doc warning. mm/migrate.c:1157: warning: Excess function parameter 'returns' description in 'next_demotion_node' Link: https://lkml.kernel.org/r/20210808203151.10632-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: introduce process_mrelease system callSuren Baghdasaryan1-0/+70
In modern systems it's not unusual to have a system component monitoring memory conditions of the system and tasked with keeping system memory pressure under control. One way to accomplish that is to kill non-essential processes to free up memory for more important ones. Examples of this are Facebook's OOM killer daemon called oomd and Android's low memory killer daemon called lmkd. For such system component it's important to be able to free memory quickly and efficiently. Unfortunately the time process takes to free up its memory after receiving a SIGKILL might vary based on the state of the process (uninterruptible sleep), size and OPP level of the core the process is running. A mechanism to free resources of the target process in a more predictable way would improve system's ability to control its memory pressure. Introduce process_mrelease system call that releases memory of a dying process from the context of the caller. This way the memory is freed in a more controllable way with CPU affinity and priority of the caller. The workload of freeing the memory will also be charged to the caller. The operation is allowed only on a dying process. After previous discussions [1, 2, 3] the decision was made [4] to introduce a dedicated system call to cover this use case. The API is as follows, int process_mrelease(int pidfd, unsigned int flags); DESCRIPTION The process_mrelease() system call is used to free the memory of an exiting process. The pidfd selects the process referred to by the PID file descriptor. (See pidfd_open(2) for further information) The flags argument is reserved for future use; currently, this argument must be specified as 0. RETURN VALUE On success, process_mrelease() returns 0. On error, -1 is returned and errno is set to indicate the error. ERRORS EBADF pidfd is not a valid PID file descriptor. EAGAIN Failed to release part of the address space. EINTR The call was interrupted by a signal; see signal(7). EINVAL flags is not 0. EINVAL The memory of the task cannot be released because the process is not exiting, the address space is shared with another live process or there is a core dump in progress. ENOSYS This system call is not supported, for example, without MMU support built into Linux. ESRCH The target process does not exist (i.e., it has terminated and been waited on). [1] https://lore.kernel.org/lkml/20190411014353.113252-3-surenb@google.com/ [2] https://lore.kernel.org/linux-api/20201113173448.1863419-1-surenb@google.com/ [3] https://lore.kernel.org/linux-api/20201124053943.1684874-3-surenb@google.com/ [4] https://lore.kernel.org/linux-api/20201223075712.GA4719@lst.de/ Link: https://lkml.kernel.org/r/20210809185259.405936-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03memblock: make memblock_find_in_range method privateMike Rapoport1-1/+1
There are a lot of uses of memblock_find_in_range() along with memblock_reserve() from the times memblock allocation APIs did not exist. memblock_find_in_range() is the very core of memblock allocations, so any future changes to its internal behaviour would mandate updates of all the users outside memblock. Replace the calls to memblock_find_in_range() with an equivalent calls to memblock_phys_alloc() and memblock_phys_alloc_range() and make memblock_find_in_range() private method of memblock. This simplifies the callers, ensures that (unlikely) errors in memblock_reserve() are handled and improves maintainability of memblock_find_in_range(). Link: https://lkml.kernel.org/r/20210816122622.30279-1-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Kirill A. Shutemov <kirill.shtuemov@linux.intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [ACPI] Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Acked-by: Nick Kossifidis <mick@ics.forth.gr> [riscv] Tested-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/mempolicy.c: use in_task() in mempolicy_slab_node()Vasily Averin1-1/+1
Obsoleted in_intrrupt() include task context with disabled BH, it's better to use in_task() instead. Link: https://lkml.kernel.org/r/984ee771-4834-21da-801f-c15c18ddf4d1@virtuozzo.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/mempolicy: unify the create() func for bind/interleave/prefer-many policiesFeng Tang1-20/+4
As they all do the same thing: sanity check and save nodemask info, create one mpol_new_nodemask() to reduce redundancy. Link: https://lkml.kernel.org/r/1627970362-61305-6-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang <feng.tang@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ben Widawsky <ben.widawsky@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/mempolicy: advertise new MPOL_PREFERRED_MANYBen Widawsky1-6/+1
Adds a new mode to the existing mempolicy modes, MPOL_PREFERRED_MANY. MPOL_PREFERRED_MANY will be adequately documented in the internal admin-guide with this patch. Eventually, the man pages for mbind(2), get_mempolicy(2), set_mempolicy(2) and numactl(8) will also have text about this mode. Those shall contain the canonical reference. NUMA systems continue to become more prevalent. New technologies like PMEM make finer grain control over memory access patterns increasingly desirable. MPOL_PREFERRED_MANY allows userspace to specify a set of nodes that will be tried first when performing allocations. If those allocations fail, all remaining nodes will be tried. It's a straight forward API which solves many of the presumptive needs of system administrators wanting to optimize workloads on such machines. The mode will work either per VMA, or per thread. [Michal Hocko: refine kernel doc for MPOL_PREFERRED_MANY] Link: https://lore.kernel.org/r/20200630212517.308045-13-ben.widawsky@intel.com Link: https://lkml.kernel.org/r/1627970362-61305-5-git-send-email-feng.tang@intel.com Signed-off-by: Ben Widawsky <ben.widawsky@intel.com> Signed-off-by: Feng Tang <feng.tang@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANYBen Widawsky1-5/+25
Implement the missing huge page allocation functionality while obeying the preferred node semantics. This is similar to the implementation for general page allocation, as it uses a fallback mechanism to try multiple preferred nodes first, and then all other nodes. To avoid adding too many "#ifdef CONFIG_NUMA" check, add a helper function in mempolicy.h to check whether a mempolicy is MPOL_PREFERRED_MANY. [akpm@linux-foundation.org: fix compiling issue when merging with other hugetlb patch] [Thanks to 0day bot for catching the !CONFIG_NUMA compiling issue] [mhocko@suse.com: suggest to remove the #ifdef CONFIG_NUMA check] [ben.widawsky@intel.com: add helpers to avoid ifdefs] Link: https://lore.kernel.org/r/20200630212517.308045-12-ben.widawsky@intel.com Link: https://lkml.kernel.org/r/1627970362-61305-4-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/20210809024430.GA46432@shbuild999.sh.intel.com [nathan@kernel.org: initialize page to NULL in alloc_buddy_huge_page_with_mpol()] Link: https://lkml.kernel.org/r/20210810200632.3812797-1-nathan@kernel.org Link: https://lore.kernel.org/r/20200630212517.308045-12-ben.widawsky@intel.com Link: https://lkml.kernel.org/r/1627970362-61305-4-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/20210809024430.GA46432@shbuild999.sh.intel.com Signed-off-by: Ben Widawsky <ben.widawsky@intel.com> Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Co-developed-by: Feng Tang <feng.tang@intel.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/memplicy: add page allocation function for MPOL_PREFERRED_MANY policyFeng Tang1-0/+30
The semantics of MPOL_PREFERRED_MANY is similar to MPOL_PREFERRED, that it will first try to allocate memory from the preferred node(s), and fallback to all nodes in system when first try fails. Add a dedicated function alloc_pages_preferred_many() for it just like for 'interleave' policy, which will be used by 2 general memoory allocation APIs: alloc_pages() and alloc_pages_vma() Link: https://lore.kernel.org/r/20200630212517.308045-9-ben.widawsky@intel.com Link: https://lkml.kernel.org/r/1627970362-61305-3-git-send-email-feng.tang@intel.com Suggested-by: Michal Hocko <mhocko@suse.com> Originally-by: Ben Widawsky <ben.widawsky@intel.com> Co-developed-by: Ben Widawsky <ben.widawsky@intel.com> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com> Signed-off-by: Feng Tang <feng.tang@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodesDave Hansen1-14/+59
Patch series "Introduce multi-preference mempolicy", v7. This patch series introduces the concept of the MPOL_PREFERRED_MANY mempolicy. This mempolicy mode can be used with either the set_mempolicy(2) or mbind(2) interfaces. Like the MPOL_PREFERRED interface, it allows an application to set a preference for nodes which will fulfil memory allocation requests. Unlike the MPOL_PREFERRED mode, it takes a set of nodes. Like the MPOL_BIND interface, it works over a set of nodes. Unlike MPOL_BIND, it will not cause a SIGSEGV or invoke the OOM killer if those preferred nodes are not available. Along with these patches are patches for libnuma, numactl, numademo, and memhog. They still need some polish, but can be found here: https://gitlab.com/bwidawsk/numactl/-/tree/prefer-many It allows new usage: `numactl -P 0,3,4` The goal of the new mode is to enable some use-cases when using tiered memory usage models which I've lovingly named. 1a. The Hare - The interconnect is fast enough to meet bandwidth and latency requirements allowing preference to be given to all nodes with "fast" memory. 1b. The Indiscriminate Hare - An application knows it wants fast memory (or perhaps slow memory), but doesn't care which node it runs on. The application can prefer a set of nodes and then xpu bind to the local node (cpu, accelerator, etc). This reverses the nodes are chosen today where the kernel attempts to use local memory to the CPU whenever possible. This will attempt to use the local accelerator to the memory. 2. The Tortoise - The administrator (or the application itself) is aware it only needs slow memory, and so can prefer that. Much of this is almost achievable with the bind interface, but the bind interface suffers from an inability to fallback to another set of nodes if binding fails to all nodes in the nodemask. Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the preference. > /* Set first two nodes as preferred in an 8 node system. */ > const unsigned long nodes = 0x3 > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8); > /* Mimic interleave policy, but have fallback *. > const unsigned long nodes = 0xaa > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8); Some internal discussion took place around the interface. There are two alternatives which we have discussed, plus one I stuck in: 1. Ordered list of nodes. Currently it's believed that the added complexity is nod needed for expected usecases. 2. A flag for bind to allow falling back to other nodes. This confuses the notion of binding and is less flexible than the current solution. 3. Create flags or new modes that helps with some ordering. This offers both a friendlier API as well as a solution for more customized usage. It's unknown if it's worth the complexity to support this. Here is sample code for how this might work: > // Prefer specific nodes for some something wacky > set_mempolicy(MPOL_PREFER_MANY, 0x17c, 1024); > > // Default > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0); > // which is the same as > set_mempolicy(MPOL_DEFAULT, NULL, 0); > > // The Hare > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0); > > // The Tortoise > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0); > > // Prefer the fast memory of the first two sockets > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2); > This patch (of 5): The NUMA APIs currently allow passing in a "preferred node" as a single bit set in a nodemask. If more than one bit it set, bits after the first are ignored. This single node is generally OK for location-based NUMA where memory being allocated will eventually be operated on by a single CPU. However, in systems with multiple memory types, folks want to target a *type* of memory instead of a location. For instance, someone might want some high-bandwidth memory but do not care about the CPU next to which it is allocated. Or, they want a cheap, high capacity allocation and want to target all NUMA nodes which have persistent memory in volatile mode. In both of these cases, the application wants to target a *set* of nodes, but does not want strict MPOL_BIND behavior as that could lead to OOM killer or SIGSEGV. So add MPOL_PREFERRED_MANY policy to support the multiple preferred nodes requirement. This is not a pie-in-the-sky dream for an API. This was a response to a specific ask of more than one group at Intel. Specifically: 1. There are existing libraries that target memory types such as https://github.com/memkind/memkind. These are known to suffer from SIGSEGV's when memory is low on targeted memory "kinds" that span more than one node. The MCDRAM on a Xeon Phi in "Cluster on Die" mode is an example of this. 2. Volatile-use persistent memory users want to have a memory policy which is targeted at either "cheap and slow" (PMEM) or "expensive and fast" (DRAM). However, they do not want to experience allocation failures when the targeted type is unavailable. 3. Allocate-then-run. Generally, we let the process scheduler decide on which physical CPU to run a task. That location provides a default allocation policy, and memory availability is not generally considered when placing tasks. For situations where memory is valuable and constrained, some users want to allocate memory first, *then* allocate close compute resources to the allocation. This is the reverse of the normal (CPU) model. Accelerators such as GPUs that operate on core-mm-managed memory are interested in this model. A check is added in sanitize_mpol_flags() to not permit 'prefer_many' policy to be used for now, and will be removed in later patch after all implementations for 'prefer_many' are ready, as suggested by Michal Hocko. [mhocko@kernel.org: suggest to refine policy_node/policy_nodemask handling] Link: https://lkml.kernel.org/r/1627970362-61305-1-git-send-email-feng.tang@intel.com Link: https://lore.kernel.org/r/20200630212517.308045-4-ben.widawsky@intel.com Link: https://lkml.kernel.org/r/1627970362-61305-2-git-send-email-feng.tang@intel.com Co-developed-by: Ben Widawsky <ben.widawsky@intel.com> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Feng Tang <feng.tang@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Huang Ying <ying.huang@intel.com>b Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/mempolicy: use readable NUMA_NO_NODE macro instead of magic numberBaolin Wang1-3/+3
The caller of mpol_misplaced() already use NUMA_NO_NODE to check whether current page node is misplaced, thus using NUMA_NO_NODE in mpol_misplaced() instead of magic number is more readable. Link: https://lkml.kernel.org/r/1b77c0ce21183fa86f4db250b115cf5e27396528.1627558356.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: compaction: support triggering of proactive compaction by userCharan Teja Reddy1-2/+36
The proactive compaction[1] gets triggered for every 500msec and run compaction on the node for COMPACTION_HPAGE_ORDER (usually order-9) pages based on the value set to sysctl.compaction_proactiveness. Triggering the compaction for every 500msec in search of COMPACTION_HPAGE_ORDER pages is not needed for all applications, especially on the embedded system usecases which may have few MB's of RAM. Enabling the proactive compaction in its state will endup in running almost always on such systems. Other side, proactive compaction can still be very much useful for getting a set of higher order pages in some controllable manner(controlled by using the sysctl.compaction_proactiveness). So, on systems where enabling the proactive compaction always may proove not required, can trigger the same from user space on write to its sysctl interface. As an example, say app launcher decide to launch the memory heavy application which can be launched fast if it gets more higher order pages thus launcher can prepare the system in advance by triggering the proactive compaction from userspace. This triggering of proactive compaction is done on a write to sysctl.compaction_proactiveness by user. [1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=facdaa917c4d5a376d09d25865f5a863f906234a [akpm@linux-foundation.org: tweak vm.rst, per Mike] Link: https://lkml.kernel.org/r/1627653207-12317-1-git-send-email-charante@codeaurora.org Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Nitin Gupta <nigupta@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm: compaction: optimize proactive compaction deferralsCharan Teja Reddy1-10/+19
Vlastimil Babka figured out that when fragmentation score didn't go down across the proactive compaction i.e. when no progress is made, next wake up for proactive compaction is deferred for 1 << COMPACT_MAX_DEFER_SHIFT, i.e. 64 times, with each wakeup interval of HPAGE_FRAG_CHECK_INTERVAL_MSEC(=500). In each of this wakeup, it just decrement 'proactive_defer' counter and goes sleep i.e. it is getting woken to just decrement a counter. The same deferral time can also achieved by simply doing the HPAGE_FRAG_CHECK_INTERVAL_MSEC << COMPACT_MAX_DEFER_SHIFT thus unnecessary wakeup of kcompact thread is avoided thus also removes the need of 'proactive_defer' thread counter. [akpm@linux-foundation.org: tweak comment] Link: https://lore.kernel.org/linux-fsdevel/88abfdb6-2c13-b5a6-5b46-742d12d1c910@suse.cz/ Link: https://lkml.kernel.org/r/1626869599-25412-1-git-send-email-charante@codeaurora.org Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Nitin Gupta <nigupta@nvidia.com> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm, vmscan: guarantee drop_slab_node() terminationVlastimil Babka1-1/+2
drop_slab_node() is called as part of echo 2>/proc/sys/vm/drop_caches operation. It iterates over all memcgs and calls shrink_slab() which in turn iterates over all slab shrinkers. Freed objects are counted and as long as the total number of freed objects from all memcgs and shrinkers is higher than 10, drop_slab_node() loops for another full memcgs*shrinkers iteration. This arbitrary constant threshold of 10 can result in effectively an infinite loop on a system with large number of memcgs and/or parallel activity that allocates new objects. This has been reported previously by Chunxin Zang [1] and recently by our customer. The previous report [1] has resulted in commit 069c411de40a ("mm/vmscan: fix infinite loop in drop_slab_node") which added a check for signals allowing the user to terminate the command writing to drop_caches. At the time it was also considered to make the threshold grow with each iteration to guarantee termination, but such patch hasn't been formally proposed yet. This patch implements the dynamically growing threshold. At first iteration it's enough to free one object to continue, and this threshold effectively doubles with each iteration. Our customer's feedback was positive. There is always a risk that this change will result on some system in a previously terminating drop_caches operation to terminate sooner and free fewer objects. Ideally the semantics would guarantee freeing all freeable objects that existed at the moment of starting the operation, while not looping forever for newly allocated objects, but that's not feasible to track. In the less ideal solution based on thresholds, arguably the termination guarantee is more important than the exhaustiveness guarantee. If there are reports of large regression wrt being exhaustive, we can tune how fast the threshold grows. [1] https://lore.kernel.org/lkml/20200909152047.27905-1-zangchunxin@bytedance.com/T/#u [vbabka@suse.cz: avoid undefined shift behaviour] Link: https://lkml.kernel.org/r/2f034e6f-a753-550a-f374-e4e23899d3d5@suse.cz Link: https://lkml.kernel.org/r/20210818152239.25502-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Chunxin Zang <zangchunxin@bytedance.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Chris Down <chris@chrisdown.name> Cc: Michal Hocko <mhocko@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/vmscan: add 'else' to remove check_pending labelMiaohe Lin1-9/+5
We could add 'else' to remove the somewhat odd check_pending label to make code core succinct. Link: https://lkml.kernel.org/r/20210717065911.61497-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alex Shi <alexs@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/vmscan: remove unneeded return value of kswapd_run()Miaohe Lin1-5/+2
The return value of kswapd_run() is unused now. Clean it up. Link: https://lkml.kernel.org/r/20210717065911.61497-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alex Shi <alexs@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/vmscan: remove misleading setting to sc->priorityMiaohe Lin1-2/+0
The priority field of sc is used to control how many pages we should scan at once while we always traverse the list to shrink the pages in these functions. So these settings are unneeded and misleading. Link: https://lkml.kernel.org/r/20210717065911.61497-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alex Shi <alexs@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/vmscan: remove the PageDirty check after MADV_FREE pages are page_ref_freezedMiaohe Lin1-5/+8
Patch series "Cleanups for vmscan", v2. This series contains cleanups to remove unneeded return value, misleading setting and so on. Also this remove the PageDirty check after MADV_FREE pages are page_ref_freezed. More details can be found in the respective changelogs. This patch (of 4): If the MADV_FREE pages are redirtied before they could be reclaimed, put the pages back to anonymous LRU list by setting SwapBacked flag and the pages will be reclaimed in normal swapout way. But as Yu Zhao pointed out, "The page has only one reference left, which is from the isolation. After the caller puts the page back on lru and drops the reference, the page will be freed anyway. It doesn't matter which lru it goes." So we don't bother checking PageDirty here. [Yu Zhao's comment is also quoted in the code.] Link: https://lkml.kernel.org/r/20210717065911.61497-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210717065911.61497-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Yu Zhao <yuzhao@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alexs@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Shaohua Li <shli@fb.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/vmpressure: replace vmpressure_to_css() with vmpressure_to_memcg()Hui Su2-4/+3
We can get memcg directly form vmpr instead of vmpr->memcg->css->memcg, so add a new func helper vmpressure_to_memcg(). And no code will use vmpressure_to_css(), so delete it. Link: https://lkml.kernel.org/r/20210630112146.455103-1-suhui@zeku.com Signed-off-by: Hui Su <suhui@zeku.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Chris Down <chris@chrisdown.name> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/migrate: add sysfs interface to enable reclaim migrationHuang Ying2-2/+64
Some method is obviously needed to enable reclaim-based migration. Just like traditional autonuma, there will be some workloads that will benefit like workloads with more "static" configurations where hot pages stay hot and cold pages stay cold. If pages come and go from the hot and cold sets, the benefits of this approach will be more limited. The benefits are truly workload-based and *not* hardware-based. We do not believe that there is a viable threshold where certain hardware configurations should have this mechanism enabled while others do not. To be conservative, earlier work defaulted to disable reclaim- based migration and did not include a mechanism to enable it. This proposes add a new sysfs file /sys/kernel/mm/numa/demotion_enabled as a method to enable it. We are open to any alternative that allows end users to enable this mechanism or disable it if workload harm is detected (just like traditional autonuma). Once this is enabled page demotion may move data to a NUMA node that does not fall into the cpuset of the allocating process. This could be construed to violate the guarantees of cpusets. However, since this is an opt-in mechanism, the assumption is that anyone enabling it is content to relax the guarantees. Link: https://lkml.kernel.org/r/20210721063926.3024591-9-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210715055145.195411-10-ying.huang@intel.com Signed-off-by: Huang Ying <ying.huang@intel.com> Originally-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03mm/vmscan: never demote for memcg reclaimDave Hansen1-2/+7
Global reclaim aims to reduce the amount of memory used on a given node or set of nodes. Migrating pages to another node serves this purpose. memcg reclaim is different. Its goal is to reduce the total memory consumption of the entire memcg, across all nodes. Migration does not assist memcg reclaim because it just moves page contents between nodes rather than actually reducing memory consumption. Link: https://lkml.kernel.org/r/20210715055145.195411-9-ying.huang@intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>