From 3806b04144e5e030aa17835ac1bb42473af4b957 Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Fri, 31 May 2019 22:30:03 -0700 Subject: mm/vmalloc.c: fix typo in comment Reported-by: Nicholas Joll Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmalloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 233af6936c93..7350a124524b 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -815,7 +815,7 @@ find_vmap_lowest_match(unsigned long size, } /* - * OK. We roll back and find the fist right sub-tree, + * OK. We roll back and find the first right sub-tree, * that will satisfy the search criteria. It can happen * only once due to "vstart" restriction. */ -- cgit v1.2.3 From bc81426f5beef7da863d3365bc9d45e820448745 Mon Sep 17 00:00:00 2001 From: Michal Koutný Date: Fri, 31 May 2019 22:30:19 -0700 Subject: prctl_set_mm: downgrade mmap_sem to read lock MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The commit a3b609ef9f8b ("proc read mm's {arg,env}_{start,end} with mmap semaphore taken.") added synchronization of reading argument/environment boundaries under mmap_sem. Later commit 88aa7cc688d4 ("mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct") avoided the coarse use of mmap_sem in similar situations. But there still remained two places that (mis)use mmap_sem. get_cmdline should also use arg_lock instead of mmap_sem when it reads the boundaries. The second place that should use arg_lock is in prctl_set_mm. By protecting the boundaries fields with the arg_lock, we can downgrade mmap_sem to reader lock (analogous to what we already do in prctl_set_mm_map). [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20190502125203.24014-3-mkoutny@suse.com Fixes: 88aa7cc688d4 ("mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct") Signed-off-by: Michal Koutný Signed-off-by: Laurent Dufour Co-developed-by: Laurent Dufour Reviewed-by: Cyrill Gorcunov Acked-by: Michal Hocko Cc: Yang Shi Cc: Mateusz Guzik Cc: Kirill Tkhai Cc: Konstantin Khlebnikov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sys.c | 11 +++++++++-- mm/util.c | 4 ++-- 2 files changed, 11 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/kernel/sys.c b/kernel/sys.c index 775bf8d18d03..2969304c29fe 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2127,9 +2127,15 @@ static int prctl_set_mm(int opt, unsigned long addr, error = -EINVAL; - down_write(&mm->mmap_sem); + /* + * arg_lock protects concurent updates of arg boundaries, we need + * mmap_sem for a) concurrent sys_brk, b) finding VMA for addr + * validation. + */ + down_read(&mm->mmap_sem); vma = find_vma(mm, addr); + spin_lock(&mm->arg_lock); prctl_map.start_code = mm->start_code; prctl_map.end_code = mm->end_code; prctl_map.start_data = mm->start_data; @@ -2217,7 +2223,8 @@ static int prctl_set_mm(int opt, unsigned long addr, error = 0; out: - up_write(&mm->mmap_sem); + spin_unlock(&mm->arg_lock); + up_read(&mm->mmap_sem); return error; } diff --git a/mm/util.c b/mm/util.c index 91682a2090ee..9834c4ab7d8e 100644 --- a/mm/util.c +++ b/mm/util.c @@ -718,12 +718,12 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen) if (!mm->arg_end) goto out_mm; /* Shh! No looking before we're done */ - down_read(&mm->mmap_sem); + spin_lock(&mm->arg_lock); arg_start = mm->arg_start; arg_end = mm->arg_end; env_start = mm->env_start; env_end = mm->env_end; - up_read(&mm->mmap_sem); + spin_unlock(&mm->arg_lock); len = arg_end - arg_start; -- cgit v1.2.3 From 3e8589963773a5c23e2f1fe4bcad0e9a90b7f471 Mon Sep 17 00:00:00 2001 From: Jiri Slaby Date: Fri, 31 May 2019 22:30:26 -0700 Subject: memcg: make it work on sparse non-0-node systems We have a single node system with node 0 disabled: Scanning NUMA topology in Northbridge 24 Number of physical nodes 2 Skipping disabled node 0 Node 1 MemBase 0000000000000000 Limit 00000000fbff0000 NODE_DATA(1) allocated [mem 0xfbfda000-0xfbfeffff] This causes crashes in memcg when system boots: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 #PF error: [normal kernel read fault] ... RIP: 0010:list_lru_add+0x94/0x170 ... Call Trace: d_lru_add+0x44/0x50 dput.part.34+0xfc/0x110 __fput+0x108/0x230 task_work_run+0x9f/0xc0 exit_to_usermode_loop+0xf5/0x100 It is reproducible as far as 4.12. I did not try older kernels. You have to have a new enough systemd, e.g. 241 (the reason is unknown -- was not investigated). Cannot be reproduced with systemd 234. The system crashes because the size of lru array is never updated in memcg_update_all_list_lrus and the reads are past the zero-sized array, causing dereferences of random memory. The root cause are list_lru_memcg_aware checks in the list_lru code. The test in list_lru_memcg_aware is broken: it assumes node 0 is always present, but it is not true on some systems as can be seen above. So fix this by avoiding checks on node 0. Remember the memcg-awareness by a bool flag in struct list_lru. Link: http://lkml.kernel.org/r/20190522091940.3615-1-jslaby@suse.cz Fixes: 60d3fd32a7a9 ("list_lru: introduce per-memcg lists") Signed-off-by: Jiri Slaby Acked-by: Michal Hocko Suggested-by: Vladimir Davydov Acked-by: Vladimir Davydov Reviewed-by: Shakeel Butt Cc: Johannes Weiner Cc: Raghavendra K T Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/list_lru.h | 1 + mm/list_lru.c | 8 +++----- 2 files changed, 4 insertions(+), 5 deletions(-) (limited to 'mm') diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index aa5efd9351eb..d5ceb2839a2d 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -54,6 +54,7 @@ struct list_lru { #ifdef CONFIG_MEMCG_KMEM struct list_head list; int shrinker_id; + bool memcg_aware; #endif }; diff --git a/mm/list_lru.c b/mm/list_lru.c index 0bdf3152735e..e4709fdaa8e6 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -38,11 +38,7 @@ static int lru_shrinker_id(struct list_lru *lru) static inline bool list_lru_memcg_aware(struct list_lru *lru) { - /* - * This needs node 0 to be always present, even - * in the systems supporting sparse numa ids. - */ - return !!lru->node[0].memcg_lrus; + return lru->memcg_aware; } static inline struct list_lru_one * @@ -452,6 +448,8 @@ static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) { int i; + lru->memcg_aware = memcg_aware; + if (!memcg_aware) return 0; -- cgit v1.2.3 From df17277b2a85c00f5710e33ce238ba4114687a28 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Fri, 31 May 2019 22:30:33 -0700 Subject: mm/gup: continue VM_FAULT_RETRY processing even for pre-faults When get_user_pages*() is called with pages = NULL, the processing of VM_FAULT_RETRY terminates early without actually retrying to fault-in all the pages. If the pages in the requested range belong to a VMA that has userfaultfd registered, handle_userfault() returns VM_FAULT_RETRY *after* user space has populated the page, but for the gup pre-fault case there's no actual retry and the caller will get no pages although they are present. This issue was uncovered when running post-copy memory restore in CRIU after d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails"). After this change, the copying of FPU state to the sigframe switched from copy_to_user() variants which caused a real page fault to get_user_pages() with pages parameter set to NULL. In post-copy mode of CRIU, the destination memory is managed with userfaultfd and lack of the retry for pre-fault case in get_user_pages() causes a crash of the restored process. Making the pre-fault behavior of get_user_pages() the same as the "normal" one fixes the issue. Link: http://lkml.kernel.org/r/1557844195-18882-1-git-send-email-rppt@linux.ibm.com Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails") Signed-off-by: Mike Rapoport Tested-by: Andrei Vagin [https://travis-ci.org/avagin/linux/builds/533184940] Tested-by: Hugh Dickins Cc: Andrea Arcangeli Cc: Sebastian Andrzej Siewior Cc: Borislav Petkov Cc: Pavel Machek Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/gup.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) (limited to 'mm') diff --git a/mm/gup.c b/mm/gup.c index f173fcbaf1b2..ddde097cf9e4 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1042,10 +1042,6 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, BUG_ON(ret >= nr_pages); } - if (!pages) - /* If it's a prefault don't insist harder */ - return ret; - if (ret > 0) { nr_pages -= ret; pages_done += ret; @@ -1061,8 +1057,12 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, pages_done = ret; break; } - /* VM_FAULT_RETRY triggered, so seek to the faulting offset */ - pages += ret; + /* + * VM_FAULT_RETRY triggered, so seek to the faulting offset. + * For the prefault case (!pages) we only update counts. + */ + if (likely(pages)) + pages += ret; start += ret << PAGE_SHIFT; /* @@ -1085,7 +1085,8 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, pages_done++; if (!nr_pages) break; - pages++; + if (likely(pages)) + pages++; start += PAGE_SIZE; } if (lock_dropped && *locked) { -- cgit v1.2.3 From bb9f6f63f32da40ca34a921e377ad3181a4f9023 Mon Sep 17 00:00:00 2001 From: Vitaly Wool Date: Fri, 31 May 2019 22:30:39 -0700 Subject: z3fold: fix sheduling while atomic kmem_cache_alloc() may be called from z3fold_alloc() in atomic context, so we need to pass correct gfp flags to avoid "scheduling while atomic" bug. Link: http://lkml.kernel.org/r/20190523153245.119dfeed55927e8755250ddd@gmail.com Fixes: 7c2b8baa61fe5 ("mm/z3fold.c: add structure for buddy handles") Signed-off-by: Vitaly Wool Reviewed-by: Andrew Morton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/z3fold.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) (limited to 'mm') diff --git a/mm/z3fold.c b/mm/z3fold.c index 99be52c5ca45..985732c8b025 100644 --- a/mm/z3fold.c +++ b/mm/z3fold.c @@ -190,10 +190,11 @@ static int size_to_chunks(size_t size) static void compact_page_work(struct work_struct *w); -static inline struct z3fold_buddy_slots *alloc_slots(struct z3fold_pool *pool) +static inline struct z3fold_buddy_slots *alloc_slots(struct z3fold_pool *pool, + gfp_t gfp) { struct z3fold_buddy_slots *slots = kmem_cache_alloc(pool->c_handle, - GFP_KERNEL); + gfp); if (slots) { memset(slots->slot, 0, sizeof(slots->slot)); @@ -295,10 +296,10 @@ static void z3fold_unregister_migration(struct z3fold_pool *pool) /* Initializes the z3fold header of a newly allocated z3fold page */ static struct z3fold_header *init_z3fold_page(struct page *page, - struct z3fold_pool *pool) + struct z3fold_pool *pool, gfp_t gfp) { struct z3fold_header *zhdr = page_address(page); - struct z3fold_buddy_slots *slots = alloc_slots(pool); + struct z3fold_buddy_slots *slots = alloc_slots(pool, gfp); if (!slots) return NULL; @@ -912,7 +913,7 @@ retry: if (!page) return -ENOMEM; - zhdr = init_z3fold_page(page, pool); + zhdr = init_z3fold_page(page, pool, gfp); if (!zhdr) { __free_page(page); return -ENOMEM; -- cgit v1.2.3 From 0600597c854e53d2f9b7a6a718c1da2b8b4cb4db Mon Sep 17 00:00:00 2001 From: Nathan Chancellor Date: Fri, 31 May 2019 22:30:42 -0700 Subject: kasan: initialize tag to 0xff in __kasan_kmalloc When building with -Wuninitialized and CONFIG_KASAN_SW_TAGS unset, Clang warns: mm/kasan/common.c:484:40: warning: variable 'tag' is uninitialized when used here [-Wuninitialized] kasan_unpoison_shadow(set_tag(object, tag), size); ^~~ set_tag ignores tag in this configuration but clang doesn't realize it at this point in its pipeline, as it points to arch_kasan_set_tag as being the point where it is used, which will later be expanded to (void *)(object) without a use of tag. Initialize tag to 0xff, as it removes this warning and doesn't change the meaning of the code. Link: https://github.com/ClangBuiltLinux/linux/issues/465 Link: http://lkml.kernel.org/r/20190502163057.6603-1-natechancellor@gmail.com Fixes: 7f94ffbc4c6a ("kasan: add hooks implementation for tag-based mode") Signed-off-by: Nathan Chancellor Reviewed-by: Andrey Konovalov Reviewed-by: Andrey Ryabinin Cc: Alexander Potapenko Cc: Dmitry Vyukov Cc: Nick Desaulniers Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/kasan/common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/kasan/common.c b/mm/kasan/common.c index 36afcf64e016..242fdc01aaa9 100644 --- a/mm/kasan/common.c +++ b/mm/kasan/common.c @@ -464,7 +464,7 @@ static void *__kasan_kmalloc(struct kmem_cache *cache, const void *object, { unsigned long redzone_start; unsigned long redzone_end; - u8 tag; + u8 tag = 0xff; if (gfpflags_allow_blocking(flags)) quarantine_reduce(); -- cgit v1.2.3 From e577c8b64d58fe307ea4d5149d31615df2d90861 Mon Sep 17 00:00:00 2001 From: Suzuki K Poulose Date: Fri, 31 May 2019 22:30:59 -0700 Subject: mm, compaction: make sure we isolate a valid PFN When we have holes in a normal memory zone, we could endup having cached_migrate_pfns which may not necessarily be valid, under heavy memory pressure with swapping enabled ( via __reset_isolation_suitable(), triggered by kswapd). Later if we fail to find a page via fast_isolate_freepages(), we may end up using the migrate_pfn we started the search with, as valid page. This could lead to accessing NULL pointer derefernces like below, due to an invalid mem_section pointer. Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 [47/1825] Mem abort info: ESR = 0x96000004 Exception class = DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 user pgtable: 4k pages, 48-bit VAs, pgdp = 0000000082f94ae9 [0000000000000008] pgd=0000000000000000 Internal error: Oops: 96000004 [#1] SMP ... CPU: 10 PID: 6080 Comm: qemu-system-aar Not tainted 510-rc1+ #6 Hardware name: AmpereComputing(R) OSPREY EV-883832-X3-0001/OSPREY, BIOS 4819 09/25/2018 pstate: 60000005 (nZCv daif -PAN -UAO) pc : set_pfnblock_flags_mask+0x58/0xe8 lr : compaction_alloc+0x300/0x950 [...] Process qemu-system-aar (pid: 6080, stack limit = 0x0000000095070da5) Call trace: set_pfnblock_flags_mask+0x58/0xe8 compaction_alloc+0x300/0x950 migrate_pages+0x1a4/0xbb0 compact_zone+0x750/0xde8 compact_zone_order+0xd8/0x118 try_to_compact_pages+0xb4/0x290 __alloc_pages_direct_compact+0x84/0x1e0 __alloc_pages_nodemask+0x5e0/0xe18 alloc_pages_vma+0x1cc/0x210 do_huge_pmd_anonymous_page+0x108/0x7c8 __handle_mm_fault+0xdd4/0x1190 handle_mm_fault+0x114/0x1c0 __get_user_pages+0x198/0x3c0 get_user_pages_unlocked+0xb4/0x1d8 __gfn_to_pfn_memslot+0x12c/0x3b8 gfn_to_pfn_prot+0x4c/0x60 kvm_handle_guest_abort+0x4b0/0xcd8 handle_exit+0x140/0x1b8 kvm_arch_vcpu_ioctl_run+0x260/0x768 kvm_vcpu_ioctl+0x490/0x898 do_vfs_ioctl+0xc4/0x898 ksys_ioctl+0x8c/0xa0 __arm64_sys_ioctl+0x28/0x38 el0_svc_common+0x74/0x118 el0_svc_handler+0x38/0x78 el0_svc+0x8/0xc Code: f8607840 f100001f 8b011401 9a801020 (f9400400) ---[ end trace af6a35219325a9b6 ]--- The issue was reported on an arm64 server with 128GB with holes in the zone (e.g, [32GB@4GB, 96GB@544GB]), with a swap device enabled, while running 100 KVM guest instances. This patch fixes the issue by ensuring that the page belongs to a valid PFN when we fallback to using the lower limit of the scan range upon failure in fast_isolate_freepages(). Link: http://lkml.kernel.org/r/1558711908-15688-1-git-send-email-suzuki.poulose@arm.com Fixes: 5a811889de10f1eb ("mm, compaction: use free lists to quickly locate a migration target") Signed-off-by: Suzuki K Poulose Reported-by: Marc Zyngier Reviewed-by: Mel Gorman Reviewed-by: Anshuman Khandual Cc: Michal Hocko Cc: Qian Cai Cc: Marc Zyngier Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/compaction.c b/mm/compaction.c index 9febc8cc84e7..9e1b9acb116b 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1399,7 +1399,7 @@ fast_isolate_freepages(struct compact_control *cc) page = pfn_to_page(highest); cc->free_pfn = highest; } else { - if (cc->direct_compaction) { + if (cc->direct_compaction && pfn_valid(min_pfn)) { page = pfn_to_page(min_pfn); cc->free_pfn = min_pfn; } -- cgit v1.2.3