From 71abdc15adf8c702a1dd535f8e30df50758848d2 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Fri, 6 Jun 2014 14:35:35 -0700 Subject: mm: vmscan: clear kswapd's special reclaim powers before exiting When kswapd exits, it can end up taking locks that were previously held by allocating tasks while they waited for reclaim. Lockdep currently warns about this: On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote: > inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage. > kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes: > (&sig->group_rwsem){+++++?}, at: exit_signals+0x24/0x130 > {RECLAIM_FS-ON-W} state was registered at: > mark_held_locks+0xb9/0x140 > lockdep_trace_alloc+0x7a/0xe0 > kmem_cache_alloc_trace+0x37/0x240 > flex_array_alloc+0x99/0x1a0 > cgroup_attach_task+0x63/0x430 > attach_task_by_pid+0x210/0x280 > cgroup_procs_write+0x16/0x20 > cgroup_file_write+0x120/0x2c0 > vfs_write+0xc0/0x1f0 > SyS_write+0x4c/0xa0 > tracesys+0xdd/0xe2 > irq event stamp: 49 > hardirqs last enabled at (49): _raw_spin_unlock_irqrestore+0x36/0x70 > hardirqs last disabled at (48): _raw_spin_lock_irqsave+0x2b/0xa0 > softirqs last enabled at (0): copy_process.part.24+0x627/0x15f0 > softirqs last disabled at (0): (null) > > other info that might help us debug this: > Possible unsafe locking scenario: > > CPU0 > ---- > lock(&sig->group_rwsem); > > lock(&sig->group_rwsem); > > *** DEADLOCK *** > > no locks held by kswapd2/1151. > > stack backtrace: > CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4 > Call Trace: > dump_stack+0x19/0x1b > print_usage_bug+0x1f7/0x208 > mark_lock+0x21d/0x2a0 > __lock_acquire+0x52a/0xb60 > lock_acquire+0xa2/0x140 > down_read+0x51/0xa0 > exit_signals+0x24/0x130 > do_exit+0xb5/0xa50 > kthread+0xdb/0x100 > ret_from_fork+0x7c/0xb0 This is because the kswapd thread is still marked as a reclaimer at the time of exit. But because it is exiting, nobody is actually waiting on it to make reclaim progress anymore, and it's nothing but a regular thread at this point. Be tidy and strip it of all its powers (PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state) before returning from the thread function. Signed-off-by: Johannes Weiner Reported-by: Gu Zheng Cc: Yasuaki Ishimatsu Cc: Tang Chen Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 9149444f947d..05d41c0d7f6c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3372,7 +3372,10 @@ static int kswapd(void *p) } } + tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD); current->reclaim_state = NULL; + lockdep_clear_current_reclaim_state(); + return 0; } -- cgit v1.2.3 From 688eb988d15af55c1d1b70b1ca9f6ce58f277c20 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 6 Jun 2014 14:38:15 -0700 Subject: vmscan: memcg: always use swappiness of the reclaimed memcg Memory reclaim always uses swappiness of the reclaim target memcg (origin of the memory pressure) or vm_swappiness for global memory reclaim. This behavior was consistent (except for difference between global and hard limit reclaim) because swappiness was enforced to be consistent within each memcg hierarchy. After "mm: memcontrol: remove hierarchy restrictions for swappiness and oom_control" each memcg can have its own swappiness independent of hierarchical parents, though, so the consistency guarantee is gone. This can lead to an unexpected behavior. Say that a group is explicitly configured to not swapout by memory.swappiness=0 but its memory gets swapped out anyway when the memory pressure comes from its parent with a It is also unexpected that the knob is meaningless without setting the hard limit which would trigger the reclaim and enforce the swappiness. There are setups where the hard limit is configured higher in the hierarchy by an administrator and children groups are under control of somebody else who is interested in the swapout behavior but not necessarily about the memory limit. From a semantic point of view swappiness is an attribute defining anon vs. file proportional scanning of LRU which is memcg specific (unlike charges which are propagated up the hierarchy) so it should be applied to the particular memcg's LRU regardless where the memory pressure comes from. This patch removes vmscan_swappiness() and stores the swappiness into the scan_control structure. mem_cgroup_swappiness is then used to provide the correct value before shrink_lruvec is called. The global vm_swappiness is used for the root memcg. [hughd@google.com: oopses immediately when booted with cgroup_disable=memory] Signed-off-by: Michal Hocko Acked-by: Johannes Weiner Cc: Tejun Heo Signed-off-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/memory.txt | 15 +++++++-------- mm/memcontrol.c | 2 +- mm/vmscan.c | 18 ++++++++---------- 3 files changed, 16 insertions(+), 19 deletions(-) (limited to 'mm/vmscan.c') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 4937e6fff9b4..b3429aec444c 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -540,14 +540,13 @@ Note: 5.3 swappiness -Similar to /proc/sys/vm/swappiness, but only affecting reclaim that is -triggered by this cgroup's hard limit. The tunable in the root cgroup -corresponds to the global swappiness setting. - -Please note that unlike the global swappiness, memcg knob set to 0 -really prevents from any swapping even if there is a swap storage -available. This might lead to memcg OOM killer if there are no file -pages to reclaim. +Overrides /proc/sys/vm/swappiness for the particular group. The tunable +in the root cgroup corresponds to the global swappiness setting. + +Please note that unlike during the global reclaim, limit reclaim +enforces that 0 swappiness really prevents from any swapping even if +there is a swap storage available. This might lead to memcg OOM killer +if there are no file pages to reclaim. 5.4 failcnt diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a500cb0594c4..9bf8a84bcaae 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1550,7 +1550,7 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) int mem_cgroup_swappiness(struct mem_cgroup *memcg) { /* root ? */ - if (!css_parent(&memcg->css)) + if (mem_cgroup_disabled() || !css_parent(&memcg->css)) return vm_swappiness; return memcg->swappiness; diff --git a/mm/vmscan.c b/mm/vmscan.c index 05d41c0d7f6c..f44476a41544 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -83,6 +83,9 @@ struct scan_control { /* Scan (total_size >> priority) pages at once */ int priority; + /* anon vs. file LRUs scanning "ratio" */ + int swappiness; + /* * The memory cgroup that hit its limit and as a result is the * primary target of this reclaim invocation. @@ -1845,13 +1848,6 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, return shrink_inactive_list(nr_to_scan, lruvec, sc, lru); } -static int vmscan_swappiness(struct scan_control *sc) -{ - if (global_reclaim(sc)) - return vm_swappiness; - return mem_cgroup_swappiness(sc->target_mem_cgroup); -} - enum scan_balance { SCAN_EQUAL, SCAN_FRACT, @@ -1912,7 +1908,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, * using the memory controller's swap limit feature would be * too expensive. */ - if (!global_reclaim(sc) && !vmscan_swappiness(sc)) { + if (!global_reclaim(sc) && !sc->swappiness) { scan_balance = SCAN_FILE; goto out; } @@ -1922,7 +1918,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, * system is close to OOM, scan both anon and file equally * (unless the swappiness setting disagrees with swapping). */ - if (!sc->priority && vmscan_swappiness(sc)) { + if (!sc->priority && sc->swappiness) { scan_balance = SCAN_EQUAL; goto out; } @@ -1965,7 +1961,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, * With swappiness at 100, anonymous and file have the same priority. * This scanning priority is essentially the inverse of IO cost. */ - anon_prio = vmscan_swappiness(sc); + anon_prio = sc->swappiness; file_prio = 200 - anon_prio; /* @@ -2265,6 +2261,7 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) lruvec = mem_cgroup_zone_lruvec(zone, memcg); + sc->swappiness = mem_cgroup_swappiness(memcg); shrink_lruvec(lruvec, sc); /* @@ -2731,6 +2728,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg, .may_swap = !noswap, .order = 0, .priority = 0, + .swappiness = mem_cgroup_swappiness(memcg), .target_mem_cgroup = memcg, }; struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg); -- cgit v1.2.3 From b1de0d139c97a6078bbada6cf2d27c30ce127a97 Mon Sep 17 00:00:00 2001 From: Mitchel Humpherys Date: Fri, 6 Jun 2014 14:38:30 -0700 Subject: mm: convert some level-less printks to pr_* printk is meant to be used with an associated log level. There are some instances of printk scattered around the mm code where the log level is missing. Add a log level and adhere to suggestions by scripts/checkpatch.pl by moving to the pr_* macros. Also add the typical pr_fmt definition so that print statements can be easily traced back to the modules where they occur, correlated one with another, etc. This will require the removal of some (now redundant) prefixes on a few print statements. Signed-off-by: Mitchel Humpherys Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- block/bounce.c | 7 +++++-- mm/mempolicy.c | 5 ++++- mm/mmap.c | 21 ++++++++++++--------- mm/nommu.c | 5 ++++- mm/vmscan.c | 5 ++++- 5 files changed, 29 insertions(+), 14 deletions(-) (limited to 'mm/vmscan.c') diff --git a/block/bounce.c b/block/bounce.c index 523918b8c6dc..ab21ba203d5c 100644 --- a/block/bounce.c +++ b/block/bounce.c @@ -3,6 +3,8 @@ * - Split from highmem.c */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + #include #include #include @@ -15,6 +17,7 @@ #include #include #include +#include #include #include @@ -34,7 +37,7 @@ static __init int init_emergency_pool(void) page_pool = mempool_create_page_pool(POOL_SIZE, 0); BUG_ON(!page_pool); - printk("bounce pool size: %d pages\n", POOL_SIZE); + pr_info("pool size: %d pages\n", POOL_SIZE); return 0; } @@ -86,7 +89,7 @@ int init_emergency_isa_pool(void) mempool_free_pages, (void *) 0); BUG_ON(!isa_page_pool); - printk("isa bounce pool size: %d pages\n", ISA_POOL_SIZE); + pr_info("isa pool size: %d pages\n", ISA_POOL_SIZE); return 0; } diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 16bc9fa42998..1c16c228f35a 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -65,6 +65,8 @@ kernel is not always grateful with that. */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + #include #include #include @@ -91,6 +93,7 @@ #include #include #include +#include #include #include @@ -2645,7 +2648,7 @@ void __init numa_policy_init(void) node_set(prefer, interleave_nodes); if (do_set_mempolicy(MPOL_INTERLEAVE, 0, &interleave_nodes)) - printk("numa_policy_init: interleaving failed\n"); + pr_err("%s: interleaving failed\n", __func__); check_numabalancing_enable(); } diff --git a/mm/mmap.c b/mm/mmap.c index ced5efcdd4b6..129b847d30cc 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -6,6 +6,8 @@ * Address space accounting code */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + #include #include #include @@ -37,6 +39,7 @@ #include #include #include +#include #include #include @@ -361,20 +364,20 @@ static int browse_rb(struct rb_root *root) struct vm_area_struct *vma; vma = rb_entry(nd, struct vm_area_struct, vm_rb); if (vma->vm_start < prev) { - printk("vm_start %lx prev %lx\n", vma->vm_start, prev); + pr_info("vm_start %lx prev %lx\n", vma->vm_start, prev); bug = 1; } if (vma->vm_start < pend) { - printk("vm_start %lx pend %lx\n", vma->vm_start, pend); + pr_info("vm_start %lx pend %lx\n", vma->vm_start, pend); bug = 1; } if (vma->vm_start > vma->vm_end) { - printk("vm_end %lx < vm_start %lx\n", + pr_info("vm_end %lx < vm_start %lx\n", vma->vm_end, vma->vm_start); bug = 1; } if (vma->rb_subtree_gap != vma_compute_subtree_gap(vma)) { - printk("free gap %lx, correct %lx\n", + pr_info("free gap %lx, correct %lx\n", vma->rb_subtree_gap, vma_compute_subtree_gap(vma)); bug = 1; @@ -388,7 +391,7 @@ static int browse_rb(struct rb_root *root) for (nd = pn; nd; nd = rb_prev(nd)) j++; if (i != j) { - printk("backwards %d, forwards %d\n", j, i); + pr_info("backwards %d, forwards %d\n", j, i); bug = 1; } return bug ? -1 : i; @@ -423,17 +426,17 @@ static void validate_mm(struct mm_struct *mm) i++; } if (i != mm->map_count) { - printk("map_count %d vm_next %d\n", mm->map_count, i); + pr_info("map_count %d vm_next %d\n", mm->map_count, i); bug = 1; } if (highest_address != mm->highest_vm_end) { - printk("mm->highest_vm_end %lx, found %lx\n", + pr_info("mm->highest_vm_end %lx, found %lx\n", mm->highest_vm_end, highest_address); bug = 1; } i = browse_rb(&mm->mm_rb); if (i != mm->map_count) { - printk("map_count %d rb %d\n", mm->map_count, i); + pr_info("map_count %d rb %d\n", mm->map_count, i); bug = 1; } BUG_ON(bug); @@ -3280,7 +3283,7 @@ static struct notifier_block reserve_mem_nb = { static int __meminit init_reserve_notifier(void) { if (register_hotmemory_notifier(&reserve_mem_nb)) - printk("Failed registering memory add/remove notifier for admin reserve"); + pr_err("Failed registering memory add/remove notifier for admin reserve\n"); return 0; } diff --git a/mm/nommu.c b/mm/nommu.c index 85f8d6698d48..b78e3a8f5ee7 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -13,6 +13,8 @@ * Copyright (c) 2007-2010 Paul Mundt */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + #include #include #include @@ -32,6 +34,7 @@ #include #include #include +#include #include #include @@ -1246,7 +1249,7 @@ error_free: return ret; enomem: - printk("Allocation of length %lu from process %d (%s) failed\n", + pr_err("Allocation of length %lu from process %d (%s) failed\n", len, current->pid, current->comm); show_free_areas(0); return -ENOMEM; diff --git a/mm/vmscan.c b/mm/vmscan.c index f44476a41544..71f23c0c1090 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -11,6 +11,8 @@ * Multiqueue VM started 5.8.00, Rik van Riel. */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + #include #include #include @@ -43,6 +45,7 @@ #include #include #include +#include #include #include @@ -480,7 +483,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping, if (page_has_private(page)) { if (try_to_free_buffers(page)) { ClearPageDirty(page); - printk("%s: orphaned page\n", __func__); + pr_info("%s: orphaned page\n", __func__); return PAGE_CLEAN; } } -- cgit v1.2.3