summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2014-02-08Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds1-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Peter Anvin: "Quite a varied little collection of fixes. Most of them are relatively small or isolated; the biggest one is Mel Gorman's fixes for TLB range flushing. A couple of AMD-related fixes (including not crashing when given an invalid microcode image) and fix a crash when compiled with gcov" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, microcode, AMD: Unify valid container checks x86, hweight: Fix BUG when booting with CONFIG_GCOV_PROFILE_ALL=y x86/efi: Allow mapping BGRT on x86-32 x86: Fix the initialization of physnode_map x86, cpu hotplug: Fix stack frame warning in check_irq_vectors_for_cpu_disable() x86/intel/mid: Fix X86_INTEL_MID dependencies arch/x86/mm/srat: Skip NUMA_NO_NODE while parsing SLIT mm, x86: Revisit tlb_flushall_shift tuning for page flushes except on IvyBridge x86: mm: change tlb_flushall_shift for IvyBridge x86/mm: Eliminate redundant page table walk during TLB range flushing x86/mm: Clean up inconsistencies when flushing TLB ranges mm, x86: Account for TLB flushes only when debugging x86/AMD/NB: Fix amd_set_subcaches() parameter type x86/quirks: Add workaround for AMD F16h Erratum792 x86, doc, kconfig: Fix dud URL for Microcode data
2014-02-07Merge tag 'efi-urgent' into x86/urgentH. Peter Anvin51-1666/+3496
* Avoid WARN_ON() when mapping BGRT on Baytrail (EFI 32-bit). Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-06mm: __set_page_dirty_nobuffers() uses spin_lock_irqsave() instead of ↵KOSAKI Motohiro1-2/+3
spin_lock_irq() During aio stress test, we observed the following lockdep warning. This mean AIO+numa_balancing is currently deadlockable. The problem is, aio_migratepage disable interrupt, but __set_page_dirty_nobuffers unintentionally enable it again. Generally, all helper function should use spin_lock_irqsave() instead of spin_lock_irq() because they don't know caller at all. other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&ctx->completion_lock)->rlock); <Interrupt> lock(&(&ctx->completion_lock)->rlock); *** DEADLOCK *** dump_stack+0x19/0x1b print_usage_bug+0x1f7/0x208 mark_lock+0x21d/0x2a0 mark_held_locks+0xb9/0x140 trace_hardirqs_on_caller+0x105/0x1d0 trace_hardirqs_on+0xd/0x10 _raw_spin_unlock_irq+0x2c/0x50 __set_page_dirty_nobuffers+0x8c/0xf0 migrate_page_copy+0x434/0x540 aio_migratepage+0xb1/0x140 move_to_new_page+0x7d/0x230 migrate_pages+0x5e5/0x700 migrate_misplaced_page+0xbc/0xf0 do_numa_page+0x102/0x190 handle_pte_fault+0x241/0x970 handle_mm_fault+0x265/0x370 __do_page_fault+0x172/0x5a0 do_page_fault+0x1a/0x70 page_fault+0x28/0x30 Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-06mm/swap: fix race on swap_info reuse between swapoff and swaponWeijie Yang1-1/+10
swapoff clear swap_info's SWP_USED flag prematurely and free its resources after that. A concurrent swapon will reuse this swap_info while its previous resources are not cleared completely. These late freed resources are: - p->percpu_cluster - swap_cgroup_ctrl[type] - block_device setting - inode->i_flags &= ~S_SWAPFILE This patch clears the SWP_USED flag after all its resources are freed, so that swapon can reuse this swap_info by alloc_swap_info() safely. [akpm@linux-foundation.org: tidy up code comment] Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-06swap: add a simple detector for inappropriate swapin readaheadShaohua Li1-3/+60
This is a patch to improve swap readahead algorithm. It's from Hugh and I slightly changed it. Hugh's original changelog: swapin readahead does a blind readahead, whether or not the swapin is sequential. This may be ok on harddisk, because large reads have relatively small costs, and if the readahead pages are unneeded they can be reclaimed easily - though, what if their allocation forced reclaim of useful pages? But on SSD devices large reads are more expensive than small ones: if the readahead pages are unneeded, reading them in caused significant overhead. This patch adds very simplistic random read detection. Stealing the PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages() simply looks at readahead's current success rate, and narrows or widens its readahead window accordingly. There is little science to its heuristic: it's about as stupid as can be whilst remaining effective. The table below shows elapsed times (in centiseconds) when running a single repetitive swapping load across a 1000MB mapping in 900MB ram with 1GB swap (the harddisk tests had taken painfully too long when I used mem=500M, but SSD shows similar results for that). Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch which Shaohua showed to be defective; HughNew this Nov 14 patch, with page_cluster as usual at default of 3 (8-page reads); HughPC4 this same patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0 (1-page reads: no readahead). HDD for swapping to harddisk, SSD for swapping to VertexII SSD. Seq for sequential access to the mapping, cycling five times around; Rand for the same number of random touches. Anon for a MAP_PRIVATE anon mapping; Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs. One weakness of Shaohua's vma/anon_vma approach was that it did not optimize Shmem: seen below. Konstantin's approach was perhaps mistuned, 50% slower on Seq: did not compete and is not shown below. HDD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0 Seq Anon 73921 76210 75611 76904 78191 121542 Seq Shmem 73601 73176 73855 72947 74543 118322 Rand Anon 895392 831243 871569 845197 846496 841680 Rand Shmem 1058375 1053486 827935 764955 764376 756489 SSD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0 Seq Anon 24634 24198 24673 25107 21614 70018 Seq Shmem 24959 24932 25052 25703 22030 69678 Rand Anon 43014 26146 28075 25989 26935 25901 Rand Shmem 45349 45215 28249 24268 24138 24332 These tests are, of course, two extremes of a very simple case: under heavier mixed loads I've not yet observed any consistent improvement or degradation, and wider testing would be welcome. Shaohua Li: Test shows Vanilla is slightly better in sequential workload than Hugh's patch. I observed with Hugh's patch sometimes the readahead size is shrinked too fast (from 8 to 1 immediately) in sequential workload if there is no hit. And in such case, continuing doing readahead is good actually. I don't prepare a sophisticated algorithm for the sequential workload because so far we can't guarantee sequential accessed pages are swap out sequentially. So I slightly change Hugh's heuristic - don't shrink readahead size too fast. Here is my test result (unit second, 3 runs average): Vanilla Hugh New Seq 356 370 360 Random 4525 2447 2444 Attached graph is the swapin/swapout throughput I collected with 'vmstat 2'. The first part is running a random workload (till around 1200 of the x-axis) and the second part is running a sequential workload. swapin and swapout throughput are almost identical in steady state in both workloads. These are expected behavior. while in Vanilla, swapin is much bigger than swapout especially in random workload (because wrong readahead). Original patches by: Shaohua Li and Konstantin Khlebnikov. [fengguang.wu@intel.com: swapin_nr_pages() can be static] Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-02Merge branch 'slab/next' of ↵Linus Torvalds2-23/+35
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux Pull SLAB changes from Pekka Enberg: "Random bug fixes that have accumulated in my inbox over the past few months" * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: mm: Fix warning on make htmldocs caused by slab.c mm: slub: work around unneeded lockdep warning mm: sl[uo]b: fix misleading comments slub: Fix possible format string bug. slub: use lockdep_assert_held slub: Fix calculation of cpu slabs slab.h: remove duplicate kmalloc declaration and fix kernel-doc warnings
2014-01-31mm: Fix warning on make htmldocs caused by slab.cMasanari Iida1-1/+1
This patch fixed following errors while make htmldocs Warning(/mm/slab.c:1956): No description found for parameter 'page' Warning(/mm/slab.c:1956): Excess function parameter 'slabp' description in 'slab_destroy' Incorrect function parameter "slabp" was set instead of "page" Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2014-01-31mm: slub: work around unneeded lockdep warningDave Hansen1-0/+6
The slub code does some setup during early boot in early_kmem_cache_node_alloc() with some local data. There is no possible way that another CPU can see this data, so the slub code doesn't unnecessarily lock it. However, some new lockdep asserts check to make sure that add_partial() _always_ has the list_lock held. Just add the locking, even though it is technically unnecessary. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@arm.linux.org.uk> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2014-01-30memcg: fix mutex not unlocked on memcg_create_kmem_cache fail pathVladimir Davydov1-4/+3
Commit 842e2873697e ("memcg: get rid of kmem_cache_dup()") introduced a mutex for memcg_create_kmem_cache() to protect the tmp_name buffer that holds the memcg name. It failed to unlock the mutex if this buffer could not be allocated. This patch fixes the issue by appropriately unlocking the mutex if the allocation fails. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Glauber Costa <glommer@parallels.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30mm, oom: base root bonus on current usageDavid Rientjes1-1/+1
A 3% of system memory bonus is sometimes too excessive in comparison to other processes. With commit a63d83f427fb ("oom: badness heuristic rewrite"), the OOM killer tries to avoid killing privileged tasks by subtracting 3% of overall memory (system or cgroup) from their per-task consumption. But as a result, all root tasks that consume less than 3% of overall memory are considered equal, and so it only takes 33+ privileged tasks pushing the system out of memory for the OOM killer to do something stupid and kill dhclient or other root-owned processes. For example, on a 32G machine it can't tell the difference between the 1M agetty and the 10G fork bomb member. The changelog describes this 3% boost as the equivalent to the global overcommit limit being 3% higher for privileged tasks, but this is not the same as discounting 3% of overall memory from _every privileged task individually_ during OOM selection. Replace the 3% of system memory bonus with a 3% of current memory usage bonus. By giving root tasks a bonus that is proportional to their actual size, they remain comparable even when relatively small. In the example above, the OOM killer will discount the 1M agetty's 256 badness points down to 179, and the 10G fork bomb's 262144 points down to 183500 points and make the right choice, instead of discounting both to 0 and killing agetty because it's first in the task list. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30mm/slub.c: fix page->_count corruption (again)Dave Hansen1-2/+17
Commit abca7c496584 ("mm: fix slab->page _count corruption when using slub") notes that we can not _set_ a page->counters directly, except when using a real double-cmpxchg. Doing so can lose updates to ->_count. That is an absolute rule: You may not *set* page->counters except via a cmpxchg. Commit abca7c496584 fixed this for the folks who have the slub cmpxchg_double code turned off at compile time, but it left the bad case alone. It can still be reached, and the same bug triggered in two cases: 1. Turning on slub debugging at runtime, which is available on the distro kernels that I looked at. 2. On 64-bit CPUs with no CMPXCHG16B (some early AMD x86-64 cpus, evidently) There are at least 3 ways we could fix this: 1. Take all of the exising calls to cmpxchg_double_slab() and __cmpxchg_double_slab() and convert them to take an old, new and target 'struct page'. 2. Do (1), but with the newly-introduced 'slub_data'. 3. Do some magic inside the two cmpxchg...slab() functions to pull the counters out of new_counters and only set those fields in page->{inuse,frozen,objects}. I've done (2) as well, but it's a bunch more code. This patch is an attempt at (3). This was the most straightforward and foolproof way that I could think to do this. This would also technically allow us to get rid of the ugly #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \ defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE) in 'struct page', but leaving it alone has the added benefit that 'counters' stays 'unsigned' instead of 'unsigned long', so all the copies that the slub code does stay a bit smaller. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30mm/mempolicy.c: fix mempolicy printing in numa_mapsDavid Rientjes1-1/+1
As a result of commit 5606e3877ad8 ("mm: numa: Migrate on reference policy"), /proc/<pid>/numa_maps prints the mempolicy for any <pid> as "prefer:N" for the local node, N, of the process reading the file. This should only be printed when the mempolicy of <pid> is MPOL_PREFERRED for node N. If the process is actually only using the default mempolicy for local node allocation, make sure "default" is printed as expected. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Robert Lippert <rlippert@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: <stable@vger.kernel.org> [3.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30zsmalloc: add copyrightMinchan Kim1-0/+1
Add my copyright to the zsmalloc source code which I maintain. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30zsmalloc: move it under mmMinchan Kim3-0/+1131
This patch moves zsmalloc under mm directory. Before that, description will explain why we have needed custom allocator. Zsmalloc is a new slab-based memory allocator for storing compressed pages. It is designed for low fragmentation and high allocation success rate on large object, but <= PAGE_SIZE allocations. zsmalloc differs from the kernel slab allocator in two primary ways to achieve these design goals. zsmalloc never requires high order page allocations to back slabs, or "size classes" in zsmalloc terms. Instead it allows multiple single-order pages to be stitched together into a "zspage" which backs the slab. This allows for higher allocation success rate under memory pressure. Also, zsmalloc allows objects to span page boundaries within the zspage. This allows for lower fragmentation than could be had with the kernel slab allocator for objects between PAGE_SIZE/2 and PAGE_SIZE. With the kernel slab allocator, if a page compresses to 60% of it original size, the memory savings gained through compression is lost in fragmentation because another object of the same size can't be stored in the leftover space. This ability to span pages results in zsmalloc allocations not being directly addressable by the user. The user is given an non-dereferencable handle in response to an allocation request. That handle must be mapped, using zs_map_object(), which returns a pointer to the mapped region that can be used. The mapping is necessary since the object data may reside in two different noncontigious pages. The zsmalloc fulfills the allocation needs for zram perfectly [sjenning@linux.vnet.ibm.com: borrow Seth's quote] Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Nitin Gupta <ngupta@vflare.org> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-blockLinus Torvalds2-28/+26
Pull core block IO changes from Jens Axboe: "The major piece in here is the immutable bio_ve series from Kent, the rest is fairly minor. It was supposed to go in last round, but various issues pushed it to this release instead. The pull request contains: - Various smaller blk-mq fixes from different folks. Nothing major here, just minor fixes and cleanups. - Fix for a memory leak in the error path in the block ioctl code from Christian Engelmayer. - Header export fix from CaiZhiyong. - Finally the immutable biovec changes from Kent Overstreet. This enables some nice future work on making arbitrarily sized bios possible, and splitting more efficient. Related fixes to immutable bio_vecs: - dm-cache immutable fixup from Mike Snitzer. - btrfs immutable fixup from Muthu Kumar. - bio-integrity fix from Nic Bellinger, which is also going to stable" * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits) xtensa: fixup simdisk driver to work with immutable bio_vecs block/blk-mq-cpu.c: use hotcpu_notifier() blk-mq: for_each_* macro correctness block: Fix memory leak in rw_copy_check_uvector() handling bio-integrity: Fix bio_integrity_verify segment start bug block: remove unrelated header files and export symbol blk-mq: uses page->list incorrectly blk-mq: use __smp_call_function_single directly btrfs: fix missing increment of bi_remaining Revert "block: Warn and free bio if bi_end_io is not set" block: Warn and free bio if bi_end_io is not set blk-mq: fix initializing request's start time block: blk-mq: don't export blk_mq_free_queue() block: blk-mq: make blk_sync_queue support mq block: blk-mq: support draining mq queue dm cache: increment bi_remaining when bi_end_io is restored block: fixup for generic bio chaining block: Really silence spurious compiler warnings block: Silence spurious compiler warnings block: Kill bio_pair_split() ...
2014-01-29memblock: add limit checking to memblock_virt_allocYinghai Lu1-0/+3
In original bootmem wrapper for memblock, we have limit checking. Add it to memblock_virt_alloc, to address arm and x86 booting crash. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Reported-by: Kevin Hilman <khilman@linaro.org> Tested-by: Kevin Hilman <khilman@linaro.org> Reported-by: Olof Johansson <olof@lixom.net> Tested-by: Olof Johansson <olof@lixom.net> Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Santosh Shilimkar <santosh.shilimkar@ti.com> Cc: "Strashko, Grygorii" <grygorii.strashko@ti.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-29mm/readahead.c: fix do_readahead() for no readpage(s)Mark Rutland1-10/+5
Commit 63d0f0a3c7e1 ("mm/readahead.c:do_readhead(): don't check for ->readpage") unintentionally made do_readahead return 0 for all valid files regardless of whether readahead was supported, rather than the expected -EINVAL. This gets forwarded on to userspace, and results in sys_readahead appearing to succeed in cases that don't make sense (e.g. when called on pipes or sockets). This issue is detected by the LTP readahead01 testcase. As the exact return value of force_page_cache_readahead is currently never used, we can simplify it to return only 0 or -EINVAL (when readpage or readpages is missing). With that in place we can simply forward on the return value of force_page_cache_readahead in do_readahead. This patch performs said change, restoring the expected semantics. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-29mm/slub.c: do not VM_BUG_ON_PAGE() for temporary on-stack pagesDave Hansen1-6/+6
Commit 309381feaee5 ("mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE") added a bunch of VM_BUG_ON_PAGE() calls. But, most of the ones in the slub code are for _temporary_ 'struct page's which are declared on the stack and likely have lots of gunk in them. Dumping their contents out will just confuse folks looking at bad_page() output. Plus, if we try to page_to_pfn() on them or soemthing, we'll probably oops anyway. Turn them back in to VM_BUG_ON()s. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-29slab: fix wrong retval on kmem_cache_create_memcg error pathDave Jones1-8/+11
On kmem_cache_create_memcg() error path we set 'err', but leave 's' (the new cache ptr) undefined. The latter can be NULL if we could not allocate the cache, or pointing to a freed area if we failed somewhere later while trying to initialize it. Initially we checked 'err' immediately before exiting the function and returned NULL if it was set ignoring the value of 's': out_unlock: ... if (err) { /* report error */ return NULL; } return s; Recently this check was, in fact, broken by commit f717eb3abb5e ("slab: do not panic if we fail to create memcg cache"), which turned it to: out_unlock: ... if (err && !memcg) { /* report error */ return NULL; } return s; As a result, if we are failing creating a cache for a memcg, we will skip the check and return 's' that can contain crap. Obviously, commit f717eb3abb5e intended not to return crap on error allocating a cache for a memcg, but only to remove the error reporting in this case, so the check should look like this: out_unlock: ... if (err) { if (!memcg) return NULL; /* report error */ return NULL; } return s; [rientjes@google.com: despaghettification] [vdavydov@parallels.com: patch monkeying] Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Dave Jones <davej@redhat.com> Reported-by: Dave Jones <davej@redhat.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-29mm/mempolicy.c: convert to pr_foo()Andrew Morton1-2/+2
A few printk(KERN_*'s have snuck in there. Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-29mm: numa: initialise numa balancing after jump label initialisationMel Gorman1-6/+11
The command line parsing takes place before jump labels are initialised which generates a warning if numa_balancing= is specified and CONFIG_JUMP_LABEL is set. On older kernels before commit c4b2c0c5f647 ("static_key: WARN on usage before jump_label_init was called") the kernel would have crashed. This patch enables automatic numa balancing later in the initialisation process if numa_balancing= is specified. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-29mm/page-writeback.c: do not count anon pages as dirtyable memoryJohannes Weiner3-25/+5
The VM is currently heavily tuned to avoid swapping. Whether that is good or bad is a separate discussion, but as long as the VM won't swap to make room for dirty cache, we can not consider anonymous pages when calculating the amount of dirtyable memory, the baseline to which dirty_background_ratio and dirty_ratio are applied. A simple workload that occupies a significant size (40+%, depending on memory layout, storage speeds etc.) of memory with anon/tmpfs pages and uses the remainder for a streaming writer demonstrates this problem. In that case, the actual cache pages are a small fraction of what is considered dirtyable overall, which results in an relatively large portion of the cache pages to be dirtied. As kswapd starts rotating these, random tasks enter direct reclaim and stall on IO. Only consider free pages and file pages dirtyable. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Tejun Heo <tj@kernel.org> Tested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-29mm/page-writeback.c: fix dirty_balance_reserve subtraction from dirtyable memoryJohannes Weiner1-31/+24
Tejun reported stuttering and latency spikes on a system where random tasks would enter direct reclaim and get stuck on dirty pages. Around 50% of memory was occupied by tmpfs backed by an SSD, and another disk (rotating) was reading and writing at max speed to shrink a partition. : The problem was pretty ridiculous. It's a 8gig machine w/ one ssd and 10k : rpm harddrive and I could reliably reproduce constant stuttering every : several seconds for as long as buffered IO was going on on the hard drive : either with tmpfs occupying somewhere above 4gig or a test program which : allocates about the same amount of anon memory. Although swap usage was : zero, turning off swap also made the problem go away too. : : The trigger conditions seem quite plausible - high anon memory usage w/ : heavy buffered IO and swap configured - and it's highly likely that this : is happening in the wild too. (this can happen with copying large files : to usb sticks too, right?) This patch (of 2): The dirty_balance_reserve is an approximation of the fraction of free pages that the page allocator does not make available for page cache allocations. As a result, it has to be taken into account when calculating the amount of "dirtyable memory", the baseline to which dirty_background_ratio and dirty_ratio are applied. However, currently the reserve is subtracted from the sum of free and reclaimable pages, which is non-sensical and leads to erroneous results when the system is dominated by unreclaimable pages and the dirty_balance_reserve is bigger than free+reclaimable. In that case, at least the already allocated cache should be considered dirtyable. Fix the calculation by subtracting the reserve from the amount of free pages, then adding the reclaimable pages on top. [akpm@linux-foundation.org: fix CONFIG_HIGHMEM build] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Tejun Heo <tj@kernel.org> Tested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-28Merge branch 'for-linus' of ↵Linus Torvalds2-56/+43
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "Assorted stuff; the biggest pile here is Christoph's ACL series. Plus assorted cleanups and fixes all over the place... There will be another pile later this week" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits) __dentry_path() fixes vfs: Remove second variable named error in __dentry_path vfs: Is mounted should be testing mnt_ns for NULL or error. Fix race when checking i_size on direct i/o read hfsplus: remove can_set_xattr nfsd: use get_acl and ->set_acl fs: remove generic_acl nfs: use generic posix ACL infrastructure for v3 Posix ACLs gfs2: use generic posix ACL infrastructure jfs: use generic posix ACL infrastructure xfs: use generic posix ACL infrastructure reiserfs: use generic posix ACL infrastructure ocfs2: use generic posix ACL infrastructure jffs2: use generic posix ACL infrastructure hfsplus: use generic posix ACL infrastructure f2fs: use generic posix ACL infrastructure ext2/3/4: use generic posix ACL infrastructure btrfs: use generic posix ACL infrastructure fs: make posix_acl_create more useful fs: make posix_acl_chmod more useful ...
2014-01-27Merge branch 'akpm' (incoming from Andrew)Linus Torvalds4-19/+11
Merge misc updates from Andrew Morton: - a few hotfixes - dynamic-debug updates - ipc updates - various other sweepings off the factory floor * akpm: (31 commits) firmware/google: drop 'select EFI' to avoid recursive dependency compat: fix sys_fanotify_mark checkpatch.pl: check for function declarations without arguments mm/migrate.c: fix setting of cpupid on page migration twice against normal page softirq: use const char * const for softirq_to_name, whitespace neatening softirq: convert printks to pr_<level> softirq: use ffs() in __do_softirq() kernel/kexec.c: use vscnprintf() instead of vsnprintf() in vmcoreinfo_append_str() splice: fix unexpected size truncation ipc: fix compat msgrcv with negative msgtyp ipc,msg: document barriers ipc: delete seq_max field in struct ipc_ids ipc: simplify sysvipc_proc_open() return ipc: remove useless return statement ipc: remove braces for single statements ipc: standardize code comments ipc: whitespace cleanup ipc: change kern_ipc_perm.deleted type to bool ipc: introduce ipc_valid_object() helper to sort out IPC_RMID races ipc/sem.c: avoid overflow of semop undo (semadj) value ...
2014-01-27Merge branch 'next' of ↵Linus Torvalds1-3/+2
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc updates from Ben Herrenschmidt: "So here's my next branch for powerpc. A bit late as I was on vacation last week. It's mostly the same stuff that was in next already, I just added two patches today which are the wiring up of lockref for powerpc, which for some reason fell through the cracks last time and is trivial. The highlights are, in addition to a bunch of bug fixes: - Reworked Machine Check handling on kernels running without a hypervisor (or acting as a hypervisor). Provides hooks to handle some errors in real mode such as TLB errors, handle SLB errors, etc... - Support for retrieving memory error information from the service processor on IBM servers running without a hypervisor and routing them to the memory poison infrastructure. - _PAGE_NUMA support on server processors - 32-bit BookE relocatable kernel support - FSL e6500 hardware tablewalk support - A bunch of new/revived board support - FSL e6500 deeper idle states and altivec powerdown support You'll notice a generic mm change here, it has been acked by the relevant authorities and is a pre-req for our _PAGE_NUMA support" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (121 commits) powerpc: Implement arch_spin_is_locked() using arch_spin_value_unlocked() powerpc: Add support for the optimised lockref implementation powerpc/powernv: Call OPAL sync before kexec'ing powerpc/eeh: Escalate error on non-existing PE powerpc/eeh: Handle multiple EEH errors powerpc: Fix transactional FP/VMX/VSX unavailable handlers powerpc: Don't corrupt transactional state when using FP/VMX in kernel powerpc: Reclaim two unused thread_info flag bits powerpc: Fix races with irq_work Move precessing of MCE queued event out from syscall exit path. pseries/cpuidle: Remove redundant call to ppc64_runlatch_off() in cpu idle routines powerpc: Make add_system_ram_resources() __init powerpc: add SATA_MV to ppc64_defconfig powerpc/powernv: Increase candidate fw image size powerpc: Add debug checks to catch invalid cpu-to-node mappings powerpc: Fix the setup of CPU-to-Node mappings during CPU online powerpc/iommu: Don't detach device without IOMMU group powerpc/eeh: Hotplug improvement powerpc/eeh: Call opal_pci_reinit() on powernv for restoring config space powerpc/eeh: Add restore_config operation ...
2014-01-27Merge branch 'merge' of ↵Linus Torvalds1-9/+5
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc mremap fix from Ben Herrenschmidt: "This is the patch that I had sent after -rc8 and which we decided to wait before merging. It's based on a different tree than my -next branch (it needs some pre-reqs that were in -rc4 or so while my -next is based on -rc1) so I left it as a separate branch for your to pull. It's identical to the request I did 2 or 3 weeks back. This fixes crashes in mremap with THP on powerpc. The fix however requires a small change in the generic code. It moves a condition into a helper we can override from the arch which is harmless, but it *also* slightly changes the order of the set_pmd and the withdraw & deposit, which should be fine according to Kirill (who wrote that code) but I agree -rc8 is a bit late... It was acked by Kirill and Andrew told me to just merge it via powerpc" * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: powerpc/thp: Fix crash on mremap
2014-01-27mm/migrate.c: fix setting of cpupid on page migration twice against normal pageWanpeng Li1-2/+0
Commit 7851a45cd3f6 ("mm: numa: Copy cpupid on page migration") copies over the cpupid at page migration time. It is unnecessary to set it again in alloc_misplaced_dst_page(). Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-27mm: bring back /sys/kernel/mmHugh Dickins1-1/+1
Commit da29bd36224b ("mm/mm_init.c: make creation of the mm_kobj happen earlier than device_initcall") changed to pure_initcall(mm_sysfs_init). That's too early: mm_sysfs_init() depends on core_initcall(ksysfs_init) to have made the kernel_kobj directory "kernel" in which to create "mm". Make it postcore_initcall(mm_sysfs_init). We could use core_initcall(), and depend upon Makefile link order kernel/ mm/ fs/ ipc/ security/ ... as core_initcall(debugfs_init) and core_initcall(securityfs_init) do; but better not. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-27Revert "mm/vmalloc: interchage the implementation of vmalloc_to_{pfn,page}"malc1-10/+10
Revert commit ece86e222db4, which was intended as a small performance improvement. Despite the claim that the patch doesn't introduce any functional changes in fact it does. The "no page" path behaves different now. Originally, vmalloc_to_page might return NULL under some conditions, with new implementation it returns pfn_to_page(0) which is not the same as NULL. Simple test shows the difference. test.c #include <linux/kernel.h> #include <linux/module.h> #include <linux/vmalloc.h> #include <linux/mm.h> int __init myi(void) { struct page *p; void *v; v = vmalloc(PAGE_SIZE); /* trigger the "no page" path in vmalloc_to_page*/ vfree(v); p = vmalloc_to_page(v); pr_err("expected val = NULL, returned val = %p", p); return -EBUSY; } void __exit mye(void) { } module_init(myi) module_exit(mye) Before interchange: expected val = NULL, returned val = (null) After interchange: expected val = NULL, returned val = c7ebe000 Signed-off-by: Vladimir Murzin <murzin.v@gmail.com> Cc: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-27memblock: don't silently align size in memblock_virt_alloc()Yinghai Lu1-6/+0
In original __alloc_memory_core_early() for bootmem wrapper, we do not align size silently. We should not do that, as later free with old size will leave some range not freed. It's obvious that code is copied from memblock_base_nid(), and that code is wrong for the same reason. Also remove that in memblock_alloc_base. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-26Fix race when checking i_size on direct i/o readSteven Whitehouse1-22/+20
So far I've had one ACK for this, and no other comments. So I think it is probably time to send this via some suitable tree. I'm guessing that the vfs tree would be the most appropriate route, but not sure that there is one at the moment (don't see anything recent at kernel.org) so in that case I think -mm is the "back up plan". Al, please let me know if you will take this? Steve. --------------------- Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes" thread on linux-fsdevel, this patch is my current version of the fix proposed as option (b) in that thread. Removing the i_size test from the direct i/o read path at vfs level means that filesystems now have to deal with requests which are beyond i_size themselves. These I've divided into three sets: a) Those with "no op" ->direct_IO (9p, cifs, ceph) These are obviously not going to be an issue b) Those with "home brew" ->direct_IO (nfs, fuse) I've been told that NFS should not have any problem with the larger i_size, however I've added an extra test to FUSE to duplicate the original behaviour just to be on the safe side. c) Those using __blockdev_direct_IO() These call through to ->get_block() which should deal with the EOF condition correctly. I've verified that with GFS2 and I believe that Zheng has verified it for ext4. I've also run the test on XFS and it passes both before and after this change. The part of the patch in filemap.c looks a lot larger than it really is - there are only two lines of real change. The rest is just indentation of the contained code. There remains a test of i_size though, which was added for btrfs. It doesn't cause the other filesystems a problem as the test is performed after ->direct_IO has been called. It is possible that there is a race that does matter to btrfs, however this patch doesn't change that, so its still an overall improvement. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Reported-by: Zheng Liu <gnehzuil.liu@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Acked-by: Miklos Szeredi <miklos@szeredi.hu> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-26fs: remove generic_aclChristoph Hellwig1-34/+23
And instead convert tmpfs to use the new generic ACL code, with two stub methods provided for in-memory filesystems. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25Merge branch 'linus' into x86/urgentIngo Molnar1-1/+4
Merge in the x86 changes to apply a fix. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-25mm, x86: Account for TLB flushes only when debuggingMel Gorman1-1/+3
Bisection between 3.11 and 3.12 fingered commit 9824cf97 ("mm: vmstats: tlb flush counters") to cause overhead problems. The counters are undeniably useful but how often do we really need to debug TLB flush related issues? It does not justify taking the penalty everywhere so make it a debugging option. Signed-off-by: Mel Gorman <mgorman@suse.de> Tested-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Alex Shi <alex.shi@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-XzxjntugxuwpxXhcrxqqh53b@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-23mm: ignore VM_SOFTDIRTY on VMA mergingCyrill Gorcunov1-2/+10
The VM_SOFTDIRTY bit affects vma merge routine: if two VMAs has all bits in vm_flags matched except dirty bit the kernel can't longer merge them and this forces the kernel to generate new VMAs instead. It finally may lead to the situation when userspace application reaches vm.max_map_count limit and get crashed in worse case | (gimp:11768): GLib-ERROR **: gmem.c:110: failed to allocate 4096 bytes | | (file-tiff-load:12038): LibGimpBase-WARNING **: file-tiff-load: gimp_wire_read(): error | xinit: connection to X server lost | | waiting for X server to shut down | /usr/lib64/gimp/2.0/plug-ins/file-tiff-load terminated: Hangup | /usr/lib64/gimp/2.0/plug-ins/script-fu terminated: Hangup | /usr/lib64/gimp/2.0/plug-ins/script-fu terminated: Hangup https://bugzilla.kernel.org/show_bug.cgi?id=67651 https://bugzilla.gnome.org/show_bug.cgi?id=719619#c0 Initial problem came from missed VM_SOFTDIRTY in do_brk() routine but even if we would set up VM_SOFTDIRTY here, there is still a way to prevent VMAs from merging: one can call | echo 4 > /proc/$PID/clear_refs and clear all VM_SOFTDIRTY over all VMAs presented in memory map, then new do_brk() will try to extend old VMA and finds that dirty bit doesn't match thus new VMA will be generated. As discussed with Pavel, the right approach should be to ignore VM_SOFTDIRTY bit when we're trying to merge VMAs and if merge successed we mark extended VMA with dirty bit where needed. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Reported-by: Bastian Hougaard <gnome@rvzt.net> Reported-by: Mel Gorman <mgorman@suse.de> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23mm/rmap: fix coccinelle warningsFengguang Wu1-2/+2
mm/rmap.c:851:9-10: WARNING: return of 0/1 in function 'invalid_mkclean_vma' with return type bool Return statements in functions returning bool should use true/false instead of 1/0. Generated by: coccinelle/misc/boolreturn.cocci Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23mm/swapfile.c: do not skip lowest_bit in scan_swap_map() scan loopJamie Liu1-1/+2
In the second half of scan_swap_map()'s scan loop, offset is set to si->lowest_bit and then incremented before entering the loop for the first time, causing si->swap_map[si->lowest_bit] to be skipped. Signed-off-by: Jamie Liu <jamieliu@google.com> Cc: Shaohua Li <shli@fusionio.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23memcg: remove unused code from kmem_cache_destroy_work_funcVladimir Davydov1-4/+2
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23mm: improve documentation of page_orderMel Gorman2-4/+9
Developers occasionally try and optimise PFN scanners by using page_order but miss that in general it requires zone->lock. This has happened twice for compaction.c and rejected both times. This patch clarifies the documentation of page_order and adds a note to compaction.c why page_order is not used. [akpm@linux-foundation.org: tweaks] [lauraa@codeaurora.org: Corrected a page_zone(page)->lock reference] Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rafael Aquini <aquini@redhat.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Laura Abbott <lauraa@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23memcg: fix css reference leak and endless loop in mem_cgroup_iterMichal Hocko1-5/+13
Commit 19f39402864e ("memcg: simplify mem_cgroup_iter") has reorganized mem_cgroup_iter code in order to simplify it. A part of that change was dropping an optimization which didn't call css_tryget on the root of the walked tree. The patch however didn't change the css_put part in mem_cgroup_iter which excludes root. This wasn't an issue at the time because __mem_cgroup_iter_next bailed out for root early without taking a reference as cgroup iterators (css_next_descendant_pre) didn't visit root themselves. Nevertheless cgroup iterators have been reworked to visit root by commit bd8815a6d802 ("cgroup: make css_for_each_descendant() and friends include the origin css in the iteration") when the root bypass have been dropped in __mem_cgroup_iter_next. This means that css_put is not called for root and so css along with mem_cgroup and other cgroup internal object tied by css lifetime are never freed. Fix the issue by reintroducing root check in __mem_cgroup_iter_next and do not take css reference for it. This reference counting magic protects us also from another issue, an endless loop reported by Hugh Dickins when reclaim races with root removal and css_tryget called by iterator internally would fail. There would be no other nodes to visit so __mem_cgroup_iter_next would return NULL and mem_cgroup_iter would interpret it as "start looping from root again" and so mem_cgroup_iter would loop forever internally. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Hugh Dickins <hughd@google.com> Tested-by: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Greg Thelen <gthelen@google.com> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23memcg: fix endless loop caused by mem_cgroup_iterMichal Hocko1-3/+14
Hugh has reported an endless loop when the hardlimit reclaim sees the same group all the time. This might happen when the reclaim races with the memcg removal. shrink_zone [rmdir root] mem_cgroup_iter(root, NULL, reclaim) // prev = NULL rcu_read_lock() mem_cgroup_iter_load last_visited = iter->last_visited // gets root || NULL css_tryget(last_visited) // failed last_visited = NULL [1] memcg = root = __mem_cgroup_iter_next(root, NULL) mem_cgroup_iter_update iter->last_visited = root; reclaim->generation = iter->generation mem_cgroup_iter(root, root, reclaim) // prev = root rcu_read_lock mem_cgroup_iter_load last_visited = iter->last_visited // gets root css_tryget(last_visited) // failed [1] The issue seemed to be introduced by commit 5f5781619718 ("memcg: relax memcg iter caching") which has replaced unconditional css_get/css_put by css_tryget/css_put for the cached iterator. This patch fixes the issue by skipping css_tryget on the root of the tree walk in mem_cgroup_iter_load and symmetrically doesn't release it in mem_cgroup_iter_update. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Hugh Dickins <hughd@google.com> Tested-by: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Greg Thelen <gthelen@google.com> Cc: <stable@vger.kernel.org> [3.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23mm, oom: prefer thread group leaders for display purposesDavid Rientjes2-11/+20
When two threads have the same badness score, it's preferable to kill the thread group leader so that the actual process name is printed to the kernel log rather than the thread group name which may be shared amongst several processes. This was the behavior when select_bad_process() used to do for_each_process(), but it now iterates threads instead and leads to ambiguity. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23mm/memcg: iteration skip memcgs not yet fully initializedHugh Dickins1-4/+2
It is surprising that the mem_cgroup iterator can return memcgs which have not yet been fully initialized. By accident (or trial and error?) this appears not to present an actual problem; but it may be better to prevent such surprises, by skipping memcgs not yet online. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23mm/memcg: fix last_dead_count memory wastageHugh Dickins1-1/+1
Shorten mem_cgroup_reclaim_iter.last_dead_count from unsigned long to int: it's assigned from an int and compared with an int, and adjacent to an unsigned int: so there's no point to it being unsigned long, which wasted 104 bytes in every mem_cgroup_per_zone. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23mm: audit/fix non-modular users of module_init in core codePaul Gortmaker4-7/+6
Code that is obj-y (always built-in) or dependent on a bool Kconfig (built-in or absent) can never be modular. So using module_init as an alias for __initcall can be somewhat misleading. Fix these up now, so that we can relocate module_init from init.h into module.h in the future. If we don't do this, we'd have to add module.h to obviously non-modular code, and that would be a worse thing. The audit targets the following module_init users for change: mm/ksm.c bool KSM mm/mmap.c bool MMU mm/huge_memory.c bool TRANSPARENT_HUGEPAGE mm/mmu_notifier.c bool MMU_NOTIFIER Note that direct use of __initcall is discouraged, vs. one of the priority categorized subgroups. As __initcall gets mapped onto device_initcall, our use of subsys_initcall (which makes sense for these files) will thus change this registration from level 6-device to level 4-subsys (i.e. slightly earlier). However no observable impact of that difference has been observed during testing. One might think that core_initcall (l2) or postcore_initcall (l3) would be more appropriate for anything in mm/ but if we look at some actual init functions themselves, we see things like: mm/huge_memory.c --> hugepage_init --> hugepage_init_sysfs mm/mmap.c --> init_user_reserve --> sysctl_user_reserve_kbytes mm/ksm.c --> ksm_init --> sysfs_create_group and hence the choice of subsys_initcall (l4) seems reasonable, and at the same time minimizes the risk of changing the priority too drastically all at once. We can adjust further in the future. Also, several instances of missing ";" at EOL are fixed. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23mm/mm_init.c: make creation of the mm_kobj happen earlier than device_initcallPaul Gortmaker1-2/+1
The use of __initcall is to be eventually replaced by choosing one from the prioritized groupings laid out in init.h header: pure_initcall 0 core_initcall 1 postcore_initcall 2 arch_initcall 3 subsys_initcall 4 fs_initcall 5 device_initcall 6 late_initcall 7 In the interim, all __initcall are mapped onto device_initcall, which as can be seen above, comes quite late in the ordering. Currently the mm_kobj is created with __initcall in mm_sysfs_init(). This means that any other initcalls that want to reference the mm_kobj have to be device_initcall (or later), otherwise we will for example, trip the BUG_ON(!kobj) in sysfs's internal_create_group(). This unfairly restricts those users; for example something that clearly makes sense to be an arch_initcall will not be able to choose that. However, upon examination, it is only this way for historical reasons (i.e. simply not reprioritized yet). We see that sysfs is ready quite earlier in init/main.c via: vfs_caches_init |_ mnt_init |_ sysfs_init well ahead of the processing of the prioritized calls listed above. So we can recategorize mm_sysfs_init to be a pure_initcall, which in turn allows any mm_kobj initcall users a wider range (1 --> 7) of initcall priorities to choose from. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23mm: show message when updating min_free_kbytes in thpHan Pingtian3-2/+9
min_free_kbytes may be raised during THP's initialization. Sometimes, this will change the value which was set by the user. Showing this message will clarify this confusion. Only show this message when changing a value which was set by the user according to Michal Hocko's suggestion. Show the old value of min_free_kbytes according to Dave Hansen's suggestion. This will give user the chance to restore old value of min_free_kbytes. Signed-off-by: Han Pingtian <hanpt@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23mm/memory_hotplug.c: move register_memory_resource out of the ↵Nathan Zimmer1-3/+4
lock_memory_hotplug We don't need to do register_memory_resource() under lock_memory_hotplug() since it has its own lock and doesn't make any callbacks. Also register_memory_resource return NULL on failure so we don't have anything to cleanup at this point. The reason for this rfc is I was doing some experiments with hotplugging of memory on some of our larger systems. While it seems to work, it can be quite slow. With some preliminary digging I found that lock_memory_hotplug is clearly ripe for breakup. It could be broken up per nid or something but it also covers the online_page_callback. The online_page_callback shouldn't be very hard to break out. Also there is the issue of various structures(wmarks come to mind) that are only updated under the lock_memory_hotplug that would need to be dealt with. Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Hedi <hedi@sgi.com> Cc: Mike Travis <travis@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23mm/nobootmem: free_all_bootmem againPhilipp Hachtmann2-26/+16
get_allocated_memblock_reserved_regions_info() should work if it is compiled in. Extended the ifdef around get_allocated_memblock_memory_regions_info() to include get_allocated_memblock_reserved_regions_info() as well. Similar changes in nobootmem.c/free_low_memory_core_early() where the two functions are called. [akpm@linux-foundation.org: cleanup] Signed-off-by: Philipp Hachtmann <phacht@linux.vnet.ibm.com> Cc: qiuxishi <qiuxishi@huawei.com> Cc: David Howells <dhowells@redhat.com> Cc: Daeseok Youn <daeseok.youn@gmail.com> Cc: Jiang Liu <liuj97@gmail.com> Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Santosh Shilimkar <santosh.shilimkar@ti.com> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>