summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2012-05-29mm/fs: remove truncate_rangeHugh Dickins2-26/+0
Remove vmtruncate_range(), and remove the truncate_range method from struct inode_operations: only tmpfs ever supported it, and tmpfs has now converted over to using the fallocate method of file_operations. Update Documentation accordingly, adding (setlease and) fallocate lines. And while we're in mm.h, remove duplicate declarations of shmem_lock() and shmem_file_setup(): everyone is now using the ones in shmem_fs.h. Based-on-patch-by: Cong Wang <amwang@redhat.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Cong Wang <amwang@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm/fs: route MADV_REMOVE to FALLOC_FL_PUNCH_HOLEHugh Dickins1-8/+7
Now tmpfs supports hole-punching via fallocate(), switch madvise_remove() to use do_fallocate() instead of vmtruncate_range(): which extends madvise(,,MADV_REMOVE) support from tmpfs to ext4, ocfs2 and xfs. There is one more user of vmtruncate_range() in our tree, staging/android's ashmem_shrink(): convert it to use do_fallocate() too (but if its unpinned areas are already unmapped - I don't know - then it would do better to use shmem_truncate_range() directly). Based-on-patch-by: Cong Wang <amwang@redhat.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Colin Cross <ccross@android.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger@dilger.ca> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Ben Myers <bpm@sgi.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29tmpfs: support fallocate FALLOC_FL_PUNCH_HOLEHugh Dickins1-11/+57
tmpfs has supported hole-punching since 2.6.16, via madvise(,,MADV_REMOVE). But nowadays fallocate(,FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE,,) is the agreed way to punch holes. So add shmem_fallocate() to support that, and tweak shmem_truncate_range() to support partial pages at both the beginning and end of range (never needed for madvise, which demands rounded addr and rounds up length). Based-on-patch-by: Cong Wang <amwang@redhat.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Cong Wang <amwang@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29tmpfs: optimize clearing when writingHugh Dickins1-3/+17
Nick proposed years ago that tmpfs should avoid clearing its pages where write will overwrite them with new data, as ramfs has long done. But I messed it up and just got bad data. Tried again recently, it works fine. Here's time output for writing 4GiB 16 times on this Core i5 laptop: before: real 0m21.169s user 0m0.028s sys 0m21.057s real 0m21.382s user 0m0.016s sys 0m21.289s real 0m21.311s user 0m0.020s sys 0m21.217s after: real 0m18.273s user 0m0.032s sys 0m18.165s real 0m18.354s user 0m0.020s sys 0m18.265s real 0m18.440s user 0m0.032s sys 0m18.337s ramfs: real 0m16.860s user 0m0.028s sys 0m16.765s real 0m17.382s user 0m0.040s sys 0m17.273s real 0m17.133s user 0m0.044s sys 0m17.021s Yes, I have done perf reports, but they need more explanation than they deserve: in summary, clear_page vanishes, its cache loading shifts into copy_user_generic_unrolled; shmem_getpage_gfp goes down, and surprisingly mark_page_accessed goes way up - I think because they are respectively where the cache gets to be reloaded after being purged by clear or copy. Suggested-by: Nick Piggin <npiggin@gmail.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29tmpfs: enable NOSEC optimizationHugh Dickins1-0/+1
Let tmpfs into the NOSEC optimization (avoiding file_remove_suid() overhead on most common writes): set MS_NOSEC on its superblocks. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29shmem: replace page if mapping excludes its zoneHugh Dickins3-24/+136
The GMA500 GPU driver uses GEM shmem objects, but with a new twist: the backing RAM has to be below 4GB. Not a problem while the boards supported only 4GB: but now Intel's D2700MUD boards support 8GB, and their GMA3600 is managed by the GMA500 driver. shmem/tmpfs has never pretended to support hardware restrictions on the backing memory, but it might have appeared to do so before v3.1, and even now it works fine until a page is swapped out then back in. When read_cache_page_gfp() supplied a freshly allocated page for copy, that compensated for whatever choice might have been made by earlier swapin readahead; but swapoff was likely to destroy the illusion. We'd like to continue to support GMA500, so now add a new shmem_should_replace_page() check on the zone when about to move a page from swapcache to filecache (in swapin and swapoff cases), with shmem_replace_page() to allocate and substitute a suitable page (given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32). This does involve a minor extension to mem_cgroup_replace_page_cache() (the page may or may not have already been charged); and I've removed a comment and call to mem_cgroup_uncharge_cache_page(), which in fact is always a no-op while PageSwapCache. Also removed optimization of an unlikely path in shmem_getpage_gfp(), now that we need to check PageSwapCache more carefully (a racing caller might already have made the copy). And at one point shmem_unuse_inode() needs to use the hitherto private page_swapcount(), to guard against racing with inode eviction. It would make sense to extend shmem_should_replace_page(), to cover cpuset and NUMA mempolicy restrictions too, but set that aside for now: needs a cleanup of shmem mempolicy handling, and more testing, and ought to handle swap faults in do_swap_page() as well as shmem. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Stephane Marchesin <marcheu@chromium.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Rob Clark <rob.clark@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: compaction: handle incorrect MIGRATE_UNMOVABLE type pageblocksBartlomiej Zolnierkiewicz3-28/+131
When MIGRATE_UNMOVABLE pages are freed from MIGRATE_UNMOVABLE type pageblock (and some MIGRATE_MOVABLE pages are left in it) waiting until an allocation takes ownership of the block may take too long. The type of the pageblock remains unchanged so the pageblock cannot be used as a migration target during compaction. Fix it by: * Adding enum compact_mode (COMPACT_ASYNC_[MOVABLE,UNMOVABLE], and COMPACT_SYNC) and then converting sync field in struct compact_control to use it. * Adding nr_pageblocks_skipped field to struct compact_control and tracking how many destination pageblocks were of MIGRATE_UNMOVABLE type. If COMPACT_ASYNC_MOVABLE mode compaction ran fully in try_to_compact_pages() (COMPACT_COMPLETE) it implies that there is not a suitable page for allocation. In this case then check how if there were enough MIGRATE_UNMOVABLE pageblocks to try a second pass in COMPACT_ASYNC_UNMOVABLE mode. * Scanning the MIGRATE_UNMOVABLE pageblocks (during COMPACT_SYNC and COMPACT_ASYNC_UNMOVABLE compaction modes) and building a count based on finding PageBuddy pages, page_count(page) == 0 or PageLRU pages. If all pages within the MIGRATE_UNMOVABLE pageblock are in one of those three sets change the whole pageblock type to MIGRATE_MOVABLE. My particular test case (on a ARM EXYNOS4 device with 512 MiB, which means 131072 standard 4KiB pages in 'Normal' zone) is to: - allocate 120000 pages for kernel's usage - free every second page (60000 pages) of memory just allocated - allocate and use 60000 pages from user space - free remaining 60000 pages of kernel memory (now we have fragmented memory occupied mostly by user space pages) - try to allocate 100 order-9 (2048 KiB) pages for kernel's usage The results: - with compaction disabled I get 11 successful allocations - with compaction enabled - 14 successful allocations - with this patch I'm able to get all 100 successful allocations NOTE: If we can make kswapd aware of order-0 request during compaction, we can enhance kswapd with changing mode to COMPACT_ASYNC_FULL (COMPACT_ASYNC_MOVABLE + COMPACT_ASYNC_UNMOVABLE). Please see the following thread: http://marc.info/?l=linux-mm&m=133552069417068&w=2 [minchan@kernel.org: minor cleanups] Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: remove sparsemem allocation details from the bootmem allocatorJohannes Weiner3-57/+12
alloc_bootmem_section() derives allocation area constraints from the specified sparsemem section. This is a bit specific for a generic memory allocator like bootmem, though, so move it over to sparsemem. As __alloc_bootmem_node_nopanic() already retries failed allocations with relaxed area constraints, the fallback code in sparsemem.c can be removed and the code becomes a bit more compact overall. [akpm@linux-foundation.org: fix build] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: bootmem: pass pgdat instead of pgdat->bdata down the stackJohannes Weiner1-10/+10
Pass down the node descriptor instead of the more specific bootmem node descriptor down the call stack, like nobootmem does, when there is no good reason for the two to be different. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: nobootmem: unify allocation policy of (non-)panicking node allocationsJohannes Weiner1-52/+54
While the panicking node-specific allocation function tries to satisfy node+goal, goal, node, anywhere, the non-panicking function still does node+goal, goal, anywhere. Make it simpler: define the panicking version in terms of the non-panicking one, like the node-agnostic interface, so they always behave the same way apart from how to deal with allocation failure. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: nobootmem: panic on node-specific allocation failureJohannes Weiner1-4/+16
__alloc_bootmem_node and __alloc_bootmem_low_node documentation claims the functions panic on allocation failure. Do it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: bootmem: unify allocation policy of (non-)panicking node allocationsJohannes Weiner1-20/+24
While the panicking node-specific allocation function tries to satisfy node+goal, goal, node, anywhere, the non-panicking function still does node+goal, goal, anywhere. Make it simpler: define the panicking version in terms of the non-panicking one, like the node-agnostic interface, so they always behave the same way apart from how to deal with allocation failure. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: bootmem: allocate in order node+goal, goal, node, anywhereJohannes Weiner1-1/+13
Match the nobootmem version of __alloc_bootmem_node. Try to satisfy both the node and the goal, then just the goal, then just the node, then allocate anywhere before panicking. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: bootmem: split out goal-to-node mapping from goal droppingJohannes Weiner1-2/+15
Matching the desired goal to the right node is one thing, dropping the goal when it can not be satisfied is another. Split this into separate functions so that subsequent patches can use the node-finding but drop and handle the goal fallback on their own terms. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: bootmem: rename alloc_bootmem_core to alloc_bootmem_bdataJohannes Weiner1-7/+7
Callsites need to provide a bootmem_data_t *, make the naming more descriptive. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: bootmem: remove redundant offset check when finally freeing bootmemJohannes Weiner1-1/+1
When bootmem releases an unaligned BITS_PER_LONG pages chunk of memory to the page allocator, it checks the bitmap if there are still unreserved pages in the chunk (set bits), but also if the offset in the chunk indicates BITS_PER_LONG loop iterations already. But since the consulted bitmap is only a one-word-excerpt of the full per-node bitmap, there can not be more than BITS_PER_LONG bits set in it. The additional offset check is unnecessary. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: bootmem: fix checking the bitmap when finally freeing bootmemGavin Shan1-0/+1
When bootmem releases an unaligned chunk of memory at the beginning of a node to the page allocator, it iterates from that unaligned PFN but checks an aligned word of the page bitmap. The checked bits do not correspond to the PFNs and, as a result, reserved pages can be freed. Properly shift the bitmap word so that the lowest bit corresponds to the starting PFN before entering the freeing loop. This bug has been around since commit 41546c17418f ("bootmem: clean up free_all_bootmem_core") (2.6.27) without known reports. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm/page_alloc.c: remove pageblock_default_order()Andrew Morton1-18/+15
This has always been broken: one version takes an unsigned int and the other version takes no arguments. This bug was hidden because one version of set_pageblock_order() was a macro which doesn't evaluate its argument. Simplify it all and remove pageblock_default_order() altogether. Reported-by: rajman mekaco <rajman.mekaco@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: print physical addresses consistently with other parts of kernelBjorn Helgaas2-14/+19
Print physical address info in a style consistent with the %pR style used elsewhere in the kernel. For example: -Zone PFN ranges: +Zone ranges: - DMA32 0x00000010 -> 0x00100000 + DMA32 [mem 0x00010000-0xffffffff] - Normal 0x00100000 -> 0x01080000 + Normal [mem 0x100000000-0x107fffffff] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29Cross Memory Attach: make it KconfigurableChristopher Yeoh2-2/+15
Add a Kconfig option to allow people who don't want cross memory attach to not have it included in their build. Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm, thp: drop page_table_lock to uncharge memcg pagesDavid Rientjes1-0/+2
mm->page_table_lock is hotly contested for page fault tests and isn't necessary to do mem_cgroup_uncharge_page() in do_huge_pmd_wp_page(). Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: rename is_mlocked_vma() to mlocked_vma_newpage()Ying Han2-3/+4
Andrew pointed out that the is_mlocked_vma() is misnamed. A function with name like that would expect bool return and no side-effects. Since it is called on the fault path for new page, rename it in this patch. Signed-off-by: Ying Han <yinghan@google.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com> Reviewed-by: Minchan Kim <minchan@kernel.org> [akpm@linux-foundation.org: s/mlock_vma_newpage/mlock_vma_newpage/, per Minchan] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: memcg: count pte references from every member of the reclaimed hierarchyJohannes Weiner2-7/+15
The rmap walker checking page table references has historically ignored references from VMAs that were not part of the memcg that was being reclaimed during memcg hard limit reclaim. When transitioning global reclaim to memcg hierarchy reclaim, I missed that bit and now references from outside a memcg are ignored even during global reclaim. Reverting back to traditional behaviour - count all references during global reclaim and only mind references of the memcg being reclaimed during limit reclaim would be one option. However, the more generic idea is to ignore references exactly then when they are outside the hierarchy that is currently under reclaim; because only then will their reclamation be of any use to help the pressure situation. It makes no sense to ignore references from a sibling memcg and then evict a page that will be immediately refaulted by that sibling which contributes to the same usage of the common ancestor under reclaim. The solution: make the rmap walker ignore references from VMAs that are not part of the hierarchy that is being reclaimed. Flat limit reclaim will stay the same, hierarchical limit reclaim will mind the references only to pages that the hierarchy owns. Global reclaim, since it reclaims from all memcgs, will be fixed to regard all references. [akpm@linux-foundation.org: name the args in the declaration] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Konstantin Khlebnikov<khlebnikov@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29kernel: cgroup: push rcu read locking from css_is_ancestor() to callsiteJohannes Weiner1-5/+9
Library functions should not grab locks when the callsites can do it, even if the lock nests like the rcu read-side lock does. Push the rcu_read_lock() from css_is_ancestor() to its single user, mem_cgroup_same_or_subtree() in preparation for another user that may already hold the rcu read-side lock. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: do_migrate_pages(): rename argumentsAndrew Morton1-9/+9
s/from_nodes/from and s/to_nodes/to/. The "_nodes" is redundant - it duplicates the argument's type. Done in a fit of irritation over 80-col issues :( Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <mkosaki@redhat.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: do_migrate_pages() calls migrate_to_node() even if task is already on a ↵Larry Woodman1-0/+20
correct node While running an application that moves tasks from one cpuset to another I noticed that it takes much longer and moves many more pages than expected. The reason for this is do_migrate_pages() does its best to preserve the relative node differential from the first node of the cpuset because the application may have been written with that in mind. If memory was interleaved on the nodes of the source cpuset by an application do_migrate_pages() will try its best to maintain that interleaving on the nodes of the destination cpuset. This means copying the memory from all source nodes to the destination nodes even if the source and destination nodes overlap. This is a problem for userspace NUMA placement tools. The amount of time spent doing extra memory moves cancels out some of the NUMA performance improvements. Furthermore, if the number of source and destination nodes are to maintain the previous interleaving layout anyway. This patch changes do_migrate_pages() to only preserve the relative layout inside the program if the number of NUMA nodes in the source and destination mask are the same. If the number is different, we do a much more efficient migration by not touching memory that is in an allowed node. This preserves the old behaviour for programs that want it, while allowing a userspace NUMA placement tool to use the new, faster migration. This improves performance in our tests by up to a factor of 7. Without this change migrating tasks from a cpuset containing nodes 0-7 to a cpuset containing nodes 3-4, we migrate from ALL the nodes even if they are in the both the source and destination nodesets: Migrating 7 to 4 Migrating 6 to 3 Migrating 5 to 4 Migrating 4 to 3 Migrating 1 to 4 Migrating 3 to 4 Migrating 0 to 3 Migrating 2 to 3 With this change we only migrate from nodes that are not in the destination nodesets: Migrating 7 to 4 Migrating 6 to 3 Migrating 5 to 4 Migrating 2 to 3 Migrating 1 to 4 Migrating 0 to 3 Yet if we move from a cpuset containing nodes 2,3,4 to a cpuset containing 3,4,5 we still do move everything so that we preserve the desired NUMA offsets: Migrating 4 to 5 Migrating 3 to 4 Migrating 2 to 3 As far as performance is concerned this simple patch improves the time it takes to move 14, 20 and 26 large tasks from a cpuset containing nodes 0-7 to a cpuset containing nodes 1 & 3 by up to a factor of 7. Here are the timings with and without the patch: BEFORE PATCH -- Move times: 59, 140, 651 seconds ============ Moving 14 tasks from nodes (0-7) to nodes (1,3) numad(8780) do_migrate_pages (mm=0xffff88081d414400 from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x7 dest=0x3 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x6 dest=0x1 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x5 dest=0x3 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x4 dest=0x1 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x2 dest=0x1 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x1 dest=0x3 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x0 dest=0x1 flags=0x4) (Above moves repeated for each of the 14 tasks...) PID 8890 moved to node(s) 1,3 in 59.2 seconds Moving 20 tasks from nodes (0-7) to nodes (1,4-5) numad(8780) do_migrate_pages (mm=0xffff88081d88c700 from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x7 dest=0x4 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x6 dest=0x1 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x3 dest=0x1 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x2 dest=0x5 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x1 dest=0x4 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x0 dest=0x1 flags=0x4) (Above moves repeated for each of the 20 tasks...) PID 8962 moved to node(s) 1,4-5 in 139.88 seconds Moving 26 tasks from nodes (0-7) to nodes (1-3,5) numad(8780) do_migrate_pages (mm=0xffff88081d5bc740 from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x7 dest=0x5 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x6 dest=0x3 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x5 dest=0x2 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x3 dest=0x5 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x2 dest=0x3 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x1 dest=0x2 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x0 dest=0x1 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x4 dest=0x1 flags=0x4) (Above moves repeated for each of the 26 tasks...) PID 9058 moved to node(s) 1-3,5 in 651.45 seconds AFTER PATCH -- Move times: 42, 56, 93 seconds =========== Moving 14 tasks from nodes (0-7) to nodes (5,7) numad(33209) do_migrate_pages (mm=0xffff88101d5ff140 from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x6 dest=0x5 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x4 dest=0x5 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x3 dest=0x7 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x2 dest=0x5 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x1 dest=0x7 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x0 dest=0x5 flags=0x4) (Above moves repeated for each of the 14 tasks...) PID 33221 moved to node(s) 5,7 in 41.67 seconds Moving 20 tasks from nodes (0-7) to nodes (1,3,5) numad(33209) do_migrate_pages (mm=0xffff88101d6c37c0 from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x7 dest=0x3 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x6 dest=0x1 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x4 dest=0x3 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x2 dest=0x5 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x0 dest=0x1 flags=0x4) (Above moves repeated for each of the 20 tasks...) PID 33289 moved to node(s) 1,3,5 in 56.3 seconds Moving 26 tasks from nodes (0-7) to nodes (1,3,5,7) numad(33209) do_migrate_pages (mm=0xffff88101d924400 from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x6 dest=0x5 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x4 dest=0x1 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x2 dest=0x5 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x0 dest=0x1 flags=0x4) (Above moves repeated for each of the 26 tasks...) PID 33372 moved to node(s) 1,3,5,7 in 92.67 seconds [akpm@linux-foundation.org: clean up comment layout] Signed-off-by: Larry Woodman <lwoodman@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29thp, memcg: split hugepage for memcg oom on cowDavid Rientjes2-3/+18
On COW, a new hugepage is allocated and charged to the memcg. If the system is oom or the charge to the memcg fails, however, the fault handler will return VM_FAULT_OOM which results in an oom kill. Instead, it's possible to fallback to splitting the hugepage so that the COW results only in an order-0 page being allocated and charged to the memcg which has a higher liklihood to succeed. This is expensive because the hugepage must be split in the page fault handler, but it is much better than unnecessarily oom killing a process. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm/vmstat.c: remove debug fs entries on failure of file creation and made ↵Sasikantha babu1-3/+7
extfrag_debug_root dentry local Remove debug fs files and directory on failure. Since no one is using "extfrag_debug_root" dentry outside of extfrag_debug_init(), make it local to the function. Signed-off-by: Sasikantha babu <sasikanth.v19@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm/mmap.c: find_vma(): remove unnecessary if(mm) checkRajman Mekaco1-26/+27
The "if (mm)" check is not required in find_vma, as the kernel code calls find_vma only when it is absolutely sure that the mm_struct arg to it is non-NULL. Remove the if(mm) check and adding the a WARN_ONCE(!mm) for now. This will serve the purpose of mandating that the execution context(user-mode/kernel-mode) be known before find_vma is called. Also fixed 2 checkpatch.pl errors in the declaration of the rb_node and vma_tmp local variables. I was browsing through the internet and read a discussion at https://lkml.org/lkml/2012/3/27/342 which discusses removal of the validation check within find_vma. Since no-one responded, I decided to send this patch with Andrew's suggestions. [akpm@linux-foundation.org: add remove-me comment] Signed-off-by: Rajman Mekaco <rajman.mekaco@gmail.com> Cc: Kautuk Consul <consul.kautuk@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: use kcalloc() instead of kzalloc() to allocate arrayThomas Meyer1-2/+2
The advantage of kcalloc is, that will prevent integer overflows which could result from the multiplication of number of elements and size and it is also a bit nicer to read. The semantic patch that makes this change is available in https://lkml.org/lkml/2011/11/25/107 Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: vmscan: remove reclaim_mode_tMel Gorman1-50/+22
There is little motiviation for reclaim_mode_t once RECLAIM_MODE_[A]SYNC and lumpy reclaim have been removed. This patch gets rid of reclaim_mode_t as well and improves the documentation about what reclaim/compaction is and when it is triggered. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Hugh Dickins <hughd@google.com> Cc: Ying Han <yinghan@google.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: vmscan: do not stall on writeback during memory compactionMel Gorman1-77/+8
This patch stops reclaim/compaction entering sync reclaim as this was only intended for lumpy reclaim and an oversight. Page migration has its own logic for stalling on writeback pages if necessary and memory compaction is already using it. Waiting on page writeback is bad for a number of reasons but the primary one is that waiting on writeback to a slow device like USB can take a considerable length of time. Page reclaim instead uses wait_iff_congested() to throttle if too many dirty pages are being scanned. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Hugh Dickins <hughd@google.com> Cc: Ying Han <yinghan@google.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: vmscan: remove lumpy reclaimMel Gorman1-129/+15
This series removes lumpy reclaim and some stalling logic that was unintentionally being used by memory compaction. The end result is that stalling on dirty pages during page reclaim now depends on wait_iff_congested(). Four kernels were compared 3.3.0 vanilla 3.4.0-rc2 vanilla 3.4.0-rc2 lumpyremove-v2 is patch one from this series 3.4.0-rc2 nosync-v2r3 is the full series Removing lumpy reclaim saves almost 900 bytes of text whereas the full series removes 1200 bytes. text data bss dec hex filename 6740375 1927944 2260992 10929311 a6c49f vmlinux-3.4.0-rc2-vanilla 6739479 1927944 2260992 10928415 a6c11f vmlinux-3.4.0-rc2-lumpyremove-v2 6739159 1927944 2260992 10928095 a6bfdf vmlinux-3.4.0-rc2-nosync-v2 There are behaviour changes in the series and so tests were run with monitoring of ftrace events. This disrupts results so the performance results are distorted but the new behaviour should be clearer. fs-mark running in a threaded configuration showed little of interest as it did not push reclaim aggressively FS-Mark Multi Threaded 3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3 Files/s min 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) Files/s mean 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) Files/s stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Files/s max 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) Overhead min 508667.00 ( 0.00%) 521350.00 (-2.49%) 544292.00 (-7.00%) 547168.00 (-7.57%) Overhead mean 551185.00 ( 0.00%) 652690.73 (-18.42%) 991208.40 (-79.83%) 570130.53 (-3.44%) Overhead stddev 18200.69 ( 0.00%) 331958.29 (-1723.88%) 1579579.43 (-8578.68%) 9576.81 (47.38%) Overhead max 576775.00 ( 0.00%) 1846634.00 (-220.17%) 6901055.00 (-1096.49%) 585675.00 (-1.54%) MMTests Statistics: duration Sys Time Running Test (seconds) 309.90 300.95 307.33 298.95 User+Sys Time Running Test (seconds) 319.32 309.67 315.69 307.51 Total Elapsed Time (seconds) 1187.85 1193.09 1191.98 1193.73 MMTests Statistics: vmstat Page Ins 80532 82212 81420 79480 Page Outs 111434984 111456240 111437376 111582628 Swap Ins 0 0 0 0 Swap Outs 0 0 0 0 Direct pages scanned 44881 27889 27453 34843 Kswapd pages scanned 25841428 25860774 25861233 25843212 Kswapd pages reclaimed 25841393 25860741 25861199 25843179 Direct pages reclaimed 44881 27889 27453 34843 Kswapd efficiency 99% 99% 99% 99% Kswapd velocity 21754.791 21675.460 21696.029 21649.127 Direct efficiency 100% 100% 100% 100% Direct velocity 37.783 23.375 23.031 29.188 Percentage direct scans 0% 0% 0% 0% ftrace showed that there was no stalling on writeback or pages submitted for IO from reclaim context. postmark was similar and while it was more interesting, it also did not push reclaim heavily. POSTMARK 3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3 Transactions per second: 16.00 ( 0.00%) 20.00 (25.00%) 18.00 (12.50%) 17.00 ( 6.25%) Data megabytes read per second: 18.80 ( 0.00%) 24.27 (29.10%) 22.26 (18.40%) 20.54 ( 9.26%) Data megabytes written per second: 35.83 ( 0.00%) 46.25 (29.08%) 42.42 (18.39%) 39.14 ( 9.24%) Files created alone per second: 28.00 ( 0.00%) 38.00 (35.71%) 34.00 (21.43%) 30.00 ( 7.14%) Files create/transact per second: 8.00 ( 0.00%) 10.00 (25.00%) 9.00 (12.50%) 8.00 ( 0.00%) Files deleted alone per second: 556.00 ( 0.00%) 1224.00 (120.14%) 3062.00 (450.72%) 6124.00 (1001.44%) Files delete/transact per second: 8.00 ( 0.00%) 10.00 (25.00%) 9.00 (12.50%) 8.00 ( 0.00%) MMTests Statistics: duration Sys Time Running Test (seconds) 113.34 107.99 109.73 108.72 User+Sys Time Running Test (seconds) 145.51 139.81 143.32 143.55 Total Elapsed Time (seconds) 1159.16 899.23 980.17 1062.27 MMTests Statistics: vmstat Page Ins 13710192 13729032 13727944 13760136 Page Outs 43071140 42987228 42733684 42931624 Swap Ins 0 0 0 0 Swap Outs 0 0 0 0 Direct pages scanned 0 0 0 0 Kswapd pages scanned 9941613 9937443 9939085 9929154 Kswapd pages reclaimed 9940926 9936751 9938397 9928465 Direct pages reclaimed 0 0 0 0 Kswapd efficiency 99% 99% 99% 99% Kswapd velocity 8576.567 11051.058 10140.164 9347.109 Direct efficiency 100% 100% 100% 100% Direct velocity 0.000 0.000 0.000 0.000 It looks like here that the full series regresses performance but as ftrace showed no usage of wait_iff_congested() or sync reclaim I am assuming it's a disruption due to monitoring. Other data such as memory usage, page IO, swap IO all looked similar. Running a benchmark with a plain DD showed nothing very interesting. The full series stalled in wait_iff_congested() slightly less but stall times on vanilla kernels were marginal. Running a benchmark that hammered on file-backed mappings showed stalls due to congestion but not in sync writebacks MICRO 3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3 MMTests Statistics: duration Sys Time Running Test (seconds) 308.13 294.50 298.75 299.53 User+Sys Time Running Test (seconds) 330.45 316.28 318.93 320.79 Total Elapsed Time (seconds) 1814.90 1833.88 1821.14 1832.91 MMTests Statistics: vmstat Page Ins 108712 120708 97224 110344 Page Outs 155514576 156017404 155813676 156193256 Swap Ins 0 0 0 0 Swap Outs 0 0 0 0 Direct pages scanned 2599253 1550480 2512822 2414760 Kswapd pages scanned 69742364 71150694 68839041 69692533 Kswapd pages reclaimed 34824488 34773341 34796602 34799396 Direct pages reclaimed 53693 94750 61792 75205 Kswapd efficiency 49% 48% 50% 49% Kswapd velocity 38427.662 38797.901 37799.972 38022.889 Direct efficiency 2% 6% 2% 3% Direct velocity 1432.174 845.464 1379.807 1317.446 Percentage direct scans 3% 2% 3% 3% Page writes by reclaim 0 0 0 0 Page writes file 0 0 0 0 Page writes anon 0 0 0 0 Page reclaim immediate 0 0 0 1218 Page rescued immediate 0 0 0 0 Slabs scanned 15360 16384 13312 16384 Direct inode steals 0 0 0 0 Kswapd inode steals 4340 4327 1630 4323 FTrace Reclaim Statistics: congestion_wait Direct number congest waited 0 0 0 0 Direct time congest waited 0ms 0ms 0ms 0ms Direct full congest waited 0 0 0 0 Direct number conditional waited 900 870 754 789 Direct time conditional waited 0ms 0ms 0ms 20ms Direct full conditional waited 0 0 0 0 KSwapd number congest waited 2106 2308 2116 1915 KSwapd time congest waited 139924ms 157832ms 125652ms 132516ms KSwapd full congest waited 1346 1530 1202 1278 KSwapd number conditional waited 12922 16320 10943 14670 KSwapd time conditional waited 0ms 0ms 0ms 0ms KSwapd full conditional waited 0 0 0 0 Reclaim statistics are not radically changed. The stall times in kswapd are massive but it is clear that it is due to calls to congestion_wait() and that is almost certainly the call in balance_pgdat(). Otherwise stalls due to dirty pages are non-existant. I ran a benchmark that stressed high-order allocation. This is very artifical load but was used in the past to evaluate lumpy reclaim and compaction. Generally I look at allocation success rates and latency figures. STRESS-HIGHALLOC 3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3 Pass 1 81.00 ( 0.00%) 28.00 (-53.00%) 24.00 (-57.00%) 28.00 (-53.00%) Pass 2 82.00 ( 0.00%) 39.00 (-43.00%) 38.00 (-44.00%) 43.00 (-39.00%) while Rested 88.00 ( 0.00%) 87.00 (-1.00%) 88.00 ( 0.00%) 88.00 ( 0.00%) MMTests Statistics: duration Sys Time Running Test (seconds) 740.93 681.42 685.14 684.87 User+Sys Time Running Test (seconds) 2922.65 3269.52 3281.35 3279.44 Total Elapsed Time (seconds) 1161.73 1152.49 1159.55 1161.44 MMTests Statistics: vmstat Page Ins 4486020 2807256 2855944 2876244 Page Outs 7261600 7973688 7975320 7986120 Swap Ins 31694 0 0 0 Swap Outs 98179 0 0 0 Direct pages scanned 53494 57731 34406 113015 Kswapd pages scanned 6271173 1287481 1278174 1219095 Kswapd pages reclaimed 2029240 1281025 1260708 1201583 Direct pages reclaimed 1468 14564 16649 92456 Kswapd efficiency 32% 99% 98% 98% Kswapd velocity 5398.133 1117.130 1102.302 1049.641 Direct efficiency 2% 25% 48% 81% Direct velocity 46.047 50.092 29.672 97.306 Percentage direct scans 0% 4% 2% 8% Page writes by reclaim 1616049 0 0 0 Page writes file 1517870 0 0 0 Page writes anon 98179 0 0 0 Page reclaim immediate 103778 27339 9796 17831 Page rescued immediate 0 0 0 0 Slabs scanned 1096704 986112 980992 998400 Direct inode steals 223 215040 216736 247881 Kswapd inode steals 175331 61548 68444 63066 Kswapd skipped wait 21991 0 1 0 THP fault alloc 1 135 125 134 THP collapse alloc 393 311 228 236 THP splits 25 13 7 8 THP fault fallback 0 0 0 0 THP collapse fail 3 5 7 7 Compaction stalls 865 1270 1422 1518 Compaction success 370 401 353 383 Compaction failures 495 869 1069 1135 Compaction pages moved 870155 3828868 4036106 4423626 Compaction move failure 26429 23865 29742 27514 Success rates are completely hosed for 3.4-rc2 which is almost certainly due to commit fe2c2a106663 ("vmscan: reclaim at order 0 when compaction is enabled"). I expected this would happen for kswapd and impair allocation success rates (https://lkml.org/lkml/2012/1/25/166) but I did not anticipate this much a difference: 80% less scanning, 37% less reclaim by kswapd In comparison, reclaim/compaction is not aggressive and gives up easily which is the intended behaviour. hugetlbfs uses __GFP_REPEAT and would be much more aggressive about reclaim/compaction than THP allocations are. The stress test above is allocating like neither THP or hugetlbfs but is much closer to THP. Mainline is now impaired in terms of high order allocation under heavy load although I do not know to what degree as I did not test with __GFP_REPEAT. Keep this in mind for bugs related to hugepage pool resizing, THP allocation and high order atomic allocation failures from network devices. In terms of congestion throttling, I see the following for this test FTrace Reclaim Statistics: congestion_wait Direct number congest waited 3 0 0 0 Direct time congest waited 0ms 0ms 0ms 0ms Direct full congest waited 0 0 0 0 Direct number conditional waited 957 512 1081 1075 Direct time conditional waited 0ms 0ms 0ms 0ms Direct full conditional waited 0 0 0 0 KSwapd number congest waited 36 4 3 5 KSwapd time congest waited 3148ms 400ms 300ms 500ms KSwapd full congest waited 30 4 3 5 KSwapd number conditional waited 88514 197 332 542 KSwapd time conditional waited 4980ms 0ms 0ms 0ms KSwapd full conditional waited 49 0 0 0 The "conditional waited" times are the most interesting as this is directly impacted by the number of dirty pages encountered during scan. As lumpy reclaim is no longer scanning contiguous ranges, it is finding fewer dirty pages. This brings wait times from about 5 seconds to 0. kswapd itself is still calling congestion_wait() so it'll still stall but it's a lot less. In terms of the type of IO we were doing, I see this FTrace Reclaim Statistics: mm_vmscan_writepage Direct writes anon sync 0 0 0 0 Direct writes anon async 0 0 0 0 Direct writes file sync 0 0 0 0 Direct writes file async 0 0 0 0 Direct writes mixed sync 0 0 0 0 Direct writes mixed async 0 0 0 0 KSwapd writes anon sync 0 0 0 0 KSwapd writes anon async 91682 0 0 0 KSwapd writes file sync 0 0 0 0 KSwapd writes file async 822629 0 0 0 KSwapd writes mixed sync 0 0 0 0 KSwapd writes mixed async 0 0 0 0 In 3.2, kswapd was doing a bunch of async writes of pages but reclaim/compaction was never reaching a point where it was doing sync IO. This does not guarantee that reclaim/compaction was not calling wait_on_page_writeback() but I would consider it unlikely. It indicates that merging patches 2 and 3 to stop reclaim/compaction calling wait_on_page_writeback() should be safe. This patch: Lumpy reclaim had a purpose but in the mind of some, it was to kick the system so hard it trashed. For others the purpose was to complicate vmscan.c. Over time it was giving softer shoes and a nicer attitude but memory compaction needs to step up and replace it so this patch sends lumpy reclaim to the farm. The tracepoint format changes for isolating LRU pages with this patch applied. Furthermore reclaim/compaction can no longer queue dirty pages in pageout() if the underlying BDI is congested. Lumpy reclaim used this logic and reclaim/compaction was using it in error. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Hugh Dickins <hughd@google.com> Cc: Ying Han <yinghan@google.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: remove swap token codeRik van Riel6-170/+2
The swap token code no longer fits in with the current VM model. It does not play well with cgroups or the better NUMA placement code in development, since we have only one swap token globally. It also has the potential to mess with scalability of the system, by increasing the number of non-reclaimable pages on the active and inactive anon LRU lists. Last but not least, the swap token code has been broken for a year without complaints, as reported by Konstantin Khlebnikov. This suggests we no longer have much use for it. The days of sub-1G memory systems with heavy use of swap are over. If we ever need thrashing reducing code in the future, we will have to implement something that does scale. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Hugh Dickins <hughd@google.com> Acked-by: Bob Picco <bpicco@meloft.net> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm, thp: allow fallback when pte_alloc_one() fails for huge pmdDavid Rientjes1-5/+8
The transparent hugepages feature is careful to not invoke the oom killer when a hugepage cannot be allocated. pte_alloc_one() failing in __do_huge_pmd_anonymous_page(), however, currently results in VM_FAULT_OOM which invokes the pagefault oom killer to kill a memory-hogging task. This is unnecessary since it's possible to drop the reference to the hugepage and fallback to allocating a small page. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm, thp: remove unnecessary ret variableDavid Rientjes1-2/+1
The "ret" variable is unnecessary in __do_huge_pmd_anonymous_page(), so remove it. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm/hugetlb.c: use long vars instead of int in region_count()Wang Sheng-Hui1-2/+2
The arguments f & t and fields from & to of struct file_region are defined as long. So use long instead of int to type the temp vars. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm/mempolicy.c: use enum value MPOL_REBIND_ONCE in mpol_rebind_policy()Wang Sheng-Hui1-1/+1
We have enum definition in mempolicy.h: MPOL_REBIND_ONCE. It should replace the magic number 0 for step comparison in function mpol_rebind_policy. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm/memory_failure: let the compiler add the function nameBorislav Petkov1-4/+4
These things tend to get out of sync with time so let the compiler automatically enter the current function name using __func__. No functional change. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Acked-by: Andi Kleen <andi@firstfloor.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-28Merge tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linuxLinus Torvalds2-2/+3
Pull writeback tree from Wu Fengguang: "Mainly from Jan Kara to avoid iput() in the flusher threads." * tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: Avoid iput() from flusher thread vfs: Rename end_writeback() to clear_inode() vfs: Move waiting for inode writeback from end_writeback() to evict_inode() writeback: Refactor writeback_single_inode() writeback: Remove wb->list_lock from writeback_single_inode() writeback: Separate inode requeueing after writeback writeback: Move I_DIRTY_PAGES handling writeback: Move requeueing when I_SYNC set to writeback_sb_inodes() writeback: Move clearing of I_SYNC into inode_sync_complete() writeback: initialize global_dirty_limit fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds mm: page-writeback.c: local functions should not be exposed globally
2012-05-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tileLinus Torvalds2-2/+3
Pull tile updates from Chris Metcalf: "These changes cover a range of new arch/tile features and optimizations. They've been through LKML review and on linux-next for a month or so. There's also one bug-fix that just missed 3.4, which I've marked for stable." Fixed up trivial conflict in arch/tile/Kconfig (new added tile Kconfig entries clashing with the generic timer/clockevents changes). * git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile: tile: default to tilegx_defconfig for ARCH=tile tile: fix bug where fls(0) was not returning 0 arch/tile: mark TILEGX as not EXPERIMENTAL tile/mm/fault.c: Port OOM changes to handle_page_fault arch/tile: add descriptive text if the kernel reports a bad trap arch/tile: allow querying cpu module information from the hypervisor arch/tile: fix hardwall for tilegx and generalize for idn and ipi arch/tile: support multiple huge page sizes dynamically mm: add new arch_make_huge_pte() method for tile support arch/tile: support kexec() for tilegx arch/tile: support <asm/cachectl.h> header for cacheflush() syscall arch/tile: Allow tilegx to build with either 16K or 64K page size arch/tile: optimize get_user/put_user and friends arch/tile: support building big-endian kernel arch/tile: allow building Linux with transparent huge pages enabled arch/tile: use interrupt critical sections less
2012-05-25mm: add new arch_make_huge_pte() method for tile supportChris Metcalf1-0/+1
The tile support for multiple-size huge pages requires tagging the hugetlb PTE with a "super" bit for PTEs that are multiples of the basic size of a pagetable span. To set that bit properly we need to tweak the PTe in make_huge_pte() based on the vma. This change provides the API for a subsequent tile-specific change to use. Reviewed-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
2012-05-25arch/tile: allow building Linux with transparent huge pages enabledChris Metcalf1-2/+2
The change adds some infrastructure for managing tile pmd's more generally, using pte_pmd() and pmd_pte() methods to translate pmd values to and from ptes, since on TILEPro a pmd is really just a nested structure holding a pgd (aka pte). Several existing pmd methods are moved into this framework, and a whole raft of additional pmd accessors are defined that are used by the transparent hugepage framework. The tile PTE now has a "client2" bit. The bit is used to indicate a transparent huge page is in the process of being split into subpages. This change also fixes a generic bug where the return value of the generic pmdp_splitting_flush() was incorrect. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
2012-05-25Merge branch 'for-linus' of ↵Linus Torvalds9-216/+675
git://git.linaro.org/people/mszyprowski/linux-dma-mapping Pull CMA and ARM DMA-mapping updates from Marek Szyprowski: "These patches contain two major updates for DMA mapping subsystem (mainly for ARM architecture). First one is Contiguous Memory Allocator (CMA) which makes it possible for device drivers to allocate big contiguous chunks of memory after the system has booted. The main difference from the similar frameworks is the fact that CMA allows to transparently reuse the memory region reserved for the big chunk allocation as a system memory, so no memory is wasted when no big chunk is allocated. Once the alloc request is issued, the framework migrates system pages to create space for the required big chunk of physically contiguous memory. For more information one can refer to nice LWN articles: - 'A reworked contiguous memory allocator': http://lwn.net/Articles/447405/ - 'CMA and ARM': http://lwn.net/Articles/450286/ - 'A deep dive into CMA': http://lwn.net/Articles/486301/ - and the following thread with the patches and links to all previous versions: https://lkml.org/lkml/2012/4/3/204 The main client for this new framework is ARM DMA-mapping subsystem. The second part provides a complete redesign in ARM DMA-mapping subsystem. The core implementation has been changed to use common struct dma_map_ops based infrastructure with the recent updates for new dma attributes merged in v3.4-rc2. This allows to use more than one implementation of dma-mapping calls and change/select them on the struct device basis. The first client of this new infractructure is dmabounce implementation which has been completely cut out of the core, common code. The last patch of this redesign update introduces a new, experimental implementation of dma-mapping calls on top of generic IOMMU framework. This lets ARM sub-platform to transparently use IOMMU for DMA-mapping calls if one provides required IOMMU hardware. For more information please refer to the following thread: http://www.spinics.net/lists/arm-kernel/msg175729.html The last patch merges changes from both updates and provides a resolution for the conflicts which cannot be avoided when patches have been applied on the same files (mainly arch/arm/mm/dma-mapping.c)." Acked by Andrew Morton <akpm@linux-foundation.org>: "Yup, this one please. It's had much work, plenty of review and I think even Russell is happy with it." * 'for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping: (28 commits) ARM: dma-mapping: use PMD size for section unmap cma: fix migration mode ARM: integrate CMA with DMA-mapping subsystem X86: integrate CMA with DMA-mapping subsystem drivers: add Contiguous Memory Allocator mm: trigger page reclaim in alloc_contig_range() to stabilise watermarks mm: extract reclaim code from __alloc_pages_direct_reclaim() mm: Serialize access to min_free_kbytes mm: page_isolation: MIGRATE_CMA isolation functions added mm: mmzone: MIGRATE_CMA migration type added mm: page_alloc: change fallbacks array handling mm: page_alloc: introduce alloc_contig_range() mm: compaction: export some of the functions mm: compaction: introduce isolate_freepages_range() mm: compaction: introduce map_pages() mm: compaction: introduce isolate_migratepages_range() mm: page_alloc: remove trailing whitespace ARM: dma-mapping: add support for IOMMU mapper ARM: dma-mapping: use alloc, mmap, free from dma_ops ARM: dma-mapping: remove redundant code and do the cleanup ... Conflicts: arch/x86/include/asm/dma-mapping.h
2012-05-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds1-2/+5
Pull more networking updates from David Miller: "Ok, everything from here on out will be bug fixes." 1) One final sync of wireless and bluetooth stuff from John Linville. These changes have all been in his tree for more than a week, and therefore have had the necessary -next exposure. John was just away on a trip and didn't have a change to send the pull request until a day or two ago. 2) Put back some defines in user exposed header file areas that were removed during the tokenring purge. From Stephen Hemminger and Paul Gortmaker. 3) A bug fix for UDP hash table allocation got lost in the pile due to one of those "you got it.. no I've got it.." situations. :-) From Tim Bird. 4) SKB coalescing in TCP needs to have stricter checks, otherwise we'll try to coalesce overlapping frags and crash. Fix from Eric Dumazet. 5) RCU routing table lookups can race with free_fib_info(), causing crashes when we deref the device pointers in the route. Fix by releasing the net device in the RCU callback. From Yanmin Zhang. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (293 commits) tcp: take care of overlaps in tcp_try_coalesce() ipv4: fix the rcu race between free_fib_info and ip_route_output_slow mm: add a low limit to alloc_large_system_hash ipx: restore token ring define to include/linux/ipx.h if: restore token ring ARP type to header xen: do not disable netfront in dom0 phy/micrel: Fix ID of KSZ9021 mISDN: Add X-Tensions USB ISDN TA XC-525 gianfar:don't add FCB length to hard_header_len Bluetooth: Report proper error number in disconnection Bluetooth: Create flags for bt_sk() Bluetooth: report the right security level in getsockopt Bluetooth: Lock the L2CAP channel when sending Bluetooth: Restore locking semantics when looking up L2CAP channels Bluetooth: Fix a redundant and problematic incoming MTU check Bluetooth: Add support for Foxconn/Hon Hai AR5BBU22 0489:E03C Bluetooth: Fix EIR data generation for mgmt_device_found Bluetooth: Fix Inquiry with RSSI event mask Bluetooth: improve readability of l2cap_seq_list code Bluetooth: Fix skb length calculation ...
2012-05-24Merge branch 'perf-uprobes-for-linus' of ↵Linus Torvalds2-1/+35
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull user-space probe instrumentation from Ingo Molnar: "The uprobes code originates from SystemTap and has been used for years in Fedora and RHEL kernels. This version is much rewritten, reviews from PeterZ, Oleg and myself shaped the end result. This tree includes uprobes support in 'perf probe' - but SystemTap (and other tools) can take advantage of user probe points as well. Sample usage of uprobes via perf, for example to profile malloc() calls without modifying user-space binaries. First boot a new kernel with CONFIG_UPROBE_EVENT=y enabled. If you don't know which function you want to probe you can pick one from 'perf top' or can get a list all functions that can be probed within libc (binaries can be specified as well): $ perf probe -F -x /lib/libc.so.6 To probe libc's malloc(): $ perf probe -x /lib64/libc.so.6 malloc Added new event: probe_libc:malloc (on 0x7eac0) You can now use it in all perf tools, such as: perf record -e probe_libc:malloc -aR sleep 1 Make use of it to create a call graph (as the flat profile is going to look very boring): $ perf record -e probe_libc:malloc -gR make [ perf record: Woken up 173 times to write data ] [ perf record: Captured and wrote 44.190 MB perf.data (~1930712 $ perf report | less 32.03% git libc-2.15.so [.] malloc | --- malloc 29.49% cc1 libc-2.15.so [.] malloc | --- malloc | |--0.95%-- 0x208eb1000000000 | |--0.63%-- htab_traverse_noresize 11.04% as libc-2.15.so [.] malloc | --- malloc | 7.15% ld libc-2.15.so [.] malloc | --- malloc | 5.07% sh libc-2.15.so [.] malloc | --- malloc | 4.99% python-config libc-2.15.so [.] malloc | --- malloc | 4.54% make libc-2.15.so [.] malloc | --- malloc | |--7.34%-- glob | | | |--93.18%-- 0x41588f | | | --6.82%-- glob | 0x41588f ... Or: $ perf report -g flat | less # Overhead Command Shared Object Symbol # ........ ............. ............. .......... # 32.03% git libc-2.15.so [.] malloc 27.19% malloc 29.49% cc1 libc-2.15.so [.] malloc 24.77% malloc 11.04% as libc-2.15.so [.] malloc 11.02% malloc 7.15% ld libc-2.15.so [.] malloc 6.57% malloc ... The core uprobes design is fairly straightforward: uprobes probe points register themselves at (inode:offset) addresses of libraries/binaries, after which all existing (or new) vmas that map that address will have a software breakpoint injected at that address. vmas are COW-ed to preserve original content. The probe points are kept in an rbtree. If user-space executes the probed inode:offset instruction address then an event is generated which can be recovered from the regular perf event channels and mmap-ed ring-buffer. Multiple probes at the same address are supported, they create a dynamic callback list of event consumers. The basic model is further complicated by the XOL speedup: the original instruction that is probed is copied (in an architecture specific fashion) and executed out of line when the probe triggers. The XOL area is a single vma per process, with a fixed number of entries (which limits probe execution parallelism). The API: uprobes are installed/removed via /sys/kernel/debug/tracing/uprobe_events, the API is integrated to align with the kprobes interface as much as possible, but is separate to it. Injecting a probe point is privileged operation, which can be relaxed by setting perf_paranoid to -1. You can use multiple probes as well and mix them with kprobes and regular PMU events or tracepoints, when instrumenting a task." Fix up trivial conflicts in mm/memory.c due to previous cleanup of unmap_single_vma(). * 'perf-uprobes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits) perf probe: Detect probe target when m/x options are absent perf probe: Provide perf interface for uprobes tracing: Fix kconfig warning due to a typo tracing: Provide trace events interface for uprobes tracing: Extract out common code for kprobes/uprobes trace events tracing: Modify is_delete, is_return from int to bool uprobes/core: Decrement uprobe count before the pages are unmapped uprobes/core: Make background page replacement logic account for rss_stat counters uprobes/core: Optimize probe hits with the help of a counter uprobes/core: Allocate XOL slots for uprobes use uprobes/core: Handle breakpoint and singlestep exceptions uprobes/core: Rename bkpt to swbp uprobes/core: Make order of function parameters consistent across functions uprobes/core: Make macro names consistent uprobes: Update copyright notices uprobes/core: Move insn to arch specific structure uprobes/core: Remove uprobe_opcode_sz uprobes/core: Make instruction tables volatile uprobes: Move to kernel/events/ uprobes/core: Clean up, refactor and improve the code ...
2012-05-24mm: add a low limit to alloc_large_system_hashTim Bird1-2/+5
UDP stack needs a minimum hash size value for proper operation and also uses alloc_large_system_hash() for proper NUMA distribution of its hash tables and automatic sizing depending on available system memory. On some low memory situations, udp_table_init() must ignore the alloc_large_system_hash() result and reallocs a bigger memory area. As we cannot easily free old hash table, we leak it and kmemleak can issue a warning. This patch adds a low limit parameter to alloc_large_system_hash() to solve this problem. We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table allocation. Reported-by: Mark Asselstine <mark.asselstine@windriver.com> Reported-by: Tim Bird <tim.bird@am.sony.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-23mm: mempolicy: Let vma_merge and vma_split handle vma->vm_policy linkagesMel Gorman1-24/+17
Dave Jones' system call fuzz testing tool "trinity" triggered the following bug error with slab debugging enabled ============================================================================= BUG numa_policy (Not tainted): Poison overwritten ----------------------------------------------------------------------------- INFO: 0xffff880146498250-0xffff880146498250. First byte 0x6a instead of 0x6b INFO: Allocated in mpol_new+0xa3/0x140 age=46310 cpu=6 pid=32154 __slab_alloc+0x3d3/0x445 kmem_cache_alloc+0x29d/0x2b0 mpol_new+0xa3/0x140 sys_mbind+0x142/0x620 system_call_fastpath+0x16/0x1b INFO: Freed in __mpol_put+0x27/0x30 age=46268 cpu=6 pid=32154 __slab_free+0x2e/0x1de kmem_cache_free+0x25a/0x260 __mpol_put+0x27/0x30 remove_vma+0x68/0x90 exit_mmap+0x118/0x140 mmput+0x73/0x110 exit_mm+0x108/0x130 do_exit+0x162/0xb90 do_group_exit+0x4f/0xc0 sys_exit_group+0x17/0x20 system_call_fastpath+0x16/0x1b INFO: Slab 0xffffea0005192600 objects=27 used=27 fp=0x (null) flags=0x20000000004080 INFO: Object 0xffff880146498250 @offset=592 fp=0xffff88014649b9d0 This implied a reference counting bug and the problem happened during mbind(). mbind() applies a new memory policy to a range and uses mbind_range() to merge existing VMAs or split them as necessary. In the event of splits, mpol_dup() will allocate a new struct mempolicy and maintain existing reference counts whose rules are documented in Documentation/vm/numa_memory_policy.txt . The problem occurs with shared memory policies. The vm_op->set_policy increments the reference count if necessary and split_vma() and vma_merge() have already handled the existing reference counts. However, policy_vma() screws it up by replacing an existing vma->vm_policy with one that potentially has the wrong reference count leading to a premature free. This patch removes the damage caused by policy_vma(). With this patch applied Dave's trinity tool runs an mbind test for 5 minutes without error. /proc/slabinfo reported that there are no numa_policy or shared_policy_node objects allocated after the test completed and the shared memory region was deleted. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Dave Jones <davej@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Stephen Wilson <wilsons@start.ca> Cc: Christoph Lameter <cl@linux.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-23Merge branch 'for-linus' of ↵Linus Torvalds4-12/+22
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace enhancements from Eric Biederman: "This is a course correction for the user namespace, so that we can reach an inexpensive, maintainable, and reasonably complete implementation. Highlights: - Config guards make it impossible to enable the user namespace and code that has not been converted to be user namespace safe. - Use of the new kuid_t type ensures the if you somehow get past the config guards the kernel will encounter type errors if you enable user namespaces and attempt to compile in code whose permission checks have not been updated to be user namespace safe. - All uids from child user namespaces are mapped into the initial user namespace before they are processed. Removing the need to add an additional check to see if the user namespace of the compared uids remains the same. - With the user namespaces compiled out the performance is as good or better than it is today. - For most operations absolutely nothing changes performance or operationally with the user namespace enabled. - The worst case performance I could come up with was timing 1 billion cache cold stat operations with the user namespace code enabled. This went from 156s to 164s on my laptop (or 156ns to 164ns per stat operation). - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value. Most uid/gid setting system calls treat these value specially anyway so attempting to use -1 as a uid would likely cause entertaining failures in userspace. - If setuid is called with a uid that can not be mapped setuid fails. I have looked at sendmail, login, ssh and every other program I could think of that would call setuid and they all check for and handle the case where setuid fails. - If stat or a similar system call is called from a context in which we can not map a uid we lie and return overflowuid. The LFS experience suggests not lying and returning an error code might be better, but the historical precedent with uids is different and I can not think of anything that would break by lying about a uid we can't map. - Capabilities are localized to the current user namespace making it safe to give the initial user in a user namespace all capabilities. My git tree covers all of the modifications needed to convert the core kernel and enough changes to make a system bootable to runlevel 1." Fix up trivial conflicts due to nearby independent changes in fs/stat.c * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits) userns: Silence silly gcc warning. cred: use correct cred accessor with regards to rcu read lock userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq userns: Convert cgroup permission checks to use uid_eq userns: Convert tmpfs to use kuid and kgid where appropriate userns: Convert sysfs to use kgid/kuid where appropriate userns: Convert sysctl permission checks to use kuid and kgids. userns: Convert proc to use kuid/kgid where appropriate userns: Convert ext4 to user kuid/kgid where appropriate userns: Convert ext3 to use kuid/kgid where appropriate userns: Convert ext2 to use kuid/kgid where appropriate. userns: Convert devpts to use kuid/kgid where appropriate userns: Convert binary formats to use kuid/kgid where appropriate userns: Add negative depends on entries to avoid building code that is userns unsafe userns: signal remove unnecessary map_cred_ns userns: Teach inode_capable to understand inodes whose uids map to other namespaces. userns: Fail exec for suid and sgid binaries with ids outside our user namespace. userns: Convert stat to return values mapped from kuids and kgids userns: Convert user specfied uids and gids in chown into kuids and kgid userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs ...
2012-05-22Merge branch 'for-3.5' of ↵Linus Torvalds1-63/+52
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "cgroup file type addition / removal is updated so that file types are added and removed instead of individual files so that dynamic file type addition / removal can be implemented by cgroup and used by controllers. blkio controller changes which will come through block tree are dependent on this. Other changes include res_counter cleanup and disallowing kthread / PF_THREAD_BOUND threads to be attached to non-root cgroups. There's a reported bug with the file type addition / removal handling which can lead to oops on cgroup umount. The issue is being looked into. It shouldn't cause problems for most setups and isn't a security concern." Fix up trivial conflict in Documentation/feature-removal-schedule.txt * 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits) res_counter: Account max_usage when calling res_counter_charge_nofail() res_counter: Merge res_counter_charge and res_counter_charge_nofail cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads cgroup: remove cgroup_subsys->populate() cgroup: get rid of populate for memcg cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcg cgroup: make css->refcnt clearing on cgroup removal optional cgroup: use negative bias on css->refcnt to block css_tryget() cgroup: implement cgroup_rm_cftypes() cgroup: introduce struct cfent cgroup: relocate __d_cgrp() and __d_cft() cgroup: remove cgroup_add_file[s]() cgroup: convert memcg controller to the new cftype interface memcg: always create memsw files if CONFIG_CGROUP_MEM_RES_CTLR_SWAP cgroup: convert all non-memcg controllers to the new cftype interface cgroup: relocate cftype and cgroup_subsys definitions in controllers cgroup: merge cft_release_agent cftype array into the base files array cgroup: implement cgroup_add_cftypes() and friends cgroup: build list of all cgroups under a given cgroupfs_root cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir() ...