summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2014-06-09Merge branch 'for-3.16' of ↵Linus Torvalds2-108/+117
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A lot of activities on cgroup side. Heavy restructuring including locking simplification took place to improve the code base and enable implementation of the unified hierarchy, which currently exists behind a __DEVEL__ mount option. The core support is mostly complete but individual controllers need further work. To explain the design and rationales of the the unified hierarchy Documentation/cgroups/unified-hierarchy.txt is added. Another notable change is css (cgroup_subsys_state - what each controller uses to identify and interact with a cgroup) iteration update. This is part of continuing updates on css object lifetime and visibility. cgroup started with reference count draining on removal way back and is now reaching a point where csses behave and are iterated like normal refcnted objects albeit with some complexities to allow distinguishing the state where they're being deleted. The css iteration update isn't taken advantage of yet but is planned to be used to simplify memcg significantly" * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits) cgroup: disallow disabled controllers on the default hierarchy cgroup: don't destroy the default root cgroup: disallow debug controller on the default hierarchy cgroup: clean up MAINTAINERS entries cgroup: implement css_tryget() device_cgroup: use css_has_online_children() instead of has_children() cgroup: convert cgroup_has_live_children() into css_has_online_children() cgroup: use CSS_ONLINE instead of CGRP_DEAD cgroup: iterate cgroup_subsys_states directly cgroup: introduce CSS_RELEASED and reduce css iteration fallback window cgroup: move cgroup->serial_nr into cgroup_subsys_state cgroup: link all cgroup_subsys_states in their sibling lists cgroup: move cgroup->sibling and ->children into cgroup_subsys_state cgroup: remove cgroup->parent device_cgroup: remove direct access to cgroup->children memcg: update memcg_has_children() to use css_next_child() memcg: remove tasks/children test from mem_cgroup_force_empty() cgroup: remove css_parent() cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css cgroup: use cgroup->self.refcnt for cgroup refcnting ...
2014-06-08Don't trigger congestion wait on dirty-but-not-writeout pagesLinus Torvalds1-7/+5
shrink_inactive_list() used to wait 0.1s to avoid congestion when all the pages that were isolated from the inactive list were dirty but not under active writeback. That makes no real sense, and apparently causes major interactivity issues under some loads since 3.11. The ostensible reason for it was to wait for kswapd to start writing pages, but that seems questionable as well, since the congestion wait code seems to trigger for kswapd itself as well. Also, the logic behind delaying anything when we haven't actually started writeback is not clear - it only delays actually starting that writeback. We'll still trigger the congestion waiting if (a) the process is kswapd, and we hit pages flagged for immediate reclaim (b) the process is not kswapd, and the zone backing dev writeback is actually congested. This probably needs to be revisited, but as it is this fixes a reported regression. Reported-by: Felipe Contreras <felipe.contreras@gmail.com> Pinpointed-by: Hillf Danton <dhillf@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-08Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-5/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Clean ups and miscellaneous bug fixes, in particular for the new collapse_range and zero_range fallocate functions. In addition, improve the scalability of adding and remove inodes from the orphan list" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits) ext4: handle symlink properly with inline_data ext4: fix wrong assert in ext4_mb_normalize_request() ext4: fix zeroing of page during writeback ext4: remove unused local variable "stored" from ext4_readdir(...) ext4: fix ZERO_RANGE test failure in data journalling ext4: reduce contention on s_orphan_lock ext4: use sbi in ext4_orphan_{add|del}() ext4: use EXT_MAX_BLOCKS in ext4_es_can_be_merged() ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access ext4: remove unnecessary double parentheses ext4: do not destroy ext4_groupinfo_caches if ext4_mb_init() fails ext4: make local functions static ext4: fix block bitmap validation when bigalloc, ^flex_bg ext4: fix block bitmap initialization under sparse_super2 ext4: find the group descriptors on a 1k-block bigalloc,meta_bg filesystem ext4: avoid unneeded lookup when xattr name is invalid ext4: fix data integrity sync in ordered mode ext4: remove obsoleted check ext4: add a new spinlock i_raw_lock to protect the ext4's raw inode ext4: fix locking for O_APPEND writes ...
2014-06-08Merge branch 'next' (accumulated 3.16 merge window patches) into masterLinus Torvalds47-2724/+2974
Now that 3.15 is released, this merges the 'next' branch into 'master', bringing us to the normal situation where my 'master' branch is the merge window. * accumulated work in next: (6809 commits) ufs: sb mutex merge + mutex_destroy powerpc: update comments for generic idle conversion cris: update comments for generic idle conversion idle: remove cpu_idle() forward declarations nbd: zero from and len fields in NBD_CMD_DISCONNECT. mm: convert some level-less printks to pr_* MAINTAINERS: adi-buildroot-devel is moderated MAINTAINERS: add linux-api for review of API/ABI changes mm/kmemleak-test.c: use pr_fmt for logging fs/dlm/debug_fs.c: replace seq_printf by seq_puts fs/dlm/lockspace.c: convert simple_str to kstr fs/dlm/config.c: convert simple_str to kstr mm: mark remap_file_pages() syscall as deprecated mm: memcontrol: remove unnecessary memcg argument from soft limit functions mm: memcontrol: clean up memcg zoneinfo lookup mm/memblock.c: call kmemleak directly from memblock_(alloc|free) mm/mempool.c: update the kmemleak stack trace for mempool allocations lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations mm: introduce kmemleak_update_trace() mm/kmemleak.c: use %u to print ->checksum ...
2014-06-06mm: convert some level-less printks to pr_*Mitchel Humpherys4-12/+24
printk is meant to be used with an associated log level. There are some instances of printk scattered around the mm code where the log level is missing. Add a log level and adhere to suggestions by scripts/checkpatch.pl by moving to the pr_* macros. Also add the typical pr_fmt definition so that print statements can be easily traced back to the modules where they occur, correlated one with another, etc. This will require the removal of some (now redundant) prefixes on a few print statements. Signed-off-by: Mitchel Humpherys <mitchelh@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06mm/kmemleak-test.c: use pr_fmt for loggingFabian Frederick1-17/+19
Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06mm: mark remap_file_pages() syscall as deprecatedKirill A. Shutemov1-0/+4
The remap_file_pages() system call is used to create a nonlinear mapping, that is, a mapping in which the pages of the file are mapped into a nonsequential order in memory. The advantage of using remap_file_pages() over using repeated calls to mmap(2) is that the former approach does not require the kernel to create additional VMA (Virtual Memory Area) data structures. Supporting of nonlinear mapping requires significant amount of non-trivial code in kernel virtual memory subsystem including hot paths. Also to get nonlinear mapping work kernel need a way to distinguish normal page table entries from entries with file offset (pte_file). Kernel reserves flag in PTE for this purpose. PTE flags are scarce resource especially on some CPU architectures. It would be nice to free up the flag for other usage. Fortunately, there are not many users of remap_file_pages() in the wild. It's only known that one enterprise RDBMS implementation uses the syscall on 32-bit systems to map files bigger than can linearly fit into 32-bit virtual address space. This use-case is not critical anymore since 64-bit systems are widely available. The plan is to deprecate the syscall and replace it with an emulation. The emulation will create new VMAs instead of nonlinear mappings. It's going to work slower for rare users of remap_file_pages() but ABI is preserved. One side effect of emulation (apart from performance) is that user can hit vm.max_map_count limit more easily due to additional VMAs. See comment for DEFAULT_MAX_MAP_COUNT for more details on the limit. [akpm@linux-foundation.org: fix spello] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Dave Jones <davej@redhat.com> Cc: Armin Rigo <arigo@tunes.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06mm: memcontrol: remove unnecessary memcg argument from soft limit functionsJohannes Weiner1-20/+14
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06mm: memcontrol: clean up memcg zoneinfo lookupJianyu Zhan1-50/+39
Memcg zoneinfo lookup sites have either the page, the zone, or the node id and zone index, but sites that only have the zone have to look up the node id and zone index themselves, whereas sites that already have those two integers use a function for a simple pointer chase. Provide mem_cgroup_zone_zoneinfo() that takes a zone pointer and let sites that already have node id and zone index - all for each node, for each zone iterators - use &memcg->nodeinfo[nid]->zoneinfo[zid]. Rename page_cgroup_zoneinfo() to mem_cgroup_page_zoneinfo() to match. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06mm/memblock.c: call kmemleak directly from memblock_(alloc|free)Catalin Marinas2-4/+8
Kmemleak could ignore memory blocks allocated via memblock_alloc() leading to false positives during scanning. This patch adds the corresponding callbacks and removes kmemleak_free_* calls in mm/nobootmem.c to avoid duplication. The kmemleak_alloc() in mm/nobootmem.c is kept since __alloc_memory_core_early() does not use memblock_alloc() directly. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06mm/mempool.c: update the kmemleak stack trace for mempool allocationsCatalin Marinas1-0/+6
When mempool_alloc() returns an existing pool object, kmemleak_alloc() is no longer called and the stack trace corresponds to the original object allocation. This patch updates the kmemleak allocation stack trace for such objects to make it more useful for debugging. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06mm: introduce kmemleak_update_trace()Catalin Marinas1-0/+34
The memory allocation stack trace is not always useful for debugging a memory leak (e.g. radix_tree_preload). This function, when called, updates the stack trace for an already allocated object. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06mm/kmemleak.c: use %u to print ->checksumJianpeng Ma1-1/+1
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06vmscan: memcg: always use swappiness of the reclaimed memcgMichal Hocko2-11/+9
Memory reclaim always uses swappiness of the reclaim target memcg (origin of the memory pressure) or vm_swappiness for global memory reclaim. This behavior was consistent (except for difference between global and hard limit reclaim) because swappiness was enforced to be consistent within each memcg hierarchy. After "mm: memcontrol: remove hierarchy restrictions for swappiness and oom_control" each memcg can have its own swappiness independent of hierarchical parents, though, so the consistency guarantee is gone. This can lead to an unexpected behavior. Say that a group is explicitly configured to not swapout by memory.swappiness=0 but its memory gets swapped out anyway when the memory pressure comes from its parent with a It is also unexpected that the knob is meaningless without setting the hard limit which would trigger the reclaim and enforce the swappiness. There are setups where the hard limit is configured higher in the hierarchy by an administrator and children groups are under control of somebody else who is interested in the swapout behavior but not necessarily about the memory limit. From a semantic point of view swappiness is an attribute defining anon vs. file proportional scanning of LRU which is memcg specific (unlike charges which are propagated up the hierarchy) so it should be applied to the particular memcg's LRU regardless where the memory pressure comes from. This patch removes vmscan_swappiness() and stores the swappiness into the scan_control structure. mem_cgroup_swappiness is then used to provide the correct value before shrink_lruvec is called. The global vm_swappiness is used for the root memcg. [hughd@google.com: oopses immediately when booted with cgroup_disable=memory] Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06mm: convert use of typedef ctl_table to struct ctl_tableJoe Perches2-7/+7
This typedef is unnecessary and should just be removed. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06slub: search partial list on numa_mem_id(), instead of numa_node_id()Joonsoo Kim1-1/+1
Currently, if allocation constraint to node is NUMA_NO_NODE, we search a partial slab on numa_node_id() node. This doesn't work properly on a system having memoryless nodes, since it can have no memory on that node so there must be no partial slab on that node. On that node, page allocation always falls back to numa_mem_id() first. So searching a partial slab on numa_node_id() in that case is the proper solution for the memoryless node case. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Han Pingtian <hanpt@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06mm: vmscan: clear kswapd's special reclaim powers before exitingJohannes Weiner1-0/+3
When kswapd exits, it can end up taking locks that were previously held by allocating tasks while they waited for reclaim. Lockdep currently warns about this: On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote: > inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage. > kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes: > (&sig->group_rwsem){+++++?}, at: exit_signals+0x24/0x130 > {RECLAIM_FS-ON-W} state was registered at: > mark_held_locks+0xb9/0x140 > lockdep_trace_alloc+0x7a/0xe0 > kmem_cache_alloc_trace+0x37/0x240 > flex_array_alloc+0x99/0x1a0 > cgroup_attach_task+0x63/0x430 > attach_task_by_pid+0x210/0x280 > cgroup_procs_write+0x16/0x20 > cgroup_file_write+0x120/0x2c0 > vfs_write+0xc0/0x1f0 > SyS_write+0x4c/0xa0 > tracesys+0xdd/0xe2 > irq event stamp: 49 > hardirqs last enabled at (49): _raw_spin_unlock_irqrestore+0x36/0x70 > hardirqs last disabled at (48): _raw_spin_lock_irqsave+0x2b/0xa0 > softirqs last enabled at (0): copy_process.part.24+0x627/0x15f0 > softirqs last disabled at (0): (null) > > other info that might help us debug this: > Possible unsafe locking scenario: > > CPU0 > ---- > lock(&sig->group_rwsem); > <Interrupt> > lock(&sig->group_rwsem); > > *** DEADLOCK *** > > no locks held by kswapd2/1151. > > stack backtrace: > CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4 > Call Trace: > dump_stack+0x19/0x1b > print_usage_bug+0x1f7/0x208 > mark_lock+0x21d/0x2a0 > __lock_acquire+0x52a/0xb60 > lock_acquire+0xa2/0x140 > down_read+0x51/0xa0 > exit_signals+0x24/0x130 > do_exit+0xb5/0xa50 > kthread+0xdb/0x100 > ret_from_fork+0x7c/0xb0 This is because the kswapd thread is still marked as a reclaimer at the time of exit. But because it is exiting, nobody is actually waiting on it to make reclaim progress anymore, and it's nothing but a regular thread at this point. Be tidy and strip it of all its powers (PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state) before returning from the thread function. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06mm: add !pte_present() check on existing hugetlb_entry callbacksNaoya Horiguchi1-1/+5
The age table walker doesn't check non-present hugetlb entry in common path, so hugetlb_entry() callbacks must check it. The reason for this behavior is that some callers want to handle it in its own way. [ I think that reason is bogus, btw - it should just do what the regular code does, which is to call the "pte_hole()" function for such hugetlb entries - Linus] However, some callers don't check it now, which causes unpredictable result, for example when we have a race between migrating hugepage and reading /proc/pid/numa_maps. This patch fixes it by adding !pte_present checks on buggy callbacks. This bug exists for years and got visible by introducing hugepage migration. ChangeLog v2: - fix if condition (check !pte_present() instead of pte_present()) Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Backported to 3.15. Signed-off-by: Josh Boyer <jwboyer@fedoraproject.org> ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06mm: rmap: fix use-after-free in __put_anon_vmaAndrey Ryabinin1-2/+1
While working address sanitizer for kernel I've discovered use-after-free bug in __put_anon_vma. For the last anon_vma, anon_vma->root freed before child anon_vma. Later in anon_vma_free(anon_vma) we are referencing to already freed anon_vma->root to check rwsem. This fixes it by freeing the child anon_vma before freeing anon_vma->root. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> # v3.0+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05Merge branch 'x86/vdso' of ↵Linus Torvalds1-29/+60
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next Pull x86 cdso updates from Peter Anvin: "Vdso cleanups and improvements largely from Andy Lutomirski. This makes the vdso a lot less ''special''" * 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/vdso, build: Make LE access macros clearer, host-safe x86/vdso, build: Fix cross-compilation from big-endian architectures x86/vdso, build: When vdso2c fails, unlink the output x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET x86, mm: Replace arch_vma_name with vm_ops->name for vsyscalls x86, mm: Improve _install_special_mapping and fix x86 vdso naming mm, fs: Add vm_ops->name as an alternative to arch_vma_name x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET x86, vdso: Remove vestiges of VDSO_PRELINK and some outdated comments x86, vdso: Move the vvar and hpet mappings next to the 64-bit vDSO x86, vdso: Move the 32-bit vdso special pages after the text x86, vdso: Reimplement vdso.so preparation in build-time C x86, vdso: Move syscall and sysenter setup into kernel/cpu/common.c x86, vdso: Clean up 32-bit vs 64-bit vdso params x86, mm: Ensure correct alignment of the fixmap
2014-06-04mm/zswap: NUMA aware allocation for zswap_dstmemEric Dumazet1-1/+1
zswap_dstmem is a percpu block of memory, which should be allocated using kmalloc_node(), to get better NUMA locality. Without it, all the blocks are allocated from a single node. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Seth Jennings <sjennings@variantweb.net> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm/zsmalloc: make zsmalloc module-buildableMinchan Kim1-1/+1
Now, we can build zsmalloc as module because unmap_kernel_range was exported. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm/vmalloc.c: export unmap_kernel_range()Minchan Kim1-0/+1
zsmalloc needs exported unmap_kernel_range for building as a module. See https://lkml.org/lkml/2013/1/18/487 I didn't send a patch to make unmap_kernel_range exportable at that time because zram was staging stuff and I thought VM function exporting for staging stuff makes no sense. Now zsmalloc was promoted. If we can't build zsmalloc as module, it means we can't build zram as module, either. Additionally, buddy map_vm_area is already exported so let's export unmap_kernel_range to help his buddy. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04zsmalloc: fixup trivial zs size classes value in commentsWeijie Yang1-1/+1
According to calculation, ZS_SIZE_CLASSES value is 255 on systems with 4K page size, not 254. The old value may forget count the ZS_MIN_ALLOC_SIZE in. This patch fixes this trivial issue in the comments. Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm/zbud.c: make size unsigned like unique callsiteFabian Frederick1-2/+2
zbud_alloc is only called by zswap_frontswap_store with unsigned int len. Change function parameter + update >= 0 check. Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Seth Jennings <sjennings@variantweb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm, memcg: periodically schedule when emptying page listHugh Dickins1-1/+1
mem_cgroup_force_empty_list() can iterate a large number of pages on an lru and mem_cgroup_move_parent() doesn't return an errno unless certain criteria, none of which indicate that the iteration may be taking too long, is met. We have encountered the following stack trace many times indicating "need_resched set for > 51000020 ns (51 ticks) without schedule", for example: scheduler_tick() <timer irq> mem_cgroup_move_account+0x4d/0x1d5 mem_cgroup_move_parent+0x8d/0x109 mem_cgroup_reparent_charges+0x149/0x2ba mem_cgroup_css_offline+0xeb/0x11b cgroup_offline_fn+0x68/0x16b process_one_work+0x129/0x350 If this iteration is taking too long, we still need to do cond_resched() even when an individual page is not busy. [rientjes@google.com: changelog] Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm/memory-failure.c: support use of a dedicated thread to handle ↵Naoya Horiguchi1-13/+43
SIGBUS(BUS_MCEERR_AO) Currently memory error handler handles action optional errors in the deferred manner by default. And if a recovery aware application wants to handle it immediately, it can do it by setting PF_MCE_EARLY flag. However, such signal can be sent only to the main thread, so it's problematic if the application wants to have a dedicated thread to handler such signals. So this patch adds dedicated thread support to memory error handler. We have PF_MCE_EARLY flags for each thread separately, so with this patch AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main thread. If you want to implement a dedicated thread, you call prctl() to set PF_MCE_EARLY on the thread. Memory error handler collects processes to be killed, so this patch lets it check PF_MCE_EARLY flag on each thread in the collecting routines. No behavioral change for all non-early kill cases. Tony said: : The old behavior was crazy - someone with a multithreaded process might : well expect that if they call prctl(PF_MCE_EARLY) in just one thread, then : that thread would see the SIGBUS with si_code = BUS_MCEERR_A0 - even if : that thread wasn't the main thread for the process. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: Kamil Iskra <iskra@mcs.anl.gov> Cc: Andi Kleen <andi@firstfloor.org> Cc: Borislav Petkov <bp@suse.de> Cc: Chen Gong <gong.chen@linux.jf.intel.com> Cc: <stable@vger.kernel.org> [3.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm/memory-failure.c: don't let collect_procs() skip over processes for ↵Tony Luck1-9/+12
MF_ACTION_REQUIRED When Linux sees an "action optional" machine check (where h/w has reported an error that is not in the current execution path) we generally do not want to signal a process, since most processes do not have a SIGBUS handler - we'd just prematurely terminate the process for a problem that they might never actually see. task_early_kill() decides whether to consider a process - and it checks whether this specific process has been marked for early signals with "prctl", or if the system administrator has requested early signals for all processes using /proc/sys/vm/memory_failure_early_kill. But for MF_ACTION_REQUIRED case we must not defer. The error is in the execution path of the current thread so we must send the SIGBUS immediatley. Fix by passing a flag argument through collect_procs*() to task_early_kill() so it knows whether we can defer or must take action. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Borislav Petkov <bp@suse.de> Cc: Chen Gong <gong.chen@linux.jf.intel.com> Cc: <stable@vger.kernel.org> [3.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm/memory-failure.c-failure: send right signal code to correct threadTony Luck1-2/+2
When a thread in a multi-threaded application hits a machine check because of an uncorrectable error in memory - we want to send the SIGBUS with si.si_code = BUS_MCEERR_AR to that thread. Currently we fail to do that if the active thread is not the primary thread in the process. collect_procs() just finds primary threads and this test: if ((flags & MF_ACTION_REQUIRED) && t == current) { will see that the thread we found isn't the current thread and so send a si.si_code = BUS_MCEERR_AO to the primary (and nothing to the active thread at this time). We can fix this by checking whether "current" shares the same mm with the process that collect_procs() said owned the page. If so, we send the SIGBUS to current (with code BUS_MCEERR_AR). Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Otto Bruggeman <otto.g.bruggeman@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Borislav Petkov <bp@suse.de> Cc: Chen Gong <gong.chen@linux.jf.intel.com> Cc: <stable@vger.kernel.org> [3.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm/page-writeback.c: remove outdated commentJianyu Zhan1-18/+0
There is an orphaned prehistoric comment , which used to be against get_dirty_limits(), the dawn of global_dirtyable_memory(). Back then, the implementation of get_dirty_limits() is complicated and full of magic numbers, so this comment is necessary. But we now use the clear and neat global_dirtyable_memory(), which renders this comment ambiguous and useless. Remove it. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm/swapfile.c: delete the "last_in_cluster < scan_base" loop in the body of ↵Chen Yucong1-26/+3
scan_swap_map() Via commit ebc2a1a69111 ("swap: make cluster allocation per-cpu"), we can find that all SWP_SOLIDSTATE "seek is cheap"(SSD case) has already gone to si->cluster_info scan_swap_map_try_ssd_cluster() route. So that the "last_in_cluster < scan_base" loop in the body of scan_swap_map() has already become a dead code snippet, and it should have been deleted. This patch is to delete the redundant loop as Hugh and Shaohua suggested. [hughd@google.com: fix comment, simplify code] Signed-off-by: Chen Yucong <slaoub@gmail.com> Cc: Shaohua Li <shli@kernel.org> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04hugetlb: rename hugepage_migration_support() to ..._supported()Naoya Horiguchi2-2/+2
We already have a function named hugepages_supported(), and the similar name hugepage_migration_support() is a bit unconfortable, so let's rename it hugepage_migration_supported(). Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm: document do_fault_around() featureKirill A. Shutemov1-0/+27
Some clarification on how faultaround works. [akpm@linux-foundation.org: tweak comment text] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm: nominate faultaround area in bytes rather than page orderKirill A. Shutemov1-39/+23
There is evidencs that the faultaround feature is less relevant on architectures with page size bigger then 4k. Which makes sense since page fault overhead per byte of mapped area should be less there. Let's rework the feature to specify faultaround area in bytes instead of page order. It's 64 kilobytes for now. The patch effectively disables faultaround on architectures with page size >= 64k (like ppc64). It's possible that some other size of faultaround area is relevant for a platform. We can expose `fault_around_bytes' variable to arch-specific code once such platforms will be found. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Hugh Dickins <hughd@google.com> Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm/page_alloc.c: cleanup add_active_range() related commentsZhang Zhen1-13/+8
add_active_range() has been repalced by memblock_set_node(). Clean up the comments to comply with that change. Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm/rmap.c: cleanup ttu_flagsKonstantin Khlebnikov1-5/+5
Transform action part of ttu_flags into individiual bits. These flags aren't part of any uses-space visible api or even trace events. Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm/rmap.c: don't call mmu_notifier_invalidate_page() during munlockKonstantin Khlebnikov1-1/+1
In its munmap mode, try_to_unmap_one() searches other mlocked vmas, it never unmaps pages. There is no reason for invalidation because ptes are left unchanged. Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm/process_vm_access: move config option into init/KconfigKonstantin Khlebnikov1-10/+0
CONFIG_CROSS_MEMORY_ATTACH adds couple syscalls: process_vm_readv and process_vm_writev, it's a kind of IPC for copying data between processes. Currently this option is placed inside "Processor type and features". This patch moves it into "General setup" (where all other arch-independed syscalls and ipc features are placed) and changes prompt string to less cryptic. Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Christopher Yeoh <cyeoh@au1.ibm.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Hugh Dickins <hughd@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm: vmscan: use proportional scanning during direct reclaim and full scan at ↵Mel Gorman1-11/+25
DEF_PRIORITY Commit "mm: vmscan: obey proportional scanning requirements for kswapd" ensured that file/anon lists were scanned proportionally for reclaim from kswapd but ignored it for direct reclaim. The intent was to minimse direct reclaim latency but Yuanhan Liu pointer out that it substitutes one long stall for many small stalls and distorts aging for normal workloads like streaming readers/writers. Hugh Dickins pointed out that a side-effect of the same commit was that when one LRU list dropped to zero that the entirety of the other list was shrunk leading to excessive reclaim in memcgs. This patch scans the file/anon lists proportionally for direct reclaim to similarly age page whether reclaimed by kswapd or direct reclaim but takes care to abort reclaim if one LRU drops to zero after reclaiming the requested number of pages. Based on ext4 and using the Intel VM scalability test 3.15.0-rc5 3.15.0-rc5 shrinker proportion Unit lru-file-readonce elapsed 5.3500 ( 0.00%) 5.4200 ( -1.31%) Unit lru-file-readonce time_range 0.2700 ( 0.00%) 0.1400 ( 48.15%) Unit lru-file-readonce time_stddv 0.1148 ( 0.00%) 0.0536 ( 53.33%) Unit lru-file-readtwice elapsed 8.1700 ( 0.00%) 8.1700 ( 0.00%) Unit lru-file-readtwice time_range 0.4300 ( 0.00%) 0.2300 ( 46.51%) Unit lru-file-readtwice time_stddv 0.1650 ( 0.00%) 0.0971 ( 41.16%) The test cases are running multiple dd instances reading sparse files. The results are within the noise for the small test machine. The impact of the patch is more noticable from the vmstats 3.15.0-rc5 3.15.0-rc5 shrinker proportion Minor Faults 35154 36784 Major Faults 611 1305 Swap Ins 394 1651 Swap Outs 4394 5891 Allocation stalls 118616 44781 Direct pages scanned 4935171 4602313 Kswapd pages scanned 15921292 16258483 Kswapd pages reclaimed 15913301 16248305 Direct pages reclaimed 4933368 4601133 Kswapd efficiency 99% 99% Kswapd velocity 670088.047 682555.961 Direct efficiency 99% 99% Direct velocity 207709.217 193212.133 Percentage direct scans 23% 22% Page writes by reclaim 4858.000 6232.000 Page writes file 464 341 Page writes anon 4394 5891 Note that there are fewer allocation stalls even though the amount of direct reclaim scanning is very approximately the same. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Jan Kara <jack@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm: fix typo in comment in do_fault_around()Kirill A. Shutemov1-1/+1
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm/msync.c: sync only the requested range in msync()Matthew Wilcox1-1/+7
msync() currently syncs more than POSIX requires or BSD or Solaris implement. It is supposed to be equivalent to fdatasync(), not fsync(), and it is only supposed to sync the portion of the file that overlaps the range passed to msync. If the VMA is non-linear, fall back to syncing the entire file, but we still optimise to only fdatasync() the entire file, not the full fsync(). akpm: there are obvious concerns with bck-compatibility: is anyone relying on the undocumented side-effect for their data integrity? And how would they ever know if this change broke their data integrity? We think the risk is reasonably low, and this patch brings the kernel into line with other OS's and with what the manpage has always said... Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Chris Mason <clm@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm, compaction: properly signal and act upon lock and need_sched() contentionVlastimil Babka2-11/+48
Compaction uses compact_checklock_irqsave() function to periodically check for lock contention and need_resched() to either abort async compaction, or to free the lock, schedule and retake the lock. When aborting, cc->contended is set to signal the contended state to the caller. Two problems have been identified in this mechanism. First, compaction also calls directly cond_resched() in both scanners when no lock is yet taken. This call either does not abort async compaction, or set cc->contended appropriately. This patch introduces a new compact_should_abort() function to achieve both. In isolate_freepages(), the check frequency is reduced to once by SWAP_CLUSTER_MAX pageblocks to match what the migration scanner does in the preliminary page checks. In case a pageblock is found suitable for calling isolate_freepages_block(), the checks within there are done on higher frequency. Second, isolate_freepages() does not check if isolate_freepages_block() aborted due to contention, and advances to the next pageblock. This violates the principle of aborting on contention, and might result in pageblocks not being scanned completely, since the scanning cursor is advanced. This problem has been noticed in the code by Joonsoo Kim when reviewing related patches. This patch makes isolate_freepages_block() check the cc->contended flag and abort. In case isolate_freepages() has already isolated some pages before aborting due to contention, page migration will proceed, which is OK since we do not want to waste the work that has been done, and page migration has own checks for contention. However, we do not want another isolation attempt by either of the scanners, so cc->contended flag check is added also to compaction_alloc() and compact_finished() to make sure compaction is aborted right after the migration. The outcome of the patch should be reduced lock contention by async compaction and lower latencies for higher-order allocations where direct compaction is involved. [akpm@linux-foundation.org: fix typo in comment] Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Tested-by: Shawn Guo <shawn.guo@linaro.org> Tested-by: Kevin Hilman <khilman@linaro.org> Tested-by: Stephen Warren <swarren@nvidia.com> Tested-by: Fabio Estevam <fabio.estevam@freescale.com> Cc: David Rientjes <rientjes@google.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm/vmscan.c: use DIV_ROUND_UP for calculation of zone's balance_gap and ↵Jianyu Zhan1-6/+4
correct comments. Currently, we use (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) / KSWAPD_ZONE_BALANCE_GAP_RATIO to avoid a zero gap value. It's better to use DIV_ROUND_UP macro for neater code and clear meaning. Besides, the gap value is calculated against the per-zone "managed pages", not "present pages". This patch also corrects the comment and do some rephrasing. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm, hugetlb: move the error handle logic out of normal code pathJianyu Zhan1-13/+13
alloc_huge_page() now mixes normal code path with error handle logic. This patches move out the error handle logic, to make normal code path more clean and redue code duplicate. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm/memory-failure.c: move commentNaoya Horiguchi1-5/+4
The comment about pages under writeback is far from the relevant code, so let's move it to the right place. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm: avoid unnecessary atomic operations during end_page_writeback()Mel Gorman1-1/+10
If a page is marked for immediate reclaim then it is moved to the tail of the LRU list. This occurs when the system is under enough memory pressure for pages under writeback to reach the end of the LRU but we test for this using atomic operations on every writeback. This patch uses an optimistic non-atomic test first. It'll miss some pages in rare cases but the consequences are not severe enough to warrant such a penalty. While the function does not dominate profiles during a simple dd test the cost of it is reduced. 73048 0.7428 vmlinux-3.15.0-rc5-mmotm-20140513 end_page_writeback 23740 0.2409 vmlinux-3.15.0-rc5-lessatomic end_page_writeback Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm: page_alloc: calculate classzone_idx once from the zonelist refMel Gorman1-25/+34
There is no need to calculate zone_idx(preferred_zone) multiple times or use the pgdat to figure it out. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm: non-atomically mark page accessed during page cache allocation where ↵Mel Gorman3-128/+91
possible aops->write_begin may allocate a new page and make it visible only to have mark_page_accessed called almost immediately after. Once the page is visible the atomic operations are necessary which is noticable overhead when writing to an in-memory filesystem like tmpfs but should also be noticable with fast storage. The objective of the patch is to initialse the accessed information with non-atomic operations before the page is visible. The bulk of filesystems directly or indirectly use grab_cache_page_write_begin or find_or_create_page for the initial allocation of a page cache page. This patch adds an init_page_accessed() helper which behaves like the first call to mark_page_accessed() but may called before the page is visible and can be done non-atomically. The primary APIs of concern in this care are the following and are used by most filesystems. find_get_page find_lock_page find_or_create_page grab_cache_page_nowait grab_cache_page_write_begin All of them are very similar in detail to the patch creates a core helper pagecache_get_page() which takes a flags parameter that affects its behavior such as whether the page should be marked accessed or not. Then old API is preserved but is basically a thin wrapper around this core function. Each of the filesystems are then updated to avoid calling mark_page_accessed when it is known that the VM interfaces have already done the job. There is a slight snag in that the timing of the mark_page_accessed() has now changed so in rare cases it's possible a page gets to the end of the LRU as PageReferenced where as previously it might have been repromoted. This is expected to be rare but it's worth the filesystem people thinking about it in case they see a problem with the timing change. It is also the case that some filesystems may be marking pages accessed that previously did not but it makes sense that filesystems have consistent behaviour in this regard. The test case used to evaulate this is a simple dd of a large file done multiple times with the file deleted on each iterations. The size of the file is 1/10th physical memory to avoid dirty page balancing. In the async case it will be possible that the workload completes without even hitting the disk and will have variable results but highlight the impact of mark_page_accessed for async IO. The sync results are expected to be more stable. The exception is tmpfs where the normal case is for the "IO" to not hit the disk. The test machine was single socket and UMA to avoid any scheduling or NUMA artifacts. Throughput and wall times are presented for sync IO, only wall times are shown for async as the granularity reported by dd and the variability is unsuitable for comparison. As async results were variable do to writback timings, I'm only reporting the maximum figures. The sync results were stable enough to make the mean and stddev uninteresting. The performance results are reported based on a run with no profiling. Profile data is based on a separate run with oprofile running. async dd 3.15.0-rc3 3.15.0-rc3 vanilla accessed-v2 ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%) tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%) btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%) ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%) xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%) The XFS figure is a bit strange as it managed to avoid a worst case by sheer luck but the average figures looked reasonable. samples percentage ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm: do not use unnecessary atomic operations when adding pages to the LRUMel Gorman1-2/+4
When adding pages to the LRU we clear the active bit unconditionally. As the page could be reachable from other paths we cannot use unlocked operations without risk of corruption such as a parallel mark_page_accessed. This patch tests if is necessary to clear the active flag before using an atomic operation. This potentially opens a tiny race when PageActive is checked as mark_page_accessed could be called after PageActive was checked. The race already exists but this patch changes it slightly. The consequence is that that the page may be promoted to the active list that might have been left on the inactive list before the patch. It's too tiny a race and too marginal a consequence to always use atomic operations for. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm: do not use atomic operations when releasing pagesMel Gorman1-1/+1
There should be no references to it any more and a parallel mark should not be reordered against us. Use non-locked varient to clear page active. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>