linux - Linux Kernel (branches are rebased on master from time to time)

Age	Commit message (Collapse)	Author	Files	Lines
2013-09-11	task_work: minor cleanups	Oleg Nesterov	1	-2/+2
	Trivial. Remove the unnecessary "work = NULL" initialization and turn read_barrier_depends() into smp_read_barrier_depends() in task_work_cancel(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	kernel/smp.c: quit unconditionally enabling irqs in on_each_cpu_mask().	David Daney	1	-4/+7
	As in commit f21afc25f9ed ("smp.h: Use local_irq_{save,restore}() in !SMP version of on_each_cpu()"), we don't want to enable irqs if they are not already enabled. I don't know of any bugs currently caused by this unconditional local_irq_enable(), but I want to use this function in MIPS/OCTEON early boot (when we have early_boot_irqs_disabled). This also makes this function have similar semantics to on_each_cpu() which is good in itself. Signed-off-by: David Daney <david.daney@cavium.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: Christoph Lameter <cl@linux.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	syscalls.h: add forward declarations for inplace syscall wrappers	Sergei Trofimovich	2	-0/+2
	Unclutter -Wmissing-prototypes warning types (enabled at make W=1) linux/include/linux/syscalls.h:190:18: warning: no previous prototype for 'SyS_semctl' [-Wmissing-prototypes] asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \ ^ linux/include/linux/syscalls.h:183:2: note: in expansion of macro '__SYSCALL_DEFINEx' __SYSCALL_DEFINEx(x, sname, __VA_ARGS__) ^ by adding forward declarations right before definitions. Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	extable: skip sorting if the table is empty	Uwe Kleine-König	1	-1/+1
	At least on ARM no-MMU the extable is empty and so there is nothing to sort. So add a check for the table to be empty which effectively only changes that the misleading pr_notice is suppressed. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: David Daney <david.daney@cavium.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Borislav Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	smp.h: move !SMP version of on_each_cpu() out-of-line	David Daney	2	-16/+16
	All of the other non-trivial !SMP versions of functions in smp.h are out-of-line in up.c. Move on_each_cpu() there as well. This allows us to get rid of the #include <linux/irqflags.h>. The drawback is that this makes both the x86_64 and i386 defconfig !SMP kernels about 200 bytes larger each. Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	up.c: use local_irq_{save,restore}() in smp_call_function_single.	David Daney	1	-3/+5
	The SMP version of this function doesn't unconditionally enable irqs, so neither should this !SMP version. There are no know problems caused by this, but we make the change for consistency's sake. Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	smp: quit unconditionally enabling irq in on_each_cpu_mask and on_each_cpu_cond	David Daney	2	-46/+55
	As in commit f21afc25f9ed ("smp.h: Use local_irq_{save,restore}() in !SMP version of on_each_cpu()"), we don't want to enable irqs if they are not already enabled. There are currently no known problematical callers of these functions, but since it is a known failure pattern, we preemptively fix them. Since they are not trivial functions, make them non-inline by moving them to up.c. This also makes it so we don't have to fix #include dependancies for preempt_{disable,enable}. Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	kernel/spinlock.c: add default arch_*_relax definitions for GENERIC_LOCKBREAK	Will Deacon	1	-0/+14
	When running with GENERIC_LOCKBREAK=y, the locking implementations emit calls to arch_{read,write,spin}_relax when spinning on a contended lock in order to allow architectures to favour the CPU owning the lock if possible. In reality, everybody apart from PowerPC and S390 just does cpu_relax() here, so make that the default behaviour and allow it to be overridden if required. Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	kernel/smp.c: free related resources when failure occurs in hotplug_cfd()	Chen Gang	1	-1/+4
	When failure occurs in hotplug_cfd(), need release related resources, or will cause memory leak. Signed-off-by: Chen Gang <gang.chen@asianux.com> Acked-by: Wang YanQing <udknight@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	fs/bio-integrity: fix a potential mem leak	Gu Zheng	1	-4/+5
	Free the bio_integrity_pool in the fail path of biovec_create_pool in function bioset_integrity_create(). Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	lto, watchdog/hpwdt.c: make assembler label global	Andi Kleen	1	-2/+4
	We cannot assume that the inline assembler code always ends up in the same file as the original C file. So make any assembler labels that are called with "extern" by C global Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Wim Van Sebroeck <wim@iguana.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	kernel/modsign_pubkey.c: fix init const for module signing code	Andi Kleen	1	-3/+3
	const has to use __initconst, not __initdata Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	kernel-wide: fix missing validations on __get/__put/__copy_to/__copy_from_user()	Mathieu Desnoyers	5	-34/+39
	I found the following pattern that leads in to interesting findings: grep -r "ret.\|=.__put_user" * grep -r "ret.\|=.__get_user" * grep -r "ret.\|=.__copy" * The __put_user() calls in compat_ioctl.c, ptrace compat, signal compat, since those appear in compat code, we could probably expect the kernel addresses not to be reachable in the lower 32-bit range, so I think they might not be exploitable. For the "__get_user" cases, I don't think those are exploitable: the worse that can happen is that the kernel will copy kernel memory into in-kernel buffers, and will fail immediately afterward. The alpha csum_partial_copy_from_user() seems to be missing the access_ok() check entirely. The fix is inspired from x86. This could lead to information leak on alpha. I also noticed that many architectures map csum_partial_copy_from_user() to csum_partial_copy_generic(), but I wonder if the latter is performing the access checks on every architectures. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	drivers/firmware/google/gsmi.c: replace strict_strtoul() with kstrtoul()	Jingoo Han	1	-1/+1
	The use of strict_strtoul() is not preferred, because strict_strtoul() is obsolete. Thus, kstrtoul() should be used. Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Matt Fleming <matt.fleming@intel.com> Cc: Tom Gundersen <teg@jklm.no> Cc: Mike Waychison <mikew@google.com> Acked-by: Mike Waychison <mikew@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	platform: convert apple-gmux driver to dev_pm_ops from legacy pm_ops	Shuah Khan	1	-4/+14
	Convert drivers/platform/x86/apple-gmux to use dev_pm_ops instead of legacy pm_ops. This patch depends on pnp driver bus ops change to invoke pnp_driver dev_pm_ops. Signed-off-by: Shuah Khan <shuah.kh@samsung.com> Cc: Matthew Garrett <matthew.garrett@nebula.com> Cc: Leonidas Da Silva Barbosa <leosilva@linux.vnet.ibm.com> Cc: Ashley Lai <ashley@ashleylai.com> Cc: Rajiv Andrade <mail@srajiv.net> Cc: Marcel Selhorst <tpmdd@selhorst.net> Cc: Sirrix AG <tpmdd@sirrix.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Rob Herring <rob.herring@calxeda.com> Cc: Peter Hüwe <PeterHuewe@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	tpm: convert tpm_tis driver to use dev_pm_ops from legacy pm_ops	Shuah Khan	1	-36/+24
	Convert drivers/char/tpm/tpm_tis.c to use dev_pm_ops instead of legacy pm_ops. This patch depends on pnp driver bus ops change to invoke pnp_driver dev_pm_ops. Signed-off-by: Shuah Khan <shuah.kh@samsung.com> Cc: Matthew Garrett <matthew.garrett@nebula.com> Cc: Leonidas Da Silva Barbosa <leosilva@linux.vnet.ibm.com> Cc: Ashley Lai <ashley@ashleylai.com> Cc: Rajiv Andrade <mail@srajiv.net> Cc: Marcel Selhorst <tpmdd@selhorst.net> Cc: Sirrix AG <tpmdd@sirrix.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Rob Herring <rob.herring@calxeda.com> Cc: Peter Hüwe <PeterHuewe@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	rtc: convert rtc-cmos to dev_pm_ops from legacy pm_ops	Shuah Khan	1	-19/+5
	Convert drivers/rtc/rtc-cmos to use dev_pm_ops instead of legacy pm_ops. This patch depends on pnp driver bus ops change to invoke pnp_driver dev_pm_ops. Signed-off-by: Shuah Khan <shuah.kh@samsung.com> Cc: Matthew Garrett <matthew.garrett@nebula.com> Cc: Leonidas Da Silva Barbosa <leosilva@linux.vnet.ibm.com> Cc: Ashley Lai <ashley@ashleylai.com> Cc: Rajiv Andrade <mail@srajiv.net> Cc: Marcel Selhorst <tpmdd@selhorst.net> Cc: Sirrix AG <tpmdd@sirrix.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Rob Herring <rob.herring@calxeda.com> Cc: Peter Hüwe <PeterHuewe@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	pnp: change pnp bus pm_ops to invoke pnp driver dev_pm_ops if specified	Shuah Khan	1	-0/+13
	pnp_bus_suspend() and pnp_bus_resume() invoke legacy pm_ops from pnp_driver. Changed pnp_bus_suspend() and pnp_bus_resume() to check if pnp driver has dev_pm_ops and call. If dev_pm_ops don't exist, then call use legacy pm_ops. Without this change, pnp_driver dev_pm_ops will not get called. In addition to the pnp driver bus pm_ops change to invoke driver dev_pm_ops, this patch set contains changes to rtc-cmos, tpm_tis, and apple-gmux pnp drivers to convert from legacy pm_ops to dev_pm_ops. This patch (of 4): pnp_bus_suspend() and pnp_bus_resume() invoke legacy pm_ops from pnp_driver. Changed pnp_bus_suspend() and pnp_bus_resume() to check if pnp driver has dev_pm_ops and call. If dev_pm_ops don't exist, then call use legacy pm_ops. Without this change, pnp_driver dev_pm_ops will not get called. Signed-off-by: Shuah Khan <shuah.kh@samsung.com> Cc: Matthew Garrett <matthew.garrett@nebula.com> Cc: Leonidas Da Silva Barbosa <leosilva@linux.vnet.ibm.com> Cc: Ashley Lai <ashley@ashleylai.com> Cc: Rajiv Andrade <mail@srajiv.net> Cc: Marcel Selhorst <tpmdd@selhorst.net> Cc: Sirrix AG <tpmdd@sirrix.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Rob Herring <rob.herring@calxeda.com> Cc: Peter Hüwe <PeterHuewe@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	memcg: fix multiple large threshold notifications	Greg Thelen	1	-1/+7
	A memory cgroup with (1) multiple threshold notifications and (2) at least one threshold >=2G was not reliable. Specifically the notifications would either not fire or would not fire in the proper order. The __mem_cgroup_threshold() signaling logic depends on keeping 64 bit thresholds in sorted order. mem_cgroup_usage_register_event() sorts them with compare_thresholds(), which returns the difference of two 64 bit thresholds as an int. If the difference is positive but has bit[31] set, then sort() treats the difference as negative and breaks sort order. This fix compares the two arbitrary 64 bit thresholds returning the classic -1, 0, 1 result. The test below sets two notifications (at 0x1000 and 0x81001000): cd /sys/fs/cgroup/memory mkdir x for x in 4096 2164264960; do cgroup_event_listener x/memory.usage_in_bytes $x \| sed "s/^/$x listener:/" & done echo $$ > x/cgroup.procs anon_leaker 500M v3.11-rc7 fails to signal the 4096 event listener: Leaking... Done leaking pages. Patched v3.11-rc7 properly notifies: Leaking... 4096 listener:2013:8:31:14:13:36 Done leaking pages. The fixed bug is old. It appears to date back to the introduction of memcg threshold notifications in v2.6.34-rc1-116-g2e72b6347c94 "memcg: implement memory thresholds" Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/mempool.c: convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...)	Joe Perches	1	-1/+1
	Use the helper function instead of __GFP_ZERO. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	lib/genalloc.c: convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...)	Joe Perches	1	-1/+1
	Use the helper function instead of __GFP_ZERO. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/mmap: remove unnecessary assignment	Yanchuan Nian	1	-1/+0
	pgoff is not used after the statement "pgoff = vma->vm_pgoff;", so the assignment is redundant. Signed-off-by: Yanchuan Nian <ycnian@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	writeback: fix race that cause writeback hung	Junxiao Bi	1	-2/+2
	There is a race between mark inode dirty and writeback thread, see the following scenario. In this case, writeback thread will not run though there is dirty_io. __mark_inode_dirty() bdi_writeback_workfn() ... ... spin_lock(&inode->i_lock); ... if (bdi_cap_writeback_dirty(bdi)) { <<< assume wb has dirty_io, so wakeup_bdi is false. <<< the following inode_dirty also have wakeup_bdi false. if (!wb_has_dirty_io(&bdi->wb)) wakeup_bdi = true; } spin_unlock(&inode->i_lock); <<< assume last dirty_io is removed here. pages_written = wb_do_writeback(wb); ... <<< work_list empty and wb has no dirty_io, <<< delayed_work will not be queued. if (!list_empty(&bdi->work_list) \|\| (wb_has_dirty_io(wb) && dirty_writeback_interval)) queue_delayed_work(bdi_wq, &wb->dwork, msecs_to_jiffies(dirty_writeback_interval * 10)); spin_lock(&bdi->wb.list_lock); inode->dirtied_when = jiffies; <<< new dirty_io is added. list_move(&inode->i_wb_list, &bdi->wb.b_dirty); spin_unlock(&bdi->wb.list_lock); <<< though there is dirty_io, but wakeup_bdi is false, <<< so writeback thread will not be waked up and <<< the new dirty_io will not be flushed. if (wakeup_bdi) bdi_wakeup_thread_delayed(bdi); Writeback will run until there is a new flush work queued. This may cause a lot of dirty pages stay in memory for a long time. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/madvise.c:madvise_hwpoison(): remove local `ret'	Andrew Morton	1	-4/+5
	madvise_hwpoison() has two locals called "ret". Fix it all up. Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/madvise.c: fix return value of madvise_hwpoison()	Wanpeng Li	1	-1/+1
	The return value outside for loop is always zero which means madvise_hwpoison return success, however, this is not truth for soft_offline_page w/ failure return value. Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/memory-failure.c: fix bug triggered by unpoisoning empty zero page	Wanpeng Li	1	-1/+1
	Injecting memory failure for page 0x19d0 at 0xb77d2000 MCE 0x19d0: non LRU page recovery: Ignored MCE: Software-unpoisoned page 0x19d0 BUG: Bad page state in process bash pfn:019d0 page:f3461a00 count:0 mapcount:0 mapping: (null) index:0x0 page flags: 0x40000404(referenced\|reserved) Modules linked in: nfsd auth_rpcgss i915 nfs_acl nfs lockd video drm_kms_helper drm bnep rfcomm sunrpc bluetooth psmouse parport_pc ppdev lp serio_raw fscache parport gpio_ich lpc_ich mac_hid i2c_algo_bit tpm_tis wmi usb_storage hid_generic usbhid hid e1000e firewire_ohci firewire_core ahci ptp libahci pps_core crc_itu_t CPU: 3 PID: 2123 Comm: bash Not tainted 3.11.0-rc6+ #12 Hardware name: LENOVO 7034DD7/ , BIOS 9HKT47AUS 01//2012 00000000 00000000 e9625ea0 c15ec49b f3461a00 e9625eb8 c15ea119 c17cbf18 ef084314 000019d0 f3461a00 e9625ed8 c110dc8a f3461a00 00000001 00000000 f3461a00 40000404 00000000 e9625ef8 c110dcc1 f3461a00 f3461a00 000019d0 Call Trace: dump_stack+0x41/0x52 bad_page+0xcf/0xeb free_pages_prepare+0x12a/0x140 free_hot_cold_page+0x21/0x110 __put_single_page+0x21/0x30 put_page+0x25/0x40 unpoison_memory+0x107/0x200 hwpoison_unpoison+0x20/0x30 simple_attr_write+0xb6/0xd0 vfs_write+0xa0/0x1b0 SyS_write+0x4f/0x90 sysenter_do_call+0x12/0x22 Disabling lock debugging due to kernel taint Testcase: #define _GNU_SOURCE #include <stdlib.h> #include <stdio.h> #include <sys/mman.h> #include <unistd.h> #include <fcntl.h> #include <sys/types.h> #include <errno.h> #define PAGES_TO_TEST 1 #define PAGE_SIZE 4096 int main(void) { char mem; mem = mmap(NULL, PAGES_TO_TEST PAGE_SIZE, PROT_READ \| PROT_WRITE, MAP_PRIVATE \| MAP_ANONYMOUS, 0, 0); if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1) return -1; munmap(mem, PAGES_TO_TEST * PAGE_SIZE); return 0; } There is one page reference count for default empty zero page, madvise_hwpoison add another one by get_user_pages_fast. memory_hwpoison reduce one page reference count since it's a non LRU page. unpoison_memory release the last page reference count and free empty zero page to buddy system which is not correct since empty zero page has PG_reserved flag. This patch fix it by don't reduce the page reference count under 1 against empty zero page. Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/hwpoison-inject.c: change permission of corrupt-pfn/unpoison-pfn to 0200	Wanpeng Li	1	-2/+2
	Hwpoison injection doesn't implement read method for corrupt-pfn/unpoison-pfn attributes: # cat /sys/kernel/debug/hwpoison/corrupt-pfn cat: /sys/kernel/debug/hwpoison/corrupt-pfn: Permission denied # cat /sys/kernel/debug/hwpoison/unpoison-pfn cat: /sys/kernel/debug/hwpoison/unpoison-pfn: Permission denied This patch changes the permission of corrupt-pfn/unpoison-pfn to 0200. Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/hwpoison.c: fix held reference count after unpoisoning empty zero page	Wanpeng Li	1	-0/+4
	madvise hwpoison inject will poison the read-only empty zero page if there is no write access before poison. Empty zero page reference count will be increased for hwpoison, subsequent poison zero page will return directly since page has already been set PG_hwpoison, however, page reference count is still increased by get_user_pages_fast. The unpoison process will unpoison the empty zero page and decrease the reference count successfully for the fist time, however, subsequent unpoison empty zero page will return directly since page has already been unpoisoned and without decrease the page reference count of empty zero page. This patch fixes it by make madvise_hwpoison() put a page and return immediately (without calling memory_failure() or soft_offline_page()) when the page is already hwpoisoned. Testcase: #define _GNU_SOURCE #include <stdlib.h> #include <stdio.h> #include <sys/mman.h> #include <unistd.h> #include <fcntl.h> #include <sys/types.h> #include <errno.h> #define PAGES_TO_TEST 3 #define PAGE_SIZE 4096 int main(void) { char mem; int i; mem = mmap(NULL, PAGES_TO_TEST PAGE_SIZE, PROT_READ \| PROT_WRITE, MAP_PRIVATE \| MAP_ANONYMOUS, 0, 0); if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1) return -1; munmap(mem, PAGES_TO_TEST * PAGE_SIZE); return 0; } Add printk to dump page reference count: [ 93.075959] Injecting memory failure for page 0x19d0 at 0xb77d8000 [ 93.076207] MCE 0x19d0: non LRU page recovery: Ignored [ 93.076209] pfn 0x19d0, page count = 1 after memory failure [ 93.076220] Injecting memory failure for page 0x19d0 at 0xb77d9000 [ 93.076221] MCE 0x19d0: already hardware poisoned [ 93.076222] pfn 0x19d0, page count = 2 after memory failure [ 93.076224] Injecting memory failure for page 0x19d0 at 0xb77da000 [ 93.076224] MCE 0x19d0: already hardware poisoned [ 93.076225] pfn 0x19d0, page count = 3 after memory failure Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/hwpoison: add '#' to madvise_hwpoison	Wanpeng Li	1	-2/+2
	Add '#' to madvise_hwpoison. Before patch: [ 95.892866] Injecting memory failure for page 19d0 at b7786000 [ 95.893151] MCE 0x19d0: non LRU page recovery: Ignored After patch: [ 95.892866] Injecting memory failure for page 0x19d0 at 0xb7786000 [ 95.893151] MCE 0x19d0: non LRU page recovery: Ignored Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/hwpoison: drop forward reference declarations __soft_offline_page()	Wanpeng Li	1	-66/+64
	Drop forward reference declarations __soft_offline_page. Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/hwpoison: don't set migration type twice to avoid holding heavily contend ↵	Wanpeng Li	1	-1/+2
	zone->lock Set pageblock migration type will hold zone->lock which is heavy contended in system to avoid race. However, soft offline page will set pageblock migration type twice during get page if the page is in used, not hugetlbfs page and not on lru list. There is unnecessary to set the pageblock migration type and hold heavy contended zone->lock again if the first round get page have already set the pageblock to right migration type. The trick here is migration type is MIGRATE_ISOLATE. There are other two parts can change MIGRATE_ISOLATE except hwpoison. One is memory hoplug, however, we hold lock_memory_hotplug() which avoid race. The second is CMA which umovable page allocation requst can't fallback to. So it's safe here. Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/hwpoison: replace atomic_long_sub() with atomic_long_dec()	Wanpeng Li	1	-1/+1
	Replace atomic_long_sub() with atomic_long_dec() since the page is normal page instead of hugetlbfs page or thp. Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/hwpoison: fix race against poison thp	Wanpeng Li	1	-0/+10
	There is a race between hwpoison page and unpoison page, memory_failure set the page hwpoison and increase num_poisoned_pages without hold page lock, and one page count will be accounted against thp for num_poisoned_pages. However, unpoison can occur before memory_failure hold page lock and split transparent hugepage, unpoison will decrease num_poisoned_pages by 1 << compound_order since memory_failure has not yet split transparent hugepage with page lock held. That means we account one page for hwpoison and 1 << compound_order for unpoison. This patch fix it by inserting a PageTransHuge check before doing TestClearPageHWPoison, unpoison failed without clearing PageHWPoison and decreasing num_poisoned_pages. A B memory_failue TestSetPageHWPoison(p); if (PageHuge(p)) nr_pages = 1 << compound_order(hpage); else nr_pages = 1; atomic_long_add(nr_pages, &num_poisoned_pages); unpoison_memory nr_pages = 1<< compound_trans_order(page); if(TestClearPageHWPoison(p)) atomic_long_sub(nr_pages, &num_poisoned_pages); lock page if (!PageHWPoison(p)) unlock page and return hwpoison_user_mappings if (PageTransHuge(hpage)) split_huge_page(hpage); Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/hwpoison: don't need to hold compound lock for hugetlbfs page	Wanpeng Li	2	-20/+6
	compound lock is introduced by commit e9da73d67("thp: compound_lock."), it is used to serialize put_page against __split_huge_page_refcount(). In addition, transparent hugepages will be splitted in hwpoison handler and just one subpage will be poisoned. There is unnecessary to hold compound lock for hugetlbfs page. This patch replace compound_trans_order by compond_order in the place where the page is hugetlbfs page. Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/hwpoison: fix loss of PG_dirty for errors on mlocked pages	Wanpeng Li	1	-0/+3
	memory_failure() store the page flag of the error page before doing unmap, and (only) if the first check with page flags at the time decided the error page is unknown, it do the second check with the stored page flag since memory_failure() does unmapping of the error pages before doing page_action(). This unmapping changes the page state, especially page_remove_rmap() (called from try_to_unmap_one()) clears PG_mlocked, so page_action() can't catch mlocked pages after that. However, memory_failure() can't handle memory errors on dirty mlocked pages correctly. try_to_unmap_one will move the dirty bit from pte to the physical page, the second check lose it since it check the stored page flag. This patch fix it by restore PG_dirty flag to stored page flag if the page is dirty. Testcase: #define _GNU_SOURCE #include <stdlib.h> #include <stdio.h> #include <sys/mman.h> #include <sys/types.h> #include <errno.h> #define PAGES_TO_TEST 2 #define PAGE_SIZE 4096 int main(void) { char mem; int i; mem = mmap(NULL, PAGES_TO_TEST PAGE_SIZE, PROT_READ \| PROT_WRITE, MAP_PRIVATE \| MAP_ANONYMOUS \| MAP_LOCKED, 0, 0); for (i = 0; i < PAGES_TO_TEST; i++) mem[i * PAGE_SIZE] = 'a'; if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1) return -1; return 0; } Before patch: [ 912.839247] Injecting memory failure for page 7dfb8 at 7f6b4e37b000 [ 912.839257] MCE 0x7dfb8: clean mlocked LRU page recovery: Recovered [ 912.845550] MCE 0x7dfb8: clean mlocked LRU page still referenced by 1 users [ 912.852586] Injecting memory failure for page 7e6aa at 7f6b4e37c000 [ 912.852594] MCE 0x7e6aa: clean mlocked LRU page recovery: Recovered [ 912.858936] MCE 0x7e6aa: clean mlocked LRU page still referenced by 1 users After patch: [ 163.590225] Injecting memory failure for page 91bc2f at 7f9f5b0e5000 [ 163.590264] MCE 0x91bc2f: dirty mlocked LRU page recovery: Recovered [ 163.596680] MCE 0x91bc2f: dirty mlocked LRU page still referenced by 1 users [ 163.603831] Injecting memory failure for page 91cdd3 at 7f9f5b0e6000 [ 163.603852] MCE 0x91cdd3: dirty mlocked LRU page recovery: Recovered [ 163.610305] MCE 0x91cdd3: dirty mlocked LRU page still referenced by 1 users Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	hwpoison: always unset MIGRATE_ISOLATE before returning from soft_offline_page()	Naoya Horiguchi	1	-1/+2
	Soft offline code expects that MIGRATE_ISOLATE is set on the target page only during soft offlining work. But currenly it doesn't work as expected when get_any_page() fails and returns negative value. In the result, end users can have unexpectedly isolated pages. This patch just fixes it. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm: correct the comment about the value for buddy _mapcount	Wang Sheng-Hui	1	-4/+7
	Set _mapcount PAGE_BUDDY_MAPCOUNT_VALUE to make the page buddy. Not the magic number -2. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm: make sure _PAGE_SWP_SOFT_DIRTY bit is not set on present pte	Cyrill Gorcunov	2	-15/+22
	_PAGE_SOFT_DIRTY bit should never be set on present pte so add VM_BUG_ON to catch any potential future abuse. Also add a comment on _PAGE_SWP_SOFT_DIRTY definition explaining scope of its usage. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/page-writeback.c: add strictlimit feature	Maxim Patlasov	3	-62/+206
	The feature prevents mistrusted filesystems (ie: FUSE mounts created by unprivileged users) to grow a large number of dirty pages before throttling. For such filesystems balance_dirty_pages always check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under "freerun", it's not allowed to skip bdi checks. The only use case for now is fuse: it sets bdi max_ratio to 1% by default and system administrators are supposed to expect that this limit won't be exceeded. The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag. A filesystem may set the flag when it initializes its BDI. The problematic scenario comes from the fact that nobody pays attention to the NR_WRITEBACK_TEMP counter (i.e. number of pages under fuse writeback). The implementation of fuse writeback releases original page (by calling end_page_writeback) almost immediately. A fuse request queued for real processing bears a copy of original page. Hence, if userspace fuse daemon doesn't finalize write requests in timely manner, an aggressive mmap writer can pollute virtually all memory by those temporary fuse page copies. They are carefully accounted in NR_WRITEBACK_TEMP, but nobody cares. To make further explanations shorter, let me use "NR_WRITEBACK_TEMP problem" as a shortcut for "a possibility of uncontrolled grow of amount of RAM consumed by temporary pages allocated by kernel fuse to process writeback". The problem was very easy to reproduce. There is a trivial example filesystem implementation in fuse userspace distribution: fusexmp_fh.c. I added "sleep(1);" to the write methods, then recompiled and mounted it. Then created a huge file on the mount point and run a simple program which mmap-ed the file to a memory region, then wrote a data to the region. An hour later I observed almost all RAM consumed by fuse writeback. Since then some unrelated changes in kernel fuse made it more difficult to reproduce, but it is still possible now. Putting this theoretical happens-in-the-lab thing aside, there is another thing that really hurts real world (FUSE) users. This is write-through page cache policy FUSE currently uses. I.e. handling write(2), kernel fuse populates page cache and flushes user data to the server synchronously. This is excessively suboptimal. Pavel Emelyanov's patches ("writeback cache policy") solve the problem, but they also make resolving NR_WRITEBACK_TEMP problem absolutely necessary. Otherwise, simply copying a huge file to a fuse mount would result in memory starvation. Miklos, the maintainer of FUSE, believes strictlimit feature the way to go. And eventually putting FUSE topics aside, there is one more use-case for strictlimit feature. Using a slow USB stick (mass storage) in a machine with huge amount of RAM installed is a well-known pain. Let's make simple computations. Assuming 64GB of RAM installed, existing implementation of balance_dirty_pages will start throttling only after 9.6GB of RAM becomes dirty (freerun == 15% of total RAM). So, the command "cp 9GB_file /media/my-usb-storage/" may return in a few seconds, but subsequent "umount /media/my-usb-storage/" will take more than two hours if effective throughput of the storage is, to say, 1MB/sec. After inclusion of strictlimit feature, it will be trivial to add a knob (e.g. /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand. Manually or via udev rule. May be I'm wrong, but it seems to be quite a natural desire to limit the amount of dirty memory for some devices we are not fully trust (in the sense of sustainable throughput). [akpm@linux-foundation.org: fix warning in page-writeback.c] Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Cc: Jan Kara <jack@suse.cz> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/backing-dev.c: check user buffer length before copying data to the ↵	Chen Gang	1	-1/+1
	related user buffer '*lenp' may be less than "sizeof(kbuf)" so we must check this before the next copy_to_user(). pdflush_proc_obsolete() is called by sysctl which 'procname' is "nr_pdflush_threads", if the user passes buffer length less than "sizeof(kbuf)", it will cause issue. Signed-off-by: Chen Gang <gang.chen@asianux.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/mremap.c: call pud_free() after fail calling pmd_alloc()	Chen Gang	1	-1/+4
	In alloc_new_pmd(), if pud_alloc() was called successfully, but pmd_alloc() fails, avoid leaking `pud'. Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/vmalloc: use wrapper function get_vm_area_size to caculate size of vm area	Wanpeng Li	1	-6/+6
	Use wrapper function get_vm_area_size to calculate size of vm area. Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/writeback: make writeback_inodes_wb static	Wanpeng Li	2	-3/+1
	It's not used globally and could be static. Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/sparse: introduce alloc_usemap_and_memmap	Wanpeng Li	1	-76/+57
	After commit 9bdac9142407 ("sparsemem: Put mem map for one node together."), vmemmap for one node will be allocated together, its logic is similar as memory allocation for pageblock flags. This patch introduces alloc_usemap_and_memmap to extract the same logic of memory alloction for pageblock flags and vmemmap. Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm: vmscan: fix do_try_to_free_pages() livelock	Lisa Du	9	-42/+44
	This patch is based on KOSAKI's work and I add a little more description, please refer https://lkml.org/lkml/2012/6/14/74. Currently, I found system can enter a state that there are lots of free pages in a zone but only order-0 and order-1 pages which means the zone is heavily fragmented, then high order allocation could make direct reclaim path's long stall(ex, 60 seconds) especially in no swap and no compaciton enviroment. This problem happened on v3.4, but it seems issue still lives in current tree, the reason is do_try_to_free_pages enter live lock: kswapd will go to sleep if the zones have been fully scanned and are still not balanced. As kswapd thinks there's little point trying all over again to avoid infinite loop. Instead it changes order from high-order to 0-order because kswapd think order-0 is the most important. Look at 73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep and may leave zone->all_unreclaimable =3D 0. It assume high-order users can still perform direct reclaim if they wish. Direct reclaim continue to reclaim for a high order which is not a COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble= . This is because to avoid too early oom-kill. So it means direct_reclaim depends on kswapd to break this loop. In worst case, direct-reclaim may continue to page reclaim forever when kswapd sleeps forever until someone like watchdog detect and finally kill the process. As described in: http://thread.gmane.org/gmane.linux.kernel.mm/103737 We can't turn on zone->all_unreclaimable from direct reclaim path because direct reclaim path don't take any lock and this way is racy. Thus this patch removes zone->all_unreclaimable field completely and recalculates zone reclaimable state every time. Note: we can't take the idea that direct-reclaim see zone->pages_scanned directly and kswapd continue to use zone->all_unreclaimable. Because, it is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use zone->all_unreclaimable as a name) describes the detail. [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()] Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Nick Piggin <npiggin@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Bob Liu <lliubbo@gmail.com> Cc: Neil Zhang <zhangwm@marvell.com> Cc: Russell King - ARM Linux <linux@arm.linux.org.uk> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Lisa Du <cldu@marvell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm: munlock: manual pte walk in fast path instead of follow_page_mask()	Vlastimil Babka	2	-37/+85
	Currently munlock_vma_pages_range() calls follow_page_mask() to obtain each individual struct page. This entails repeated full page table translations and page table lock taken for each page separately. This patch avoids the costly follow_page_mask() where possible, by iterating over ptes within single pmd under single page table lock. The first pte is obtained by get_locked_pte() for non-THP page acquired by the initial follow_page_mask(). The rest of the on-stack pagevec for munlock is filled up using pte_walk as long as pte_present() and vm_normal_page() are sufficient to obtain the struct page. After this patch, a 14% speedup was measured for munlocking a 56GB large memory area with THP disabled. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jörn Engel <joern@logfs.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm: munlock: remove redundant get_page/put_page pair on the fast path	Vlastimil Babka	1	-12/+14
	The performance of the fast path in munlock_vma_range() can be further improved by avoiding atomic ops of a redundant get_page()/put_page() pair. When calling get_page() during page isolation, we already have the pin from follow_page_mask(). This pin will be then returned by __pagevec_lru_add(), after which we do not reference the pages anymore. After this patch, an 8% speedup was measured for munlocking a 56GB large memory area with THP disabled. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jörn Engel <joern@logfs.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm: munlock: bypass per-cpu pvec for putback_lru_page	Vlastimil Babka	1	-4/+69
	After introducing batching by pagevecs into munlock_vma_range(), we can further improve performance by bypassing the copying into per-cpu pagevec and the get_page/put_page pair associated with that. Instead we perform LRU putback directly from our pagevec. However, this is possible only for single-mapped pages that are evictable after munlock. Unevictable pages require rechecking after putting on the unevictable list, so for those we fallback to putback_lru_page(), hich handles that. After this patch, a 13% speedup was measured for munlocking a 56GB large memory area with THP disabled. [akpm@linux-foundation.org:clarify comment] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jörn Engel <joern@logfs.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm: munlock: batch NR_MLOCK zone state updates	Vlastimil Babka	1	-3/+3
	Depending on previous batch which introduced batched isolation in munlock_vma_range(), we can batch also the updates of NR_MLOCK page stats. After the whole pagevec is processed for page isolation, the stats are updated only once with the number of successful isolations. There were however no measurable perfomance gains. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jörn Engel <joern@logfs.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm: munlock: batch non-THP page isolation and munlock+putback using pagevec	Vlastimil Babka	1	-40/+156
	Currently, munlock_vma_range() calls munlock_vma_page on each page in a loop, which results in repeated taking and releasing of the lru_lock spinlock for isolating pages one by one. This patch batches the munlock operations using an on-stack pagevec, so that isolation is done under single lru_lock. For THP pages, the old behavior is preserved as they might be split while putting them into the pagevec. After this patch, a 9% speedup was measured for munlocking a 56GB large memory area with THP disabled. A new function __munlock_pagevec() is introduced that takes a pagevec and: 1) It clears PageMlocked and isolates all pages under lru_lock. Zone page stats can be also updated using the variant which assumes disabled interrupts. 2) It finishes the munlock and lru putback on all pages under their lock_page. Note that previously, lock_page covered also the PageMlocked clearing and page isolation, but it is not needed for those operations. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jörn Engel <joern@logfs.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>