summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2010-10-26mm: add account_page_writeback()Michael Rubin1-1/+12
To help developers and applications gain visibility into writeback behaviour this patch adds two counters to /proc/vmstat. # grep nr_dirtied /proc/vmstat nr_dirtied 3747 # grep nr_written /proc/vmstat nr_written 3618 These entries allow user apps to understand writeback behaviour over time and learn how it is impacting their performance. Currently there is no way to inspect dirty and writeback speed over time. It's not possible for nr_dirty/nr_writeback. These entries are necessary to give visibility into writeback behaviour. We have /proc/diskstats which lets us understand the io in the block layer. We have blktrace for more in depth understanding. We have e2fsprogs and debugsfs to give insight into the file systems behaviour, but we don't offer our users the ability understand what writeback is doing. There is no way to know how active it is over the whole system, if it's falling behind or to quantify it's efforts. With these values exported users can easily see how much data applications are sending through writeback and also at what rates writeback is processing this data. Comparing the rates of change between the two allow developers to see when writeback is not able to keep up with incoming traffic and the rate of dirty memory being sent to the IO back end. This allows folks to understand their io workloads and track kernel issues. Non kernel engineers at Google often use these counters to solve puzzling performance problems. Patch #4 adds a pernode vmstat file with nr_dirtied and nr_written Patch #5 add writeback thresholds to /proc/vmstat Currently these values are in debugfs. But they should be promoted to /proc since they are useful for developers who are writing databases and file servers and are not debugging the kernel. The output is as below: # grep threshold /proc/vmstat nr_pages_dirty_threshold 409111 nr_pages_dirty_background_threshold 818223 This patch: This allows code outside of the mm core to safely manipulate page writeback state and not worry about the other accounting. Not using these routines means that some code will lose track of the accounting and we get bugs. Modify nilfs2 to use interface. Signed-off-by: Michael Rubin <mrubin@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Jiro SEKIBA <jir@unicus.jp> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26mm/mempolicy.c: check return code of check_rangeVasiliy Kulikov1-1/+4
Function check_range may return ERR_PTR(...). Check for it. Signed-off-by: Vasiliy Kulikov <segooon@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Christoph Lameter <cl@linux.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26vmscan: prevent background aging of anon page in no swap systemMinchan Kim1-1/+16
Ying Han reported that backing aging of anon pages in no swap system causes unnecessary TLB flush. When I sent a patch(69c8548175), I wanted this patch but Rik pointed out and allowed aging of anon pages to give a chance to promote from inactive to active LRU. It has a two problem. 1) non-swap system Never make sense to age anon pages. 2) swap configured but still doesn't swapon It doesn't make sense to age anon pages until swap-on time. But it's arguable. If we have aged anon pages by swapon, VM have moved anon pages from active to inactive. And in the time swapon by admin, the VM can't reclaim hot pages so we can protect hot pages swapout. But let's think about it. When does swap-on happen? It depends on admin. we can't expect it. Nonetheless, we have done aging of anon pages to protect hot pages swapout. It means we lost run time overhead when below high watermark but gain hot page swap-[in/out] overhead when VM decide swapout. Is it true? Let's think more detail. We don't promote anon pages in case of non-swap system. So even though VM does aging of anon pages, the pages would be in inactive LRU for a long time. It means many of pages in there would mark access bit again. So access bit hot/code separation would be pointless. This patch prevents unnecessary anon pages demotion in not-yet-swapon and non-configured swap system. Even, in non-configuared swap system inactive_anon_is_low can be compiled out. It could make side effect that hot anon pages could swap out when admin does swap on. But I think sooner or later it would be steady state. So it's not a big problem. We could lose someting but gain more thing(TLB flush and unnecessary function call to demote anon pages). Signed-off-by: Ying Han <yinghan@google.com> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26memory hotplug: unify is_removable and offline detection codeKAMEZAWA Hiroyuki2-36/+68
Now, sysfs interface of memory hotplug shows whether the section is removable or not. But it checks only migrateype of pages and doesn't check details of cluster of pages. Next, memory hotplug's set_migratetype_isolate() has the same kind of check, too. This patch adds the function __count_unmovable_pages() and makes above 2 checks to use the same logic. Then, is_removable and hotremove code uses the same logic. No changes in the hotremove logic itself. TODO: need to find a way to check RECLAMABLE. But, considering bit, calling shrink_slab() against a range before starting memory hotremove sounds better. If so, this patch's logic doesn't need to be changed. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reported-by: Michal Hocko <mhocko@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26memory hotplug: fix notifier's return value checkKAMEZAWA Hiroyuki1-1/+1
Even if notifier cannot find any pages, it doesn't mean no pages are available...And, if there are no notifiers registered, this condition will be always true and memory hotplug will show -EBUSY. This is a bug but not critical. In most case, a pageblock which will be offlined is MIGRATE_MOVABLE This "notifier" is called only when the pageblock is _not_ MIGRATE_MOVABLE. But if not MIGRATE_MOVABLE, it's common case that memory hotplug will fail. So, no one notice this bug. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26mm: compaction: fix COMPACTPAGEFAILED countingMinchan Kim4-7/+18
Presently update_nr_listpages() doesn't have a role. That's because lists passed is always empty just after calling migrate_pages. The migrate_pages cleans up page list which have failed to migrate before returning by aaa994b3. [PATCH] page migration: handle freeing of pages in migrate_pages() Do not leave pages on the lists passed to migrate_pages(). Seems that we will not need any postprocessing of pages. This will simplify the handling of pages by the callers of migrate_pages(). At that time, we thought we don't need any postprocessing of pages. But the situation is changed. The compaction need to know the number of failed to migrate for COMPACTPAGEFAILED stat This patch makes new rule for caller of migrate_pages to call putback_lru_pages. So caller need to clean up the lists so it has a chance to postprocess the pages. [suggested by Christoph Lameter] Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <andi@firstfloor.org> Reviewed-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26mm: only build per-node scan_unevictable functions when NUMA is enabledThadeu Lima de Souza Cascardo1-1/+2
Non-NUMA systems do never create these files anyway, since they are only created by driver subsystem when NUMA is configured. [akpm@linux-foundation.org: cleanup] Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26writeback: remove nonblocking/encountered_congestion referencesWu Fengguang2-2/+0
This removes more dead code that was somehow missed by commit 0d99519efef (writeback: remove unused nonblocking and congestion checks). There are no behavior change except for the removal of two entries from one of the ext4 tracing interface. The nonblocking checks in ->writepages are no longer used because the flusher now prefer to block on get_request_wait() than to skip inodes on IO congestion. The latter will lead to more seeky IO. The nonblocking checks in ->writepage are no longer used because it's redundant with the WB_SYNC_NONE check. We no long set ->nonblocking in VM page out and page migration, because a) it's effectively redundant with WB_SYNC_NONE in current code b) it's old semantic of "Don't get stuck on request queues" is mis-behavior: that would skip some dirty inodes on congestion and page out others, which is unfair in terms of LRU age. Inspired by Christoph Hellwig. Thanks! Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: David Howells <dhowells@redhat.com> Cc: Sage Weil <sage@newdream.net> Cc: Steve French <sfrench@samba.org> Cc: Chris Mason <chris.mason@oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26oom: kill all threads sharing oom killed task's mmDavid Rientjes1-0/+24
It's necessary to kill all threads that share an oom killed task's mm if the goal is to lead to future memory freeing. This patch reintroduces the code removed in 8c5cd6f3 (oom: oom_kill doesn't kill vfork parent (or child)) since it is obsoleted. It's now guaranteed that any task passed to oom_kill_task() does not share an mm with any thread that is unkillable. Thus, we're safe to issue a SIGKILL to any thread sharing the same mm. This is especially necessary to solve an mm->mmap_sem livelock issue whereas an oom killed thread must acquire the lock in the exit path while another thread is holding it in the page allocator while trying to allocate memory itself (and will preempt the oom killer since a task was already killed). Since tasks with pending fatal signals are now granted access to memory reserves, the thread holding the lock may quickly allocate and release the lock so that the oom killed task may exit. This mainly is for threads that are cloned with CLONE_VM but not CLONE_THREAD, so they are in a different thread group. Non-NPTL threads exist in the wild and this change is necessary to prevent the livelock in such cases. We care more about preventing the livelock than incurring the additional tasklist in the oom killer when a task has been killed. Systems that are sufficiently large to not want the tasklist scan in the oom killer in the first place already have the option of enabling /proc/sys/vm/oom_kill_allocating_task, which was designed specifically for that purpose. This code had existed in the oom killer for over eight years dating back to the 2.4 kernel. [akpm@linux-foundation.org: add nice comment] Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26oom: avoid killing a task if a thread sharing its mm cannot be killedDavid Rientjes1-4/+5
The oom killer's goal is to kill a memory-hogging task so that it may exit, free its memory, and allow the current context to allocate the memory that triggered it in the first place. Thus, killing a task is pointless if other threads sharing its mm cannot be killed because of its /proc/pid/oom_adj or /proc/pid/oom_score_adj value. This patch checks whether any other thread sharing p->mm has an oom_score_adj of OOM_SCORE_ADJ_MIN. If so, the thread cannot be killed and oom_badness(p) returns 0, meaning it's unkillable. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26mm, page-allocator: do not check the state of a non-existant buddy during freeMel Gorman1-1/+1
There is a bug in commit 6dda9d55 ("page allocator: reduce fragmentation in buddy allocator by adding buddies that are merging to the tail of the free lists") that means a buddy at order MAX_ORDER is checked for merging. A page of this order never exists so at times, an effectively random piece of memory is being checked. Alan Curry has reported that this is causing memory corruption in userspace data on a PPC32 platform (http://lkml.org/lkml/2010/10/9/32). It is not clear why this is happening. It could be a cache coherency problem where pages mapped in both user and kernel space are getting different cache lines due to the bad read from kernel space (http://lkml.org/lkml/2010/10/13/179). It could also be that there are some special registers being io-remapped at the end of the memmap array and that a read has special meaning on them. Compiler bugs have been ruled out because the assembly before and after the patch looks relatively harmless. This patch fixes the problem by ensuring we are not reading a possibly invalid location of memory. It's not clear why the read causes corruption but one way or the other it is a buggy read. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Corrado Zoccolo <czoccolo@gmail.com> Reported-by: Alan Curry <pacman@kosh.dhis.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26mm: fix return value of scan_lru_pages in memory unplugKAMEZAWA Hiroyuki1-1/+1
scan_lru_pages returns pfn. So, it's type should be "unsigned long" not "int". Note: I guess this has been work until now because memory hotplug tester's machine has not very big memory.... physical address < 32bit << PAGE_SHIFT. Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26Merge branch 'hwpoison' of ↵Linus Torvalds5-156/+514
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (22 commits) Add _addr_lsb field to ia64 siginfo Fix migration.c compilation on s390 HWPOISON: Remove retry loop for try_to_unmap HWPOISON: Turn addr_valid from bitfield into char HWPOISON: Disable DEBUG by default HWPOISON: Convert pr_debugs to pr_info HWPOISON: Improve comments in memory-failure.c x86: HWPOISON: Report correct address granuality for huge hwpoison faults Encode huge page size for VM_FAULT_HWPOISON errors Fix build error with !CONFIG_MIGRATION hugepage: move is_hugepage_on_freelist inside ifdef to avoid warning Clean up __page_set_anon_rmap HWPOISON, hugetlb: fix unpoison for hugepage HWPOISON, hugetlb: soft offlining for hugepage HWPOSION, hugetlb: recover from free hugepage error when !MF_COUNT_INCREASED hugetlb: move refcounting in hugepage allocation inside hugetlb_lock HWPOISON, hugetlb: add free check to dequeue_hwpoison_huge_page() hugetlb: hugepage migration core hugetlb: redefine hugepage copy functions hugetlb: add allocate function for hugepage migration ...
2010-10-25[S390] add support for nonquiescing sskeMartin Schwidefsky1-2/+2
Improve performance of the sske operation by using the nonquiescing variant if the affected page has no mappings established. On machines with no support for the new sske variant the mask bit will be ignored. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2010-10-24Merge branch 'for-next' of ↵Linus Torvalds2-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits) Update broken web addresses in arch directory. Update broken web addresses in the kernel. Revert "drivers/usb: Remove unnecessary return's from void functions" for musb gadget Revert "Fix typo: configuation => configuration" partially ida: document IDA_BITMAP_LONGS calculation ext2: fix a typo on comment in ext2/inode.c drivers/scsi: Remove unnecessary casts of private_data drivers/s390: Remove unnecessary casts of private_data net/sunrpc/rpc_pipe.c: Remove unnecessary casts of private_data drivers/infiniband: Remove unnecessary casts of private_data drivers/gpu/drm: Remove unnecessary casts of private_data kernel/pm_qos_params.c: Remove unnecessary casts of private_data fs/ecryptfs: Remove unnecessary casts of private_data fs/seq_file.c: Remove unnecessary casts of private_data arm: uengine.c: remove C99 comments arm: scoop.c: remove C99 comments Fix typo configue => configure in comments Fix typo: configuation => configuration Fix typo interrest[ing|ed] => interest[ing|ed] Fix various typos of valid in comments ... Fix up trivial conflicts in: drivers/char/ipmi/ipmi_si_intf.c drivers/usb/gadget/rndis.c net/irda/irnet/irnet_ppp.c
2010-10-24Merge branch 'for-linus' of ↵Linus Torvalds2-373/+419
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: (27 commits) SLUB: Fix memory hotplug with !NUMA slub: Move functions to reduce #ifdefs slub: Enable sysfs support for !CONFIG_SLUB_DEBUG SLUB: Optimize slab_free() debug check slub: Move NUMA-related functions under CONFIG_NUMA slub: Add lock release annotation slub: Fix signedness warnings slub: extract common code to remove objects from partial list without locking SLUB: Pass active and inactive redzone flags instead of boolean to debug functions slub: reduce differences between SMP and NUMA Revert "Slub: UP bandaid" percpu: clear memory allocated with the km allocator percpu: use percpu allocator on UP too percpu: reduce PCPU_MIN_UNIT_SIZE to 32k vmalloc: pcpu_get/free_vm_areas() aren't needed on UP SLUB: Fix merged slab cache names Slub: UP bandaid slub: fix SLUB_RESILIENCY_TEST for dynamic kmalloc caches slub: Fix up missing kmalloc_cache -> kmem_cache_node case for memoryhotplug slub: Add dummy functions for the !SLUB_DEBUG case ...
2010-10-24Merge branch 'kvm-updates/2.6.37' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-0/+13
* 'kvm-updates/2.6.37' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (321 commits) KVM: Drop CONFIG_DMAR dependency around kvm_iommu_map_pages KVM: Fix signature of kvm_iommu_map_pages stub KVM: MCE: Send SRAR SIGBUS directly KVM: MCE: Add MCG_SER_P into KVM_MCE_CAP_SUPPORTED KVM: fix typo in copyright notice KVM: Disable interrupts around get_kernel_ns() KVM: MMU: Avoid sign extension in mmu_alloc_direct_roots() pae root address KVM: MMU: move access code parsing to FNAME(walk_addr) function KVM: MMU: audit: check whether have unsync sps after root sync KVM: MMU: audit: introduce audit_printk to cleanup audit code KVM: MMU: audit: unregister audit tracepoints before module unloaded KVM: MMU: audit: fix vcpu's spte walking KVM: MMU: set access bit for direct mapping KVM: MMU: cleanup for error mask set while walk guest page table KVM: MMU: update 'root_hpa' out of loop in PAE shadow path KVM: x86 emulator: Eliminate compilation warning in x86_decode_insn() KVM: x86: Fix constant type in kvm_get_time_scale KVM: VMX: Add AX to list of registers clobbered by guest switch KVM guest: Move a printk that's using the clock before it's ready KVM: x86: TSC catchup mode ...
2010-10-24Merge branch 'master' into for-linusPekka Enberg13-504/+851
Conflicts: include/linux/percpu.h mm/percpu.c
2010-10-24export __get_user_pages_fast() functionXiao Guangrong1-0/+13
This function is used by KVM to pin process's page in the atomic context. Define the 'weak' function to avoid other architecture not support it Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-10-22Merge branch 'for-linus' of ↵Linus Torvalds6-206/+250
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: update comments to reflect that percpu allocations are always zero-filled percpu: Optimize __get_cpu_var() x86, percpu: Optimize this_cpu_ptr percpu: clear memory allocated with the km allocator percpu: fix build breakage on s390 and cleanup build configuration tests percpu: use percpu allocator on UP too percpu: reduce PCPU_MIN_UNIT_SIZE to 32k vmalloc: pcpu_get/free_vm_areas() aren't needed on UP Fixed up trivial conflicts in include/linux/percpu.h
2010-10-22Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds1-2/+0
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: remove in_workqueue_context() workqueue: Clarify that schedule_on_each_cpu is synchronous memory_hotplug: drop spurious calls to flush_scheduled_work() shpchp: update workqueue usage pciehp: update workqueue usage isdn/eicon: don't call flush_scheduled_work() from diva_os_remove_soft_isr() workqueue: add and use WQ_MEM_RECLAIM flag workqueue: fix HIGHPRI handling in keep_working() workqueue: add queue_work and activate_work trace points workqueue: prepare for more tracepoints workqueue: implement flush[_delayed]_work_sync() workqueue: factor out start_flush_work() workqueue: cleanup flush/cancel functions workqueue: implement alloc_ordered_workqueue() Fix up trivial conflict in fs/gfs2/main.c as per Tejun
2010-10-22Merge branch 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds1-3/+3
* 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits) xen-blkfront: disable barrier/flush write support Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c block: remove BLKDEV_IFL_WAIT aic7xxx_old: removed unused 'req' variable block: remove the BH_Eopnotsupp flag block: remove the BLKDEV_IFL_BARRIER flag block: remove the WRITE_BARRIER flag swap: do not send discards as barriers fat: do not send discards as barriers ext4: do not send discards as barriers jbd2: replace barriers with explicit flush / FUA usage jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier jbd: replace barriers with explicit flush / FUA usage nilfs2: replace barriers with explicit flush / FUA usage reiserfs: replace barriers with explicit flush / FUA usage gfs2: replace barriers with explicit flush / FUA usage btrfs: replace barriers with explicit flush / FUA usage xfs: replace barriers with explicit flush / FUA usage block: pass gfp_mask and flags to sb_issue_discard dm: convey that all flushes are processed as empty ...
2010-10-22Merge branch 'hwpoison-hugepages' into hwpoisonAndi Kleen5-115/+482
Conflicts: mm/memory-failure.c
2010-10-22Merge branch 'hwpoison-cleanups' into hwpoisonAndi Kleen1-41/+32
2010-10-21Merge branch 'core-memblock-for-linus' of ↵Linus Torvalds4-316/+631
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-memblock-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (74 commits) x86-64: Only set max_pfn_mapped to 512 MiB if we enter via head_64.S xen: Cope with unmapped pages when initializing kernel pagetable memblock, bootmem: Round pfn properly for memory and reserved regions memblock: Annotate memblock functions with __init_memblock memblock: Allow memblock_init to be called early memblock/arm: Fix memblock_region_is_memory() typo x86, memblock: Remove __memblock_x86_find_in_range_size() memblock: Fix wraparound in find_region() x86-32, memblock: Make add_highpages honor early reserved ranges x86, memblock: Fix crashkernel allocation arm, memblock: Fix the sparsemem build memblock: Fix section mismatch warnings powerpc, memblock: Fix memblock API change fallout memblock, microblaze: Fix memblock API change fallout x86: Remove old bootmem code x86, memblock: Use memblock_memory_size()/memblock_free_memory_size() to get correct dma_reserve x86: Remove not used early_res code x86, memblock: Replace e820_/_early string with memblock_ x86: Use memblock to replace early_res x86, memblock: Use memblock_debug to control debug message print out ... Fix up trivial conflicts in arch/x86/kernel/setup.c and kernel/Makefile
2010-10-21Merge branch 'x86-mm-for-linus' of ↵Linus Torvalds2-1/+10
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86-32, percpu: Correct the ordering of the percpu readmostly section x86, mm: Enable ARCH_DMA_ADDR_T_64BIT with X86_64 || HIGHMEM64G x86: Spread tlb flush vector between nodes percpu: Introduce a read-mostly percpu API x86, mm: Fix incorrect data type in vmalloc_sync_all() x86, mm: Hold mm->page_table_lock while doing vmalloc_sync x86, mm: Fix bogus whitespace in sync_global_pgds() x86-32: Fix sparse warning for the __PHYSICAL_MASK calculation x86, mm: Add RESERVE_BRK_ARRAY() helper mm, x86: Saving vmcore with non-lazy freeing of vmas x86, kdump: Change copy_oldmem_page() to use cached addressing x86, mm: fix uninitialized addr in kernel_physical_mapping_init() x86, kmemcheck: Remove double test x86, mm: Make spurious_fault check explicitly check the PRESENT bit x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping changes x86, mm: Separate x86_64 vmalloc_sync_all() into separate functions x86, mm: Avoid unnecessary TLB flush
2010-10-19memory_hotplug: drop spurious calls to flush_scheduled_work()Tejun Heo1-2/+0
lru_add_drain_all() uses schedule_on_each_cpu() which is synchronous. There is no reason to call flush_scheduled_work() after lru_add_drain_all(). Drop the spurious calls. This is to prepare for the deprecation and removal of flush_scheduled_work(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Mel Gorman <mel@csn.ul.ie>
2010-10-19Merge branch 'v2.6.36-rc8' into for-2.6.37/barrierJens Axboe23-212/+319
Conflicts: block/blk-core.c drivers/block/loop.c mm/swapfile.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-11Merge branch 'x86/urgent' into core/memblockH. Peter Anvin3-11/+15
Reason for merge: Forward-port urgent change to arch/x86/mm/srat_64.c to the memblock tree. Resolved Conflicts: arch/x86/mm/srat_64.c Originally-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2010-10-11memblock: Annotate memblock functions with __init_memblockYinghai Lu1-2/+2
Stephen found WARNING: mm/built-in.o(.text+0x25ab8): Section mismatch in reference from the function memblock_find_base() to the function .init.text:memblock_find_region() The function memblock_find_base() references the function __init memblock_find_region(). This is often because memblock_find_base lacks a __init annotation or the annotation of memblock_find_region is wrong. So let memblock_find_region() to use __init_memblock instead of __init directly. Also fix one function that did not have __init* to be __init_memblock. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4CB366B1.40405@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2010-10-11memblock: Allow memblock_init to be called earlyJeremy Fitzhardinge1-0/+6
The Xen setup code needs to call memblock_x86_reserve_range() very early, so allow it to initialize the memblock subsystem before doing so. The second memblock_init() is ignored. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> LKML-Reference: <4CACFDAD.3090900@goop.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2010-10-11Fix migration.c compilation on s390Andi Kleen1-0/+2
31bit s390 doesn't have huge pages and failed with: > mm/migrate.c: In function 'remove_migration_pte': > mm/migrate.c:143:3: error: implicit declaration of function 'pte_mkhuge' > mm/migrate.c:143:7: error: incompatible types when assigning to type 'pte_t' from type 'int' Put that code into a ifdef. Reported by Heiko Carstens Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08HWPOISON: Remove retry loop for try_to_unmapAndi Kleen1-14/+1
We don't reply in other temporary failure cases and there were no reports of replies happening. I think the original reason it was added was also just an early bug, not an observation of the race. So remove the loop for now, but keep a warning message. Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08HWPOISON: Turn addr_valid from bitfield into charAndi Kleen1-1/+1
The addr_valid flag is the only flag in "to_kill" and it's slightly more efficient to have it as char instead of a bitfield. Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08HWPOISON: Disable DEBUG by defaultAndi Kleen1-1/+0
Now that only a few obscure messages are left as pr_debug disable outputting of pr_debug in memory-failure.c by default. Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08HWPOISON: Convert pr_debugs to pr_infoAndi Kleen1-12/+12
Convert a lot of pr_debugs in memory-failure.c that are generally useful to pr_info. It's reasonable to print at least one message why offlining succeeded or failed by default. Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08HWPOISON: Improve comments in memory-failure.cAndi Kleen1-13/+18
Clean up and improve the overview comment in memory-failure.c Tidy some grammar issues in other comments. Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08Encode huge page size for VM_FAULT_HWPOISON errorsAndi Kleen2-3/+6
This fixes a problem introduced with the hugetlb hwpoison handling The user space SIGBUS signalling wants to know the size of the hugepage that caused a HWPOISON fault. Unfortunately the architecture page fault handlers do not have easy access to the struct page. Pass the information out in the fault error code instead. I added a separate VM_FAULT_HWPOISON_LARGE bit for this case and encode the hpage index in some free upper bits of the fault code. The small page hwpoison keeps stays with the VM_FAULT_HWPOISON name to minimize changes. Also add code to hugetlb.h to convert that index into a page shift. Will be used in a further patch. Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: fengguang.wu@intel.com Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08hugepage: move is_hugepage_on_freelist inside ifdef to avoid warningAndi Kleen1-1/+2
Fixes warning reported by Stephen Rothwell mm/hugetlb.c:2950: warning: 'is_hugepage_on_freelist' defined but not used for the !CONFIG_MEMORY_FAILURE case. Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08Clean up __page_set_anon_rmapAndi Kleen1-17/+8
Linus asked for a cleanup of __page_set_anon_rmap to make it look more like the cleaner huge pages version. Factor out the duplicated PageAnon check into a single check at the beginning of the function. Remove obsolete comments and rewrite them into standard English. No functional changes. Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08HWPOISON, hugetlb: fix unpoison for hugepageNaoya Horiguchi1-2/+2
Currently unpoisoning hugepages doesn't work correctly because clearing PG_HWPoison is done outside if (TestClearPageHWPoison). This patch fixes it. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08HWPOISON, hugetlb: soft offlining for hugepageNaoya Horiguchi1-4/+55
This patch extends soft offlining framework to support hugepage. When memory corrected errors occur repeatedly on a hugepage, we can choose to stop using it by migrating data onto another hugepage and disabling the original (maybe half-broken) one. ChangeLog since v4: - branch soft_offline_page() for hugepage ChangeLog since v3: - remove comment about "ToDo: hugepage soft-offline" ChangeLog since v2: - move refcount handling into isolate_lru_page() ChangeLog since v1: - add double check in isolating hwpoisoned hugepage - define free/non-free checker for hugepage - postpone calling put_page() for hugepage in soft_offline_page() Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08HWPOSION, hugetlb: recover from free hugepage error when !MF_COUNT_INCREASEDNaoya Horiguchi2-1/+33
Currently error recovery for free hugepage works only for MF_COUNT_INCREASED. This patch enables !MF_COUNT_INCREASED case. Free hugepages can be handled directly by alloc_huge_page() and dequeue_hwpoisoned_huge_page(), and both of them are protected by hugetlb_lock, so there is no race between them. Note that this patch defines the refcount of HWPoisoned hugepage dequeued from freelist is 1, deviated from present 0, thereby we can avoid race between unpoison and memory failure on free hugepage. This is reasonable because unlikely to free buddy pages, free hugepage is governed by hugetlbfs even after error handling finishes. And it also makes unpoison code added in the later patch cleaner. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08hugetlb: move refcounting in hugepage allocation inside hugetlb_lockNaoya Horiguchi1-22/+13
Currently alloc_huge_page() raises page refcount outside hugetlb_lock. but it causes race when dequeue_hwpoison_huge_page() runs concurrently with alloc_huge_page(). To avoid it, this patch moves set_page_refcounted() in hugetlb_lock. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08HWPOISON, hugetlb: add free check to dequeue_hwpoison_huge_page()Naoya Horiguchi2-6/+29
This check is necessary to avoid race between dequeue and allocation, which can cause a free hugepage to be dequeued twice and get kernel unstable. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08hugetlb: hugepage migration coreNaoya Horiguchi2-19/+231
This patch extends page migration code to support hugepage migration. One of the potential users of this feature is soft offlining which is triggered by memory corrected errors (added by the next patch.) Todo: - there are other users of page migration such as memory policy, memory hotplug and memocy compaction. They are not ready for hugepage support for now. ChangeLog since v4: - define migrate_huge_pages() - remove changes on isolation/putback_lru_page() ChangeLog since v2: - refactor isolate/putback_lru_page() to handle hugepage - add comment about race on unmap_and_move_huge_page() ChangeLog since v1: - divide migration code path for hugepage - define routine checking migration swap entry for hugetlb - replace "goto" with "if/else" in remove_migration_pte() Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08hugetlb: redefine hugepage copy functionsNaoya Horiguchi1-5/+40
This patch modifies hugepage copy functions to have only destination and source hugepages as arguments for later use. The old ones are renamed from copy_{gigantic,huge}_page() to copy_user_{gigantic,huge}_page(). This naming convention is consistent with that between copy_highpage() and copy_user_highpage(). ChangeLog since v4: - add blank line between local declaration and code - remove unnecessary might_sleep() ChangeLog since v2: - change copy_huge_page() from macro to inline dummy function to avoid compile warning when !CONFIG_HUGETLB_PAGE. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08hugetlb: add allocate function for hugepage migrationNaoya Horiguchi1-25/+54
We can't use existing hugepage allocation functions to allocate hugepage for page migration, because page migration can happen asynchronously with the running processes and page migration users should call the allocation function with physical addresses (not virtual addresses) as arguments. ChangeLog since v3: - unify alloc_buddy_huge_page() and alloc_buddy_huge_page_node() ChangeLog since v2: - remove unnecessary get/put_mems_allowed() (thanks to David Rientjes) ChangeLog since v1: - add comment on top of alloc_huge_page_no_vma() Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08hugetlb: fix metadata corruption in hugetlb_fault()Naoya Horiguchi1-12/+9
Since the PageHWPoison() check is for avoiding hwpoisoned page remained in pagecache mapping to the process, it should be done in "found in pagecache" branch, not in the common path. Otherwise, metadata corruption occurs if memory failure happens between alloc_huge_page() and lock_page() because page fault fails with metadata changes remained (such as refcount, mapcount, etc.) This patch moves the check to "found in pagecache" branch and fix the problem. ChangeLog since v2: - remove retry check in "new allocation" path. - make description more detailed - change patch name from "HWPOISON, hugetlb: move PG_HWPoison bit check" Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08Merge commit 'v2.6.36-rc7' into core/memblockIngo Molnar20-181/+269
Merge reason: Update from -rc3 to -rc7. Signed-off-by: Ingo Molnar <mingo@elte.hu>