summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2008-02-05page migraton: handle orphaned pagesShaohua Li2-7/+25
Orphaned page might have fs-private metadata, the page is truncated. As the page hasn't mapping, page migration refuse to migrate the page. It appears the page is only freed in page reclaim and if zone watermark is low, the page is never freed, as a result migration always fail. I thought we could free the metadata so such page can be freed in migration and make migration more reliable. [akpm@linux-foundation.org: go direct to try_to_free_buffers()] Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05Document lowmem_reserve_ratioYasunori Goto1-17/+63
Though the lower_zone_protection was changed to lowmem_reserve_ratio, the document has been not changed. The lowmem_reserve_ratio seems quite hard to estimate, but there is no guidance. This patch is to change document for it. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Andrea Arcangeli <andrea@cpushare.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05check ADVICE of fadvise64_64 even if get_xip_page is givenMasatake YAMATO1-2/+14
I've written some test programs in ltp project. During writing I met an problem which I cannot solve in user land. So I wrote a patch for linux kernel. Please, include this patch if acceptable. The test program tests the 4th parameter of fadvise64_64: long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice); My test case calls fadvise64_64 with invalid advice value and checks errno is set to EINVAL. About the advice parameter man page says: ... Permissible values for advice include: POSIX_FADV_NORMAL ... POSIX_FADV_SEQUENTIAL ... POSIX_FADV_RANDOM ... POSIX_FADV_NOREUSE ... POSIX_FADV_WILLNEED ... POSIX_FADV_DONTNEED ... ERRORS ... EINVAL An invalid value was specified for advice. However, I got a bug report that the system call invocations in my test case returned 0 unexpectedly. I've inspected the kernel code: asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice) { struct file *file = fget(fd); struct address_space *mapping; struct backing_dev_info *bdi; loff_t endbyte; /* inclusive */ pgoff_t start_index; pgoff_t end_index; unsigned long nrpages; int ret = 0; if (!file) return -EBADF; if (S_ISFIFO(file->f_path.dentry->d_inode->i_mode)) { ret = -ESPIPE; goto out; } mapping = file->f_mapping; if (!mapping || len < 0) { ret = -EINVAL; goto out; } if (mapping->a_ops->get_xip_page) /* no bad return value, but ignore advice */ goto out; ... out: fput(file); return ret; } I found the advice parameter is just ignored in the case mapping->a_ops->get_xip_page is given. This behavior is different from what is written on the man page. Is this o.k.? get_xip_page is given if CONFIG_EXT2_FS_XIP is true. Anyway I cannot find the easy way to detect get_xip_page field is given or CONFIG_EXT2_FS_XIP is true from the user space. I propose the following patch which checks the advice parameter even if get_xip_page is given. Signed-off-by: Masatake YAMATO <yamato@redhat.com> Acked-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05Include count of pagecache pages in show_mem() outputLarry Woodman1-0/+2
The show_mem() output does not include the total number of pagecache pages. This would be helpful when analyzing the debug information in the /var/log/messages file after OOM kills occur. This patch includes the total pagecache pages in that output. Signed-off-by: Larry Woodman <lwoodman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05Fix dirty page accounting leak with ext3 data=journalBjorn Steinbrink1-2/+2
In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 ("Clean up and make try_to_free_buffers() not race with dirty pages"), try_to_free_buffers was changed to bail out if the page was dirty. That in turn caused truncate_complete_page to leak massive amounts of memory, because the dirty bit was only cleared after the call to try_to_free_buffers. So the call to cancel_dirty_page was moved up to have the dirty bit cleared early in 3e67c0987d7567ad666641164a153dca9a43b11d ("truncate: clear page dirtiness before running try_to_free_buffers()"). The problem with that fix is, that the page can be redirtied after cancel_dirty_page was called, eg. like this: truncate_complete_page() cancel_dirty_page() // PG_dirty cleared, decr. dirty pages do_invalidatepage() ext3_invalidatepage() journal_invalidatepage() journal_unmap_buffer() __dispose_buffer() __journal_unfile_buffer() __journal_temp_unlink_buffer() mark_buffer_dirty(); // PG_dirty set, incr. dirty pages And then we end up with dirty pages being wrongly accounted. As a result, in ecdfc9787fe527491baefc22dce8b2dbd5b2908d ("Resurrect 'try_to_free_buffers()' VM hackery") the changes to try_to_free_buffers were reverted, so the original reason for the massive memory leak is gone, and we can also revert the move of the call to cancel_dirty_page from truncate_complete_page and get the accounting right again. I'm not sure if it matters, but opposed to the final check in __remove_from_page_cache, this one also cares about the task io accounting, so maybe we want to use this instead, although it's not quite the clean fix either. Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de> Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl> Cc: Jan Kara <jack@ucw.cz> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Osterried <osterried@jesse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05set_page_refcounted() VM_BUG_ON fixQi Yong1-1/+1
The current PageTail semantic is that a PageTail page is first a PageCompound page. So remove the redundant PageCompound test in set_page_refcounted(). Signed-off-by: Qi Yong <qiyong@fc-cn.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05mm: remove fastcall from mm/Harvey Harrison7-23/+24
fastcall is always defined to be empty, remove it [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05page allocator: remove unused arguments in zone_init_free_lists()Andi Kleen1-3/+2
Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05skip writing data pages when inode is under I_SYNCQi Yong1-12/+1
Since I_SYNC was split out from I_LOCK, the concern in commit 4b89eed93e0fa40a63e3d7b1796ec1337ea7a3aa ("Write back inode data pages even when the inode itself is locked") is not longer valid. We should revert to the original behavior: in __writeback_single_inode(), when we find an I_SYNC-ed inode and we're not doing a data-integrity sync, skip writing entirely. Otherwise, we are double calling do_writepages() Signed-off-by: Qi Yong <qiyong@fc-cn.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hugh@veritas.com> Cc: Joern Engel <joern@wohnheim.fh-wedel.de> Cc: WU Fengguang <wfg@mail.ustc.edu.cn> Cc: Michael Rubin <mrubin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05mm: don't waste swap on locked pagesHugh Dickins1-1/+4
try_to_unmap always fails on a page found in a VM_LOCKED vma (unless migrating), and recycles it back to the active list. But if it's an anonymous page, we've already allocated swap to it: just wasting swap. Spot locked pages in page_referenced_one and treat them as referenced. Signed-off-by: Hugh Dickins <hugh@veritas.com> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ethan Solomita <solo@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05vmstat: remove prefetchChristoph Lameter1-9/+2
Remove the prefetch logic in order to avoid touching impossible per cpu areas. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Mike Travis <travis@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05Fix /proc dcache deadlock in do_exitAndrea Arcangeli1-1/+2
This patch fixes a sles9 system hang in start_this_handle from a customer with some heavy workload where all tasks are waiting on kjournald to commit the transaction, but kjournald waits on t_updates to go down to zero (it never does). This was reported as a lowmem shortage deadlock but when checking the debug data I noticed the VM wasn't under pressure at all (well it was really under vm pressure, because lots of tasks hanged in the VM prune_dcache methods trying to flush dirty inodes, but no task was hanging in GFP_NOFS mode, the holder of the journal handle should have if this was a vm issue in the first place). No task was apparently holding the leftover handle in the committing transaction, so I deduced t_updates was stuck to 1 because a journal_stop was never run by some path (this turned out to be correct). With a debug patch adding proper reverse links and stack trace logging in ext3 deployed in production, I found journal_stop is never run because mark_inode_dirty_sync is called inside release_task called by do_exit. (that was quite fun because I would have never thought about this subtleness, I thought a regular path in ext3 had a bug and it forgot to call journal_stop) do_exit->release_task->mark_inode_dirty_sync->schedule() (will never come back to run journal_stop) The reason is that shrink_dcache_parent is racy by design (feature not a bug) and it can do blocking I/O in some case, but the point is that calling shrink_dcache_parent at the last stage of do_exit isn't safe for self-reaping tasks. I guess the memory pressure of the unbalanced highmem system allowed to trigger this more easily. Now mainline doesn't have this line in iput (like sles9 has): if (inode->i_state & I_DIRTY_DELAYED) mark_inode_dirty_sync(inode); so it will probably not crash with ext3, but for example ext2 implements an I/O-blocking ext2_put_inode that will lead to similar screwups with ext2_free_blocks never coming back and it's definitely wrong to call blocking-IO paths inside do_exit. So this should fix a subtle bug in mainline too (not verified in practice though). The equivalent fix for ext3 is also not verified yet to fix the problem in sles9 but I don't have doubt it will (it usually takes days to crash, so it'll take weeks to be sure). An alternate fix would be to offload that work to a kernel thread, but I don't think a reschedule for this is worth it, the vm should be able to collect those entries for the synchronous release_task. Signed-off-by: Andrea Arcangeli <andrea@suse.de> Cc: Jan Kara <jack@ucw.cz> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05mm/page-writeback: highmem_is_dirtyable optionBron Gondwana5-5/+47
Add vm.highmem_is_dirtyable toggle A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of approximately 2Gb size which contains a hash format that is written randomly by the dbclean process. On 2.6.16 this process took a few minutes. With lowmem only accounting of dirty ratios, this takes about 12 hours of 100% disk IO, all random writes. Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to add the highmem back to the total available memory count. [akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build] Signed-off-by: Bron Gondwana <brong@fastmail.fm> Cc: Ethan Solomita <solo@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: WU Fengguang <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05Page allocator: get rid of the list of cold pagesChristoph Lameter3-49/+40
We have repeatedly discussed if the cold pages still have a point. There is one way to join the two lists: Use a single list and put the cold pages at the end and the hot pages at the beginning. That way a single list can serve for both types of allocations. The discussion of the RFC for this and Mel's measurements indicate that there may not be too much of a point left to having separate lists for hot and cold pages (see http://marc.info/?t=119492914200001&r=1&w=2). Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Martin Bligh <mbligh@mbligh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05mm: don't allow ioremapping of ranges larger than vmalloc spaceRobert Bragg1-0/+4
When running with a 16M IOREMAP_MAX_ORDER (on armv7) we found that the vmlist search routine in __get_vm_area_node can mistakenly allow a driver to ioremap a range larger than vmalloc space. If at the time of the ioremap all existing vmlist areas sit below the determined alignment then the search routine continues past all entries and exits the for loop - straight into the found: label - without ever testing for integer wrapping or that the requested size fits. We were seeing a driver successfully ioremap 128M of flash even though there was only 120M of vmalloc space. From that point the system was left with the remainder of the first 16M of space to vmalloc/ioremap within. Signed-off-by: Robert Bragg <robert@sixbynine.org> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05vmstat: small revisions to refresh_cpu_vm_stats()Christoph Lameter1-4/+16
1. Add comments explaining how the function can be called. 2. Collect global diffs in a local array and only spill them once into the global counters when the zone scan is finished. This means that we only touch each global counter once instead of each time we fold cpu counters into zone counters. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05arch_rebalance_pgtables callMartin Schwidefsky1-1/+5
In order to change the layout of the page tables after an mmap has crossed the adress space limit of the current page table layout a architecture hook in get_unmapped_area is needed. The arguments are the address of the new mapping and the length of it. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05add mm argument to pte/pmd/pud/pgd_freeBenjamin Herrenschmidt43-151/+151
(with Martin Schwidefsky <schwidefsky@de.ibm.com>) The pgd/pud/pmd/pte page table allocation functions get a mm_struct pointer as first argument. The free functions do not get the mm_struct argument. This is 1) asymmetrical and 2) to do mm related page table allocations the mm argument is needed on the free function as well. [kamalesh@linux.vnet.ibm.com: i386 fix] [akpm@linux-foundation.org: coding-syle fixes] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05Page allocator: clean up pcp draining functionsChristoph Lameter5-44/+48
- Add comments explaing how drain_pages() works. - Eliminate useless functions - Rename drain_all_local_pages to drain_all_pages(). It does drain all pages not only those of the local processor. - Eliminate useless interrupt off / on sequences. drain_pages() disables interrupts on its own. The execution thread is pinned to processor by the caller. So there is no need to disable interrupts. - Put drain_all_pages() declaration in gfp.h and remove the declarations from suspend.h and from mm/memory_hotplug.c - Make software suspend call drain_all_pages(). The draining of processor local pages is may not the right approach if software suspend wants to support SMP. If they call drain_all_pages then we can make drain_pages() static. [akpm@linux-foundation.org: fix build] Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Daniel Walker <dwalker@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05radix-tree: avoid atomic allocations for preloaded insertionsNick Piggin3-6/+11
Most pagecache (and some other) radix tree insertions have the great opportunity to preallocate a few nodes with relaxed gfp flags. But the preallocation is squandered when it comes time to allocate a node, we default to first attempting a GFP_ATOMIC allocation -- that doesn't normally fail, but it can eat into atomic memory reserves that we don't need to be using. Another upshot of this is that it removes the sometimes highly contended zone->lock from underneath tree_lock. Pagecache insertions are always performed with a radix tree preload, and after this change, such a situation will never fall back to kmem_cache_alloc within radix_tree_node_alloc. David Miller reports seeing this allocation fail on a highly threaded sparc64 system: [527319.459981] dd: page allocation failure. order:0, mode:0x20 [527319.460403] Call Trace: [527319.460568] [00000000004b71e0] __slab_alloc+0x1b0/0x6a8 [527319.460636] [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8 [527319.460698] [000000000055309c] radix_tree_node_alloc+0x20/0x90 [527319.460763] [0000000000553238] radix_tree_insert+0x12c/0x260 [527319.460830] [0000000000495cd0] add_to_page_cache+0x38/0xb0 [527319.460893] [00000000004e4794] mpage_readpages+0x6c/0x134 [527319.460955] [000000000049c7fc] __do_page_cache_readahead+0x170/0x280 [527319.461028] [000000000049cc88] ondemand_readahead+0x208/0x214 [527319.461094] [0000000000496018] do_generic_mapping_read+0xe8/0x428 [527319.461152] [0000000000497948] generic_file_aio_read+0x108/0x170 [527319.461217] [00000000004badac] do_sync_read+0x88/0xd0 [527319.461292] [00000000004bb5cc] vfs_read+0x78/0x10c [527319.461361] [00000000004bb920] sys_read+0x34/0x60 [527319.461424] [0000000000406294] linux_sparc_syscall32+0x3c/0x40 The calltrace is significant: __do_page_cache_readahead allocates a number of pages with GFP_KERNEL, and hence it should have reclaimed sufficient memory to satisfy GFP_ATOMIC allocations. However after the list of pages goes to mpage_readpages, there can be significant intervals (including disk IO) before all the pages are inserted into the radix-tree. So the reserves can easily be depleted at that point. The patch is confirmed to fix the problem. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05make __vmalloc_area_node() staticAdrian Bunk1-2/+2
__vmalloc_area_node() can become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05Remove unused code from mm/tiny-shmem.cBalbir Singh1-12/+0
This code in mm/tiny-shmem.c is under #if 0 - remove it. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05mm/page-writeback.c: make a function staticAdrian Bunk1-1/+1
task_dirty_limit() can become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05maps4: make page monitoring /proc file optionalMatt Mackall5-3/+20
Make /proc/ page monitoring configurable This puts the following files under an embedded config option: /proc/pid/clear_refs /proc/pid/smaps /proc/pid/pagemap /proc/kpagecount /proc/kpageflags [akpm@linux-foundation.org: Kconfig fix] Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05maps4: add /proc/kpageflags interfaceMatt Mackall1-1/+83
This makes a subset of physical page flags available to userspace. Together with /proc/pid/kpagemap, this allows tracking of a wide variety of VM behaviors. Exported flags are decoupled from the kernel's internal flags. This allows us to reorder flag bits, and synthesize any bits that get redefined in terms of other bits. [akpm@linux-foundation.org: remove unneeded access_ok()] [akpm@linux-foundation.org: s/0/NULL/] Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05maps4: add /proc/kpagecount interfaceMatt Mackall1-0/+50
This makes physical page map counts available to userspace. Together with /proc/pid/pagemap and /proc/pid/clear_refs, this can be used to monitor memory usage on a per-page basis. [akpm@linux-foundation.org: remove unneeded access_ok()] [bunk@stusta.de: make struct proc_kpagemap static] Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05maps4: add /proc/pid/pagemap interfaceMatt Mackall3-2/+204
This interface provides a mapping for each page in an address space to its physical page frame number, allowing precise determination of what pages are mapped and what pages are shared between processes. New in this version: - headers gone again (as recommended by Dave Hansen and Alan Cox) - 64-bit entries (as per discussion with Andi Kleen) - swap pte information exported (from Dave Hansen) - page walker callback for holes (from Dave Hansen) - direct put_user I/O (as suggested by Rusty Russell) This patch folds in cleanups and swap PTE support from Dave Hansen <haveblue@us.ibm.com>. Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05maps4: regroup task_mmu by interfaceMatt Mackall1-207/+210
Reorder source so that all the code and data for each interface is together. Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: David Rientjes <rientjes@google.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05maps4: move clear_refs code to task_mmu.cMatt Mackall4-54/+39
This puts all the clear_refs code where it belongs and probably lets things compile on MMU-less systems as well. Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: David Rientjes <rientjes@google.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05maps4: simplify interdependence of maps and smapsMatt Mackall1-26/+26
This pulls the shared map display code out of show_map and puts it in show_smap where it belongs. Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05maps4: use pagewalker in clear_refs and smapsMatt Mackall1-78/+17
Use the generic pagewalker for smaps and clear_refs Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05maps4: introduce a generic page walkerMatt Mackall3-1/+154
Introduce a general page table walker Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05maps4: move is_swap_pteMatt Mackall2-5/+6
Move is_swap_pte helper function to swapops.h for use by pagemap code Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05maps4: rework TASK_SIZE macrosDave Hansen6-4/+14
The following replaces the earlier patches sent. It should address David Rientjes's comments, and has been compile tested on all the architectures that it touches, save for parisc. For the /proc/<pid>/pagemap code[1], we need to able to query how much virtual address space a particular task has. The trick is that we do it through /proc and can't use TASK_SIZE since it references "current" on some arches. The process opening the /proc file might be a 32-bit process opening a 64-bit process's pagemap file. x86_64 already has a TASK_SIZE_OF() macro: #define TASK_SIZE_OF(child) ((test_tsk_thread_flag(child, TIF_IA32)) ? IA32_PAGE_OFFSET : TASK_SIZE64) I'd like to have that for other architectures. So, add it for all the architectures that actually use "current" in their TASK_SIZE. For the others, just add a quick #define in sched.h to use plain old TASK_SIZE. 1. http://www.linuxworld.com/news/2007/042407-kernel.html - MIPS portion from Ralf Baechle <ralf@linux-mips.org> [akpm@linux-foundation.org: fix mips build] Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Matt Mackall <mpm@selenic.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05maps4: add proportional set size accounting in smapsFengguang Wu1-1/+27
The "proportional set size" (PSS) of a process is the count of pages it has in memory, where each page is divided by the number of processes sharing it. So if a process has 1000 pages all to itself, and 1000 shared with one other process, its PSS will be 1500. - lwn.net: "ELC: How much memory are applications really using?" The PSS proposed by Matt Mackall is a very nice metic for measuring an process's memory footprint. So collect and export it via /proc/<pid>/smaps. Matt Mackall's pagemap/kpagemap and John Berthels's exmap can also do the job. They are comprehensive tools. But for PSS, let's do it in the simple way. Cc: John Berthels <jjberthels@gmail.com> Cc: Bernardo Innocenti <bernie@codewiz.org> Cc: Padraig Brady <P@draigBrady.com> Cc: Denys Vlasenko <vda.linux@googlemail.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Hugh Dickins <hugh@veritas.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05clean up vmtruncateChristoph Hellwig1-35/+34
vmtruncate is a twisted maze of gotos, this patch cleans it up to have a proper if else for the two major cases of extending and truncating truncate and thus makes it a lot more readable while keeping exactly the same functinality. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05tmpfs: fix shmem_swaplist racesHugh Dickins1-12/+25
Intensive swapoff testing shows shmem_unuse spinning on an entry in shmem_swaplist pointing to itself: how does that come about? Days pass... First guess is this: shmem_delete_inode tests list_empty without taking the global mutex (so the swapping case doesn't slow down the common case); but there's an instant in shmem_unuse_inode's list_move_tail when the list entry may appear empty (a rare case, because it's actually moving the head not the the list member). So there's a danger of leaving the inode on the swaplist when it's freed, then reinitialized to point to itself when reused. Fix that by skipping the list_move_tail when it's a no-op, which happens to plug this. But this same spinning then surfaces on another machine. Ah, I'd never suspected it, but shmem_writepage's swaplist manipulation is unsafe: though we still hold page lock, which would hold off inode deletion if the page were in pagecache, it doesn't hold off once it's in swapcache (free_swap_and_cache doesn't wait on locked pages). Hmm: we could put the the inode on swaplist earlier, but then shmem_unuse_inode could never prune unswapped inodes. Fix this with an igrab before dropping info->lock, as in shmem_unuse_inode; though I am a little uneasy about the iput which has to follow - it works, and I see nothing wrong with it, but it is surprising that shmem inode deletion may now occur below shmem_writepage. Revisit this fix later? And while we're looking at these races: the way shmem_unuse tests swapped without holding info->lock looks unsafe, if we've more than one swap area: a racing shmem_writepage on another page of the same inode could be putting it in swapcache, just as we're deciding to remove the inode from swaplist - there's a danger of going on swap without being listed, so a later swapoff would hang, being unable to locate the entry. Move that test and removal down into shmem_unuse_inode, once info->lock is held. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05tmpfs: radix_tree_preloadingHugh Dickins1-7/+18
Nick has observed that shmem.c still uses GFP_ATOMIC when adding to page cache or swap cache, without any radix tree preload: so tending to deplete emergency reserves of memory. GFP_ATOMIC remains appropriate in shmem_writepage's add_to_swap_cache: it's being called under memory pressure, so must not wait for more memory to become available. But shmem_unuse_inode now has a window in which it can and should preload with GFP_KERNEL, and say GFP_NOWAIT instead of GFP_ATOMIC in its add_to_page_cache. shmem_getpage is not so straightforward: its filepage/swappage integrity relies upon exchanging between caches under spinlock, and it would need a lot of restructuring to place the preloads correctly. Instead, follow its pattern of retrying on races: use GFP_NOWAIT instead of GFP_ATOMIC in add_to_page_cache, and begin each circuit of the repeat loop with a sleeping radix_tree_preload, followed immediately by radix_tree_preload_end - that won't guarantee success in the next add_to_page_cache, but doesn't need to. And we can then remove that bothersome congestion_wait: when needed, it'll automatically get done in the course of the radix_tree_preload. Signed-off-by: Hugh Dickins <hugh@veritas.com> Looks-good-to: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05tmpfs: open a window in shmem_unuse_inodeHugh Dickins2-35/+45
There are a couple of reasons (patches follow) why it would be good to open a window for sleep in shmem_unuse_inode, between its search for a matching swap entry, and its handling of the entry found. shmem_unuse_inode must then use igrab to hold the inode against deletion in that window, and its corresponding iput might result in deletion: so it had better unlock_page before the iput, and might as well release the page too. Nor is there any need to hold on to shmem_swaplist_mutex once we know we'll leave the loop. So this unwinding moves from try_to_unuse and shmem_unuse into shmem_unuse_inode, in the case when it finds a match. Let try_to_unuse break on error in the shmem_unuse case, as it does in the unuse_mm case: though at this point in the series, no error to break on. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05tmpfs: make shmem_unuse more preemptibleHugh Dickins1-7/+16
shmem_unuse is at present an unbroken search through every swap vector page of every tmpfs file which might be swapped, all under shmem_swaplist_lock. This dates from long ago, when the caller held mmlist_lock over it all too: long gone, but there's never been much pressure for preemptible swapoff. Make it a little more preemptible, replacing shmem_swaplist_lock by shmem_swaplist_mutex, inserting a cond_resched in the main loop, and a cond_resched_lock (on info->lock) at one convenient point in the shmem_unuse_inode loop, where it has no outstanding kmap_atomic. If we're serious about preemptible swapoff, there's much further to go e.g. I'm stupid to let the kmap_atomics of the decreasingly significant HIGHMEM case dictate preemptiblility for other configs. But as in the earlier patch to make swapoff scan ptes preemptibly, my hidden agenda is really towards making memcgroups work, hardly about preemptibility at all. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05tmpfs: allocate on read when stackedHugh Dickins1-1/+13
tmpfs is expected to limit the memory used (unless mounted with nr_blocks=0 or size=0). But if a stacked filesystem such as unionfs gets pages from a sparse tmpfs file by reading holes, and then writes to them, it can easily exceed any such limit at present. So suppress the SGP_READ "don't allocate page" ZERO_PAGE optimization when reading for the kernel (a KERNEL_DS check, ugh, sorry about that). Indeed, pessimistically mark such pages as dirty, so they cannot get reclaimed and unaccounted by mistake. The venerable shmem_recalc_inode code (originally to account for the reclaim of clean pages) suffices to get the accounting right when swappages are dropped in favour of more uptodate filepages. This also fixes the NULL shmem_swp_entry BUG or oops in shmem_writepage, caused by unionfs writing to a very sparse tmpfs file: to minimize memory allocation in swapout, tmpfs requires the swap vector be allocated upfront, which wasn't always happening in this stacked case. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05tmpfs: allow filepage alongside swappageHugh Dickins1-25/+44
tmpfs has long allowed for a fresh filepage to be created in pagecache, just before shmem_getpage gets the chance to match it up with the swappage which already belongs to that offset. But unionfs_writepage now does a find_or_create_page, divorced from shmem_getpage, which leaves conflicting filepage and swappage outstanding indefinitely, when unionfs is over tmpfs. Therefore shmem_writepage (where a page is swizzled from file to swap) must now be on the lookout for existing swap, ready to free it in favour of the more uptodate filepage, instead of BUGging on that clash. And when the add_to_page_cache fails in shmem_unuse_inode, it must defer to an uptodate filepage, otherwise swapoff would hang. Whereas when add_to_page_cache fails in shmem_getpage, it should retry in the same way it already does. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05tmpfs: move swap swizzling into shmemHugh Dickins3-49/+17
move_to_swap_cache and move_from_swap_cache functions (which swizzle a page between tmpfs page cache and swap cache, to avoid page copying) are only used by shmem.c; and our subsequent fix for unionfs needs different treatments in the two instances of move_from_swap_cache. Move them from swap_state.c into their callsites shmem_writepage, shmem_unuse_inode and shmem_getpage, making add_to_swap_cache externally visible. shmem.c likes to say set_page_dirty where swap_state.c liked to say SetPageDirty: respect that diversity, which __set_page_dirty_no_writeback makes moot (and implies we should lose that "shift page from clean_pages to dirty_pages list" comment: it's on neither). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05tmpfs: shuffle add_to_swap_cachesHugh Dickins1-34/+19
add_to_swap_cache doesn't amount to much: merge it into its sole caller read_swap_cache_async. But we'll be needing to call __add_to_swap_cache from shmem.c, so promote it to the new add_to_swap_cache. Both were static, so there's no interface confusion to worry about. And lose that inappropriate "Anon pages are already on the LRU" comment in the merging: they're not already on the LRU, as Nick Piggin noticed. Signed-off-by: Hugh Dickins <hugh@veritas.com> No-problems-with: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05tmpfs: move swap_state stats updateHugh Dickins1-16/+6
Both unionfs and memcgroups pose challenges to tmpfs and shmem. To help fix, it's best to move the swap swizzling functions from swap_state.c to shmem.c. As a preliminary to that, move swap stats updating down into __add_to_swap_cache, which will remain internal to swap_state.c. Well, actually, just move down the incrementation of add_total: remove noent_race and exist_race completely, they are relics of my 2.4.11 testing. Alt-SysRq-m users will be thrilled if 2.6.25 is at last free of "race M+N"s. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05tmpfs: fix mounts when size is less than the page sizeMichael Marineau1-1/+1
When tmpfs is mounted with a size less than one page, the number of blocks is set to 0 which makes the tmpfs mount unlimited. This can lead to a quick and surprising death if someone typos a tmpfs mount command and writes too much. tmpfs can still be mounted as unlimited if size or nr_blocks is exactly 0, as Documentation/filesystems/tmpfs.txt says. Hugh: do this by rounding size up instead of down in all cases: which slightly expands other odd-sized tmpfs mounts, but in a consistent way. Signed-off-by: Michael Marineau <mike@marineau.org> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05shmem: factor out sbi->free_inodes manipulationsPavel Emelyanov1-39/+38
The shmem_sb_info structure has a number of free_inodes. This value is altered in appropriate places under spinlock and with the sbi->max_inodes != 0 check. Consolidate these manipulations into two helpers. This is minus 42 bytes of shmem.o and minus 4 :) lines of code. [akpm@linux-foundation.org: fix error return values] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05swapoff: scan ptes preemptiblyHugh Dickins1-7/+31
Provided that CONFIG_HIGHPTE is not set, unuse_pte_range can reduce latency in swapoff by scanning the page table preemptibly: so long as unuse_pte is careful to recheck that entry under pte lock. (To tell the truth, this patch was not inspired by any cries for lower latency here: rather, this restructuring permits a future memory controller patch to allocate with GFP_KERNEL in unuse_pte, where before it could not. But it would be wrong to tuck this change away inside a memcgroup patch.) Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05swapin: fix valid_swaphandles defectHugh Dickins1-16/+33
valid_swaphandles is supposed to do a quick pass over the swap map entries neigbouring the entry which swapin_readahead is targetting, to determine for it a range worth reading all together. But since it always starts its search from the beginning of the swap "cluster", a reject (free entry) there immediately curtails the readaround, and every swapin_readahead from that cluster is for just a single page. Instead scan forwards and backwards around the target entry. Use better names for some variables: a swap_info pointer is usually called "si" not "swapdev". And at the end, if only the target page should be read, return count of 0 to disable readaround, to avoid the unnecessarily repeated call to read_swap_cache_async. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05shmem_file_write is redundantHugh Dickins1-106/+3
With the old aops, writing to a tmpfs file had to use its own special method: the generic method would pass in a fresh page to prepare_write when the right page was there in swapcache - which was inefficient to handle, even once we'd concocted the code to handle it. With the new aops, the generic method uses shmem_write_end, which lets shmem_getpage find the right page: so now abandon shmem_file_write in favour of the generic method. Yes, that does do several things that tmpfs hasn't really needed (notably balance_dirty_pages_ratelimited, which ramfs also calls); but more use of common code is preferable. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>