summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2009-01-06Dmitry has been renamedDmitry Eremin-Solenikov2-1/+2
My surname has changed due to marriage. Change MAINTAINERS entry and add .mailmap entry to reflect that. Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06percpu_counter: FBC_BATCH should be a variableEric Dumazet4-14/+20
For NR_CPUS >= 16 values, FBC_BATCH is 2*NR_CPUS Considering more and more distros are using high NR_CPUS values, it makes sense to use a more sensible value for FBC_BATCH, and get rid of NR_CPUS. A sensible value is 2*num_online_cpus(), with a minimum value of 32 (This minimum value helps branch prediction in __percpu_counter_add()) We already have a hotcpu notifier, so we can adjust FBC_BATCH dynamically. We rename FBC_BATCH to percpu_counter_batch since its not a constant anymore. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06Allow times and time system calls to return small negative valuesPaul Mackerras3-2/+9
At the moment, the times() system call will appear to fail for a period shortly after boot, while the value it want to return is between -4095 and -1. The same thing will also happen for the time() system call on 32-bit platforms some time in 2106 or so. On some platforms, such as x86, this is unavoidable because of the system call ABI, but other platforms such as powerpc have a separate error indication from the return value, so system calls can in fact return small negative values without indicating an error. On those platforms, force_successful_syscall_return() provides a way to indicate that the system call return value should not be treated as an error even if it is in the range which would normally be taken as a negative error number. This adds a force_successful_syscall_return() call to the time() and times() system calls plus their 32-bit compat versions, so that they don't erroneously indicate an error on those platforms whose system call ABI has a separate error indication. This will not affect anything on other platforms. Joakim Tjernlund added the fix for time() and the compat versions of time() and times(), after I did the fix for times(). Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se> Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06documentation: when to BUG(), and when to not BUG()David Brownell1-0/+17
Provide some basic advice about when to use BUG()/BUG_ON(): never, unless there's really no better option. This matches my understanding of the standard policy ... which seems not to be written down so far, outside of LKML messages that I haven't bookmarked. Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Cc: Hans Verkuil <hverkuil@xs4all.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06poll: allow f_op->poll to sleepTejun Heo4-21/+76
f_op->poll is the only vfs operation which is not allowed to sleep. It's because poll and select implementation used task state to synchronize against wake ups, which doesn't have to be the case anymore as wait/wake interface can now use custom wake up functions. The non-sleep restriction can be a bit tricky because ->poll is not called from an atomic context and the result of accidentally sleeping in ->poll only shows up as temporary busy looping when the timing is right or rather wrong. This patch converts poll/select to use custom wake up function and use separate triggered variable to synchronize against wake up events. The only added overhead is an extra function call during wake up and negligible. This patch removes the one non-sleep exception from vfs locking rules and is beneficial to userland filesystem implementations like FUSE, 9p or peculiar fs like spufs as it's very difficult for those to implement non-sleeping poll method. While at it, make the following cosmetic changes to make poll.h and select.c checkpatch friendly. * s/type * symbol/type *symbol/ : three places in poll.h * remove blank line before EXPORT_SYMBOL() : two places in select.c Oleg: spotted missing barrier in poll_schedule_timeout() Davide: spotted missing write barrier in pollwake() Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Ingo Molnar <mingo@elte.hu> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Brad Boyer <flar@allandria.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Roland McGrath <roland@redhat.com> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06fs: use menuconfig to control the Misc. filesystems menuRandy Dunlap1-2/+15
Have one option to control Miscellaneous filesystems. This makes it easy to disable all of them at one time. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06scripts: script from kerneloops.org to pretty print oops dumpsArjan van de Ven1-0/+162
We're struggling all the time to figure out where the code came from that oopsed.. The script below (a adaption from a script used by kerneloops.org) can help developers quite a bit, at least for non-module cases. It works and looks like this: [/home/arjan/linux]$ dmesg | perl scripts/markup_oops.pl vmlinux { struct agp_memory *memory; memory = agp_allocate_memory(agp_bridge, pg_count, type); c055c10f: 89 c2 mov %eax,%edx if (memory == NULL) c055c111: 74 19 je c055c12c <agp_allocate_memory_wrap+0x30> /* This function must only be called when current_controller != NULL */ static void agp_insert_into_pool(struct agp_memory * temp) { struct agp_memory *prev; prev = agp_fe.current_controller->pool; c055c113: a1 ec dc 8f c0 mov 0xc08fdcec,%eax *c055c118: 8b 40 10 mov 0x10(%eax),%eax <----- faulting instruction if (prev != NULL) { c055c11b: 85 c0 test %eax,%eax c055c11d: 74 05 je c055c124 <agp_allocate_memory_wrap+0x28> prev->prev = temp; c055c11f: 89 50 04 mov %edx,0x4(%eax) temp->next = prev; c055c122: 89 02 mov %eax,(%edx) } agp_fe.current_controller->pool = temp; c055c124: a1 ec dc 8f c0 mov 0xc08fdcec,%eax c055c129: 89 50 10 mov %edx,0x10(%eax) if (memory == NULL) return NULL; agp_insert_into_pool(memory); so in this case, we faulted while dereferencing agp_fe.current_controller pointer, and we get to see exactly which function and line it affects... Personally I find this very useful, and I can see value for having this script in the kernel for more-than-just-me to use. Caveats: * It only works for oopses not-in-modules * It only works nicely for kernels compiled with CONFIG_DEBUG_INFO * It's not very fast. * It only works on x86 Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06oops: increment the oops UUID every time we oopsArjan van de Ven1-0/+2
... because we do want repeated same-oops to be seen by automated tools like kerneloops.org Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06strict_strto* is not strict enoughPavel Machek1-0/+4
It decodes "\n" as 0, which is bad, because stray echo into backlight will turn your backlight off, etc... Signed-off-by: Pavel Machek <pavel@suse.cz> Cc: Yi Yang <yi.y.yang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06autodetect_raid: add missing __init markingMike Frysinger1-1/+1
The function autodetect_raid is only used by __init functions, and it refers to __initdata, so it needs __init markings. Fixes this error: The function autodetect_raid() references the variable __initdata raid_noautodetect. This is often because autodetect_raid lacks a __initdata annotation or the annotation of raid_noautodetect is wrong. Signed-off-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06samples: mark {static|__init|__exit} for {init|exit} functionsQinghuang Feng7-13/+13
None of these (init|exit) functions is called from other functions which is outside the kernel module mechanism or kernel itself, so mark them as {static|__init|__exit}. Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06Create a DIV_ROUND_CLOSEST macro to do division with roundingDarrick J. Wong1-0/+6
Create a helper macro to divide two numbers and round the result to the nearest whole number. This is a helper macro for hwmon drivers that want to convert incoming sysfs values per standard hwmon practice, though the macro itself can be used by anyone. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06lib: proportions.c trivial sparse lock annotationHarvey Harrison1-0/+2
Suppresses sparse warning: lib/proportions.c:159:16: warning: context imbalance in 'prop_get_global': wrong count at exit lib/proportions.c:159:16: context 'RCU': wanted 0, got 1 lib/proportions.c:164:2: warning: context imbalance in 'prop_put_global': unexpected unlock lib/proportions.c:164:2: context 'RCU': wanted 0, got -1 Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06lib: radix_tree.c make percpu variable staticHarvey Harrison1-1/+1
radix_tree_preloads is unused outside of this file, make it static. Noticed by sparse: lib/radix-tree.c:84:1: warning: symbol 'per_cpu__radix_tree_preloads' was not declared. Should it be static? Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06lib: fix sparse shadowed variable warningHarvey Harrison1-1/+1
pos is always set before being used, no need to declare a second one inside the if() block. lib/prio_heap.c:34:7: warning: symbol 'pos' shadows an earlier one lib/prio_heap.c:30:6: originally declared here Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06fork.c: cleanup for copy_sighand()Zhaolei1-1/+1
Check CLONE_SIGHAND only is enough, because combination of CLONE_THREAD and CLONE_SIGHAND is already done in copy_process(). Impact: cleanup, no functionality changed Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06Remove remaining unwinder codeAlexey Dobriyan7-105/+0
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Gabor Gombas <gombasg@sztaki.hu> Cc: Jan Beulich <jbeulich@novell.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu>, Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06fs/exec.c:__bprm_mm_init(): clean up error handlingLuiz Fernando N. Capitulino1-14/+6
Untangle the error unwinding in this function, saving a test of local variable `vma'. Signed-off-by: Luiz Fernando N. Capitulino <lcapitulino@mandriva.com.br> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06do_mounts: add device info to mount messageMarton Balint1-2/+2
In the past, I used the root=... command line parameter to specify the root filesystem to the kernel. Now it seems that specifying it is not necessary. The kernel detects the root filesystem even if the kernel command line is empty. My root fs is on a raid1 device by the way, and I am not using initrd for the boot process. If the kernel detects the root filesystem somehow, I think it should print out the result of this detection, otherwise I will not know which device has the root filesystem. Or is there an easy way to get this information on a running system? I had a quick look at the /proc and /sys filesystems, but haven't found anything useful there. Signed-off-by: Marton Balint <cus@fazekas.hu> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06oops handling: ensure that any oops is flushed to the mtdoops consoleViktor Rosendahl1-0/+2
This used to work unpatched with older kernels, during the development phase of mtdoops. Before commit e3e8a75d2acfc61ebf25524666a0a2c6abb0620c a space was printed with console_loglevel set to 15, which probably flushed the oops message as a side effect. This is another patch from the Nokia N810 kernel. Signed-off-by: Viktor Rosendahl <viktor.rosendahl@nokia.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06Check fops_get() return valueLaurent Pinchart3-0/+13
Several subsystem open handlers dereference the fops_get() return value without checking it for nullness. This opens a race condition between the open handler and module unloading. A module can be marked as being unloaded (MODULE_STATE_GOING) before its exit function is called and gets the chance to unregister the driver. During that window open handlers can still be called, and fops_get() will fail in try_module_get() and return a NULL pointer. This change checks the fops_get() return value and returns -ENODEV if NULL. Reported-by: Alan Jenkins <alan-jenkins@tuffmail.co.uk> Signed-off-by: Laurent Pinchart <laurent.pinchart@skynet.be> Acked-by: Takashi Iwai <tiwai@suse.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Airlie <airlied@linux.ie> Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06pci: use pci_ioremap_bar() in drivers/miscArjan van de Ven2-4/+2
Use the newly introduced pci_ioremap_bar() function in drivers/misc. pci_ioremap_bar() just takes a pci device and a bar number, with the goal of making it really hard to get wrong, while also having a central place to stick sanity checks. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06atomic_t: unify all arch definitionsMatthew Wilcox23-104/+33
The atomic_t type cannot currently be used in some header files because it would create an include loop with asm/atomic.h. Move the type definition to linux/types.h to break the loop. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06init: properly placing noinline keywordRakib Mullick1-2/+2
checkpatch warns about 'static void noinline'. It wants `static noinline void'. Both are permissible, but the kernel consistently uses `static inline' and `static noinline', and consistency is good. Hence let's keep the checkpatch warning and fix up this code site. [akpm@linux-foundation.org: rewrote changelog] Signed-off-by: Md.Rakib H. Mullick <rakib.mullick@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: hugetlb: remove redundant `if' operationCyrill Gorcunov1-3/+2
At this point we already know that 'addr' is not NULL so get rid of redundant 'if'. Probably gcc eliminate it by optimization pass. [akpm@linux-foundation.org: use __weak, too] Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Reviewed-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: stop kswapd's infinite loop at high order allocationKOSAKI Motohiro1-0/+17
Wassim Dagash reported following kswapd infinite loop problem. kswapd runs in some infinite loop trying to swap until order 10 of zone highmem is OK.... kswapd will continue to try to balance order 10 of zone highmem forever (or until someone release a very large chunk of highmem). For non order-0 allocations, the system may never be balanced due to fragmentation but kswapd should not infinitely loop as a result. Instead, recheck all watermarks at order-0 as they are the most important. If watermarks are ok, kswapd will go back to sleep. [akpm@linux-foundation.org: fix comment] Reported-by: wassim dagash <wassim.dagash@gmail.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06bootmem: print request details before BUG_ON(them)Johannes Weiner1-4/+4
Moving the request details print-out before the sanity checks that might panic() enables us to analyse invalid requests without having access to the line information of the stack dump. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: check for no mmaps in exit_mmap()Johannes Weiner1-0/+3
When dup_mmap() ooms we can end up with mm->mmap == NULL. The error path does mmput() and unmap_vmas() gets a NULL vma which it dereferences. In exit_mmap() there is nothing to do at all for this case, we can cancel the callpath right there. [akpm@linux-foundation.org: add sorely-needed comment] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: kill page_queue_congested()KOSAKI Motohiro1-20/+0
page_queue_congested() was introduced in 2002, but it was never used Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: remove CONFIG_OUT_OF_LINE_PFN_TO_PAGEKOSAKI Motohiro2-20/+0
No architectures use CONFIG_OUT_OF_LINE_PFN_TO_PAGE - it can be removed. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: introduce get_mm_hiwater_xxx(), fix taskstats->hiwater_xxx accountingOleg Nesterov4-7/+7
xacct_add_tsk() relies on do_exit()->update_hiwater_xxx() and uses mm->hiwater_xxx directly, this leads to 2 problems: - taskstats_user_cmd() can call fill_pid()->xacct_add_tsk() at any moment before the task exits, so we should check the current values of rss/vm anyway. - do_exit()->update_hiwater_xxx() calls are racy. An exiting thread can be preempted right before mm->hiwater_xxx = new_val, and another thread can use A_LOT of memory and exit in between. When the first thread resumes it can be the last thread in the thread group, in that case we report the wrong hiwater_xxx values which do not take A_LOT into account. Introduce get_mm_hiwater_rss() and get_mm_hiwater_vm() helpers and change xacct_add_tsk() to use them. The first helper will also be used by rusage->ru_maxrss accounting. Kill do_exit()->update_hiwater_xxx() calls. Unless we are going to decrease rss/vm there is no point to update mm->hiwater_xxx, and nobody can look at this mm_struct when exit_mmap() actually unmaps the memory. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Hugh Dickins <hugh@veritas.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: pagecache gfp flags fixNick Piggin1-2/+9
Frustratingly, gfp_t is really divided into two classes of flags. One are the context dependent ones (can we sleep? can we enter filesystem? block subsystem? should we use some extra reserves, etc.). The other ones are the type of memory required and depend on how the algorithm is implemented rather than the point at which the memory is allocated (highmem? dma memory? etc). Some of the functions which allocate a page and add it to page cache take a gfp_t, but sometimes those functions or their callers aren't really doing the right thing: when allocating pagecache page, the memory type should be mapping_gfp_mask(mapping). When allocating radix tree nodes, the memory type should be kernel mapped (not highmem) memory. The gfp_t argument should only really be needed for context dependent options. This patch doesn't really solve that tangle in a nice way, but it does attempt to fix a couple of bugs. - find_or_create_page changes its radix-tree allocation to only include the main context dependent flags in order so the pagecache page may be allocated from arbitrary types of memory without affecting the radix-tree. In practice, slab allocations don't come from highmem anyway, and radix-tree only uses slab allocations. So there isn't a practical change (unless some fs uses GFP_DMA for pages). - grab_cache_page_nowait() is changed to allocate radix-tree nodes with GFP_NOFS, because it is not supposed to reenter the filesystem. This bug could cause lock recursion if a filesystem is not expecting the function to reenter the fs (as-per documentation). Filesystems should be careful about exactly what semantics they want and what they get when fiddling with gfp_t masks to allocate pagecache. One should be as liberal as possible with the type of memory that can be used, and same for the the context specific flags. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06fs: sys_sync fixNick Piggin2-20/+1
s_syncing livelock avoidance was breaking data integrity guarantee of sys_sync, by allowing sys_sync to skip writing or waiting for superblocks if there is a concurrent sys_sync happening. This livelock avoidance is much less important now that we don't have the get_super_to_sync() call after every sb that we sync. This was replaced by __put_super_and_need_restart. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06fs: sync_sb_inodes fixNick Piggin1-7/+53
Fix data integrity semantics required by sys_sync, by iterating over all inodes and waiting for any writeback pages after the initial writeout. Comments explain the exact problem. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06fs: remove WB_SYNC_HOLDNick Piggin2-11/+2
Remove WB_SYNC_HOLD. The primary motiviation is the design of my anti-starvation code for fsync. It requires taking an inode lock over the sync operation, so we could run into lock ordering problems with multiple inodes. It is possible to take a single global lock to solve the ordering problem, but then that would prevent a future nice implementation of "sync multiple inodes" based on lock order via inode address. Seems like a backward step to remove this, but actually it is busted anyway: we can't use the inode lists for data integrity wait: an inode can be taken off the dirty lists but still be under writeback. In order to satisfy data integrity semantics, we should wait for it to finish writeback, but if we only search the dirty lists, we'll miss it. It would be possible to have a "writeback" list, for sys_sync, I suppose. But why complicate things by prematurely optimise? For unmounting, we could avoid the "livelock avoidance" code, which would be easier, but again premature IMO. Fixing the existing data integrity problem will come next. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06UBIFS: do not use WB_SYNC_HOLDArtem Bityutskiy1-1/+8
WB_SYNC_HOLD is going to be zapped so we should not use it. Use %WB_SYNC_NONE instead. Here is what akpm said: "I think I'll just switch that to WB_SYNC_NONE. The `wait==0' mode is just an advisory thing to help the fs shove lots of data into the queues. If some gets missed then it'll be picked up on the second ->sync_fs call, with wait==1." Thanks to Randy Dunlap for catching this. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: direct IO starvation improvementNick Piggin1-11/+5
Direct IO can invalidate and sync a lot of pagecache pages in the mapping. A 4K direct IO will actually try to sync and/or invalidate the pagecache of the entire file, for example (which might be many GB or TB large). Improve this by doing range syncs. Also, memory no longer has to be unmapped to catch the dirty bits for syncing, as dirty bits would remain coherent due to dirty mmap accounting. This fixes the immediate DM deadlocks when doing direct IO reads to block device with a mounted filesystem, if only by papering over the problem somewhat rather than addressing the fsync starvation cases. Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm/mmap.c: fix coding styleZhenwenXu1-9/+8
Fix a little of the coding style in mm/mmap.c [akpm@linux-foundation.org: cleanup] Signed-off-by: ZhenwenXu <helight.xu@gmail.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06shmem: unify regular and tiny shmemMatt Mackall4-151/+72
tiny-shmem shares most of its 130 lines of code with shmem and tends to break when particular bits of shmem get modified. Unifying saves code and makes keeping these two in sync much easier. before: 14367 392 24 14783 39bf mm/shmem.o 396 72 8 476 1dc mm/tiny-shmem.o after: 14367 392 24 14783 39bf mm/shmem.o 412 72 8 492 1ec mm/shmem.o tiny Signed-off-by: Matt Mackall <mpm@selenic.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06block_write_begin(): remove useless gotoFranck Bui-Huu1-1/+0
Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: make get_user_pages() interruptibleYing Han3-9/+15
The initial implementation of checking TIF_MEMDIE covers the cases of OOM killing. If the process has been OOM killed, the TIF_MEMDIE is set and it return immediately. This patch includes: 1. add the case that the SIGKILL is sent by user processes. The process can try to get_user_pages() unlimited memory even if a user process has sent a SIGKILL to it(maybe a monitor find the process exceed its memory limit and try to kill it). In the old implementation, the SIGKILL won't be handled until the get_user_pages() returns. 2. change the return value to be ERESTARTSYS. It makes no sense to return ENOMEM if the get_user_pages returned by getting a SIGKILL signal. Considering the general convention for a system call interrupted by a signal is ERESTARTNOSYS, so the current return value is consistant to that. Lee: An unfortunate side effect of "make-get_user_pages-interruptible" is that it prevents a SIGKILL'd task from munlock-ing pages that it had mlocked, resulting in freeing of mlocked pages. Freeing of mlocked pages, in itself, is not so bad. We just count them now--altho' I had hoped to remove this stat and add PG_MLOCKED to the free pages flags check. However, consider pages in shared libraries mapped by more than one task that a task mlocked--e.g., via mlockall(). If the task that mlocked the pages exits via SIGKILL, these pages would be left mlocked and unevictable. Proposed fix: Add another GUP flag to ignore sigkill when calling get_user_pages from munlock()--similar to Kosaki Motohiro's 'IGNORE_VMA_PERMISSIONS flag for the same purpose. We are not actually allocating memory in this case, which "make-get_user_pages-interruptible" intends to avoid. We're just munlocking pages that are already resident and mapped, and we're reusing get_user_pages() to access those pages. ?? Maybe we should combine 'IGNORE_VMA_PERMISSIONS and '_IGNORE_SIGKILL into a single flag: GUP_FLAGS_MUNLOCK ??? [Lee.Schermerhorn@hp.com: ignore sigkill in get_user_pages during munlock] Signed-off-by: Paul Menage <menage@google.com> Signed-off-by: Ying Han <yinghan@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh@veritas.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Rohit Seth <rohitseth@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06hugetlb: unsigned ret cannot be negativeRoel Kluin1-4/+8
unsigned long ret cannot be negative, but ret can get -EFAULT. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Ken Chen <kenchen@google.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06vmscan: shrink_active_list(): reduce lru_lock hold timeAndrew Morton1-7/+7
These three statements manipulate local variables and do not need the lock coverage. Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Rik van Riel <riel@redhat.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06badpage: KERN_ALERT BUG instead of KERN_EMERGHugh Dickins2-12/+12
bad_page() and rmap Eeek messages have said KERN_EMERG for a few years, which I've followed in print_bad_pte(). These are serious system errors, on a par with BUGs, but they're not quite emergencies, and we do our best to carry on: say KERN_ALERT "BUG: " like the x86 oops does. And remove the "Trying to fix it up, but a reboot is needed" line: it's not untrue, but I hope the KERN_ALERT "BUG: " conveys as much. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06badpage: ratelimit print_bad_pte and bad_pageHugh Dickins2-1/+48
print_bad_pte() and bad_page() might each need ratelimiting - especially for their dump_stacks, almost never of interest, yet not quite dispensible. Correlating corruption across neighbouring entries can be very helpful, so allow a burst of 60 reports before keeping quiet for the remainder of that minute (or allow a steady drip of one report per second). Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06badpage: remove vma from page_remove_rmapHugh Dickins5-10/+8
Remove page_remove_rmap()'s vma arg, which was only for the Eeek message. And remove the BUG_ON(page_mapcount(page) == 0) from CONFIG_DEBUG_VM's page_dup_rmap(): we're trying to be more resilient about that than BUGs. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06badpage: zap print_bad_pte on swap and fileHugh Dickins3-16/+14
Complete zap_pte_range()'s coverage of bad pagetable entries by calling print_bad_pte() on a pte_file in a linear vma and on a bad swap entry. That needs free_swap_and_cache() to tell it, which will also have shown one of those "swap_free" errors (but with much less information). Similar checks in fork's copy_one_pte()? No, that would be more noisy than helpful: we'll see them when parent and child exec or exit. Where do_nonlinear_fault() calls print_bad_pte(): omit !VM_CAN_NONLINEAR case, that could only be a bug in sys_remap_file_pages(), not a bad pte. VM_FAULT_OOM rather than VM_FAULT_SIGBUS? Well, okay, that is consistent with what happens if do_swap_page() operates a bad swap entry; but don't we have patches to be more careful about killing when VM_FAULT_OOM? Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06badpage: vm_normal_page use print_bad_pteHugh Dickins3-10/+15
print_bad_pte() is so far being called only when zap_pte_range() finds negative page_mapcount, or there's a fault on a pte_file where it does not belong. That's weak coverage when we suspect pagetable corruption. Originally, it was called when vm_normal_page() found an invalid pfn: but pfn_valid is expensive on some architectures and configurations, so 2.6.24 put that under CONFIG_DEBUG_VM (which doesn't help in the field), then 2.6.26 replaced it by a VM_BUG_ON (likewise). Reinstate the print_bad_pte() in vm_normal_page(), but use a cheaper test than pfn_valid(): memmap_init_zone() (used in bootup and hotplug) keep a __read_mostly note of the highest_memmap_pfn, vm_normal_page() then check pfn against that. We could call this pfn_plausible() or pfn_sane(), but I doubt we'll need it elsewhere: of course it's not reliable, but gives much stronger pagetable validation on many boxes. Also use print_bad_pte() when the pte_special bit is found outside a VM_PFNMAP or VM_MIXEDMAP area, instead of VM_BUG_ON. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06badpage: replace page_remove_rmap Eeek and BUGHugh Dickins3-36/+48
Now that bad pages are kept out of circulation, there is no need for the infamous page_remove_rmap() BUG() - once that page is freed, its negative mapcount will issue a "Bad page state" message and the page won't be freed. Removing the BUG() allows more info, on subsequent pages, to be gathered. We do have more info about the page at this point than bad_page() can know - notably, what the pmd is, which might pinpoint something like low 64kB corruption - but page_remove_rmap() isn't given the address to find that. In practice, there is only one call to page_remove_rmap() which has ever reported anything, that from zap_pte_range() (usually on exit, sometimes on munmap). It has all the info, so remove page_remove_rmap()'s "Eeek" message and leave it all to zap_pte_range(). mm/memory.c already has a hardly used print_bad_pte() function, showing some of the appropriate info: extend it to show what we want for the rmap case: pte info, page info (when there is a page) and vma info to compare. zap_pte_range() already knows the pmd, but print_bad_pte() is easier to use if it works that out for itself. Some of this info is also shown in bad_page()'s "Bad page state" message. Keep them separate, but adjust them to match each other as far as possible. Say "Bad page map" in print_bad_pte(), and add a TAINT_BAD_PAGE there too. print_bad_pte() show current->comm unconditionally (though it should get repeated in the usually irrelevant stack trace): sorry, I misled Nick Piggin to make it conditional on vm_mm == current->mm, but current->mm is already NULL in the exit case. Usually current->comm is good, though exceptionally it may not be that of the mm (when "swapoff" for example). Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06badpage: keep any bad page out of circulationHugh Dickins1-28/+24
Until now the bad_page() checkers have special-cased PageReserved, keeping those pages out of circulation thereafter. Now extend the special case to all: we want to keep ANY page with bad state out of circulation - the "free" page may well be in use by something. Leave the bad state of those pages untouched, for examination by debuggers; except for PageBuddy - leaving that set would risk bringing the page back. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>