summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2008-02-13Linux Kernel Markers: support multiple probesMathieu Desnoyers2-176/+508
RCU style multiple probes support for the Linux Kernel Markers. Common case (one probe) is still fast and does not require dynamic allocation or a supplementary pointer dereference on the fast path. - Move preempt disable from the marker site to the callback. Since we now have an internal callback, move the preempt disable/enable to the callback instead of the marker site. Since the callback change is done asynchronously (passing from a handler that supports arguments to a handler that does not setup the arguments is no arguments are passed), we can safely update it even if it is outside the preempt disable section. - Move probe arm to probe connection. Now, a connected probe is automatically armed. Remove MARK_MAX_FORMAT_LEN, unused. This patch modifies the Linux Kernel Markers API : it removes the probe "arm/disarm" and changes the probe function prototype : it now expects a va_list * instead of a "...". If we want to have more than one probe connected to a marker at a given time (LTTng, or blktrace, ssytemtap) then we need this patch. Without it, connecting a second probe handler to a marker will fail. It allow us, for instance, to do interesting combinations : Do standard tracing with LTTng and, eventually, to compute statistics with SystemTAP, or to have a special trigger on an event that would call a systemtap script which would stop flight recorder tracing. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mike Mason <mmlnx@us.ibm.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: David Smith <dsmith@redhat.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: "Frank Ch. Eigler" <fche@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-13hugetlb: fix overcommit lockingNishanth Aravamudan1-2/+2
proc_doulongvec_minmax() calls copy_to_user()/copy_from_user(), so we can't hold hugetlb_lock over the call. Use a dummy variable to store the sysctl result, like in hugetlb_sysctl_handler(), then grab the lock to update nr_overcommit_huge_pages. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Reported-by: Miles Lane <miles.lane@gmail.com> Cc: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-13remove final fastcall usersHarvey Harrison1-1/+1
fastcall always expands to empty, remove it. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-13rcupdate: fix commentPaul E. McKenney1-1/+4
This comment caused some consternation during fastcall removal. Make it truthful. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-schedLinus Torvalds5-196/+487
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: sched: rt-group: refure unrunnable tasks sched: rt-group: clean up the ifdeffery sched: rt-group: make rt groups scheduling configurable sched: rt-group: interface sched: rt-group: deal with PI sched: fix incorrect irq lock usage in normalize_rt_tasks() sched: fair-group: separate tg->shares from task_group_lock hrtimer: more hrtimer_init_sleeper() fallout.
2008-02-13sched: rt-group: refure unrunnable tasksPeter Zijlstra1-0/+15
Refuse to accept or create RT tasks in groups that can't run them. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13sched: rt-group: clean up the ifdefferyPeter Zijlstra1-71/+139
Clean up some of the excessive ifdeffery introduces in the last patch. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13sched: rt-group: make rt groups scheduling configurablePeter Zijlstra3-56/+126
Make the rt group scheduler compile time configurable. Keep it experimental for now. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13sched: rt-group: interfacePeter Zijlstra4-76/+178
Change the rt_ratio interface to rt_runtime_us, to match rt_period_us. This avoids picking a granularity for the ratio. Extend the /sys/kernel/uids/<uid>/ interface to allow setting the group's rt_runtime. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13sched: rt-group: deal with PIPeter Zijlstra2-5/+41
Steven mentioned the fun case where a lock holding task will be throttled. Simple fix: allow groups that have boosted tasks to run anyway. If a runnable task in a throttled group gets boosted the dequeue/enqueue done by rt_mutex_setprio() is enough to unthrottle the group. This is ofcourse not quite correct. Two possible ways forward are: - second prio array for boosted tasks - boost to a prio ceiling (this would also work for deadline scheduling) Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13sched: fix incorrect irq lock usage in normalize_rt_tasks()Peter Zijlstra1-4/+4
lockdep spotted this bogus irq locking. normalize_rt_tasks() can be called from hardirq context through sysrq-n Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13sched: fair-group: separate tg->shares from task_group_lockPeter Zijlstra1-20/+17
On Mon, 2008-02-11 at 15:09 +0300, Denis V. Lunev wrote: > BUG: sleeping function called from invalid context > at /home/den/src/linux-netns26/kernel/mutex.c:209 > in_atomic():1, irqs_disabled():0 > no locks held by swapper/0. > Pid: 0, comm: swapper Not tainted 2.6.24 #304 > > Call Trace: > <IRQ> [<ffffffff80252d1e>] ? __debug_show_held_locks+0x15/0x27 > [<ffffffff8022c2a8>] __might_sleep+0xc0/0xdf > [<ffffffff8049f1df>] mutex_lock_nested+0x28/0x2a9 > [<ffffffff80231294>] sched_destroy_group+0x18/0xea > [<ffffffff8023e835>] sched_destroy_user+0xd/0xf > [<ffffffff8023e8c1>] free_uid+0x8a/0xab > [<ffffffff80233e24>] __put_task_struct+0x3f/0xd3 > [<ffffffff80236708>] delayed_put_task_struct+0x23/0x25 > [<ffffffff8026fda7>] __rcu_process_callbacks+0x8d/0x215 > [<ffffffff8026ff52>] rcu_process_callbacks+0x23/0x44 > [<ffffffff8023a2ae>] __do_softirq+0x79/0xf8 > [<ffffffff8020f8c3>] ? profile_pc+0x2a/0x67 > [<ffffffff8020d38c>] call_softirq+0x1c/0x30 > [<ffffffff8020f689>] do_softirq+0x61/0x9c > [<ffffffff8023a233>] irq_exit+0x51/0x53 > [<ffffffff8021bd1a>] smp_apic_timer_interrupt+0x77/0xad > [<ffffffff8020ce3b>] apic_timer_interrupt+0x6b/0x70 > <EOI> [<ffffffff8020b0dd>] ? default_idle+0x43/0x76 > [<ffffffff8020b0db>] ? default_idle+0x41/0x76 > [<ffffffff8020b09a>] ? default_idle+0x0/0x76 > [<ffffffff8020b186>] ? cpu_idle+0x76/0x98 separate the tg->shares protection from the task_group lock. Reported-by: Denis V. Lunev <den@openvz.org> Tested-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13hrtimer: more hrtimer_init_sleeper() fallout.Peter Zijlstra1-1/+4
Missed an instance... futex_lock_pi() hrtimer_init_sleeper() rt_mutex_timed_lock() rt_mutex_timed_fastlock() rt_mutex_slowlock() hrtimer_start() Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-12timeconst.pl: correct reversal of USEC_TO_HZ and HZ_TO_USECH. Peter Anvin1-1/+1
The USEC_TO_HZ and HZ_TO_USEC constant sets were mislabelled, with seriously incorrect results. This among other things manifested itself as cpufreq not working when a tickless kernel was configured. Signed-off-by: H. Peter Anvin <hpa@zytor.com> Tested-by: Carlos R. Mafra <crmafra@ift.unesp.br> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-10hrtimer: don't modify restart_block->fn in restart functionsOleg Nesterov2-5/+0
hrtimer_nanosleep_restart() clears/restores restart_block->fn. This is pointless and complicates its usage. Note that if sys_restart_syscall() doesn't actually happen, we have a bogus "pending" restart->fn anyway, this is harmless. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Alexey Dobriyan <adobriyan@sw.ru> Cc: Pavel Emelyanov <xemul@sw.ru> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Toyo Abe <toyoa@mvista.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-02-10hrtimer: fix *rmtp/restarts handling in compat_sys_nanosleep()Oleg Nesterov1-4/+40
Spotted by Pavel Emelyanov and Alexey Dobriyan. compat_sys_nanosleep() implicitly uses hrtimer_nanosleep_restart(), this can't work. Make a suitable compat_nanosleep_restart() helper. Introduced by commit c70878b4e0b6cf8d2f1e46319e48e821ef4a8aba hrtimer: hook compat_sys_nanosleep up to high res timer code Also, set ->addr_limit = KERNEL_DS before doing hrtimer_nanosleep(), this func was changed by the previous patch and now takes the "__user *" parameter. Thanks to Ingo Molnar for fixing the bug in this patch. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexey Dobriyan <adobriyan@sw.ru> Cc: Pavel Emelyanov <xemul@sw.ru> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Toyo Abe <toyoa@mvista.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-02-10hrtimer: fix *rmtp handling in hrtimer_nanosleep()Oleg Nesterov2-38/+30
Spotted by Pavel Emelyanov and Alexey Dobriyan. hrtimer_nanosleep() sets restart_block->arg1 = rmtp, but this rmtp points to the local variable which lives in the caller's stack frame. This means that if sys_restart_syscall() actually happens and it is interrupted as well, we don't update the user-space variable, but write into the already dead stack frame. Introduced by commit 04c227140fed77587432667a574b14736a06dd7f hrtimer: Rework hrtimer_nanosleep to make sys_compat_nanosleep easier Change the callers to pass "__user *rmtp" to hrtimer_nanosleep(), and change hrtimer_nanosleep() to use copy_to_user() to actually update *rmtp. Small problem remains. man 2 nanosleep states that *rtmp should be written if nanosleep() was interrupted (it says nothing whether it is OK to update *rmtp if nanosleep returns 0), but (with or without this patch) we can dirty *rem even if nanosleep() returns 0. NOTE: this patch doesn't change compat_sys_nanosleep(), because it has other bugs. Fixed by the next patch. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Alexey Dobriyan <adobriyan@sw.ru> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Pavel Emelyanov <xemul@sw.ru> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Toyo Abe <toyoa@mvista.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> include/linux/hrtimer.h | 2 - kernel/hrtimer.c | 51 +++++++++++++++++++++++++----------------------- kernel/posix-timers.c | 14 +------------ 3 files changed, 30 insertions(+), 37 deletions(-)
2008-02-10ntp: correct inconsistent interval/tick_length usagejohn stultz1-4/+0
clocksource initialization and error accumulation. This corrects a 280ppm drift seen on some systems using acpi_pm, and affects other clocksources as well (likely to a lesser degree). Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-02-09Update kernel/.gitignore with new auto-generated filesS.Çağlar Onur1-0/+1
Signed-off-by: S.Çağlar Onur <caglar@pardus.org.tr> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08Merge branch 'for-2.6.25' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc * 'for-2.6.25' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: [POWERPC] Add arch-specific walk_memory_remove() for 64-bit powerpc [POWERPC] Enable hotplug memory remove for 64-bit powerpc [POWERPC] Add remove_memory() for 64-bit powerpc [POWERPC] Make cell IOMMU fixed mapping printk more useful [POWERPC] Fix potential cell IOMMU bug when switching back to default DMA ops [POWERPC] Don't enable cell IOMMU fixed mapping if there are no dma-ranges [POWERPC] Fix cell IOMMU null pointer explosion on old firmwares [POWERPC] spufs: Fix timing dependent false return from spufs_run_spu [POWERPC] spufs: No need to have a runnable SPU for libassist update [POWERPC] spufs: Update SPU_Status[CISHP] in backing runcntl write [POWERPC] spufs: Fix state_mutex leaks [POWERPC] Disable G5 NAP mode during SMU commands on U3
2008-02-08IRQ_NOPROBE helper functionsRalf Baechle1-0/+36
Probing non-ISA interrupts using the handle_percpu_irq as their handle_irq method may crash the system because handle_percpu_irq does not check IRQ_WAITING. This for example hits the MIPS Qemu configuration. This patch provides two helper functions set_irq_noprobe and set_irq_probe to set rsp. clear the IRQ_NOPROBE flag. The only current caller is MIPS code but this really belongs into generic code. As an aside, interrupt probing these days has become a mostly obsolete if not dangerous art. I think Linux interrupts should be changed to default to non-probing but that's subject of this patch. Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Acked-and-tested-by: Rob Landley <rob@landley.net> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08Add new string functions strict_strto* and convert kernel params to use themYi Yang1-10/+10
Currently, for every sysfs node, the callers will be responsible for implementing store operation, so many many callers are doing duplicate things to validate input, they have the same mistakes because they are calling simple_strtol/ul/ll/uul, especially for module params, they are just numeric, but you can echo such values as 0x1234xxx, 07777888 and 1234aaa, for these cases, module params store operation just ignores succesive invalid char and converts prefix part to a numeric although input is acctually invalid. This patch tries to fix the aforementioned issues and implements strict_strtox serial functions, kernel/params.c uses them to strictly validate input, so module params will reject such values as 0x1234xxxx and returns an error: write error: Invalid argument Any modules which export numeric sysfs node can use strict_strtox instead of simple_strtox to reject any invalid input. Here are some test results: Before applying this patch: [root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak 4096 [root@yangyi-dev /]# echo 0x1000 > /sys/module/e1000/parameters/copybreak [root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak 4096 [root@yangyi-dev /]# echo 0x1000g > /sys/module/e1000/parameters/copybreak [root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak 4096 [root@yangyi-dev /]# echo 0x1000gggggggg > /sys/module/e1000/parameters/copybreak [root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak 4096 [root@yangyi-dev /]# echo 010000 > /sys/module/e1000/parameters/copybreak [root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak 4096 [root@yangyi-dev /]# echo 0100008 > /sys/module/e1000/parameters/copybreak [root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak 4096 [root@yangyi-dev /]# echo 010000aaaaa > /sys/module/e1000/parameters/copybreak [root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak 4096 [root@yangyi-dev /]# After applying this patch: [root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak 4096 [root@yangyi-dev /]# echo 0x1000 > /sys/module/e1000/parameters/copybreak [root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak 4096 [root@yangyi-dev /]# echo 0x1000g > /sys/module/e1000/parameters/copybreak -bash: echo: write error: Invalid argument [root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak 4096 [root@yangyi-dev /]# echo 0x1000gggggggg > /sys/module/e1000/parameters/copybreak -bash: echo: write error: Invalid argument [root@yangyi-dev /]# echo 010000 > /sys/module/e1000/parameters/copybreak [root@yangyi-dev /]# echo 0100008 > /sys/module/e1000/parameters/copybreak -bash: echo: write error: Invalid argument [root@yangyi-dev /]# echo 010000aaaaa > /sys/module/e1000/parameters/copybreak -bash: echo: write error: Invalid argument [root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak 4096 [root@yangyi-dev /]# echo -n 4096 > /sys/module/e1000/parameters/copybreak [root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak 4096 [root@yangyi-dev /]# [akpm@linux-foundation.org: fix compiler warnings] [akpm@linux-foundation.org: fix off-by-one found by tiwai@suse.de] Signed-off-by: Yi Yang <yi.y.yang@intel.com> Cc: Greg KH <greg@kroah.com> Cc: "Randy.Dunlap" <rdunlap@xenotime.net> Cc: Takashi Iwai <tiwai@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08cpu: fix section mismatch warnings for enable_nonboot_cpusSam Ravnborg1-1/+1
Fix following warning: WARNING: o-x86_64/kernel/built-in.o(.text+0x36d8b): Section mismatch in reference from the function enable_nonboot_cpus() to the function .cpuinit.text:_cpu_up() enable_nonboot_cpus() are used solely from CONFIG_CONFIG_PM_SLEEP_SMP=y and PM_SLEEP_SMP imply HOTPLUG_CPU therefore the reference to _cpu_up() is valid. Annotate enable_nonboot_cpus() with __ref to silence modpost. Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Cc: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08Don't operate with pid_t in rtmutex testerPavel Emelyanov2-4/+10
The proper behavior to store task's pid and get this task later is to get the struct pid pointer and get the task with the pid_task() call. Make it for rt_mutex_waiter->deadlock_task_pid field. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08Use find_task_by_vpid in posix timersPavel Emelyanov2-5/+5
All the functions that need to lookup a task by pid in posix timers obtain this pid from a user space, and thus this value refers to a task in the same namespace, as the current task lives in. So the proper behavior is to call find_task_by_vpid() here. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08avoid overflows in kernel/time.cH. Peter Anvin3-8/+431
When the conversion factor between jiffies and milli- or microseconds is not a single multiply or divide, as for the case of HZ == 300, we currently do a multiply followed by a divide. The intervening result, however, is subject to overflows, especially since the fraction is not simplified (for HZ == 300, we multiply by 300 and divide by 1000). This is exposed to the user when passing a large timeout to poll(), for example. This patch replaces the multiply-divide with a reciprocal multiplication on 32-bit platforms. When the input is an unsigned long, there is no portable way to do this on 64-bit platforms there is no portable way to do this since it requires a 128-bit intermediate result (which gcc does support on 64-bit platforms but may generate libgcc calls, e.g. on 64-bit s390), but since the output is a 32-bit integer in the cases affected, just simplify the multiply-divide (*3/10 instead of *300/1000). The reciprocal multiply used can have off-by-one errors in the upper half of the valid output range. This could be avoided at the expense of having to deal with a potential 65-bit intermediate result. Since the intent is to avoid overflow problems and most of the other time conversions are only semiexact, the off-by-one errors were considered an acceptable tradeoff. At Ralf Baechle's suggestion, this version uses a Perl script to compute the necessary constants. We already have dependencies on Perl for kernel compiles. This does, however, require the Perl module Math::BigInt, which is included in the standard Perl distribution starting with version 5.8.0. In order to support older versions of Perl, include a table of canned constants in the script itself, and structure the script so that Math::BigInt isn't required if pulling values from said table. Running the script requires that the HZ value is available from the Makefile. Thus, this patch also adds the Kconfig variable CONFIG_HZ to the architectures which didn't already have it (alpha, cris, frv, h8300, m32r, m68k, m68knommu, sparc, v850, and xtensa.) It does *not* touch the sh or sh64 architectures, since Paul Mundt has dealt with those separately in the sh tree. Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: Ralf Baechle <ralf@linux-mips.org>, Cc: Sam Ravnborg <sam@ravnborg.org>, Cc: Paul Mundt <lethal@linux-sh.org>, Cc: Richard Henderson <rth@twiddle.net>, Cc: Michael Starvik <starvik@axis.com>, Cc: David Howells <dhowells@redhat.com>, Cc: Yoshinori Sato <ysato@users.sourceforge.jp>, Cc: Hirokazu Takata <takata@linux-m32r.org>, Cc: Geert Uytterhoeven <geert@linux-m68k.org>, Cc: Roman Zippel <zippel@linux-m68k.org>, Cc: William L. Irwin <sparclinux@vger.kernel.org>, Cc: Chris Zankel <chris@zankel.net>, Cc: H. Peter Anvin <hpa@zytor.com>, Cc: Jan Engelhardt <jengelh@computergmbh.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08printk_ratelimit() functions should use CONFIG_PRINTKJoe Perches2-10/+12
Makes an embedded image a bit smaller. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08workqueue: make delayed_work_timer_fn() staticLi Zefan1-1/+1
delayed_work_timer_fn() is a timer function, make it static. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08The scheduled 'time' option removalAdrian Bunk1-13/+0
The scheduled removal of the 'time' option. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08Nuke duplicate header from sysctl.cJesper Juhl1-1/+0
Don't include linux/security.h twice in kernel/sysctl.c Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08Nuke a duplicate include from profile.cJesper Juhl1-1/+0
Remove duplicate inclusion of linux/profile.h from kernel/profile.c Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08Nuke duplicate include from printk.cJesper Juhl1-1/+0
Remove the duplicate inclusion of linux/jiffies.h from kernel/printk.c Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08constify tables in kernel/sysctl_check.cJan Beulich1-76/+75
Remains the question whether it is intended that many, perhaps even large, tables are compiled in without ever having a chance to get used, i.e. whether there shouldn't #ifdef CONFIG_xxx get added. [akpm@linux-foundation.org: fix cut-n-paste error] Signed-off-by: Jan Beulich <jbeulich@novell.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08kernel: remove fastcall in kernel/*Harvey Harrison12-68/+67
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08time: fix typo in commentsLi Zefan3-4/+4
Fix typo in comments. BTW: I have to fix coding style in arch/ia64/kernel/time.c also, otherwise checkpatch.pl will be complaining. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08timekeeping: rename timekeeping_is_continuous to timekeeping_valid_for_hresLi Zefan2-3/+3
Function timekeeping_is_continuous() no longer checks flag CLOCK_IS_CONTINUOUS, and it checks CLOCK_SOURCE_VALID_FOR_HRES now. So rename the function accordingly. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08clockevent: simplify list operationsLi Zefan1-7/+4
list_for_each_safe() suffices here. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08clocksource: remove redundant codeLi Zefan1-1/+0
Flag CLOCK_SOURCE_WATCHDOG is cleared twice. Note clocksource_change_rating() won't do anyting with the cs flag. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08Get rid of the kill_pgrp_info() functionPavel Emelyanov1-13/+8
There's only one caller left - the kill_pgrp one - so merge these two functions and forget the kill_pgrp_info one. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Reviewed-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08Clean up the kill_something_infoPavel Emelyanov1-11/+15
This is the first step (of two) in removing the kill_pgrp_info. All the users of this function are in kernel/signal.c, but all they need is to call __kill_pgrp_info() with the tasklist_lock read-locked. Fortunately, one of its users is the kill_something_info(), which already needs this lock in one of its branches, so clean these branches up and call the __kill_pgrp_info() directly. Based on Oleg's view of how this function should look. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08Pidns: make full use of xxx_vnr() callsPavel Emelyanov5-17/+8
Some time ago the xxx_vnr() calls (e.g. pid_vnr or find_task_by_vpid) were _all_ converted to operate on the current pid namespace. After this each call like xxx_nr_ns(foo, current->nsproxy->pid_ns) is nothing but a xxx_vnr(foo) one. Switch all the xxx_nr_ns() callers to use the xxx_vnr() calls where appropriate. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Reviewed-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08ITIMER_REAL: convert to use struct pidOleg Nesterov3-16/+2
signal_struct->tsk points to the ->group_leader and thus we have the nasty code in de_thread() which has to change it and restart ->real_timer if the leader is changed. Use "struct pid *leader_pid" instead. This also allows us to kill now unneeded send_group_sig_info(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Pavel Emelyanov <xemul@openvz.org> Acked-by: Roland McGrath <roland@redhat.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08uglify kill_pid_info() to fix kill() vs exec() raceOleg Nesterov1-3/+12
kill_pid_info()->pid_task() could be the old leader of the execing process. In that case it is possible that the leader will be released before we take siglock. This means that kill_pid_info() (and thus sys_kill()) can return a false -ESRCH. Change the code to retry when lock_task_sighand() fails. The endless loop is not possible, __exit_signal() both clears ->sighand and does detach_pid(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08sys_getsid: don't use ->nsproxy directlyOleg Nesterov1-7/+4
With the new semantics of find_vpid() we don't need to play with ->nsproxy explicitely, _vxx() do the right things. Also s/tasklist/rcu/. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08pid: Extend/Fix pid_vnrEric W. Biederman1-0/+6
pid_vnr returns the user space pid with respect to the pid namespace the struct pid was allocated in. What we want before we return a pid to user space is the user space pid with respect to the pid namespace of current. pid_vnr is a very nice optimization but because it isn't quite what we want it is easy to use pid_vnr at times when we aren't certain the struct pid was allocated in our pid namespace. Currently this describes at least tiocgpgrp and tiocgsid in ttyio.c the parent process reported in the core dumps and the parent process in get_signal_to_deliver. So unless the performance impact is huge having an interface that does what we want instead of always what we want should be much more reliable and much less error prone. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08pid: sys_wait... fixesEric W. Biederman1-24/+56
This modifies do_wait and eligible child to take a pair of enum pid_type and struct pid *pid to precisely specify what set of processes are eligible to be waited for, instead of the raw pid_t value from sys_wait4. This fixes a bug in sys_waitid where you could not wait for children in just process group 1. This fixes a pid namespace crossing case in eligible_child. Allowing us to wait for a processes in our current process group even if our current process group == 0. This allows the no child with this pid case to be optimized. This allows us to optimize the pid membership test in eligible child to be optimized. This even closes a theoretical pid wraparound race where in a threaded parent if two threads are waiting for the same child and one thread picks up the child and the pid numbers wrap around and generate another child with that same pid before the other thread is scheduled (teribly insanely unlikely) we could end up waiting on the second child with the same pid# and not discover that the specific child we were waiting for has exited. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08move the related code from exit_notify() to exit_signals()Oleg Nesterov2-23/+22
The previous bugfix was not optimal, we shouldn't care about group stop when we are the only thread or the group stop is in progress. In that case nothing special is needed, just set PF_EXITING and return. Also, take the related "TIF_SIGPENDING re-targeting" code from exit_notify(). So, from the performance POV the only difference is that we don't trust !signal_pending() until we take ->siglock. But this in fact fixes another ___pure___ theoretical minor race. __group_complete_signal() finds the task without PF_EXITING and chooses it as the target for signal_wake_up(). But nothing prevents this task from exiting in between without noticing the pending signal and thus unpredictably delaying the actual delivery. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08sys_setsid: remove now unneeded session != 1 checkOleg Nesterov1-4/+1
Eric's "fix clone(CLONE_NEWPID)" eliminated the last reason for this hack. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08fix group stop with exit raceOleg Nesterov2-2/+27
do_signal_stop() counts all sub-thread and sets ->group_stop_count accordingly. Every thread should decrement ->group_stop_count and stop, the last one should notify the parent. However a sub-thread can exit before it notices the signal_pending(), or it may be somewhere in do_exit() already. In that case the group stop never finishes properly. Note: this is a minimal fix, we can add some optimizations later. Say we can return quickly if thread_group_empty(). Also, we can move some signal related code from exit_notify() to exit_signals(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08start the global /sbin/init with 0,0 special pidsOleg Nesterov1-5/+4
As Eric pointed out, there is no problem with init starting with sid == pgid == 0, and this was historical linux behavior changed in 2.6.18. Remove kernel_init()->__set_special_pids(), this is unneeded and complicates the rules for sys_setsid(). This change and the previous change in daemonize() mean that /sbin/init does not need the special "session != 1" hack in sys_setsid() any longer. We can't remove this check yet, we should cleanup copy_process(CLONE_NEWPID) first, so update the comment only. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>