summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2006-03-20[PATCH] audit string fields interface + consumerAmy Griffis4-141/+418
Updated patch to dynamically allocate audit rule fields in kernel's internal representation. Added unlikely() calls for testing memory allocation result. Amy Griffis wrote: [Wed Jan 11 2006, 02:02:31PM EST] > Modify audit's kernel-userspace interface to allow the specification > of string fields in audit rules. > > Signed-off-by: Amy Griffis <amy.griffis@hp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> (cherry picked from 5ffc4a863f92351b720fe3e9c5cd647accff9e03 commit)
2006-03-20[PATCH] Minor cosmetic cleanups to the code moved into auditfilter.cDavid Woodhouse1-7/+4
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20[PATCH] Fix audit record filtering with !CONFIG_AUDITSYSCALLDavid Woodhouse5-377/+454
This fixes the per-user and per-message-type filtering when syscall auditing isn't enabled. [AV: folded followup fix from the same author] Signed-off-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20[PATCH] Miscellaneous bug and warning fixesDustin Kirkland1-9/+12
This patch fixes a couple of bugs revealed in new features recently added to -mm1: * fixes warnings due to inconsistent use of const struct inode *inode * fixes bug that prevent a kernel from booting with audit on, and SELinux off due to a missing function in security/dummy.c * fixes a bug that throws spurious audit_panic() messages due to a missing return just before an error_path label * some reasonable house cleaning in audit_ipc_context(), audit_inode_context(), and audit_log_task_context() Signed-off-by: Dustin Kirkland <dustin.kirkland@us.ibm.com> Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20[PATCH] Capture selinux subject/object context information.Dustin Kirkland2-9/+135
This patch extends existing audit records with subject/object context information. Audit records associated with filesystem inodes, ipc, and tasks now contain SELinux label information in the field "subj" if the item is performing the action, or in "obj" if the item is the receiver of an action. These labels are collected via hooks in SELinux and appended to the appropriate record in the audit code. This additional information is required for Common Criteria Labeled Security Protection Profile (LSPP). [AV: fixed kmalloc flags use] [folded leak fixes] [folded cleanup from akpm (kfree(NULL)] [folded audit_inode_context() leak fix] [folded akpm's fix for audit_ipc_perm() definition in case of !CONFIG_AUDIT] Signed-off-by: Dustin Kirkland <dustin.kirkland@us.ibm.com> Signed-off-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20[PATCH] Exclude messages by message typeDustin Kirkland2-1/+37
- Add a new, 5th filter called "exclude". - And add a new field AUDIT_MSGTYPE. - Define a new function audit_filter_exclude() that takes a message type as input and examines all rules in the filter. It returns '1' if the message is to be excluded, and '0' otherwise. - Call the audit_filter_exclude() function near the top of audit_log_start() just after asserting audit_initialized. If the message type is not to be audited, return NULL very early, before doing a lot of work. [combined with followup fix for bug in original patch, Nov 4, same author] [combined with later renaming AUDIT_FILTER_EXCLUDE->AUDIT_FILTER_TYPE and audit_filter_exclude() -> audit_filter_type()] Signed-off-by: Dustin Kirkland <dustin.kirkland@us.ibm.com> Signed-off-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20[PATCH] Collect more inode information during syscall processing.Amy Griffis1-24/+118
This patch augments the collection of inode info during syscall processing. It represents part of the functionality that was provided by the auditfs patch included in RHEL4. Specifically, it: - Collects information for target inodes created or removed during syscalls. Previous code only collects information for the target inode's parent. - Adds the audit_inode() hook to syscalls that operate on a file descriptor (e.g. fchown), enabling audit to do inode filtering for these calls. - Modifies filtering code to check audit context for either an inode # or a parent inode # matching a given rule. - Modifies logging to provide inode # for both parent and child. - Protect debug info from NULL audit_names.name. [AV: folded a later typo fix from the same author] Signed-off-by: Amy Griffis <amy.griffis@hp.com> Signed-off-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20[PATCH] Pass dentry, not just name, in fsnotify creation hooks.Amy Griffis1-1/+1
The audit hooks (to be added shortly) will want to see dentry->d_inode too, not just the name. Signed-off-by: Amy Griffis <amy.griffis@hp.com> Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20[PATCH] Define new range of userspace messages.Steve Grubb1-0/+2
The attached patch updates various items for the new user space messages. Please apply. Signed-off-by: Steve Grubb <sgrubb@redhat.com> Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20[PATCH] Filter rule comparatorsDustin Kirkland1-42/+75
Currently, audit only supports the "=" and "!=" operators in the -F filter rules. This patch reworks the support for "=" and "!=", and adds support for ">", ">=", "<", and "<=". This turned out to be a pretty clean, and simply process. I ended up using the high order bits of the "field", as suggested by Steve and Amy. This allowed for no changes whatsoever to the netlink communications. See the documentation within the patch in the include/linux/audit.h area, where there is a table that explains the reasoning of the bitmask assignments clearly. The patch adds a new function, audit_comparator(left, op, right). This function will perform the specified comparison (op, which defaults to "==" for backward compatibility) between two values (left and right). If the negate bit is on, it will negate whatever that result was. This value is returned. Signed-off-by: Dustin Kirkland <dustin.kirkland@us.ibm.com> Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20[PATCH] AUDIT: kerneldoc for kernel/audit*.cRandy Dunlap2-46/+238
- add kerneldoc for non-static functions; - don't init static data to 0; - limit lines to < 80 columns; - fix long-format style; - delete whitespace at end of some lines; (chrisw: resend and update to current audit-2.6 tree) Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Chris Wright <chrisw@osdl.org> Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20[PATCH] make vm86 call audit_syscall_exitJason Baron1-5/+0
hi, The motivation behind the patch below was to address messages in /var/log/messages such as: Jan 31 10:54:15 mets kernel: audit(:0): major=252 name_count=0: freeing multiple contexts (1) Jan 31 10:54:15 mets kernel: audit(:0): major=113 name_count=0: freeing multiple contexts (2) I can reproduce by running 'get-edid' from: http://john.fremlin.de/programs/linux/read-edid/. These messages come about in the log b/c the vm86 calls do not exit via the normal system call exit paths and thus do not call 'audit_syscall_exit'. The next system call will then free the context for itself and for the vm86 context, thus generating the above messages. This patch addresses the issue by simply adding a call to 'audit_syscall_exit' from the vm86 code. Besides fixing the above error messages the patch also now allows vm86 system calls to become auditable. This is useful since strace does not appear to properly record the return values from sys_vm86. I think this patch is also a step in the right direction in terms of cleaning up some core auditing code. If we can correct any other paths that do not properly call the audit exit and entries points, then we can also eliminate the notion of context chaining. I've tested this patch by verifying that the log messages no longer appear, and that the audit records for sys_vm86 appear to be correct. Also, 'read_edid' produces itentical output. thanks, -Jason Signed-off-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-18[PATCH] disable unshare(CLONE_VM) for nowOleg Nesterov1-3/+1
sys_unshare() does mmput(new_mm). This is not enough if we have mm->core_waiters. This patch is a temporary fix for soon to be released 2.6.16. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> [ Checked with Uli: "I'm not planning to use unshare(CLONE_VM). It's not needed for any functionality planned so far. What we (as in Red Hat) need unshare() for now is the filesystem side." ] Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-17[PATCH] posix-timers: fix requeue accounting when signal is ignoredRoman Zippel1-0/+1
When the posix-timer signal is ignored then the timer is rearmed by the callback function. The requeue pending accounting has to be fixed up else the state might be wrong. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-17[PATCH] time_interpolator: add __read_mostlyChristoph Lameter1-2/+2
The pointer to the current time interpolator and the current list of time interpolators are typically only changed during bootup. Adding __read_mostly takes them away from possibly hot cachelines. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-17[PATCH] unshare: Use rcu_assign_pointer when setting sighandEric W. Biederman1-1/+1
The sighand pointer only needs the rcu_read_lock on the read side. So only depending on task_lock protection when setting this pointer is not enough. We also need a memory barrier to ensure the initialization is seen first. Use rcu_assign_pointer as it does this for us, and clearly documents that we are setting an rcu readable pointer. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-14[PATCH] Fix sigaltstack corruption among cloned threadsGOTO Masanori1-0/+6
This patch fixes alternate signal stack corruption among cloned threads with CLONE_SIGHAND (and CLONE_VM) for linux-2.6.16-rc6. The value of alternate signal stack is currently inherited after a call of clone(... CLONE_SIGHAND | CLONE_VM). But if sigaltstack is set by a parent thread, and then if multiple cloned child threads (+ parent threads) call signal handler at the same time, some threads may be conflicted - because they share to use the same alternative signal stack region. Finally they get sigsegv. It's an undesirable race condition. Note that child threads created from NPTL pthread_create() also hit this conflict when the parent thread uses sigaltstack, without my patch. To fix this problem, this patch clears the child threads' sigaltstack information like exec(). This behavior follows the SUSv3 specification. In SUSv3, pthread_create() says "The alternate stack shall not be inherited (when new threads are initialized)". It means that sigaltstack should be cleared when sigaltstack memory space is shared by cloned threads with CLONE_SIGHAND. Note that I chose "if (clone_flags & CLONE_SIGHAND)" line because: - If clone_flags line is not existed, fork() does not inherit sigaltstack. - CLONE_VM is another choice, but vfork() does not inherit sigaltstack. - CLONE_SIGHAND implies CLONE_VM, and it looks suitable. - CLONE_THREAD is another candidate, and includes CLONE_SIGHAND + CLONE_VM, but this flag has a bit different semantics. I decided to use CLONE_SIGHAND. [ Changed to test for CLONE_VM && !CLONE_VFORK after discussion --Linus ] Signed-off-by: GOTO Masanori <gotom@sanori.org> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Linus Torvalds <torvalds@osdl.org> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-11[PATCH] remove __put_task_struct_cb export againChristoph Hellwig2-8/+3
The patch '[PATCH] RCU signal handling' [1] added an export for __put_task_struct_cb, a put_task_struct helper newly introduced in that patch. But the put_task_struct couldn't be used modular previously as __put_task_struct wasn't exported. There are not callers of it in modular code, and it shouldn't be exported because we don't want drivers to hold references to task_structs. This patch removes the export and folds __put_task_struct into __put_task_struct_cb as there's no other caller. [1] http://www2.kernel.org/git/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e56d090310d7625ecb43a1eeebd479f04affb48b Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-08[PATCH] fix file countingDipankar Sarma1-1/+4
I have benchmarked this on an x86_64 NUMA system and see no significant performance difference on kernbench. Tested on both x86_64 and powerpc. The way we do file struct accounting is not very suitable for batched freeing. For scalability reasons, file accounting was constructor/destructor based. This meant that nr_files was decremented only when the object was removed from the slab cache. This is susceptible to slab fragmentation. With RCU based file structure, consequent batched freeing and a test program like Serge's, we just speed this up and end up with a very fragmented slab - llm22:~ # cat /proc/sys/fs/file-nr 587730 0 758844 At the same time, I see only a 2000+ objects in filp cache. The following patch I fixes this problem. This patch changes the file counting by removing the filp_count_lock. Instead we use a separate percpu counter, nr_files, for now and all accesses to it are through get_nr_files() api. In the sysctl handler for nr_files, we populate files_stat.nr_files before returning to user. Counting files as an when they are created and destroyed (as opposed to inside slab) allows us to correctly count open files with RCU. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-08[PATCH] rcu batch tuningDipankar Sarma1-18/+58
This patch adds new tunables for RCU queue and finished batches. There are two types of controls - number of completed RCU updates invoked in a batch (blimit) and monitoring for high rate of incoming RCUs on a cpu (qhimark, qlowmark). By default, the per-cpu batch limit is set to a small value. If the input RCU rate exceeds the high watermark, we do two things - force quiescent state on all cpus and set the batch limit of the CPU to INTMAX. Setting batch limit to INTMAX forces all finished RCUs to be processed in one shot. If we have more than INTMAX RCUs queued up, then we have bigger problems anyway. Once the incoming queued RCUs fall below the low watermark, the batch limit is set to the default. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-08[PATCH] idle threads should have a sane ->timestamp valueIngo Molnar1-0/+1
Idle threads should have a sane ->timestamp value, to avoid init kernel thread(s) from inheriting it and causing miscalculations in try_to_wake_up(). Reported-by: Mike Galbraith <efault@gmx.de>. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-06[PATCH] time: add barrier after updating jiffies_64Atsushi Nemoto1-0/+2
Add a compiler barrier so that we don't read jiffies before updating jiffies_64. Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-06[PATCH] fix next_timer_interrupt() for hrtimerTony Lindgren2-0/+51
Also from Thomas Gleixner <tglx@linutronix.de> Function next_timer_interrupt() got broken with a recent patch 6ba1b91213e81aa92b5cf7539f7d2a94ff54947c as sys_nanosleep() was moved to hrtimer. This broke things as next_timer_interrupt() did not check hrtimer tree for next event. Function next_timer_interrupt() is needed with dyntick (CONFIG_NO_IDLE_HZ, VST) implementations, as the system can be in idle when next hrtimer event was supposed to happen. At least ARM and S390 currently use next_timer_interrupt(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-06Add early-boot-safety check to cond_resched()Linus Torvalds1-0/+2
Just to be safe, we should not trigger a conditional reschedule during the early boot sequence. We've historically done some questionable early on, and the safety warnings in __might_sleep() are generally turned off during that period, so there might be problems lurking. This affects CONFIG_PREEMPT_VOLUNTARY, which takes over might_sleep() to cause a voluntary conditional reschedule. Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-02[PATCH] time_interpolator: Use readq_relaxed() instead of readq().Christoph Lameter1-2/+2
On some platforms readq performs additional work to make sure I/O is done in a coherent way. This is not needed for time retrieval as done by the time interpolator. So we can use readq_relaxed instead which will improve performance. It affects sparc64 and ia64 only. Apparently it makes a significant difference on ia64. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: john stultz <johnstul@us.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-02[PATCH] fix acpi_video_flags on x86-64Stefan Seyfried1-1/+1
acpi_video_flags variable is unsigned long, so it should be set as such. This actually matters on x86-64. Signed-off-by: Stefan Seyfried <seife@suse.de> Signed-off-by: Pavel Machek <pavel@suse.cz> Cc: "Brown, Len" <len.brown@intel.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-28[IA64] sysctl option to silence unaligned trap warningsJes Sorensen1-0/+14
Allow sysadmin to disable all warnings about userland apps making unaligned accesses by using: # echo 1 > /proc/sys/kernel/ignore-unaligned-usertrap Rather than having to use prctl on a process by process basis. Default behaivour leaves the warnings enabled. Signed-off-by: Jes Sorensen <jes@sgi.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2006-02-20[PATCH] kjournald keeps reference to namespaceBjörn Steinbrink1-0/+3
In daemonize() a new thread gets cleaned up and 'merged' with init_task. The current fs_struct is handled there, but not the current namespace. This adds the namespace part. [ Eric Biederman pointed out the namespace wrappers, and also notes that we can't ever count on using our parents namespace because we already have called exit_fs(), which is the only way to the namespace from a process. ] Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de> Acked-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-20Merge branch 'fixes.b8' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/viro/bird
2006-02-20[PATCH] Fix compile for CONFIG_SYSVIPC=n or CONFIG_SYSCTL=nStephen Rothwell1-0/+2
The compat syscalls are added to sys_ni.c since they are not defined if the above CONFIG options are off. Also, nfs would not build with CONFIG_SYSCTL off. Noticed by Arthur Othieno. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-20[PATCH] Fix undefined symbols for nommu architectureLuke Yang1-0/+2
Signed-off-by: Luke Yang <luke.adi@gmail.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-20[PATCH] suspend-to-ram: allow video options to be set at runtimePavel Machek1-4/+12
Currently, acpi video options can only be set on kernel command line. That's little inflexible; I'd like userland s2ram application that just works, and modifying kernel command line according to whitelist is not fun. It is better to just allow s2ram application to set video options just before suspend (according to the whitelist). This implements sysctl to allow setting suspend video options without reboot. (akpm: Documentation updates for this new sysctl are pending..) Signed-off-by: Pavel Machek <pavel@suse.cz> Cc: "Brown, Len" <len.brown@intel.com> Cc: "Antonino A. Daplas" <adaplas@pol.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-18[PATCH] GFP_KERNEL allocations in atomic (auditsc)Al Viro1-3/+3
audit_log_exit() is called from atomic contexts and gets explicit gfp_mask argument; it should use it for all allocations rather than doing some with gfp_mask and some with GFP_KERNEL. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-02-17[PATCH] swsusp: fix breakage with swap on LVMRafael J. Wysocki1-3/+1
Restore the compatibility with the older code and make it possible to suspend if the kernel command line doesn't contain the "resume=" argument Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-17[PATCH] Introduce CONFIG_DEFAULT_MIGRATION_COSTIngo Molnar1-1/+12
Heiko Carstens <heiko.carstens@de.ibm.com> wrote: The boot sequence on s390 sometimes takes ages and we spend a very long time (up to one or two minutes) in calibrate_migration_costs. The time spent there differs from boot to boot. Also the calculated costs differ a lot. I've seen differences by up to a factor of 15 (yes, factor not percent). Also I doubt that making these measurements make much sense on a completely virtualized architecture where you cannot tell how much cpu time you will get anyway. So introduce the CONFIG_DEFAULT_MIGRATION_COST method for an architecture to set the scheduler migration costs. This turns off automatic detection of migration costs. Makes sense on virtual platforms, where migration costs are hard to measure accurately. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-17[PATCH] Provide an interface for getting the current tick lengthPaul Mackerras1-5/+34
This provides an interface for arch code to find out how many nanoseconds are going to be added on to xtime by the next call to do_timer. The value returned is a fixed-point number in 52.12 format in nanoseconds. The reason for this format is that it gives the full precision that the timekeeping code is using internally. The motivation for this is to fix a problem that has arisen on 32-bit powerpc in that the value returned by do_gettimeofday drifts apart from xtime if NTP is being used. PowerPC is now using a lockless do_gettimeofday based on reading the timebase register and performing some simple arithmetic. (This method of getting the time is also exported to userspace via the VDSO.) However, the factor and offset it uses were calculated based on the nominal tick length and weren't being adjusted when NTP varied the tick length. Note that 64-bit powerpc has had the lockless do_gettimeofday for a long time now. It also had an extremely hairy routine that got called from the 32-bit compat routine for adjtimex, which adjusted the factor and offset according to what it thought the timekeeping code was going to do. Not only was this only called if a 32-bit task did adjtimex (i.e. not if a 64-bit task did adjtimex), it was also duplicating computations from kernel/timer.c and it wasn't clear that it was (still) correct. The simple solution is to ask the timekeeping code how long the current jiffy will be on each timer interrupt, after calling do_timer. If this jiffy will be a different length from the last one, we then need to compute new values for the factor and offset used in the lockless do_gettimeofday. In this way we can keep xtime and do_gettimeofday in sync, even when NTP is varying the tick length. Note that when adjtimex varies the tick length, it almost always introduces the variation from the next tick on. The only case I could see where adjtimex would vary the length of the current tick is when an old-style adjtime adjustment is being cancelled. (It's not clear to me why the adjustment has to be cancelled immediately rather than from the next tick on.) Thus I don't see any real need for a hook in adjtimex; the rare case of an old-style adjustment being cancelled can be fixed up at the next tick. Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: john stultz <johnstul@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-17[PATCH] x86_64: Add boot option to disable randomized mappings and cleanupAndi Kleen1-2/+0
AMD SimNow!'s JIT doesn't like them at all in the guest. For distribution installation it's easiest if it's a boot time option. Also I moved the variable to a more appropiate place and make it independent from sysctl And marked __read_mostly which it is. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-15[PATCH] swsusp: nuke noisy messageAndrew Morton1-3/+1
I get about 88 squillion of these when suspending an old ad450nx server. Cc: Pavel Roskin <proski@gnu.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-15[PATCH] cpuset: oops in exit on null cpuset fixPaul Jackson1-1/+34
Fix a latent bug in cpuset_exit() handling. If a task tried to allocate memory after calling cpuset_exit(), it oops'd in cpuset_update_task_memory_state() on a NULL cpuset pointer. So set the exiting tasks cpuset to the root cpuset instead of to NULL. A distro kernel hit this with an added kernel package that had just such a hook (allocating memory) in the exit code path. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-15[PATCH] fix zap_thread's ptrace related problemsOleg Nesterov1-10/+15
1. The tracee can go from ptrace_stop() to do_signal_stop() after __ptrace_unlink(p). 2. It is unsafe to __ptrace_unlink(p) while p->parent may wait for tasklist_lock in ptrace_detach(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Christoph Hellwig <hch@lst.de> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-15[PATCH] fix kill_proc_info() vs fork() theoretical raceOleg Nesterov1-2/+2
copy_process: attach_pid(p, PIDTYPE_PID, p->pid); attach_pid(p, PIDTYPE_TGID, p->tgid); What if kill_proc_info(p->pid) happens in between? copy_process() holds current->sighand.siglock, so we are safe in CLONE_THREAD case, because current->sighand == p->sighand. Otherwise, p->sighand is unlocked, the new process is already visible to the find_task_by_pid(), but have a copy of parent's 'struct pid' in ->pids[PIDTYPE_TGID]. This means that __group_complete_signal() may hang while doing do ... while (next_thread() != p) We can solve this problem if we reverse these 2 attach_pid()s: attach_pid() does wmb() group_send_sig_info() calls spin_lock(), which provides a read barrier. // Yes ? I don't think we can hit this race in practice, but still. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-15[PATCH] fix kill_proc_info() vs CLONE_THREAD raceOleg Nesterov1-3/+2
There is a window after copy_process() unlocks ->sighand.siglock and before it adds the new thread to the thread list. In that window __group_complete_signal(SIGKILL) will not see the new thread yet, so this thread will start running while the whole thread group was supposed to exit. I beleive we have another good reason to place attach_pid(PID/TGID) under ->sighand.siglock. We can do the same for release_task()->__unhash_process() de_thread()->switch_exec_pids() After that we don't need tasklist_lock to iterate over the thread list, and we can simplify things, see for example do_sigaction() or sys_times(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-14[PATCH] hrtimer: round up relative start time on low-res archesIngo Molnar1-1/+12
CONFIG_TIME_LOW_RES is a temporary way for architectures to signal that they simply return xtime in do_gettimeoffset(). In this corner-case we want to round up by resolution when starting a relative timer, to avoid short timeouts. This will go away with the GTOD framework. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-14[PATCH] sched: revert "filter affine wakeups"Chen, Kenneth W1-9/+1
Revert commit d7102e95b7b9c00277562c29aad421d2d521c5f6: [PATCH] sched: filter affine wakeups Apparently caused more than 10% performance regression for aim7 benchmark. The setup in use is 16-cpu HP rx8620, 64Gb of memory and 12 MSA1000s with 144 disks. Each disk is 72Gb with a single ext3 filesystem (courtesy of HP, who supplied benchmark results). The problem is, for aim7, the wake-up pattern is random, but it still needs load balancing action in the wake-up path to achieve best performance. With the above commit, lack of load balancing hurts that workload. However, for workloads like database transaction processing, the requirement is exactly opposite. In the wake up path, best performance is achieved with absolutely zero load balancing. We simply wake up the process on the CPU that it was previously run. Worst performance is obtained when we do load balancing at wake up. There isn't an easy way to auto detect the workload characteristics. Ingo's earlier patch that detects idle CPU and decide whether to load balance or not doesn't perform with aim7 either since all CPUs are busy (it causes even bigger perf. regression). Revert commit d7102e95b7b9c00277562c29aad421d2d521c5f6, which causes more than 10% performance regression with aim7. Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-14[PATCH] compound page: no access_process_vm checkHugh Dickins1-2/+1
The PageCompound check before access_process_vm's set_page_dirty_lock is no longer necessary, so remove it. But leave the PageCompound checks in bio_set_pages_dirty, dio_bio_complete and nfs_free_user_pages: at least some of those were introduced as a little optimization on hugetlb pages. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-10[PATCH] prevent recursive panic from softlockup watchdogJan Beulich1-0/+1
When panic_timeout is zero, suppress triggering a nested panic due to soft lockup detection. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-10[PATCH] sched: remove smpniceNick Piggin1-111/+18
I don't think the code is quite ready, which is why I asked for Peter's additions to also be merged before I acked it (although it turned out that it still isn't quite ready with his additions either). Basically I have had similar observations to Suresh in that it does not play nicely with the rest of the balancing infrastructure (and raised similar concerns in my review). The samples (group of 4) I got for "maximum recorded imbalance" on a 2x2 SMP+HT Xeon are as follows: | Following boot | hackbench 20 | hackbench 40 -----------+----------------+---------------------+--------------------- 2.6.16-rc2 | 30,37,100,112 | 5600,5530,6020,6090 | 6390,7090,8760,8470 +nosmpnice | 3, 2, 4, 2 | 28, 150, 294, 132 | 348, 348, 294, 347 Hackbench raw performance is down around 15% with smpnice (but that in itself isn't a huge deal because it is just a benchmark). However, the samples show that the imbalance passed into move_tasks is increased by about a factor of 10-30. I think this would also go some way to explaining latency blips turning up in the balancing code (though I haven't actually measured that). We'll probably have to revert this in the SUSE kernel. Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Peter Williams <pwil3058@bigpond.net.au> Cc: "Martin J. Bligh" <mbligh@aracnet.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-09[PATCH] do_sigaction: cleanup ->sa_mask manipulationOleg Nesterov1-5/+3
Clear unblockable signals beforehand. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-09[PATCH] sys_signal: initialize ->sa_maskOleg Nesterov1-0/+1
Pointed out by Linus Torvalds. sys_signal() forgets to initialize ->sa_mask. ( I suspect arch/ia64/ia32/ia32_signal.c:sys32_signal() also needs this fix ) Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07[PATCH] kernel/sys.c NULL noise removalAl Viro1-1/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>