summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2015-08-06audit: clean simple fsnotify implementationRichard Guy Briggs4-2/+232
This is to be used to audit by executable path rules, but audit watches should be able to share this code eventually. At the moment the audit watch code is a lot more complex. That code only creates one fsnotify watch per parent directory. That 'audit_parent' in turn has a list of 'audit_watches' which contain the name, ino, dev of the specific object we care about. This just creates one fsnotify watch per object we care about. So if you watch 100 inodes in /etc this code will create 100 fsnotify watches on /etc. The audit_watch code will instead create 1 fsnotify watch on /etc (the audit_parent) and then 100 individual watches chained from that fsnotify mark. We should be able to convert the audit_watch code to do one fsnotify mark per watch and simplify things/remove a whole lot of code. After that conversion we should be able to convert the audit_fsnotify code to support that hierarchy if the optimization is necessary. Move the access to the entry for audit_match_signal() to the beginning of the audit_del_rule() function in case the entry found is the same one passed in. This will enable it to be used by audit_autoremove_mark_rule(), kill_rules() and audit_remove_parent_watches(). This is a heavily modified and merged version of two patches originally submitted by Eric Paris. Cc: Peter Moody <peter@hda3.com> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: added a space after a declaration to keep ./scripts/checkpatch happy] Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-08-06audit: use macros for unset inode and device valuesRichard Guy Briggs3-8/+8
Clean up a number of places were casted magic numbers are used to represent unset inode and device numbers in preparation for the audit by executable path patch set. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: enclosed the _UNSET macros in parentheses for ./scripts/checkpatch] Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-08-05audit: make audit_del_rule() more robustRichard Guy Briggs1-6/+6
Move the access to the entry for audit_match_signal() to earlier in the function in case the entry found is the same one passed in. This will enable it to be used by audit_remove_mark_rule(). Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: tweaked subject line as it no longer made sense after multiple revs] Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-08-05audit: fix uninitialized variable in audit_add_rule()Paul Moore1-1/+1
As reported by the 0-Day testing service: kernel/auditfilter.c: In function 'audit_rule_change': >> kernel/auditfilter.c:864:6: warning: 'err' may be used uninit... int err; Cc: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-08-04audit: eliminate unnecessary extra layer of watch parent referencesRichard Guy Briggs1-4/+2
The audit watch parent count was imbalanced, adding an unnecessary layer of watch parent references. Decrement the additional parent reference when a watch is reused, already having a reference to the parent. audit_find_parent() gets a reference to the parent, if the parent is already known. This additional parental reference is not needed if the watch is subsequently found by audit_add_to_parent(), and consumed if the watch does not already exist, so we need to put the parent if the watch is found, and do nothing if this new watch is added to the parent. If the parent wasn't already known, it is created with a refcount of 1 and added to the audit_watch_group, then incremented by one to be subsequently consumed by the newly created watch in audit_add_to_parent(). The rule points to the watch, not to the parent, so the rule's refcount gets bumped, not the parent's. See LKML, 2015-07-16 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-08-04audit: eliminate unnecessary extra layer of watch referencesRichard Guy Briggs2-16/+5
The audit watch count was imbalanced, adding an unnecessary layer of watch references. Only add the second reference when it is added to a parent. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-06-11audit: Fix check of return value of strnlen_user()Jan Kara1-1/+1
strnlen_user() returns 0 when it hits fault, not -1. Fix the test in audit_log_single_execve_arg(). Luckily this shouldn't ever happen unless there's a kernel bug so it's mostly a cosmetic fix. CC: Paul Moore <pmoore@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-05-29audit: obsolete audit_context check is removed in audit_filter_rules()Mikhail Klementyev1-3/+1
Signed-off-by: Mikhail Klementyev <jollheef@riseup.net> [PM: patch applied by hand due to HTML mangling, rewrote subject line] Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-05-29audit: fix for typo in comment to function audit_log_link_denied()Shailendra Verma1-1/+1
Signed-off-by: Shailendra Verma <shailendra.capricorn@gmail.com> [PM: tweaked subject line] Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-04-09Merge tag 'pm+acpi-4.0-rc8' of ↵Linus Torvalds1-20/+1
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management and ACPI fixes from Rafael Wysocki: "These are stable-candidate fixes of some recently reported issues in the cpufreq core, cpuidle core, the ACPI cpuidle driver and the hibernate core. Specifics: - Revert a 3.17 hibernate commit that was supposed to fix an issue related to e820 reserved regions, but broke resume from hibernation on Lenovo x230 (Rafael J Wysocki). - Prevent the ACPI cpuidle driver from overwriting the name and description of the C0 state set by the core when the list of C-states changes (Thomas Schlichter). - Remove the no longer needed state_count field from struct cpuidle_device which prevents the list of C-states shown by the sysfs interface from becoming incorrect when the current number of them is different from the number of C-states on boot (Bartlomiej Zolnierkiewicz). - The cpufreq core updates the policy object of the only online CPU during system resume to make it reflect the current hardware state, but it always assumes that CPU to be CPU0 which need not be the case, so fix the code to avoid that assumption (Viresh Kumar)" * tag 'pm+acpi-4.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: Revert "PM / hibernate: avoid unsafe pages in e820 reserved regions" cpuidle: ACPI: do not overwrite name and description of C0 cpuidle: remove state_count field from struct cpuidle_device cpufreq: Schedule work for the first-online CPU on resume
2015-04-09Merge branches 'pm-sleep', 'pm-cpufreq' and 'pm-cpuidle'Rafael J. Wysocki1-20/+1
* pm-sleep: Revert "PM / hibernate: avoid unsafe pages in e820 reserved regions" * pm-cpufreq: cpufreq: Schedule work for the first-online CPU on resume * pm-cpuidle: cpuidle: ACPI: do not overwrite name and description of C0 cpuidle: remove state_count field from struct cpuidle_device
2015-04-08Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-1/+3
Merge misc fixes from Andrew Morton: "Three fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm: numa: disable change protection for vma(VM_HUGETLB) include/linux/dmapool.h: declare struct device mm: move zone lock to a different cache line than order-0 free page lists
2015-04-08Copy the kernel module data from user space in chunksLinus Torvalds1-1/+18
Unlike most (all?) other copies from user space, kernel module loading is almost unlimited in size. So we do a potentially huge "copy_from_user()" when we copy the module data from user space to the kernel buffer, which can be a latency concern when preemption is disabled (or voluntary). Also, because 'copy_from_user()' clears the tail of the kernel buffer on failures, even a *failed* copy can end up wasting a lot of time. Normally neither of these are concerns in real life, but they do trigger when doing stress-testing with trinity. Running in a VM seems to add its own overheadm causing trinity module load testing to even trigger the watchdog. The simple fix is to just chunk up the module loading, so that it never tries to copy insanely big areas in one go. That bounds the latency, and also the amount of (unnecessarily, in this case) cleared memory for the failure case. Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-07mm: numa: disable change protection for vma(VM_HUGETLB)Naoya Horiguchi1-1/+3
Currently when a process accesses a hugetlb range protected with PROTNONE, unexpected COWs are triggered, which finally puts the hugetlb subsystem into a broken/uncontrollable state, where for example h->resv_huge_pages is subtracted too much and wraps around to a very large number, and the free hugepage pool is no longer maintainable. This patch simply stops changing protection for vma(VM_HUGETLB) to fix the problem. And this also allows us to avoid useless overhead of minor faults. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Suggested-by: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-07Revert "PM / hibernate: avoid unsafe pages in e820 reserved regions"Rafael J. Wysocki1-20/+1
Commit 84c91b7ae07c (PM / hibernate: avoid unsafe pages in e820 reserved regions) is reported to make resume from hibernation on Lenovo x230 unreliable, so revert it. We will revisit the issue the commit in question was supposed to fix in the future. Link: https://bugzilla.kernel.org/show_bug.cgi?id=96111 Reported-by: rhn <kebuac.rhn@porcupinefactory.org> Cc: 3.17+ <stable@vger.kernel.org> # 3.17+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-04-01Merge tag 'lazytime_fix' of ↵Linus Torvalds1-0/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull lazytime fixes from Ted Ts'o: "This fixes a problem in the lazy time patches, which can cause frequently updated inods to never have their timestamps updated. These changes guarantee that no timestamp on disk will be stale by more than 24 hours" * tag 'lazytime_fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: fs: add dirtytime_expire_seconds sysctl fs: make sure the timestamps for lazytime inodes eventually get written
2015-03-28Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds1-2/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: "Two clocksource driver fixes, and an idle loop RCU warning fix" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource/drivers/sun5i: Fix cpufreq interaction with sched_clock() clocksource/drivers: Fix various !CONFIG_HAS_IOMEM build errors timers/tick/broadcast-hrtimer: Fix suspicious RCU usage in idle loop
2015-03-28Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "A single sched/rt corner case fix for RLIMIT_RTIME correctness" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix RLIMIT_RTTIME when PI-boosting to RT
2015-03-28Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds1-0/+10
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Ingo Molnar: "A perf kernel side fix for a fuzzer triggered lockup" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix irq_work 'tail' recursion
2015-03-28Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds2-30/+59
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Ingo Molnar: "A module unload lockdep race fix" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: lockdep: Fix the module unload key range freeing logic
2015-03-25mm: numa: slow PTE scan rate if migration failures occurMel Gorman1-2/+6
Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226 Across the board the 4.0-rc1 numbers are much slower, and the degradation is far worse when using the large memory footprint configs. Perf points straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config: - 56.07% 56.07% [kernel] [k] default_send_IPI_mask_sequence_phys - default_send_IPI_mask_sequence_phys - 99.99% physflat_send_IPI_mask - 99.37% native_send_call_func_ipi smp_call_function_many - native_flush_tlb_others - 99.85% flush_tlb_page ptep_clear_flush try_to_unmap_one rmap_walk try_to_unmap migrate_pages migrate_misplaced_page - handle_mm_fault - 99.73% __do_page_fault trace_do_page_fault do_async_page_fault + async_page_fault 0.63% native_send_call_func_single_ipi generic_exec_single smp_call_function_single This is showing excessive migration activity even though excessive migrations are meant to get throttled. Normally, the scan rate is tuned on a per-task basis depending on the locality of faults. However, if migrations fail for any reason then the PTE scanner may scan faster if the faults continue to be remote. This means there is higher system CPU overhead and fault trapping at exactly the time we know that migrations cannot happen. This patch tracks when migration failures occur and slows the PTE scanner. Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Dave Chinner <david@fromorbit.com> Tested-by: Dave Chinner <david@fromorbit.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-23timers/tick/broadcast-hrtimer: Fix suspicious RCU usage in idle loopPreeti U Murthy1-2/+9
The hrtimer mode of broadcast queues hrtimers in the idle entry path so as to wakeup cpus in deep idle states. The associated call graph is : cpuidle_idle_call() |____ clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, ....)) |_____tick_broadcast_set_event() |____clockevents_program_event() |____bc_set_next() The hrtimer_{start/cancel} functions call into tracing which uses RCU. But it is not legal to call into RCU in cpuidle because it is one of the quiescent states. Hence protect this region with RCU_NONIDLE which informs RCU that the cpu is momentarily non-idle. As an aside it is helpful to point out that the clock event device that is programmed here is not a per-cpu clock device; it is a pseudo clock device, used by the broadcast framework alone. The per-cpu clock device programming never goes through bc_set_next(). Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: linuxppc-dev@ozlabs.org Cc: mpe@ellerman.id.au Cc: tglx@linutronix.de Link: http://lkml.kernel.org/r/20150318104705.17763.56668.stgit@preeti.in.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-23lockdep: Fix the module unload key range freeing logicPeter Zijlstra2-30/+59
Module unload calls lockdep_free_key_range(), which removes entries from the data structures. Most of the lockdep code OTOH assumes the data structures are append only; in specific see the comments in add_lock_to_list() and look_up_lock_class(). Clearly this has only worked by accident; make it work proper. The actual scenario to make it go boom would involve the memory freed by the module unlock being re-allocated and re-used for a lock inside of a rcu-sched grace period. This is a very unlikely scenario, still better plug the hole. Use RCU list iteration in all places and ammend the comments. Change lockdep_free_key_range() to issue a sync_sched() between removal from the lists and returning -- which results in the memory being freed. Further ensure the callers are placed correctly and comment the requirements. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andrey Tsyvarev <tsyvarev@ispras.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-23sched: Fix RLIMIT_RTTIME when PI-boosting to RTBrian Silverman1-0/+2
When non-realtime tasks get priority-inheritance boosted to a realtime scheduling class, RLIMIT_RTTIME starts to apply to them. However, the counter used for checking this (the same one used for SCHED_RR timeslices) was not getting reset. This meant that tasks running with a non-realtime scheduling class which are repeatedly boosted to a realtime one, but never block while they are running realtime, eventually hit the timeout without ever running for a time over the limit. This patch resets the realtime timeslice counter when un-PI-boosting from an RT to a non-RT scheduling class. I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT mutex which induces priority boosting and spins while boosted that gets killed by a SIGXCPU on non-fixed kernels but doesn't with this patch applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and does happen eventually with PREEMPT_VOLUNTARY kernels. Signed-off-by: Brian Silverman <brian@peloton-tech.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: austin@peloton-tech.com Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1424305436-6716-1-git-send-email-brian@peloton-tech.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-23perf: Fix irq_work 'tail' recursionPeter Zijlstra1-0/+10
Vince reported a watchdog lockup like: [<ffffffff8115e114>] perf_tp_event+0xc4/0x210 [<ffffffff810b4f8a>] perf_trace_lock+0x12a/0x160 [<ffffffff810b7f10>] lock_release+0x130/0x260 [<ffffffff816c7474>] _raw_spin_unlock_irqrestore+0x24/0x40 [<ffffffff8107bb4d>] do_send_sig_info+0x5d/0x80 [<ffffffff811f69df>] send_sigio_to_task+0x12f/0x1a0 [<ffffffff811f71ce>] send_sigio+0xae/0x100 [<ffffffff811f72b7>] kill_fasync+0x97/0xf0 [<ffffffff8115d0b4>] perf_event_wakeup+0xd4/0xf0 [<ffffffff8115d103>] perf_pending_event+0x33/0x60 [<ffffffff8114e3fc>] irq_work_run_list+0x4c/0x80 [<ffffffff8114e448>] irq_work_run+0x18/0x40 [<ffffffff810196af>] smp_trace_irq_work_interrupt+0x3f/0xc0 [<ffffffff816c99bd>] trace_irq_work_interrupt+0x6d/0x80 Which is caused by an irq_work generating new irq_work and therefore not allowing forward progress. This happens because processing the perf irq_work triggers another perf event (tracepoint stuff) which in turn generates an irq_work ad infinitum. Avoid this by raising the recursion counter in the irq_work -- which effectively disables all software events (including tracepoints) from actually triggering again. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Tested-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20150219170311.GH21418@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-18Merge branch 'for-linus' of ↵Linus Torvalds1-4/+26
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching Pull livepatching fix from Jiri Kosina: - fix for potential race with module loading, from Petr Mladek. The race is very unlikely to be seen in real world and has been found by code inspection, but should be fixed for 4.0 anyway. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching: livepatch: Fix subtle race with coming and going modules
2015-03-17Merge branches 'perf-urgent-for-linus' and 'timers-urgent-for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf and timer fixes from Ingo Molnar: "Two small perf fixes: - kernel side context leak fix - tooling crash fix And two clocksource driver fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix context leak in put_event() perf annotate: Fix fallback to unparsed disassembler line * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clockevents: sun5i: Fix setup_irq init sequence clocksource: efm32: Fix a NULL pointer dereference
2015-03-17fs: add dirtytime_expire_seconds sysctlTheodore Ts'o1-0/+8
Add a tuning knob so we can adjust the dirtytime expiration timeout, which is very useful for testing lazytime. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2015-03-17livepatch: Fix subtle race with coming and going modulesPetr Mladek1-4/+26
There is a notifier that handles live patches for coming and going modules. It takes klp_mutex lock to avoid races with coming and going patches but it does not keep the lock all the time. Therefore the following races are possible: 1. The notifier is called sometime in STATE_MODULE_COMING. The module is visible by find_module() in this state all the time. It means that new patch can be registered and enabled even before the notifier is called. It might create wrong order of stacked patches, see below for an example. 2. New patch could still see the module in the GOING state even after the notifier has been called. It will try to initialize the related object structures but the module could disappear at any time. There will stay mess in the structures. It might even cause an invalid memory access. This patch solves the problem by adding a boolean variable into struct module. The value is true after the coming and before the going handler is called. New patches need to be applied when the value is true and they need to ignore the module when the value is false. Note that we need to know state of all modules on the system. The races are related to new patches. Therefore we do not know what modules will get patched. Also note that we could not simply ignore going modules. The code from the module could be called even in the GOING state until mod->exit() finishes. If we start supporting patches with semantic changes between function calls, we need to apply new patches to any still usable code. See below for an example. Finally note that the patch solves only the situation when a new patch is registered. There are no such problems when the patch is being removed. It does not matter who disable the patch first, whether the normal disable_patch() or the module notifier. There is nothing to do once the patch is disabled. Alternative solutions: ====================== + reject new patches when a patched module is coming or going; this is ugly + wait with adding new patch until the module leaves the COMING and GOING states; this might be dangerous and complicated; we would need to release kgr_lock in the middle of the patch registration to avoid a deadlock with the coming and going handlers; also we might need a waitqueue for each module which seems to be even bigger overhead than the boolean + stop modules from entering COMING and GOING states; wait until modules leave these states when they are already there; looks complicated; we would need to ignore the module that asked to stop the others to avoid a deadlock; also it is unclear what to do when two modules asked to stop others and both are in COMING state (situation when two new patches are applied) + always register/enable new patches and fix up the potential mess (registered patches order) in klp_module_init(); this is nasty and prone to regressions in the future development + add another MODULE_STATE where the kallsyms are visible but the module is not used yet; this looks too complex; the module states are checked on "many" locations Example of patch stacking breakage: =================================== The notifier could _not_ _simply_ ignore already initialized module objects. For example, let's have three patches (P1, P2, P3) for functions a() and b() where a() is from vmcore and b() is from a module M. Something like: a() b() P1 a1() b1() P2 a2() b2() P3 a3() b3(3) If you load the module M after all patches are registered and enabled. The ftrace ops for function a() and b() has listed the functions in this order: ops_a->func_stack -> list(a3,a2,a1) ops_b->func_stack -> list(b3,b2,b1) , so the pointer to b3() is the first and will be used. Then you might have the following scenario. Let's start with state when patches P1 and P2 are registered and enabled but the module M is not loaded. Then ftrace ops for b() does not exist. Then we get into the following race: CPU0 CPU1 load_module(M) complete_formation() mod->state = MODULE_STATE_COMING; mutex_unlock(&module_mutex); klp_register_patch(P3); klp_enable_patch(P3); # STATE 1 klp_module_notify(M) klp_module_notify_coming(P1); klp_module_notify_coming(P2); klp_module_notify_coming(P3); # STATE 2 The ftrace ops for a() and b() then looks: STATE1: ops_a->func_stack -> list(a3,a2,a1); ops_b->func_stack -> list(b3); STATE2: ops_a->func_stack -> list(a3,a2,a1); ops_b->func_stack -> list(b2,b1,b3); therefore, b2() is used for the module but a3() is used for vmcore because they were the last added. Example of the race with going modules: ======================================= CPU0 CPU1 delete_module() #SYSCALL try_stop_module() mod->state = MODULE_STATE_GOING; mutex_unlock(&module_mutex); klp_register_patch() klp_enable_patch() #save place to switch universe b() # from module that is going a() # from core (patched) mod->exit(); Note that the function b() can be called until we call mod->exit(). If we do not apply patch against b() because it is in MODULE_STATE_GOING, it will call patched a() with modified semantic and things might get wrong. [jpoimboe@redhat.com: use one boolean instead of two] Signed-off-by: Petr Mladek <pmladek@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-03-13perf: Fix context leak in put_event()Leon Yu1-1/+1
Commit: a83fe28e2e45 ("perf: Fix put_event() ctx lock") changed the locking logic in put_event() by replacing mutex_lock_nested() with perf_event_ctx_lock_nested(), but didn't fix the subsequent mutex_unlock() with a correct counterpart, perf_event_ctx_unlock(). Contexts are thus leaked as a result of incremented refcount in perf_event_ctx_lock_nested(). Signed-off-by: Leon Yu <chianglungyu@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Fixes: a83fe28e2e45 ("perf: Fix put_event() ctx lock") Link: http://lkml.kernel.org/r/1424954613-5034-1-git-send-email-chianglungyu@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-12kasan, module, vmalloc: rework shadow allocation for modulesAndrey Ryabinin1-2/+0
Current approach in handling shadow memory for modules is broken. Shadow memory could be freed only after memory shadow corresponds it is no longer used. vfree() called from interrupt context could use memory its freeing to store 'struct llist_node' in it: void vfree(const void *addr) { ... if (unlikely(in_interrupt())) { struct vfree_deferred *p = this_cpu_ptr(&vfree_deferred); if (llist_add((struct llist_node *)addr, &p->list)) schedule_work(&p->wq); Later this list node used in free_work() which actually frees memory. Currently module_memfree() called in interrupt context will free shadow before freeing module's memory which could provoke kernel crash. So shadow memory should be freed after module's memory. However, such deallocation order could race with kasan_module_alloc() in module_alloc(). Free shadow right before releasing vm area. At this point vfree()'d memory is not used anymore and yet not available for other allocations. New VM_KASAN flag used to indicate that vm area has dynamically allocated shadow memory so kasan frees shadow only if it was previously allocated. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-09Merge tag 'trace-fixes-v4.0-rc2-2' of ↵Linus Torvalds1-10/+30
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull seq-buf/ftrace fixes from Steven Rostedt: "This includes fixes for seq_buf_bprintf() truncation issue. It also contains fixes to ftrace when /proc/sys/kernel/ftrace_enabled and function tracing are started. Doing the following causes some issues: # echo 0 > /proc/sys/kernel/ftrace_enabled # echo function_graph > /sys/kernel/debug/tracing/current_tracer # echo 1 > /proc/sys/kernel/ftrace_enabled # echo nop > /sys/kernel/debug/tracing/current_tracer # echo function_graph > /sys/kernel/debug/tracing/current_tracer As well as with function tracing too. Pratyush Anand first reported this issue to me and supplied a patch. When I tested this on my x86 test box, it caused thousands of backtraces and warnings to appear in dmesg, which also caused a denial of service (a warning for every function that was listed). I applied Pratyush's patch but it did not fix the issue for me. I looked into it and found a slight problem with trampoline accounting. I fixed it and sent Pratyush a patch, but he said that it did not fix the issue for him. I later learned tha Pratyush was using an ARM64 server, and when I tested on my ARM board, I was able to reproduce the same issue as Pratyush. After applying his patch, it fixed the problem. The above test uncovered two different bugs, one in x86 and one in ARM and ARM64. As this looked like it would affect PowerPC, I tested it on my PPC64 box. It too broke, but neither the patch that fixed ARM or x86 fixed this box (the changes were all in generic code!). The above test, uncovered two more bugs that affected PowerPC. Again, the changes were only done to generic code. It's the way the arch code expected things to be done that was different between the archs. Some where more sensitive than others. The rest of this series fixes the PPC bugs as well" * tag 'trace-fixes-v4.0-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace: Fix ftrace enable ordering of sysctl ftrace_enabled ftrace: Fix en(dis)able graph caller when en(dis)abling record via sysctl ftrace: Clear REGS_EN and TRAMP_EN flags on disabling record via sysctl seq_buf: Fix seq_buf_bprintf() truncation seq_buf: Fix seq_buf_vprintf() truncation
2015-03-09Merge branch 'for-4.0-fixes' of ↵Linus Torvalds1-5/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "The cgroup iteration update two years ago and the recent cpuset restructuring introduced regressions in subset of cpuset configurations. Three patches to fix them. All are marked for -stable" * 'for-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: Fix cpuset sched_relax_domain_level cpuset: fix a warning when clearing configured masks in old hierarchy cpuset: initialize effective masks when clone_children is enabled
2015-03-09Merge branch 'for-4.0-fixes' of ↵Linus Torvalds1-4/+52
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: "One fix patch for a subtle livelock condition which can happen on PREEMPT_NONE kernels involving two racing cancel_work calls. Whoever comes in the second has to wait for the previous one to finish. This was implemented by making the later one block for the same condition that the former would be (work item completion) and then loop and retest; unfortunately, depending on the wake up order, the later one could lock out the former one to finish by busy looping on the cpu. This is fixed by implementing explicit wait mechanism. Work item might not belong anywhere at this point and there's remote possibility of thundering herd problem. I originally tried to use bit_waitqueue but it didn't work for static work items on modules. It's currently using single wait queue with filtering wake up function and exclusive wakeup. If this ever becomes a problem, which is not very likely, we can try to figure out a way to piggy back on bit_waitqueue" * 'for-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE
2015-03-09ftrace: Fix ftrace enable ordering of sysctl ftrace_enabledSteven Rostedt (Red Hat)1-3/+3
Some archs (specifically PowerPC), are sensitive with the ordering of the enabling of the calls to function tracing and setting of the function to use to be traced. That is, update_ftrace_function() sets what function the ftrace_caller trampoline should call. Some archs require this to be set before calling ftrace_run_update_code(). Another bug was discovered, that ftrace_startup_sysctl() called ftrace_run_update_code() directly. If the function the ftrace_caller trampoline changes, then it will not be updated. Instead a call to ftrace_startup_enable() should be called because it tests to see if the callback changed since the code was disabled, and will tell the arch to update appropriately. Most archs do not need this notification, but PowerPC does. The problem could be seen by the following commands: # echo 0 > /proc/sys/kernel/ftrace_enabled # echo function > /sys/kernel/debug/tracing/current_tracer # echo 1 > /proc/sys/kernel/ftrace_enabled # cat /sys/kernel/debug/tracing/trace The trace will show that function tracing was not active. Cc: stable@vger.kernel.org # 2.6.27+ Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-03-09ftrace: Fix en(dis)able graph caller when en(dis)abling record via sysctlPratyush Anand1-6/+22
When ftrace is enabled globally through the proc interface, we must check if ftrace_graph_active is set. If it is set, then we should also pass the FTRACE_START_FUNC_RET command to ftrace_run_update_code(). Similarly, when ftrace is disabled globally through the proc interface, we must check if ftrace_graph_active is set. If it is set, then we should also pass the FTRACE_STOP_FUNC_RET command to ftrace_run_update_code(). Consider the following situation. # echo 0 > /proc/sys/kernel/ftrace_enabled After this ftrace_enabled = 0. # echo function_graph > /sys/kernel/debug/tracing/current_tracer Since ftrace_enabled = 0, ftrace_enable_ftrace_graph_caller() is never called. # echo 1 > /proc/sys/kernel/ftrace_enabled Now ftrace_enabled will be set to true, but still ftrace_enable_ftrace_graph_caller() will not be called, which is not desired. Further if we execute the following after this: # echo nop > /sys/kernel/debug/tracing/current_tracer Now since ftrace_enabled is set it will call ftrace_disable_ftrace_graph_caller(), which causes a kernel warning on the ARM platform. On the ARM platform, when ftrace_enable_ftrace_graph_caller() is called, it checks whether the old instruction is a nop or not. If it's not a nop, then it returns an error. If it is a nop then it replaces instruction at that address with a branch to ftrace_graph_caller. ftrace_disable_ftrace_graph_caller() behaves just the opposite. Therefore, if generic ftrace code ever calls either ftrace_enable_ftrace_graph_caller() or ftrace_disable_ftrace_graph_caller() consecutively two times in a row, then it will return an error, which will cause the generic ftrace code to raise a warning. Note, x86 does not have an issue with this because the architecture specific code for ftrace_enable_ftrace_graph_caller() and ftrace_disable_ftrace_graph_caller() does not check the previous state, and calling either of these functions twice in a row has no ill effect. Link: http://lkml.kernel.org/r/e4fbe64cdac0dd0e86a3bf914b0f83c0b419f146.1425666454.git.panand@redhat.com Cc: stable@vger.kernel.org # 2.6.31+ Signed-off-by: Pratyush Anand <panand@redhat.com> [ removed extra if (ftrace_start_up) and defined ftrace_graph_active as 0 if CONFIG_FUNCTION_GRAPH_TRACER is not set. ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-03-09ftrace: Clear REGS_EN and TRAMP_EN flags on disabling record via sysctlSteven Rostedt (Red Hat)1-2/+6
When /proc/sys/kernel/ftrace_enabled is set to zero, all function tracing is disabled. But the records that represent the functions still hold information about the ftrace_ops that are hooked to them. ftrace_ops may request "REGS" (have a full set of pt_regs passed to the callback), or "TRAMP" (the ops has its own trampoline to use). When the record is updated to represent the state of the ops hooked to it, it sets "REGS_EN" and/or "TRAMP_EN" to state that the callback points to the correct trampoline (REGS has its own trampoline). When ftrace_enabled is set to zero, all ftrace locations are a nop, so they do not point to any trampoline. But the _EN flags are still set. This can cause the accounting to go wrong when ftrace_enabled is cleared and an ops that has a trampoline is registered or unregistered. For example, the following will cause ftrace to crash: # echo function_graph > /sys/kernel/debug/tracing/current_tracer # echo 0 > /proc/sys/kernel/ftrace_enabled # echo nop > /sys/kernel/debug/tracing/current_tracer # echo 1 > /proc/sys/kernel/ftrace_enabled # echo function_graph > /sys/kernel/debug/tracing/current_tracer As function_graph uses a trampoline, when ftrace_enabled is set to zero the updates to the record are not done. When enabling function_graph again, the record will still have the TRAMP_EN flag set, and it will look for an op that has a trampoline other than the function_graph ops, and fail to find one. Cc: stable@vger.kernel.org # 3.17+ Reported-by: Pratyush Anand <panand@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-03-08Merge tag 'tty-4.0-rc3' of ↵Linus Torvalds2-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial fixes from Greg KH: "Here are some tty and serial driver fixes for 4.0-rc3. Along with the atime fix that you know about, here are some other serial driver bugfixes as well. Most notable is a wait_until_sent bugfix that was traced back to being around since before 2.6.12 that Johan has fixed up. All have been in linux-next successfully" * tag 'tty-4.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: TTY: fix tty_wait_until_sent maximum timeout TTY: fix tty_wait_until_sent on 64-bit machines USB: serial: fix infinite wait_until_sent timeout TTY: bfin_jtag_comm: remove incorrect wait_until_sent operation net: irda: fix wait_until_sent poll timeout serial: uapi: Declare all userspace-visible io types serial: core: Fix iotype userspace breakage serial: sprd: Fix missing spin_unlock in sprd_handle_irq() console: Fix console name size mismatch tty: fix up atime/mtime mess, take four serial: 8250_dw: Fix get_mctrl behaviour serial:8250:8250_pci: delete unneeded quirk entries serial:8250:8250_pci: fix redundant entry report for WCH_CH352_2S Change email address for 8250_pci serial: 8250: Revert "tty: serial: 8250_core: read only RX if there is something in the FIFO" Revert "tty/serial: of_serial: add DT alias ID handling"
2015-03-07Merge tag 'arm64-fixes' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Catalin Marinas: "arm64 and generic kernel/module.c (acked by Rusty) fixes for CONFIG_DEBUG_SET_MODULE_RONX" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: kernel/module.c: Update debug alignment after symtable generation arm64: Don't use is_module_addr in setting page attributes
2015-03-07console: Fix console name size mismatchPeter Hurley2-1/+2
commit 6ae9200f2cab7 ("enlarge console.name") increased the storage for the console name to 16 bytes, but not the corresponding struct console_cmdline::name storage. Console names longer than 8 bytes cause read beyond end-of-string and failure to match console; I'm not sure if there are other unexpected consequences. Cc: <stable@vger.kernel.org> # 2.6.22+ Signed-off-by: Peter Hurley <peter@hurleysoftware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06Merge branch 'for-linus' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching Pull livepatching fix from Jiri Kosina: "Fix an RCU unlock misplacement in live patching infrastructure, from Peter Zijlstra" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching: livepatch: fix RCU usage in klp_find_external_symbol()
2015-03-06kernel/module.c: Update debug alignment after symtable generationLaura Abbott1-0/+2
When CONFIG_DEBUG_SET_MODULE_RONX is enabled, the sizes of module sections are aligned up so appropriate permissions can be applied. Adjusting for the symbol table may cause them to become unaligned. Make sure to re-align the sizes afterward. Signed-off-by: Laura Abbott <lauraa@codeaurora.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2015-03-06Merge branch 'irq-pm'Rafael J. Wysocki2-2/+12
* irq-pm: genirq / PM: describe IRQF_COND_SUSPEND tty: serial: atmel: rework interrupt and wakeup handling watchdog: at91sam9: request the irq with IRQF_NO_SUSPEND clk: at91: implement suspend/resume for the PMC irqchip rtc: at91rm9200: rework wakeup and interrupt handling rtc: at91sam9: rework wakeup and interrupt handling PM / wakeup: export pm_system_wakeup symbol genirq / PM: Add flag for shared NO_SUSPEND interrupt lines genirq / PM: better describe IRQF_NO_SUSPEND semantics
2015-03-05Merge branch 'suspend-to-idle'Rafael J. Wysocki1-21/+33
* suspend-to-idle: cpuidle / sleep: Use broadcast timer for states that stop local timer cpuidle: Clean up fallback handling in cpuidle_idle_call() cpuidle / sleep: Do sanity checks in cpuidle_enter_freeze() too idle / sleep: Avoid excessive disabling and enabling interrupts
2015-03-05cpuidle / sleep: Use broadcast timer for states that stop local timerRafael J. Wysocki1-9/+21
Commit 381063133246 (PM / sleep: Re-implement suspend-to-idle handling) overlooked the fact that entering some sufficiently deep idle states by CPUs may cause their local timers to stop and in those cases it is necessary to switch over to a broadcast timer prior to entering the idle state. If the cpuidle driver in use does not provide the new ->enter_freeze callback for any of the idle states, that problem affects suspend-to-idle too, but it is not taken into account after the changes made by commit 381063133246. Fix that by changing the definition of cpuidle_enter_freeze() and re-arranging of the code in cpuidle_idle_call(), so the former does not call cpuidle_enter() any more and the fallback case is handled by cpuidle_idle_call() directly. Fixes: 381063133246 (PM / sleep: Re-implement suspend-to-idle handling) Reported-and-tested-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2015-03-05workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for ↵Tejun Heo1-4/+52
PREEMPT_NONE cancel[_delayed]_work_sync() are implemented using __cancel_work_timer() which grabs the PENDING bit using try_to_grab_pending() and then flushes the work item with PENDING set to prevent the on-going execution of the work item from requeueing itself. try_to_grab_pending() can always grab PENDING bit without blocking except when someone else is doing the above flushing during cancelation. In that case, try_to_grab_pending() returns -ENOENT. In this case, __cancel_work_timer() currently invokes flush_work(). The assumption is that the completion of the work item is what the other canceling task would be waiting for too and thus waiting for the same condition and retrying should allow forward progress without excessive busy looping Unfortunately, this doesn't work if preemption is disabled or the latter task has real time priority. Let's say task A just got woken up from flush_work() by the completion of the target work item. If, before task A starts executing, task B gets scheduled and invokes __cancel_work_timer() on the same work item, its try_to_grab_pending() will return -ENOENT as the work item is still being canceled by task A and flush_work() will also immediately return false as the work item is no longer executing. This puts task B in a busy loop possibly preventing task A from executing and clearing the canceling state on the work item leading to a hang. task A task B worker executing work __cancel_work_timer() try_to_grab_pending() set work CANCELING flush_work() block for work completion completion, wakes up A __cancel_work_timer() while (forever) { try_to_grab_pending() -ENOENT as work is being canceled flush_work() false as work is no longer executing } This patch removes the possible hang by updating __cancel_work_timer() to explicitly wait for clearing of CANCELING rather than invoking flush_work() after try_to_grab_pending() fails with -ENOENT. Link: http://lkml.kernel.org/g/20150206171156.GA8942@axis.com v3: bit_waitqueue() can't be used for work items defined in vmalloc area. Switched to custom wake function which matches the target work item and exclusive wait and wakeup. v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if the target bit waitqueue has wait_bit_queue's on it. Use DEFINE_WAIT_BIT() and __wake_up_bit() instead. Reported by Tomeu Vizoso. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Rabin Vincent <rabin.vincent@axis.com> Cc: Tomeu Vizoso <tomeu.vizoso@gmail.com> Cc: stable@vger.kernel.org Tested-by: Jesper Nilsson <jesper.nilsson@axis.com> Tested-by: Rabin Vincent <rabin.vincent@axis.com>
2015-03-04genirq / PM: Add flag for shared NO_SUSPEND interrupt linesRafael J. Wysocki2-2/+12
It currently is required that all users of NO_SUSPEND interrupt lines pass the IRQF_NO_SUSPEND flag when requesting the IRQ or the WARN_ON_ONCE() in irq_pm_install_action() will trigger. That is done to warn about situations in which unprepared interrupt handlers may be run unnecessarily for suspended devices and may attempt to access those devices by mistake. However, it may cause drivers that have no technical reasons for using IRQF_NO_SUSPEND to set that flag just because they happen to share the interrupt line with something like a timer. Moreover, the generic handling of wakeup interrupts introduced by commit 9ce7a25849e8 (genirq: Simplify wakeup mechanism) only works for IRQs without any NO_SUSPEND users, so the drivers of wakeup devices needing to use shared NO_SUSPEND interrupt lines for signaling system wakeup generally have to detect wakeup in their interrupt handlers. Thus if they happen to share an interrupt line with a NO_SUSPEND user, they also need to request that their interrupt handlers be run after suspend_device_irqs(). In both cases the reason for using IRQF_NO_SUSPEND is not because the driver in question has a genuine need to run its interrupt handler after suspend_device_irqs(), but because it happens to share the line with some other NO_SUSPEND user. Otherwise, the driver would do without IRQF_NO_SUSPEND just fine. To make it possible to specify that condition explicitly, introduce a new IRQ action handler flag for shared IRQs, IRQF_COND_SUSPEND, that, when set, will indicate to the IRQ core that the interrupt user is generally fine with suspending the IRQ, but it also can tolerate handler invocations after suspend_device_irqs() and, in particular, it is capable of detecting system wakeup and triggering it as appropriate from its interrupt handler. That will allow us to work around a problem with a shared timer interrupt line on at91 platforms. Link: http://marc.info/?l=linux-kernel&m=142252777602084&w=2 Link: http://marc.info/?t=142252775300011&r=1&w=2 Link: https://lkml.org/lkml/2014/12/15/552 Reported-by: Boris Brezillon <boris.brezillon@free-electrons.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com>
2015-03-03livepatch: fix RCU usage in klp_find_external_symbol()Peter Zijlstra1-1/+2
While one must hold RCU-sched (aka. preempt_disable) for find_symbol() one must equally hold it over the use of the object returned. The moment you release the RCU-sched read lock, the object can be dead and gone. [jkosina@suse.cz: change subject line to be aligned with other patches] Cc: Seth Jennings <sjenning@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Petr Mladek <pmladek@suse.cz> Cc: Jiri Kosina <jkosina@suse.cz> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-03-02cpuidle: Clean up fallback handling in cpuidle_idle_call()Rafael J. Wysocki1-14/+15
Move the fallback code path in cpuidle_idle_call() to the end of the function to avoid jumping to a label in an if () branch. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-03-02cpuset: Fix cpuset sched_relax_domain_levelJason Low1-3/+0
The cpuset.sched_relax_domain_level can control how far we do immediate load balancing on a system. However, it was found on recent kernels that echo'ing a value into cpuset.sched_relax_domain_level did not reduce any immediate load balancing. The reason this occurred was because the update_domain_attr_tree() traversal did not update for the "top_cpuset". This resulted in nothing being changed when modifying the sched_relax_domain_level parameter. This patch is able to address that problem by having update_domain_attr_tree() allow updates for the root in the cpuset traversal. Fixes: fc560a26acce ("cpuset: replace cpuset->stack_list with cpuset_for_each_descendant_pre()") Cc: <stable@vger.kernel.org> # 3.9+ Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Serge Hallyn <serge.hallyn@canonical.com>