summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2014-07-16sched: Remove proliferation of wait_on_bit() action functionsNeilBrown2-7/+19
The current "wait_on_bit" interface requires an 'action' function to be provided which does the actual waiting. There are over 20 such functions, many of them identical. Most cases can be satisfied by one of just two functions, one which uses io_schedule() and one which just uses schedule(). So: Rename wait_on_bit and wait_on_bit_lock to wait_on_bit_action and wait_on_bit_lock_action to make it explicit that they need an action function. Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io which are *not* given an action function but implicitly use a standard one. The decision to error-out if a signal is pending is now made based on the 'mode' argument rather than being encoded in the action function. All instances of the old wait_on_bit and wait_on_bit_lock which can use the new version have been changed accordingly and their action functions have been discarded. wait_on_bit{_lock} does not return any specific error code in the event of a signal so the caller must check for non-zero and interpolate their own error code as appropriate. The wait_on_bit() call in __fscache_wait_on_invalidate() was ambiguous as it specified TASK_UNINTERRUPTIBLE but used fscache_wait_bit_interruptible as an action function. David Howells confirms this should be uniformly "uninterruptible" The main remaining user of wait_on_bit{,_lock}_action is NFS which needs to use a freezer-aware schedule() call. A comment in fs/gfs2/glock.c notes that having multiple 'action' functions is useful as they display differently in the 'wchan' field of 'ps'. (and /proc/$PID/wchan). As the new bit_wait{,_io} functions are tagged "__sched", they will not show up at all, but something higher in the stack. So the distinction will still be visible, only with different function names (gds2_glock_wait versus gfs2_glock_dq_wait in the gfs2/glock.c case). Since first version of this patch (against 3.15) two new action functions appeared, on in NFS and one in CIFS. CIFS also now uses an action function that makes the same freezer aware schedule call as NFS. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: David Howells <dhowells@redhat.com> (fscache, keys) Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2) Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steve French <sfrench@samba.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16Merge tag 'v3.16-rc5' into sched/core, to refresh the branch before applying ↵Ingo Molnar22-137/+530
bigger tree-wide changes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16sched/numa: Revert "Use effective_load() to balance NUMA loads"Peter Zijlstra1-14/+6
Due to divergent trees, Rik find that this patch is no longer required. Requested-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-u6odkgkw8wz3m7orgsjfo5pi@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16sched: Fix static_key race with sched_feat()Jason Baron1-0/+5
As pointed out by Andi Kleen, the usage of static keys can be racy in sched_feat_disable() vs. sched_feat_enable(). Currently, we first check the value of keys->enabled, and subsequently update the branch direction. This, can be racy and can potentially leave the keys in an inconsistent state. Take the i_mutex around these calls to resolve the race. Reported-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Jason Baron <jbaron@akamai.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/9d7780c83db26683955cd01e6bc654ee2586e67f.1404315388.git.jbaron@akamai.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16sched: Remove extra static_key*() function indirectionJason Baron1-11/+1
I think its a bit simpler without having to follow an extra layer of static inline fuctions. No functional change just cosmetic. Signed-off-by: Jason Baron <jbaron@akamai.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: rostedt@goodmis.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/2ce52233ce200faad93b6029d90f1411cd926667.1404315388.git.jbaron@akamai.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16sched/rt: Fix replenish_dl_entity() comments to match the current upstream codexiaofeng.yan1-1/+1
Signed-off-by: xiaofeng.yan <xiaofeng.yan@huawei.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1404712744-16986-1-git-send-email-xiaofeng.yan@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16sched: Transform resched_task() into resched_curr()Kirill Tkhai6-45/+47
We always use resched_task() with rq->curr argument. It's not possible to reschedule any task but rq's current. The patch introduces resched_curr(struct rq *) to replace all of the repeating patterns. The main aim is cleanup, but there is a little size profit too: (before) $ size kernel/sched/built-in.o text data bss dec hex filename 155274 16445 7042 178761 2ba49 kernel/sched/built-in.o $ size vmlinux text data bss dec hex filename 7411490 1178376 991232 9581098 92322a vmlinux (after) $ size kernel/sched/built-in.o text data bss dec hex filename 155130 16445 7042 178617 2b9b9 kernel/sched/built-in.o $ size vmlinux text data bss dec hex filename 7411362 1178376 991232 9580970 9231aa vmlinux I was choosing between resched_curr() and resched_rq(), and the first name looks better for me. A little lie in Documentation/trace/ftrace.txt. I have not actually collected the tracing again. With a hope the patch won't make execution times much worse :) Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20140628200219.1778.18735.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16sched/deadline: Kill task_struct->pi_top_taskOleg Nesterov2-4/+3
Remove task_struct->pi_top_task. The only user, rt_mutex_setprio(), can use a local. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Alex Thorlton <athorlton@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Daeseok Youn <daeseok.youn@gmail.com> Cc: Dario Faggioli <raistlin@linux.it> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: David Rientjes <rientjes@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Dempsky <mdempsky@chromium.org> Cc: Michal Simek <michal.simek@xilinx.com> Cc: Oleg Nesterov <oleg@redhat.com> Link: http://lkml.kernel.org/r/20140606165206.GB29465@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-10Merge branch 'for-3.16-fixes' of ↵Linus Torvalds2-9/+69
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Mostly fixes for the fallouts from the recent cgroup core changes. The decoupled nature of cgroup dynamic hierarchy management (hierarchies are created dynamically on mount but may or may not be reused once unmounted depending on remaining usages) led to more ugliness being added to kernfs. Hopefully, this is the last of it" * 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: break kernfs active protection in cpuset_write_resmask() cgroup: fix a race between cgroup_mount() and cgroup_kill_sb() kernfs: introduce kernfs_pin_sb() cgroup: fix mount failure in a corner case cpuset,mempolicy: fix sleeping function called from invalid context cgroup: fix broken css_has_online_children()
2014-07-10Merge branch 'for-3.16-fixes' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: "Two workqueue fixes. Both are one liners. One fixes missing uevent for workqueue files on sysfs. The other one fixes missing zeroing of NUMA cpu masks which can lead to oopses among other things" * 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: zero cpumask of wq_numa_possible_cpumask on init workqueue: fix dev_set_uevent_suppress() imbalance
2014-07-07workqueue: zero cpumask of wq_numa_possible_cpumask on initYasuaki Ishimatsu1-1/+1
When hot-adding and onlining CPU, kernel panic occurs, showing following call trace. BUG: unable to handle kernel paging request at 0000000000001d08 IP: [<ffffffff8114acfd>] __alloc_pages_nodemask+0x9d/0xb10 PGD 0 Oops: 0000 [#1] SMP ... Call Trace: [<ffffffff812b8745>] ? cpumask_next_and+0x35/0x50 [<ffffffff810a3283>] ? find_busiest_group+0x113/0x8f0 [<ffffffff81193bc9>] ? deactivate_slab+0x349/0x3c0 [<ffffffff811926f1>] new_slab+0x91/0x300 [<ffffffff815de95a>] __slab_alloc+0x2bb/0x482 [<ffffffff8105bc1c>] ? copy_process.part.25+0xfc/0x14c0 [<ffffffff810a3c78>] ? load_balance+0x218/0x890 [<ffffffff8101a679>] ? sched_clock+0x9/0x10 [<ffffffff81105ba9>] ? trace_clock_local+0x9/0x10 [<ffffffff81193d1c>] kmem_cache_alloc_node+0x8c/0x200 [<ffffffff8105bc1c>] copy_process.part.25+0xfc/0x14c0 [<ffffffff81114d0d>] ? trace_buffer_unlock_commit+0x4d/0x60 [<ffffffff81085a80>] ? kthread_create_on_node+0x140/0x140 [<ffffffff8105d0ec>] do_fork+0xbc/0x360 [<ffffffff8105d3b6>] kernel_thread+0x26/0x30 [<ffffffff81086652>] kthreadd+0x2c2/0x300 [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60 [<ffffffff815f20ec>] ret_from_fork+0x7c/0xb0 [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60 In my investigation, I found the root cause is wq_numa_possible_cpumask. All entries of wq_numa_possible_cpumask is allocated by alloc_cpumask_var_node(). And these entries are used without initializing. So these entries have wrong value. When hot-adding and onlining CPU, wq_update_unbound_numa() is called. wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq() calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set as follow: 3592 /* if cpumask is contained inside a NUMA node, we belong to that node */ 3593 if (wq_numa_enabled) { 3594 for_each_node(node) { 3595 if (cpumask_subset(pool->attrs->cpumask, 3596 wq_numa_possible_cpumask[node])) { 3597 pool->node = node; 3598 break; 3599 } 3600 } 3601 } But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong node is selected. As a result, kernel panic occurs. By this patch, all entries of wq_numa_possible_cpumask are allocated by zalloc_cpumask_var_node to initialize them. And the panic disappeared. Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Fixes: bce903809ab3 ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
2014-07-05Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A few minor fixlets in ARM SoC irq drivers and a fix for a memory leak which I introduced in the last round of cleanups :(" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Fix memory leak when calling irq_free_hwirqs() irqchip: spear_shirq: Fix interrupt offset irqchip: brcmstb-l2: Level-2 interrupts are edge sensitive irqchip: armada-370-xp: Mask all interrupts during initialization.
2014-07-05genirq: Fix memory leak when calling irq_free_hwirqs()Keith Busch1-2/+2
irq_free_hwirqs() always calls irq_free_descs() with a cnt == 0 which makes it a no-op since the interrupt count to free is decremented in itself. Fixes: 7b6ef1262549f6afc5c881aaef80beb8fd15f908 Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: David Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/1404167084-8070-1-git-send-email-keith.busch@intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-07-05sched: Rework check_for_tasks()Kirill Tkhai1-13/+20
1) Iterate thru all of threads in the system. Check for all threads, not only for group leaders. 2) Check for p->on_rq instead of p->state and cputime. Preempted task in !TASK_RUNNING state OR just created task may be queued, that we want to be reported too. 3) Use read_lock() instead of write_lock(). This function does not change any structures, and read_lock() is enough. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ben Segall <bsegall@google.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Cc: Konstantin Khorenko <khorenko@parallels.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael wang <wangyun@linux.vnet.ibm.com> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Paul Turner <pjt@google.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Todd E Brandt <todd.e.brandt@linux.intel.com> Cc: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403684395.3462.44.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/rt: Enqueue just unthrottled rt_rq back on the stack in ↵Kirill Tkhai1-0/+3
__disable_runtime() Make rt_rq available for pick_next_task(). Otherwise, their tasks stay prisoned long time till dead cpu becomes alive again. Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> CC: Konstantin Khorenko <khorenko@parallels.com> CC: Ben Segall <bsegall@google.com> CC: Paul Turner <pjt@google.com> CC: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403684388.3462.43.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/fair: Disable runtime_enabled on dying rqKirill Tkhai2-1/+29
We kill rq->rd on the CPU_DOWN_PREPARE stage: cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains -> -> cpu_attach_domain -> rq_attach_root -> set_rq_offline This unthrottles all throttled cfs_rqs. But the cpu is still able to call schedule() till take_cpu_down->__cpu_disable() is called from stop_machine. This case the tasks from just unthrottled cfs_rqs are pickable in a standard scheduler way, and they are picked by dying cpu. The cfs_rqs becomes throttled again, and migrate_tasks() in migration_call skips their tasks (one more unthrottle in migrate_tasks()->CPU_DYING does not happen, because rq->rd is already NULL). Patch sets runtime_enabled to zero. This guarantees, the runtime is not accounted, and the cfs_rqs won't exceed given cfs_rq->runtime_remaining = 1, and tasks will be pickable in migrate_tasks(). runtime_enabled is recalculated again when rq becomes online again. Ben Segall also noticed, we always enable runtime in tg_set_cfs_bandwidth(). Actually, we should do that for online cpus only. To prevent races with unthrottle_offline_cfs_rqs() we take get_online_cpus() lock. Reviewed-by: Ben Segall <bsegall@google.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> CC: Konstantin Khorenko <khorenko@parallels.com> CC: Paul Turner <pjt@google.com> CC: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403684382.3462.42.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/numa: Change scan period code to match intentRik van Riel1-4/+4
Reading through the scan period code and comment, it appears the intent was to slow down NUMA scanning when a majority of accesses are on the local node, specifically a local:remote ratio of 3:1. However, the code actually tests local / (local + remote), and the actual cut-off point was around 30% local accesses, well before a task has actually converged on a node. Changing the threshold to 7 means scanning slows down when a task has around 70% of its accesses local, which appears to match the intent of the code more closely. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: mgorman@suse.de Cc: chegu_vinod@hp.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403538095-31256-8-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/numa: Rework best node setting in task_numa_migrate()Rik van Riel1-6/+13
Fix up the best node setting in task_numa_migrate() to deal with a task in a pseudo-interleaved NUMA group, which is already running in the best location. Set the task's preferred nid to the current nid, so task migration is not retried at a high rate. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: mgorman@suse.de Cc: chegu_vinod@hp.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403538095-31256-7-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/numa: Examine a task move when examining a task swapRik van Riel1-2/+21
Running "perf bench numa mem -0 -m -P 1000 -p 8 -t 20" on a 4 node system results in 160 runnable threads on a system with 80 CPU threads. Once a process has nearly converged, with 39 threads on one node and 1 thread on another node, the remaining thread will be unable to migrate to its preferred node through a task swap. However, a simple task move would make the workload converge, witout causing an imbalance. Test for this unlikely occurrence, and attempt a task move to the preferred nid when it happens. # Running main, "perf bench numa mem -p 8 -t 20 -0 -m -P 1000" ### # 160 tasks will execute (on 4 nodes, 80 CPUs): # -1x 0MB global shared mem operations # -1x 1000MB process shared mem operations # -1x 0MB thread local mem operations ### ### # # 0.0% [0.2 mins] 0/0 1/1 36/2 0/0 [36/3 ] l: 0-0 ( 0) {0-2} # 0.0% [0.3 mins] 43/3 37/2 39/2 41/3 [ 6/10] l: 0-1 ( 1) {1-2} # 0.0% [0.4 mins] 42/3 38/2 40/2 40/2 [ 4/9 ] l: 1-2 ( 1) [50.0%] {1-2} # 0.0% [0.6 mins] 41/3 39/2 40/2 40/2 [ 2/9 ] l: 2-4 ( 2) [50.0%] {1-2} # 0.0% [0.7 mins] 40/2 40/2 40/2 40/2 [ 0/8 ] l: 3-5 ( 2) [40.0%] ( 41.8s converged) Without this patch, this same perf bench numa mem run had to rely on the scheduler load balancer to first balance out the load (moving a random task), before a task swap could complete the NUMA convergence. The load balancer does not normally take action unless the load difference exceeds 25%. Convergence times of over half an hour have been observed without this patch. With this patch, the NUMA balancing code will simply migrate the task, if that does not cause an imbalance. Also skip examining a CPU in detail if the improvement on that CPU is no more than the best we already have. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: chegu_vinod@hp.com Cc: mgorman@suse.de Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-ggthh0rnh0yua6o5o3p6cr1o@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/numa: Simplify task_numa_compare()Rik van Riel1-6/+1
When a task is part of a numa_group, the comparison should always use the group weight, in order to make workloads converge. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: chegu_vinod@hp.com Cc: mgorman@suse.de Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403538378-31571-4-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/numa: Use effective_load() to balance NUMA loadsRik van Riel1-6/+14
When CONFIG_FAIR_GROUP_SCHED is enabled, the load that a task places on a CPU is determined by the group the task is in. The active groups on the source and destination CPU can be different, resulting in a different load contribution by the same task at its source and at its destination. As a result, the load needs to be calculated separately for each CPU, instead of estimated once with task_h_load(). Getting this calculation right allows some workloads to converge, where previously the last thread could get stuck on another node, without being able to migrate to its final destination. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: mgorman@suse.de Cc: chegu_vinod@hp.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403538378-31571-3-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/numa: Move power adjustment into load_too_imbalanced()Rik van Riel1-15/+24
Currently the NUMA code scales the load on each node with the amount of CPU power available on that node, but it does not apply any adjustment to the load of the task that is being moved over. On systems with SMT/HT, this results in a task being weighed much more heavily than a CPU core, and a task move that would even out the load between nodes being disallowed. The correct thing is to apply the power correction to the numbers after we have first applied the move of the tasks' loads to them. This also allows us to do the power correction with a multiplication, rather than a division. Also drop two function arguments for load_too_unbalanced, since it takes various factors from env already. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: chegu_vinod@hp.com Cc: mgorman@suse.de Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403538378-31571-2-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/numa: Use group's max nid as task's preferred nidRik van Riel1-16/+1
From task_numa_placement, always try to consolidate the tasks in a group on the group's top nid. In case this task is part of a group that is interleaved over multiple nodes, task_numa_migrate will set the task's preferred nid to the best node it could find for the task, so this patch will cause at most one run through task_numa_migrate. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: mgorman@suse.de Cc: chegu_vinod@hp.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403538095-31256-2-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/fair: Implement fast idling of CPUs when the system is partially loadedTim Chen2-5/+28
When a system is lightly loaded (i.e. no more than 1 job per cpu), attempt to pull job to a cpu before putting it to idle is unnecessary and can be skipped. This patch adds an indicator so the scheduler can know when there's no more than 1 active job is on any CPU in the system to skip needless job pulls. On a 4 socket machine with a request/response kind of workload from clients, we saw about 0.13 msec delay when we go through a full load balance to try pull job from all the other cpus. While 0.1 msec was spent on processing the request and generating a response, the 0.13 msec load balance overhead was actually more than the actual work being done. This overhead can be skipped much of the time for lightly loaded systems. With this patch, we tested with a netperf request/response workload that has the server busy with half the cpus in a 4 socket system. We found the patch eliminated 75% of the load balance attempts before idling a cpu. The overhead of setting/clearing the indicator is low as we already gather the necessary info while we call add_nr_running() and update_sd_lb_stats.() We switch to full load balance load immediately if any cpu got more than one job on its run queue in add_nr_running. We'll clear the indicator to avoid load balance when we detect no cpu's have more than one job when we scan the work queues in update_sg_lb_stats(). We are aggressive in turning on the load balance and opportunistic in skipping the load balance. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Jason Low <jason.low2@hp.com> Cc: "Paul E.McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Alex Shi <alex.shi@linaro.org> Cc: Michel Lespinasse <walken@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403551009.2970.613.camel@schen9-DESK Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/idle: Drop !! while calculating 'broadcast'Viresh Kumar1-2/+2
We don't need 'broadcast' to be set to 'zero or one', but to 'zero or non-zero' and so the extra operation to convert it to 'zero or one' can be skipped. Also change type of 'broadcast' to unsigned int, i.e. type of drv->states[*].flags. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: linaro-kernel@lists.linaro.org Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/0dfbe2976aa108c53e08d3477ea90f6360c1f54c.1403584026.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched: Fix clock_gettime(CLOCK_[PROCESS/THREAD]_CPUTIME_ID) monotonicityMike Galbraith1-2/+11
If a task has been dequeued, it has been accounted. Do not project cycles that may or may not ever be accounted to a dequeued task, as that may make clock_gettime() both inaccurate and non-monotonic. Protect update_rq_clock() from slight TSC skew while at it. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: kosaki.motohiro@jp.fujitsu.com Cc: pjt@google.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1403588980.29711.11.camel@marge.simpson.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched: Fix potential near-infinite distribute_cfs_runtime() loopBen Segall1-21/+20
distribute_cfs_runtime() intentionally only hands out enough runtime to bring each cfs_rq to 1 ns of runtime, expecting the cfs_rqs to then take the runtime they need only once they actually get to run. However, if they get to run sufficiently quickly, the period timer is still in distribute_cfs_runtime() and no runtime is available, causing them to throttle. Then distribute has to handle them again, and this can go on until distribute has handed out all of the runtime 1ns at a time, which takes far too long. Instead allow access to the same runtime that distribute is handing out, accepting that corner cases with very low quota may be able to spend the entire cfs_b->runtime during distribute_cfs_runtime, meaning that the runtime directly handed out by distribute_cfs_runtime was over quota. In addition, if a cfs_rq does manage to throttle like this, make sure the existing distribute_cfs_runtime no longer loops over it again. Signed-off-by: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140620222120.13814.21652.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05sched/core: Fix formatting issues in sched_can_stop_tick()Viresh Kumar1-7/+3
sched_can_stop_tick() is using 7 spaces instead of 8 spaces or a 'tab' at the beginning of few lines. Which doesn't align well with the Coding Guidelines. Also remove local variable 'rq' as it is used at only one place and we can directly use this_rq() instead. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: fweisbec@gmail.com Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/afb781733e4a9ffbced5eb9fd25cc0aa5c6ffd7a.1403596966.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05irq_work: Remove BUG_ON in irq_work_run()Peter Zijlstra1-42/+4
Because of a collision with 8d056c48e486 ("CPU hotplug, smp: flush any pending IPI callbacks before CPU offline"), which ends up calling hotplug_cfd()->flush_smp_call_function_queue()->irq_work_run(), which is not from IRQ context. And since that already calls irq_work_run() from the hotplug path, remove our entire hotplug handling. Reported-by: Stephen Warren <swarren@wwwdotorg.org> Tested-by: Stephen Warren <swarren@wwwdotorg.org> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-busatzs2gvz4v62258agipuf@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-05Merge branch 'timers/nohz' into sched/coreIngo Molnar5-45/+84
Merge these two, because upcoming patches will touch both areas. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-03Merge tag 'trace-fixes-v3.16-rc3' of ↵Linus Torvalds3-24/+30
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Oleg Nesterov found and fixed a bug in the perf/ftrace/uprobes code where running: # perf probe -x /lib/libc.so.6 syscall # echo 1 >> /sys/kernel/debug/tracing/events/probe_libc/enable # perf record -e probe_libc:syscall whatever kills the uprobe. Along the way he found some other minor bugs and clean ups that he fixed up making it a total of 4 patches. Doing unrelated work, I found that the reading of the ftrace trace file disables all function tracer callbacks. This was fine when ftrace was the only user, but now that it's used by perf and kprobes, this is a bug where reading trace can disable kprobes and perf. A very unexpected side effect and should be fixed" * tag 'trace-fixes-v3.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Remove ftrace_stop/start() from reading the trace file tracing/uprobes: Fix the usage of uprobe_buffer_enable() in probe_event_enable() tracing/uprobes: Kill the bogus UPROBE_HANDLER_REMOVE code in uprobe_dispatcher() uprobes: Change unregister/apply to WARN() if uprobe/consumer is gone tracing/uprobes: Revert "Support mix of ftrace and perf"
2014-07-03kernel/printk/printk.c: revert "printk: enable interrupts before calling ↵Andrew Morton1-26/+18
console_trylock_for_printk()" Revert commit 939f04bec1a4 ("printk: enable interrupts before calling console_trylock_for_printk()"). Andreas reported: : None of the post 3.15 kernel boot for me. They all hang at the GRUB : screen telling me it loaded and started the kernel, but the kernel : itself stops before it prints anything (or even replaces the GRUB : background graphics). 939f04bec1a4 is modest latency reduction. Revert it until we understand the reason for these failures. Reported-by: Andreas Bombe <aeb@debian.org> Cc: Jan Kara <jack@suse.cz> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-01cpuset: break kernfs active protection in cpuset_write_resmask()Tejun Heo1-0/+12
Writing to either "cpuset.cpus" or "cpuset.mems" file flushes cpuset_hotplug_work so that cpu or memory hotunplug doesn't end up migrating tasks off a cpuset after new resources are added to it. As cpuset_hotplug_work calls into cgroup core via cgroup_transfer_tasks(), this flushing adds the dependency to cgroup core locking from cpuset_write_resmak(). This used to be okay because cgroup interface files were protected by a different mutex; however, 8353da1f91f1 ("cgroup: remove cgroup_tree_mutex") simplified the cgroup core locking and this dependency became a deadlock hazard - cgroup file removal performed under cgroup core lock tries to drain on-going file operation which is trying to flush cpuset_hotplug_work blocked on the same cgroup core lock. The locking simplification was done because kernfs added an a lot easier way to deal with circular dependencies involving kernfs active protection. Let's use the same strategy in cpuset and break active protection in cpuset_write_resmask(). While it isn't the prettiest, this is a very rare, likely unique, situation which also goes away on the unified hierarchy. The commands to trigger the deadlock warning without the patch and the lockdep output follow. localhost:/ # mount -t cgroup -o cpuset xxx /cpuset localhost:/ # mkdir /cpuset/tmp localhost:/ # echo 1 > /cpuset/tmp/cpuset.cpus localhost:/ # echo 0 > cpuset/tmp/cpuset.mems localhost:/ # echo $$ > /cpuset/tmp/tasks localhost:/ # echo 0 > /sys/devices/system/cpu/cpu1/online ====================================================== [ INFO: possible circular locking dependency detected ] 3.16.0-rc1-0.1-default+ #7 Not tainted ------------------------------------------------------- kworker/1:0/32649 is trying to acquire lock: (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e3d7>] cgroup_transfer_tasks+0x37/0x150 but task is already holding lock: (cpuset_hotplug_work){+.+...}, at: [<ffffffff81085412>] process_one_work+0x192/0x520 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (cpuset_hotplug_work){+.+...}: ... -> #1 (s_active#175){++++.+}: ... -> #0 (cgroup_mutex){+.+.+.}: ... other info that might help us debug this: Chain exists of: cgroup_mutex --> s_active#175 --> cpuset_hotplug_work Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(cpuset_hotplug_work); lock(s_active#175); lock(cpuset_hotplug_work); lock(cgroup_mutex); *** DEADLOCK *** 2 locks held by kworker/1:0/32649: #0: ("events"){.+.+.+}, at: [<ffffffff81085412>] process_one_work+0x192/0x520 #1: (cpuset_hotplug_work){+.+...}, at: [<ffffffff81085412>] process_one_work+0x192/0x520 stack backtrace: CPU: 1 PID: 32649 Comm: kworker/1:0 Not tainted 3.16.0-rc1-0.1-default+ #7 ... Call Trace: [<ffffffff815a5f78>] dump_stack+0x72/0x8a [<ffffffff810c263f>] print_circular_bug+0x10f/0x120 [<ffffffff810c481e>] check_prev_add+0x43e/0x4b0 [<ffffffff810c4ee6>] validate_chain+0x656/0x7c0 [<ffffffff810c53d2>] __lock_acquire+0x382/0x660 [<ffffffff810c57a9>] lock_acquire+0xf9/0x170 [<ffffffff815aa13f>] mutex_lock_nested+0x6f/0x380 [<ffffffff8110e3d7>] cgroup_transfer_tasks+0x37/0x150 [<ffffffff811129c0>] hotplug_update_tasks_insane+0x110/0x1d0 [<ffffffff81112bbd>] cpuset_hotplug_update_tasks+0x13d/0x180 [<ffffffff811148ec>] cpuset_hotplug_workfn+0x18c/0x630 [<ffffffff810854d4>] process_one_work+0x254/0x520 [<ffffffff810875dd>] worker_thread+0x13d/0x3d0 [<ffffffff8108e0c8>] kthread+0xf8/0x100 [<ffffffff815acaec>] ret_from_fork+0x7c/0xb0 Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Li Zefan <lizefan@huawei.com> Tested-by: Li Zefan <lizefan@huawei.com>
2014-07-01tracing: Remove ftrace_stop/start() from reading the trace fileSteven Rostedt (Red Hat)1-2/+0
Disabling reading and writing to the trace file should not be able to disable all function tracing callbacks. There's other users today (like kprobes and perf). Reading a trace file should not stop those from happening. Cc: stable@vger.kernel.org # 3.0+ Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-30tracing/uprobes: Fix the usage of uprobe_buffer_enable() in probe_event_enable()Oleg Nesterov1-12/+19
The usage of uprobe_buffer_enable() added by dcad1a20 is very wrong, 1. uprobe_buffer_enable() and uprobe_buffer_disable() are not balanced, _enable() should be called only if !enabled. 2. If uprobe_buffer_enable() fails probe_event_enable() should clear tp.flags and free event_file_link. 3. If uprobe_register() fails it should do uprobe_buffer_disable(). Link: http://lkml.kernel.org/p/20140627170146.GA18332@redhat.com Acked-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Fixes: dcad1a204f72 "tracing/uprobes: Fetch args before reserving a ring buffer" Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-30tracing/uprobes: Kill the bogus UPROBE_HANDLER_REMOVE code in ↵Oleg Nesterov1-6/+0
uprobe_dispatcher() I do not know why dd9fa555d7bb "tracing/uprobes: Move argument fetching to uprobe_dispatcher()" added the UPROBE_HANDLER_REMOVE, but it looks wrong. OK, perhaps it makes sense to avoid store_trace_args() if the tracee is nacked by uprobe_perf_filter(). But then we should kill the same code in uprobe_perf_func() and unify the TRACE/PROFILE filtering (we need to do this anyway to mix perf/ftrace). Until then this code actually adds the pessimization because uprobe_perf_filter() will be called twice and return T in likely case. Link: http://lkml.kernel.org/p/20140627170143.GA18329@redhat.com Acked-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-30uprobes: Change unregister/apply to WARN() if uprobe/consumer is goneOleg Nesterov1-3/+3
Add WARN_ON's into uprobe_unregister() and uprobe_apply() to ensure that nobody tries to play with the dead uprobe/consumer. This helps to catch the bugs like the one fixed by the previous patch. In the longer term we should fix this poorly designed interface. uprobe_register() should return "struct uprobe *" which should be passed to apply/unregister. Plus other semantic changes, see the changelog in commit 41ccba029e94. Link: http://lkml.kernel.org/p/20140627170140.GA18322@redhat.com Acked-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-30tracing/uprobes: Revert "Support mix of ftrace and perf"Oleg Nesterov1-1/+8
This reverts commit 43fe98913c9f67e3b523615ee3316f9520a623e0. This patch is very wrong. Firstly, this change leads to unbalanced uprobe_unregister(). Just for example, # perf probe -x /lib/libc.so.6 syscall # echo 1 >> /sys/kernel/debug/tracing/events/probe_libc/enable # perf record -e probe_libc:syscall whatever after that uprobe is dead (unregistered) but the user of ftrace/perf can't know this, and it looks as if nobody hits this probe. This would be easy to fix, but there are other reasons why it is not simple to mix ftrace and perf. If nothing else, they can't share the same ->consumer.filter. This is fixable too, but probably we need to fix the poorly designed uprobe_register() interface first. At least "register" and "apply" should be clearly separated. Link: http://lkml.kernel.org/p/20140627170136.GA18319@redhat.com Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: "zhangwei(Jovi)" <jovi.zhangwei@huawei.com> Cc: stable@vger.kernel.org # v3.14 Acked-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-06-30cgroup: fix a race between cgroup_mount() and cgroup_kill_sb()Li Zefan1-7/+26
We've converted cgroup to kernfs so cgroup won't be intertwined with vfs objects and locking, but there are dark areas. Run two instances of this script concurrently: for ((; ;)) { mount -t cgroup -o cpuacct xxx /cgroup umount /cgroup } After a while, I saw two mount processes were stuck at retrying, because they were waiting for a subsystem to become free, but the root associated with this subsystem never got freed. This can happen, if thread A is in the process of killing superblock but hasn't called percpu_ref_kill(), and at this time thread B is mounting the same cgroup root and finds the root in the root list and performs percpu_ref_try_get(). To fix this, we try to increase both the refcnt of the superblock and the percpu refcnt of cgroup root. v2: - we should try to get both the superblock refcnt and cgroup_root refcnt, because cgroup_root may have no superblock assosiated with it. - adjust/add comments. tj: Updated comments. Renamed @sb to @pinned_sb. Cc: <stable@vger.kernel.org> # 3.15 Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-06-30cgroup: fix mount failure in a corner caseLi Zefan1-0/+23
# cat test.sh #! /bin/bash mount -t cgroup -o cpu xxx /cgroup umount /cgroup mount -t cgroup -o cpu,cpuacct xxx /cgroup umount /cgroup # ./test.sh mount: xxx already mounted or /cgroup busy mount: according to mtab, xxx is already mounted on /cgroup It's because the cgroupfs_root of the first mount was under destruction asynchronously. Fix this by delaying and then retrying mount for this case. v3: - put the refcnt immediately after getting it. (Tejun) v2: - use percpu_ref_tryget_live() rather that introducing percpu_ref_alive(). (Tejun) - adjust comment. tj: Updated the comment a bit. Cc: <stable@vger.kernel.org> # 3.15 Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-06-25cpuset,mempolicy: fix sleeping function called from invalid contextGu Zheng1-1/+7
When runing with the kernel(3.15-rc7+), the follow bug occurs: [ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586 [ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python [ 9969.441175] INFO: lockdep is turned off. [ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85 [ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012 [ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18 [ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c [ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000 [ 9969.974071] Call Trace: [ 9970.003403] [<ffffffff8162f523>] dump_stack+0x4d/0x66 [ 9970.065074] [<ffffffff8109995a>] __might_sleep+0xfa/0x130 [ 9970.130743] [<ffffffff81633e6c>] mutex_lock_nested+0x3c/0x4f0 [ 9970.200638] [<ffffffff811ba5dc>] ? kmem_cache_alloc+0x1bc/0x210 [ 9970.272610] [<ffffffff81105807>] cpuset_mems_allowed+0x27/0x140 [ 9970.344584] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150 [ 9970.409282] [<ffffffff811b1385>] __mpol_dup+0xe5/0x150 [ 9970.471897] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150 [ 9970.536585] [<ffffffff81068c86>] ? copy_process.part.23+0x606/0x1d40 [ 9970.613763] [<ffffffff810bf28d>] ? trace_hardirqs_on+0xd/0x10 [ 9970.683660] [<ffffffff810ddddf>] ? monotonic_to_bootbased+0x2f/0x50 [ 9970.759795] [<ffffffff81068cf0>] copy_process.part.23+0x670/0x1d40 [ 9970.834885] [<ffffffff8106a598>] do_fork+0xd8/0x380 [ 9970.894375] [<ffffffff81110e4c>] ? __audit_syscall_entry+0x9c/0xf0 [ 9970.969470] [<ffffffff8106a8c6>] SyS_clone+0x16/0x20 [ 9971.030011] [<ffffffff81642009>] stub_clone+0x69/0x90 [ 9971.091573] [<ffffffff81641c29>] ? system_call_fastpath+0x16/0x1b The cause is that cpuset_mems_allowed() try to take mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in __mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock protection region to protect the access to cpuset only in current_cpuset_is_being_rebound(). So that we can avoid this bug. This patch is a temporary solution that just addresses the bug mentioned above, can not fix the long-standing issue about cpuset.mems rebinding on fork(): "When the forker's task_struct is duplicated (which includes ->mems_allowed) and it races with an update to cpuset_being_rebound in update_tasks_nodemask() then the task's mems_allowed doesn't get updated. And the child task's mems_allowed can be wrong if the cpuset's nodemask changes before the child has been added to the cgroup's tasklist." Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable <stable@vger.kernel.org>
2014-06-25Merge tag 'trace-fixes-v3.16-rc1-v2' of ↵Linus Torvalds2-15/+13
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing cleanups and fixes from Steven Rostedt: "This includes three patches from Oleg Nesterov. The first is a fix to a race condition that happens between enabling/disabling syscall tracepoints and new process creations (the check to go into the ptrace path for a process can be set when it shouldn't, or not set when it should). Not a major bug but one that should be fixed and even applied to stable. The other two patches are cleanup/fixes that are not that critical, but for an -rc1 release would be nice to have. They both deal with syscall tracepoints. It also includes a patch to introduce a new macro for the TRACE_EVENT() format called __field_struct(). Originally, __field() was used to record any variable into a trace event, but with the addition of setting the "is signed" attribute, the check causes anything but a primitive variable to fail to compile. That is, structs and unions can't be used as they once were. When the "is signed" check was introduce there were only primitive variables being recorded. But that will change soon and it was reported that __field() causes build failures. To solve the __field() issue, __field_struct() is introduced to allow trace_events to be able to record complex types too" * tag 'trace-fixes-v3.16-rc1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Add __field_struct macro for TRACE_EVENT() tracing: syscall_regfunc() should not skip kernel threads tracing: Change syscall_*regfunc() to check PF_KTHREAD and use for_each_process_thread() tracing: Fix syscall_*regfunc() vs copy_process() race
2014-06-23kernel/watchdog.c: print traces for all cpus on lockup detectionAaron Tomlin2-0/+50
A 'softlockup' is defined as a bug that causes the kernel to loop in kernel mode for more than a predefined period to time, without giving other tasks a chance to run. Currently, upon detection of this condition by the per-cpu watchdog task, debug information (including a stack trace) is sent to the system log. On some occasions, we have observed that the "victim" rather than the actual "culprit" (i.e. the owner/holder of the contended resource) is reported to the user. Often this information has proven to be insufficient to assist debugging efforts. To avoid loss of useful debug information, for architectures which support NMI, this patch makes it possible to improve soft lockup reporting. This is accomplished by issuing an NMI to each cpu to obtain a stack trace. If NMI is not supported we just revert back to the old method. A sysctl and boot-time parameter is available to toggle this feature. [dzickus@redhat.com: add CONFIG_SMP in certain areas] [akpm@linux-foundation.org: additional CONFIG_SMP=n optimisations] [mq@suse.cz: fix warning] Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jan Moskyto Matejka <mq@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23mm, pcp: allow restoring percpu_pagelist_fraction defaultDavid Rientjes1-2/+1
Oleg reports a division by zero error on zero-length write() to the percpu_pagelist_fraction sysctl: divide error: 0000 [#1] SMP DEBUG_PAGEALLOC CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000 RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120 RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246 RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010 RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50 R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060 R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800 FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0 Call Trace: proc_sys_call_handler+0xb3/0xc0 proc_sys_write+0x14/0x20 vfs_write+0xba/0x1e0 SyS_write+0x46/0xb0 tracesys+0xe1/0xe6 However, if the percpu_pagelist_fraction sysctl is set by the user, it is also impossible to restore it to the kernel default since the user cannot write 0 to the sysctl. This patch allows the user to write 0 to restore the default behavior. It still requires a fraction equal to or larger than 8, however, as stated by the documentation for sanity. If a value in the range [1, 7] is written, the sysctl will return EINVAL. This successfully solves the divide by zero issue at the same time. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Drokin <green@linuxhacker.ru> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23kernel/watchdog.c: remove preemption restrictions when restarting lockup ↵Don Zickus1-2/+0
detector Peter Wu noticed the following splat on his machine when updating /proc/sys/kernel/watchdog_thresh: BUG: sleeping function called from invalid context at mm/slub.c:965 in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init 3 locks held by init/1: #0: (sb_writers#3){.+.+.+}, at: [<ffffffff8117b663>] vfs_write+0x143/0x180 #1: (watchdog_proc_mutex){+.+.+.}, at: [<ffffffff810e02d3>] proc_dowatchdog+0x33/0x110 #2: (cpu_hotplug.lock){.+.+.+}, at: [<ffffffff810589c2>] get_online_cpus+0x32/0x80 Preemption disabled at:[<ffffffff810e0384>] proc_dowatchdog+0xe4/0x110 CPU: 0 PID: 1 Comm: init Not tainted 3.16.0-rc1-testing #34 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: dump_stack+0x4e/0x7a __might_sleep+0x11d/0x190 kmem_cache_alloc_trace+0x4e/0x1e0 perf_event_alloc+0x55/0x440 perf_event_create_kernel_counter+0x26/0xe0 watchdog_nmi_enable+0x75/0x140 update_timers_all_cpus+0x53/0xa0 proc_dowatchdog+0xe4/0x110 proc_sys_call_handler+0xb3/0xc0 proc_sys_write+0x14/0x20 vfs_write+0xad/0x180 SyS_write+0x49/0xb0 system_call_fastpath+0x16/0x1b NMI watchdog: disabled (cpu0): hardware events not enabled What happened is after updating the watchdog_thresh, the lockup detector is restarted to utilize the new value. Part of this process involved disabling preemption. Once preemption was disabled, perf tried to allocate a new event (as part of the restart). This caused the above BUG_ON as you can't sleep with preemption disabled. The preemption restriction seemed agressive as we are not doing anything on that particular cpu, but with all the online cpus (which are protected by the get_online_cpus lock). Remove the restriction and the BUG_ON goes away. Signed-off-by: Don Zickus <dzickus@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reported-by: Peter Wu <peter@lekensteyn.nl> Tested-by: Peter Wu <peter@lekensteyn.nl> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [3.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23kexec: save PG_head_mask in VMCOREINFOPetr Tesarik1-0/+1
To allow filtering of huge pages, makedumpfile must be able to identify them in the dump. This can be done by checking the appropriate page flag, so communicate its value to makedumpfile through the VMCOREINFO interface. There's only one small catch. Depending on how many page flags are available on a given architecture, this bit can be called PG_head or PG_compound. I sent a similar patch back in 2012, but Eric Biederman did not like using an #ifdef. So, this time I'm adding a common symbol (PG_head_mask) instead. See https://lkml.org/lkml/2012/11/28/91 for the previous version. Signed-off-by: Petr Tesarik <ptesarik@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Shaohua Li <shli@kernel.org> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23CPU hotplug, smp: flush any pending IPI callbacks before CPU offlineSrivatsa S. Bhat1-8/+49
There is a race between the CPU offline code (within stop-machine) and the smp-call-function code, which can lead to getting IPIs on the outgoing CPU, *after* it has gone offline. Specifically, this can happen when using smp_call_function_single_async() to send the IPI, since this API allows sending asynchronous IPIs from IRQ disabled contexts. The exact race condition is described below. During CPU offline, in stop-machine, we don't enforce any rule in the _DISABLE_IRQ stage, regarding the order in which the outgoing CPU and the other CPUs disable their local interrupts. Due to this, we can encounter a situation in which an IPI is sent by one of the other CPUs to the outgoing CPU (while it is *still* online), but the outgoing CPU ends up noticing it only *after* it has gone offline. CPU 1 CPU 2 (Online CPU) (CPU going offline) Enter _PREPARE stage Enter _PREPARE stage Enter _DISABLE_IRQ stage = Got a device interrupt, and | Didn't notice the IPI the interrupt handler sent an | since interrupts were IPI to CPU 2 using | disabled on this CPU. smp_call_function_single_async() | = Enter _DISABLE_IRQ stage Enter _RUN stage Enter _RUN stage = Busy loop with interrupts | Invoke take_cpu_down() disabled. | and take CPU 2 offline = Enter _EXIT stage Enter _EXIT stage Re-enable interrupts Re-enable interrupts The pending IPI is noted immediately, but alas, the CPU is offline at this point. This of course, makes the smp-call-function IPI handler code running on CPU 2 unhappy and it complains about "receiving an IPI on an offline CPU". One real example of the scenario on CPU 1 is the block layer's complete-request call-path: __blk_complete_request() [interrupt-handler] raise_blk_irq() smp_call_function_single_async() However, if we look closely, the block layer does check that the target CPU is online before firing the IPI. So in this case, it is actually the unfortunate ordering/timing of events in the stop-machine phase that leads to receiving IPIs after the target CPU has gone offline. In reality, getting a late IPI on an offline CPU is not too bad by itself (this can happen even due to hardware latencies in IPI send-receive). It is a bug only if the target CPU really went offline without executing all the callbacks queued on its list. (Note that a CPU is free to execute its pending smp-call-function callbacks in a batch, without waiting for the corresponding IPIs to arrive for each one of those callbacks). So, fixing this issue can be broken up into two parts: 1. Ensure that a CPU goes offline only after executing all the callbacks queued on it. 2. Modify the warning condition in the smp-call-function IPI handler code such that it warns only if an offline CPU got an IPI *and* that CPU had gone offline with callbacks still pending in its queue. Achieving part 1 is straight-forward - just flush (execute) all the queued callbacks on the outgoing CPU in the CPU_DYING stage[1], including those callbacks for which the source CPU's IPIs might not have been received on the outgoing CPU yet. Once we do this, an IPI that arrives late on the CPU going offline (either due to the race mentioned above, or due to hardware latencies) will be completely harmless, since the outgoing CPU would have executed all the queued callbacks before going offline. Overall, this fix (parts 1 and 2 put together) additionally guarantees that we will see a warning only when the *IPI-sender code* is buggy - that is, if it queues the callback _after_ the target CPU has gone offline. [1]. The CPU_DYING part needs a little more explanation: by the time we execute the CPU_DYING notifier callbacks, the CPU would have already been marked offline. But we want to flush out the pending callbacks at this stage, ignoring the fact that the CPU is offline. So restructure the IPI handler code so that we can by-pass the "is-cpu-offline?" check in this particular case. (Of course, the right solution here is to fix CPU hotplug to mark the CPU offline _after_ invoking the CPU_DYING notifiers, but this requires a lot of audit to ensure that this change doesn't break any existing code; hence lets go with the solution proposed above until that is done). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Suggested-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Borislav Petkov <bp@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <mgalbraith@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Rik van Riel <riel@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Tested-by: Sachin Kamat <sachin.kamat@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23workqueue: fix dev_set_uevent_suppress() imbalanceMaxime Bizon1-0/+1
Uevents are suppressed during attributes registration, but never restored, so kobject_uevent() does nothing. Signed-off-by: Maxime Bizon <mbizon@freebox.fr> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Fixes: 226223ab3c4118ddd10688cc2c131135848371ab
2014-06-21Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds2-4/+36
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "This is larger than usual: the main reason are the ARM symbol lookup speedups that came in late and were hard to resist. There's also a kprobes fix and various tooling fixes, plus the minimal re-enablement of the mmap2 support interface" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) x86/kprobes: Fix build errors and blacklist context_track_user perf tests: Add test for closing dso objects on EMFILE error perf tests: Add test for caching dso file descriptors perf tests: Allow reuse of test_file function perf tests: Spawn child for each test perf tools: Add dso__data_* interface descriptons perf tools: Allow to close dso fd in case of open failure perf tools: Add file size check and factor dso__data_read_offset perf tools: Cache dso data file descriptor perf tools: Add global count of opened dso objects perf tools: Add global list of opened dso objects perf tools: Add data_fd into dso object perf tools: Separate dso data related variables perf tools: Cache register accesses for unwind processing perf record: Fix to honor user freq/interval properly perf timechart: Reflow documentation perf probe: Improve error messages in --line option perf probe: Improve an error message of perf probe --vars mode perf probe: Show error code and description in verbose mode perf probe: Improve error message for unknown member of data structure ...
2014-06-21Merge branch 'locking-urgent-for-linus.patch' of ↵Linus Torvalds3-35/+218
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull rtmutex fixes from Thomas Gleixner: "Another three patches to make the rtmutex code more robust. That's the last urgent fallout from the big futex/rtmutex investigation" * 'locking-urgent-for-linus.patch' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rtmutex: Plug slow unlock race rtmutex: Detect changes in the pi lock chain rtmutex: Handle deadlock detection smarter