summaryrefslogtreecommitdiffstats
path: root/kernel/sched
AgeCommit message (Collapse)AuthorFilesLines
2013-08-08cgroup: add/update accessors which obtain subsys specific data from cssTejun Heo2-6/+13
css (cgroup_subsys_state) is usually embedded in a subsys specific data structure. Subsystems either use container_of() directly to cast from css to such data structure or has an accessor function wrapping such cast. As cgroup as whole is moving towards using css as the main interface handle, add and update such accessors to ease dealing with css's. All accessors explicitly handle NULL input and return NULL in those cases. While this looks like an extra branch in the code, as all controllers specific data structures have css as the first field, the casting doesn't involve any offsetting and the compiler can trivially optimize out the branch. * blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such accessor. Added. * memory, hugetlb and devices already had one but didn't explicitly handle NULL input. Updated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-08-08cgroup: s/cgroup_subsys_state/cgroup_css/ s/task_subsys_state/task_css/Tejun Heo3-7/+7
The names of the two struct cgroup_subsys_state accessors - cgroup_subsys_state() and task_subsys_state() - are somewhat awkward. The former clashes with the type name and the latter doesn't even indicate it's somehow related to cgroup. We're about to revamp large portion of cgroup API, so, let's rename them so that they're less awkward. Most per-controller usages of the accessors are localized in accessor wrappers and given the amount of scheduled changes, this isn't gonna add any noticeable headache. Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state() to task_css(). This patch is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-07-06Merge branch 'timers-core-for-linus' of ↵Linus Torvalds1-3/+36
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer core updates from Thomas Gleixner: "The timer changes contain: - posix timer code consolidation and fixes for odd corner cases - sched_clock implementation moved from ARM to core code to avoid duplication by other architectures - alarm timer updates - clocksource and clockevents unregistration facilities - clocksource/events support for new hardware - precise nanoseconds RTC readout (Xen feature) - generic support for Xen suspend/resume oddities - the usual lot of fixes and cleanups all over the place The parts which touch other areas (ARM/XEN) have been coordinated with the relevant maintainers. Though this results in an handful of trivial to solve merge conflicts, which we preferred over nasty cross tree merge dependencies. The patches which have been committed in the last few days are bug fixes plus the posix timer lot. The latter was in akpms queue and next for quite some time; they just got forgotten and Frederic collected them last minute." * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (59 commits) hrtimer: Remove unused variable hrtimers: Move SMP function call to thread context clocksource: Reselect clocksource when watchdog validated high-res capability posix-cpu-timers: don't account cpu timer after stopped thread runtime accounting posix_timers: fix racy timer delta caching on task exit posix-timers: correctly get dying task time sample in posix_cpu_timer_schedule() selftests: add basic posix timers selftests posix_cpu_timers: consolidate expired timers check posix_cpu_timers: consolidate timer list cleanups posix_cpu_timer: consolidate expiry time type tick: Sanitize broadcast control logic tick: Prevent uncontrolled switch to oneshot mode tick: Make oneshot broadcast robust vs. CPU offlining x86: xen: Sync the CMOS RTC as well as the Xen wallclock x86: xen: Sync the wallclock when the system time is set timekeeping: Indicate that clock was set in the pvclock gtod notifier timekeeping: Pass flags instead of multiple bools to timekeeping_update() xen: Remove clock_was_set() call in the resume path hrtimers: Support resuming with two or more CPUs online (but stopped) timer: Fix jiffies wrap behavior of round_jiffies_common() ...
2013-07-04posix-cpu-timers: don't account cpu timer after stopped thread runtime ↵KOSAKI Motohiro1-3/+36
accounting When tsk->signal->cputimer->running is 1, signal->cputimer (i.e. per process timer account) and tsk->sum_sched_runtime (i.e. per thread timer account) increase at the same pace because update_curr() increases both accounting. However, there is one exception. When thread exiting, __exit_signal() turns over task's sum_shced_runtime to sig->sum_sched_runtime, but it doesn't stop signal->cputimer accounting. This inconsistency makes POSIX timer wake up too early. This patch fixes it. Original-patch-by: Olivier Langlois <olivier@trillion01.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Olivier Langlois <olivier@trillion01.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2013-07-01Merge tag 'v3.10' into sched/coreIngo Molnar2-4/+4
Merge in a recent upstream commit: c2853c8df57f include/linux/math64.h: add div64_ul() because: 72a4cf20cb71 sched: Change cfs_rq load avg to unsigned long relies on it. [ We don't rebase sched/core for this, because the handful of followup commits after the broken commit are not behavioral changes so are unlikely to be needed during bisection. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-28sched/debug: Remove CONFIG_FAIR_GROUP_SCHED maskAlex Shi1-4/+6
Now that we are using runnable load avg in sched balance, we don't need to keep it under CONFIG_FAIR_GROUP_SCHED. Also align the code style to #ifdef instead of #if defined() and reorder the tg output info. Signed-off-by: Alex Shi <alex.shi@intel.com> Cc: pjt@google.com Cc: kamalesh@linux.vnet.ibm.com Cc: peterz@infradead.org Link: http://lkml.kernel.org/r/1372417835-4698-1-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-28sched/debug: Fix formatting of /proc/<PID>/schedKamalesh Babulal1-8/+9
This patch alters format string's width, to align all statistics at par with the longest struct sched_statistic member name under /proc/<PID>/sched. Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Cc: peterz@infradead.org Link: http://lkml.kernel.org/r/20130627165005.GA15583@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched/fair: Fix typo describing flags in enqueue_entityKamalesh Babulal1-1/+1
Fix spelling of 'calling' in description of se flags in enqueue_entity(). Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Cc: peterz@infradead.org Link: http://lkml.kernel.org/r/20130627055418.GA18582@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched/debug: Add load-tracking statistics to taskKamalesh Babulal1-0/+6
At present we print per-entity load-tracking statistics for cfs_rq of cgroups/runqueues. Given that per task statistics is maintained, it can be used to know the contribution made by the task to its parenting cfs_rq level. This patch adds per-task load-tracking statistics to /proc/<PID>/sched. Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20130625080336.GA20175@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched: Change get_rq_runnable_load() to static and inlineAlex Shi1-2/+2
Based-on-patch-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Alex Shi <alex.shi@intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-14-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched/tg: Remove tg.load_weightAlex Shi1-1/+0
Since no one use it. Signed-off-by: Alex Shi <alex.shi@intel.com> Reviewed-by: Paul Turner <pjt@google.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-13-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched/cfs_rq: Change atomic64_t removed_load to atomic_long_tAlex Shi2-5/+8
Similar to runnable_load_avg, blocked_load_avg variable, long type is enough for removed_load in 64 bit or 32 bit machine. Then we avoid the expensive atomic64 operations on 32 bit machine. Signed-off-by: Alex Shi <alex.shi@intel.com> Reviewed-by: Paul Turner <pjt@google.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-12-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched/tg: Use 'unsigned long' for load variable in task groupAlex Shi3-11/+11
Since tg->load_avg is smaller than tg->load_weight, we don't need a atomic64_t variable for load_avg in 32 bit machine. The same reason for cfs_rq->tg_load_contrib. The atomic_long_t/unsigned long variable type are more efficient and convenience for them. Signed-off-by: Alex Shi <alex.shi@intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-11-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched: Change cfs_rq load avg to unsigned longAlex Shi3-8/+5
Since the 'u64 runnable_load_avg, blocked_load_avg' in cfs_rq struct are smaller than 'unsigned long' cfs_rq->load.weight. We don't need u64 vaiables to describe them. unsigned long is more efficient and convenience. Signed-off-by: Alex Shi <alex.shi@intel.com> Reviewed-by: Paul Turner <pjt@google.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-10-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched: Consider runnable load average in move_tasks()Alex Shi1-9/+9
Aside from using runnable load average in background, move_tasks is also the key function in load balance. We need consider the runnable load average in it in order to make it an apple to apple load comparison. Morten had caught a div u64 bug on ARM, thanks! Thanks-to: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-8-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched: Compute runnable load avg in cpu_load and cpu_avg_load_per_taskAlex Shi2-4/+18
They are the base values in load balance, update them with rq runnable load average, then the load balance will consider runnable load avg naturally. We also try to include the blocked_load_avg as cpu load in balancing, but that cause kbuild performance drop 6% on every Intel machine, and aim7/oltp drop on some of 4 CPU sockets machines. Or only add blocked_load_avg into get_rq_runable_load, hackbench still drop a little on NHM EX. Signed-off-by: Alex Shi <alex.shi@intel.com> Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-7-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched: Update cpu load after task_tickAlex Shi1-1/+1
To get the latest runnable info, we need do this cpuload update after task_tick. Signed-off-by: Alex Shi <alex.shi@intel.com> Reviewed-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-6-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched: Fix sleep time double accounting in enqueue entityAlex Shi1-1/+7
The woken migrated task will __synchronize_entity_decay(se); in migrate_task_rq_fair, then it needs to set `se->avg.last_runnable_update -= (-se->avg.decay_count) << 20' before update_entity_load_avg, in order to avoid sleep time is updated twice for se.avg.load_avg_contrib in both __syncchronize and update_entity_load_avg. However if the sleeping task is woken up from the same cpu, it miss the last_runnable_update before update_entity_load_avg(se, 0, 1), then the sleep time was used twice in both functions. So we need to remove the double sleep time accounting. Paul also contributed some code comments in this commit. Signed-off-by: Alex Shi <alex.shi@intel.com> Reviewed-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-5-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched: Set an initial value of runnable avg for new forked taskAlex Shi3-4/+28
We need to initialize the se.avg.{decay_count, load_avg_contrib} for a new forked task. Otherwise random values of above variables cause a mess when a new task is enqueued: enqueue_task_fair enqueue_entity enqueue_entity_load_avg and make fork balancing imbalance due to incorrect load_avg_contrib. Further more, Morten Rasmussen notice some tasks were not launched at once after created. So Paul and Peter suggest giving a start value for new task runnable avg time same as sched_slice(). PeterZ said: > So the 'problem' is that our running avg is a 'floating' average; ie. it > decays with time. Now we have to guess about the future of our newly > spawned task -- something that is nigh impossible seeing these CPU > vendors keep refusing to implement the crystal ball instruction. > > So there's two asymptotic cases we want to deal well with; 1) the case > where the newly spawned program will be 'nearly' idle for its lifetime; > and 2) the case where its cpu-bound. > > Since we have to guess, we'll go for worst case and assume its > cpu-bound; now we don't want to make the avg so heavy adjusting to the > near-idle case takes forever. We want to be able to quickly adjust and > lower our running avg. > > Now we also don't want to make our avg too light, such that it gets > decremented just for the new task not having had a chance to run yet -- > even if when it would run, it would be more cpu-bound than not. > > So what we do is we make the initial avg of the same duration as that we > guess it takes to run each task on the system at least once -- aka > sched_slice(). > > Of course we can defeat this with wakeup/fork bombs, but in the 'normal' > case it should be good enough. Paul also contributed most of the code comments in this commit. Signed-off-by: Alex Shi <alex.shi@intel.com> Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Reviewed-by: Paul Turner <pjt@google.com> [peterz; added explanation of sched_slice() usage] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-4-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched: Move a few runnable tg variables into CONFIG_SMPAlex Shi1-0/+2
The following 2 variables are only used under CONFIG_SMP, so its better to move their definiation into CONFIG_SMP too. atomic64_t load_avg; atomic_t runnable_avg; Signed-off-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-3-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for ↵Alex Shi3-36/+7
load-tracking" Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then we can use runnable load variables. Also remove 2 CONFIG_FAIR_GROUP_SCHED setting which is not in reverted patch(introduced in 9ee474f), but also need to revert. Signed-off-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/51CA76A3.3050207@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-20Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds1-4/+17
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Two smaller fixes - plus a context tracking tracing fix that is a bit bigger" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tracing/context-tracking: Add preempt_schedule_context() for tracing sched: Fix clear NOHZ_BALANCE_KICK sched/x86: Construct all sibling maps if smt
2013-06-19sched: Don't mix use of typedef ctl_table and struct ctl_tableJoe Perches1-1/+1
Just use struct ctl_table. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371063336.2069.22.camel@joe-AO722 Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19sched: Remove WARN_ON(!sd) from init_sched_groups_power()Viresh Kumar1-1/+1
sd can't be NULL in init_sched_groups_power() and so checking it for NULL isn't useful. In case it is required, then also we need to rearrange the code a bit as we already accessed invalid pointer sd to get sg: sg = sd->groups. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/2bbe633cd74b431c05253a8ce61fdfd5066a531b.1370948150.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19sched: Fix memory leakage in build_sched_groups()Viresh Kumar1-2/+2
In build_sched_groups() we don't need to call get_group() for cpus which are already covered in previous iterations. Calling get_group() would mark the group used and eventually leak it since we wouldn't connect it and not find it again to free it. This will happen only in cases where sg->cpumask contained more than one cpu (For any topology level). This patch would free sg's memory for all cpus leaving the group leader as the group isn't marked used now. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/7a61e955abdcbb1dfa9fe493f11a5ec53a11ddd3.1370948150.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19sched: Use cached value of span instead of calling sched_domain_span()Viresh Kumar1-1/+1
In the beginning of build_sched_groups() we called sched_domain_span() and cached its return value in span. Few statements later we are calling it again to get the same pointer. Lets use the cached value instead as it hasn't changed in between. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/834ecd507071ad88aff039352dbc7e063dd996a7.1370948150.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19sched: Create for_each_sd_topology()Viresh Kumar1-3/+6
For loop for traversing sched_domain_topology was used at multiple placed in core.c. This patch removes code redundancy by creating for_each_sd_topology(). Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/e0e04542f54e9464bd9da54f5ccfe62ec6c4c0bc.1370861520.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19sched: Don't set sd->child to NULL when it is already NULLViresh Kumar1-1/+1
Memory for sd is allocated with kzalloc_node() which will initialize its fields with zero. In build_sched_domain() we are setting sd->child to child even if child is NULL, which isn't required. Lets do it only if child isn't NULL. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/f4753a1730051341003ad2ad29a3229c7356678e.1370861520.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19sched: Don't initialize alloc_state in build_sched_domains()Viresh Kumar1-1/+1
alloc_state will be overwritten by __visit_domain_allocation_hell() and so we don't actually need to initialize alloc_state. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/df57734a075cc5ad130e1ae498702e24f2529ab8.1370861520.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19sched: Optimize build_sched_domains() for saving first SD node for a cpuViresh Kumar1-5/+2
We are saving first scheduling domain for a cpu in build_sched_domains() by iterating over the nested sd->child list. We don't actually need to do it this way. tl will be equal to sched_domain_topology for the first iteration and so we can set *per_cpu_ptr(d.sd, i) based on that. So, save pointer to first SD while running the iteration loop over tl's. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/fc473527cbc4dfa0b8eeef2a59db74684eb59a83.1370436120.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19sched: Remove unused params of build_sched_domain()Viresh Kumar1-4/+3
build_sched_domain() never uses parameter struct s_data *d and so passing it is useless. Remove it. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/545e0b4536166a15b4475abcafe5ed0db4ad4a2c.1370436120.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19sched: Femove the useless declaration in kernel/sched/fair.cMichael Wang1-4/+0
default_cfs_period(), do_sched_cfs_period_timer(), do_sched_cfs_slack_timer() already defined previously, no need to declare again. Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/51AD8808.7020608@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19sched: Refine the code in unthrottle_cfs_rq()Michael Wang1-1/+1
Directly use rq to save some code. Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/51AD87EB.1070605@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19sched/rt: Simplify pull_rt_task() logic and remove .leaf_rt_rq_listKirill Tkhai2-67/+16
[ Peter, this is based off of some of my work, I ran it though a few tests and it passed. I also reviewed it, and added my SOB as I am somewhat a co-author to it. ] Based on the patch by Steven Rostedt from previous year: https://lkml.org/lkml/2012/4/18/517 1)Simplify pull_rt_task() logic: search in pushable tasks of dest runqueue. The only pullable tasks are the tasks which are pushable in their local rq, and no others. 2)Remove .leaf_rt_rq_list member of struct rt_rq and functions connected with it: nobody uses it since now. Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/287571370557898@web7d.yandex.ru Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19Merge branch 'sched/urgent' into sched/coreIngo Molnar1-4/+17
Merge in fixes before applying ongoing new work. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-19sched: Fix clear NOHZ_BALANCE_KICKVincent Guittot1-4/+17
I have faced a sequence where the Idle Load Balance was sometime not triggered for a while on my platform, in the following scenario: CPU 0 and CPU 1 are running tasks and CPU 2 is idle CPU 1 kicks the Idle Load Balance CPU 1 selects CPU 2 as the new Idle Load Balancer CPU 2 sets NOHZ_BALANCE_KICK for CPU 2 CPU 2 sends a reschedule IPI to CPU 2 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance task A quickly goes back to sleep (before a tick occurs on CPU 2) CPU 2 goes back to idle with NOHZ_BALANCE_KICK set Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be performed. We must wait for the sched softirq to be raised on CPU 2 thanks to another part the kernel to come back to clear NOHZ_BALANCE_KICK. The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if we can't raise the sched_softirq for the Idle Load Balance. Change since V1: - move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB can't run on this CPU (as suggested by Peter) Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1370419991-13870-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-31sched/fair: Remove unused variable from expire_cfs_rq_runtime()Kamalesh Babulal1-1/+0
Commit 78becc2709 ("sched: Use an accessor to read the rq clock") introduces rq_clock(), which obsoletes the use of the "rq" variable in expire_cfs_rq_runtime() and triggers this build warning: kernel/sched/fair.c: In function 'expire_cfs_rq_runtime': kernel/sched/fair.c:2159:13: warning: unused variable 'rq' [-Wunused-variable] Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Turner <pjt@google.com> Cc: peterz@infradead.org Link: http://lkml.kernel.org/r/1369904660-14169-1-git-send-email-kamalesh@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-31vtime: Use consistent clocks among nohz accountingFrederic Weisbecker2-4/+4
While computing the cputime delta of dynticks CPUs, we are mixing up clocks of differents natures: * local_clock() which takes care of unstable clock sources and fix these if needed. * sched_clock() which is the weaker version of local_clock(). It doesn't compute any fixup in case of unstable source. If the clock source is stable, those two clocks are the same and we can safely compute the difference against two random points. Otherwise it results in random deltas as sched_clock() can randomly drift away, back or forward, from local_clock(). As a consequence, some strange behaviour with unstable tsc has been observed such as non progressing constant zero cputime. (The 'top' command showing no load). Fix this by only using local_clock(), or its irq safe/remote equivalent, in vtime code. Reported-by: Mike Galbraith <efault@gmx.de> Suggested-by: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-28sched: Use swap() macro in scale_stime()Stanislaw Gruszka1-3/+2
Simple cleanup. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/1367501673-6563-1-git-send-email-sgruszka@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-28sched: Use an accessor to read the rq clockFrederic Weisbecker6-37/+47
Read the runqueue clock through an accessor. This prepares for adding a debugging infrastructure to detect missing or redundant calls to update_rq_clock() between a scheduler's entry and exit point. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Turner <pjt@google.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1365724262-20142-6-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-28sched: Update rq clock earlier in unthrottle_cfs_rqFrederic Weisbecker1-1/+3
In this function we are making use of rq->clock right before the update of the rq clock, let's just call update_rq_clock() just before that to avoid using a stale rq clock value. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Turner <pjt@google.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1365724262-20142-5-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-28sched: Update rq clock before calling check_preempt_curr()Frederic Weisbecker1-0/+2
check_preempt_curr() of fair class needs an uptodate sched clock value to update runtime stats of the current task of the target's rq. When a task is woken up, activate_task() is usually called right before ttwu_do_wakeup() unless the task is still in the runqueue. In the latter case we need to update the rq clock explicitly because activate_task() isn't here to do the job for us. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Turner <pjt@google.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1365724262-20142-4-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-28sched: Update rq clock before setting fair group sharesFrederic Weisbecker1-0/+3
Because we may update the execution time in sched_group_set_shares()->update_cfs_shares()->reweight_entity()->update_curr() before reweighting the entity while setting the group shares and this requires an uptodate version of the runqueue clock. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Turner <pjt@google.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1365724262-20142-3-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-28sched: Update rq clock before migrating tasks out of dying CPUFrederic Weisbecker1-0/+7
Because the sched_class::put_prev_task() callback of rt and fair classes are referring to the rq clock to update their runtime statistics. There is a missing rq clock update from the CPU hotplug notifier's entry point of the scheduler. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Turner <pjt@google.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1365724262-20142-2-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-28sched: Remove redundant update_runtime notifierNeil Zhang3-44/+0
migration_call() will do all the things that update_runtime() does. So let's remove it. Furthermore, there is potential risk that the current code will catch BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime threads running because of enabling runtime twice while the rt_runtime may already changed. Signed-off-by: Neil Zhang <zhangwm@marvell.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1365685499-26515-1-git-send-email-zhangwm@marvell.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-28sched/autogroup: Fix race with task_groups listGerald Schaefer1-2/+1
In autogroup_create(), a tg is allocated and added to the task_groups list. If CONFIG_RT_GROUP_SCHED is set, this tg is then modified while on the list, without locking. This can race with someone walking the list, like __enable_runtime() during CPU unplug, and result in a use-after-free bug. To fix this, move sched_online_group(), which adds the tg to the list, to the end of the autogroup_create() function after the modification. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1369411669-46971-2-git-send-email-gerald.schaefer@de.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-28Merge branch 'sched/cleanups' into sched/coreIngo Molnar5-588/+605
Merge reason: these bits, formerly in sched/urgent, are too late for v3.10. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-10sched: Use this_rq() helperNathan Zimmer2-5/+3
It is a few instructions more efficent to and slightly more readable to use this_rq()-> instead of cpu_rq(smp_processor_id())-> . Size comparison of kernel/sched/fair.o: text data bss dec hex filename 27972 122 26 28120 6dd8 fair.o.before 27956 122 26 28104 6dc8 fair.o.after Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1368116643-87971-1-git-send-email-nzimmer@sgi.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-07sched: Move update_load_*() methods from sched.h to fair.cPaul Gortmaker2-18/+18
These inlines are only used by kernel/sched/fair.c so they do not need to be present in the main kernel/sched/sched.h file. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/1366398650-31599-3-git-send-email-paul.gortmaker@windriver.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-07sched: Factor out load calculation code from sched/core.c --> sched/proc.cPaul Gortmaker4-570/+587
This large chunk of load calculation code can be easily divorced from the main core.c scheduler file, with only a couple prototypes and externs added to a kernel/sched header. Some recent commits expanded the code and the documentation of it, making it large enough to warrant separation. For example, see: 556061b, "sched/nohz: Fix rq->cpu_load[] calculations" 5aaa0b7, "sched/nohz: Fix rq->cpu_load calculations some more" 5167e8d, "sched/nohz: Rewrite and fix load-avg computation -- again" More importantly, it helps reduce the size of the main sched/core.c by yet another significant amount (~600 lines). Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/1366398650-31599-2-git-send-email-paul.gortmaker@windriver.com Signed-off-by: Ingo Molnar <mingo@kernel.org>