From 28156108fecb1f808b21d216e8ea8f0d205a530c Mon Sep 17 00:00:00 2001 From: Tianchen Ding Date: Thu, 9 Jun 2022 07:34:11 +0800 Subject: sched: Fix the check of nr_running at queue wakelist The commit 2ebb17717550 ("sched/core: Offload wakee task activation if it the wakee is descheduling") checked rq->nr_running <= 1 to avoid task stacking when WF_ON_CPU. Per the ordering of writes to p->on_rq and p->on_cpu, observing p->on_cpu (WF_ON_CPU) in ttwu_queue_cond() implies !p->on_rq, IOW p has gone through the deactivate_task() in __schedule(), thus p has been accounted out of rq->nr_running. As such, the task being the only runnable task on the rq implies reading rq->nr_running == 0 at that point. The benchmark result is in [1]. [1] https://lore.kernel.org/all/e34de686-4e85-bde1-9f3c-9bbc86b38627@linux.alibaba.com/ Suggested-by: Valentin Schneider Signed-off-by: Tianchen Ding Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Valentin Schneider Link: https://lore.kernel.org/r/20220608233412.327341-2-dtcccc@linux.alibaba.com --- kernel/sched/core.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'kernel/sched/core.c') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index bfa7452ca92e..294b9184dfe1 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3829,8 +3829,12 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags) * CPU then use the wakelist to offload the task activation to * the soon-to-be-idle CPU as the current CPU is likely busy. * nr_running is checked to avoid unnecessary task stacking. + * + * Note that we can only get here with (wakee) p->on_rq=0, + * p->on_cpu can be whatever, we've done the dequeue, so + * the wakee has been accounted out of ->nr_running. */ - if ((wake_flags & WF_ON_CPU) && cpu_rq(cpu)->nr_running <= 1) + if ((wake_flags & WF_ON_CPU) && !cpu_rq(cpu)->nr_running) return true; return false; -- cgit v1.2.3 From f3dd3f674555bd9455c5ae7fafce0696bd9931b3 Mon Sep 17 00:00:00 2001 From: Tianchen Ding Date: Thu, 9 Jun 2022 07:34:12 +0800 Subject: sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle Wakelist can help avoid cache bouncing and offload the overhead of waker cpu. So far, using wakelist within the same llc only happens on WF_ON_CPU, and this limitation could be removed to further improve wakeup performance. The commit 518cd6234178 ("sched: Only queue remote wakeups when crossing cache boundaries") disabled queuing tasks on wakelist when the cpus share llc. This is because, at that time, the scheduler must send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also supports TIF_POLLING, so this is not a problem now when the wakee cpu is in idle polling. Benefits: Queuing the task on idle cpu can help improving performance on waker cpu and utilization on wakee cpu, and further improve locality because the wakee cpu can handle its own rq. This patch helps improving rt on our real java workloads where wakeup happens frequently. Consider the normal condition (CPU0 and CPU1 share same llc) Before this patch: CPU0 CPU1 select_task_rq() idle rq_lock(CPU1->rq) enqueue_task(CPU1->rq) notify CPU1 (by sending IPI or CPU1 polling) resched() After this patch: CPU0 CPU1 select_task_rq() idle add to wakelist of CPU1 notify CPU1 (by sending IPI or CPU1 polling) rq_lock(CPU1->rq) enqueue_task(CPU1->rq) resched() We see CPU0 can finish its work earlier. It only needs to put task to wakelist and return. While CPU1 is idle, so let itself handle its own runqueue data. This patch brings no difference about IPI. This patch only takes effect when the wakee cpu is: 1) idle polling 2) idle not polling For 1), there will be no IPI with or without this patch. For 2), there will always be an IPI before or after this patch. Before this patch: waker cpu will enqueue task and check preempt. Since "idle" will be sure to be preempted, waker cpu must send a resched IPI. After this patch: waker cpu will put the task to the wakelist of wakee cpu, and send an IPI. Benchmark: We've tested schbench, unixbench, and hachbench on both x86 and arm64. On x86 (Intel Xeon Platinum 8269CY): schbench -m 2 -t 8 Latency percentiles (usec) before after 50.0000th: 8 6 75.0000th: 10 7 90.0000th: 11 8 95.0000th: 12 8 *99.0000th: 13 10 99.5000th: 15 11 99.9000th: 18 14 Unixbench with full threads (104) before after Dhrystone 2 using register variables 3011862938 3009935994 -0.06% Double-Precision Whetstone 617119.3 617298.5 0.03% Execl Throughput 27667.3 27627.3 -0.14% File Copy 1024 bufsize 2000 maxblocks 785871.4 784906.2 -0.12% File Copy 256 bufsize 500 maxblocks 210113.6 212635.4 1.20% File Copy 4096 bufsize 8000 maxblocks 2328862.2 2320529.1 -0.36% Pipe Throughput 145535622.8 145323033.2 -0.15% Pipe-based Context Switching 3221686.4 3583975.4 11.25% Process Creation 101347.1 103345.4 1.97% Shell Scripts (1 concurrent) 120193.5 123977.8 3.15% Shell Scripts (8 concurrent) 17233.4 17138.4 -0.55% System Call Overhead 5300604.8 5312213.6 0.22% hackbench -g 1 -l 100000 before after Time 3.246 2.251 On arm64 (Ampere Altra): schbench -m 2 -t 8 Latency percentiles (usec) before after 50.0000th: 14 10 75.0000th: 19 14 90.0000th: 22 16 95.0000th: 23 16 *99.0000th: 24 17 99.5000th: 24 17 99.9000th: 28 25 Unixbench with full threads (80) before after Dhrystone 2 using register variables 3536194249 3537019613 0.02% Double-Precision Whetstone 629383.6 629431.6 0.01% Execl Throughput 65920.5 65846.2 -0.11% File Copy 1024 bufsize 2000 maxblocks 1063722.8 1064026.8 0.03% File Copy 256 bufsize 500 maxblocks 322684.5 318724.5 -1.23% File Copy 4096 bufsize 8000 maxblocks 2348285.3 2328804.8 -0.83% Pipe Throughput 133542875.3 131619389.8 -1.44% Pipe-based Context Switching 3215356.1 3576945.1 11.25% Process Creation 108520.5 120184.6 10.75% Shell Scripts (1 concurrent) 122636.3 121888 -0.61% Shell Scripts (8 concurrent) 17462.1 17381.4 -0.46% System Call Overhead 4429998.9 4435006.7 0.11% hackbench -g 1 -l 100000 before after Time 4.217 2.916 Our patch has improvement on schbench, hackbench and Pipe-based Context Switching of unixbench when there exists idle cpus, and no obvious regression on other tests of unixbench. This can help improve rt in scenes where wakeup happens frequently. Signed-off-by: Tianchen Ding Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Valentin Schneider Link: https://lore.kernel.org/r/20220608233412.327341-3-dtcccc@linux.alibaba.com --- kernel/sched/core.c | 26 ++++++++++++++------------ kernel/sched/sched.h | 1 - 2 files changed, 14 insertions(+), 13 deletions(-) (limited to 'kernel/sched/core.c') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 294b9184dfe1..723452608bed 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3808,7 +3808,7 @@ bool cpus_share_cache(int this_cpu, int that_cpu) return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu); } -static inline bool ttwu_queue_cond(int cpu, int wake_flags) +static inline bool ttwu_queue_cond(int cpu) { /* * Do not complicate things with the async wake_list while the CPU is @@ -3824,17 +3824,21 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags) if (!cpus_share_cache(smp_processor_id(), cpu)) return true; + if (cpu == smp_processor_id()) + return false; + /* - * If the task is descheduling and the only running task on the - * CPU then use the wakelist to offload the task activation to - * the soon-to-be-idle CPU as the current CPU is likely busy. - * nr_running is checked to avoid unnecessary task stacking. + * If the wakee cpu is idle, or the task is descheduling and the + * only running task on the CPU, then use the wakelist to offload + * the task activation to the idle (or soon-to-be-idle) CPU as + * the current CPU is likely busy. nr_running is checked to + * avoid unnecessary task stacking. * * Note that we can only get here with (wakee) p->on_rq=0, * p->on_cpu can be whatever, we've done the dequeue, so * the wakee has been accounted out of ->nr_running. */ - if ((wake_flags & WF_ON_CPU) && !cpu_rq(cpu)->nr_running) + if (!cpu_rq(cpu)->nr_running) return true; return false; @@ -3842,10 +3846,7 @@ static inline bool ttwu_queue_cond(int cpu, int wake_flags) static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags) { - if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) { - if (WARN_ON_ONCE(cpu == smp_processor_id())) - return false; - + if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu)) { sched_clock_cpu(cpu); /* Sync clocks across CPUs */ __ttwu_queue_wakelist(p, cpu, wake_flags); return true; @@ -4167,7 +4168,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) * scheduling. */ if (smp_load_acquire(&p->on_cpu) && - ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU)) + ttwu_queue_wakelist(p, task_cpu(p), wake_flags)) goto unlock; /* @@ -4757,7 +4758,8 @@ static inline void prepare_task(struct task_struct *next) * Claim the task as running, we do this before switching to it * such that any running task will have this set. * - * See the ttwu() WF_ON_CPU case and its ordering comment. + * See the smp_load_acquire(&p->on_cpu) case in ttwu() and + * its ordering comment. */ WRITE_ONCE(next->on_cpu, 1); #endif diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 01259611beb9..1e34bb4527fd 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2039,7 +2039,6 @@ static inline int task_on_rq_migrating(struct task_struct *p) #define WF_SYNC 0x10 /* Waker goes to sleep after wakeup */ #define WF_MIGRATED 0x20 /* Internal use, task got migrated */ -#define WF_ON_CPU 0x40 /* Wakee is on_cpu */ #ifdef CONFIG_SMP static_assert(WF_EXEC == SD_BALANCE_EXEC); -- cgit v1.2.3 From 700a78335fc28a59c307f420857fd2d4521549f8 Mon Sep 17 00:00:00 2001 From: Christian Göttsche Date: Wed, 15 Jun 2022 17:25:04 +0200 Subject: sched: only perform capability check on privileged operation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit sched_setattr(2) issues via kernel/sched/core.c:__sched_setscheduler() a CAP_SYS_NICE audit event unconditionally, even when the requested operation does not require that capability / is unprivileged, i.e. for reducing niceness. This is relevant in connection with SELinux, where a capability check results in a policy decision and by default a denial message on insufficient permission is issued. It can lead to three undesired cases: 1. A denial message is generated, even in case the operation was an unprivileged one and thus the syscall succeeded, creating noise. 2. To avoid the noise from 1. the policy writer adds a rule to ignore those denial messages, hiding future syscalls, where the task performs an actual privileged operation, leading to hidden limited functionality of that task. 3. To avoid the noise from 1. the policy writer adds a rule to allow the task the capability CAP_SYS_NICE, while it does not need it, violating the principle of least privilege. Conduct privilged/unprivileged categorization first and perform a capable test (and at most once) only if needed. Signed-off-by: Christian Göttsche Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20220615152505.310488-1-cgzones@googlemail.com --- kernel/sched/core.c | 138 +++++++++++++++++++++++++++++++--------------------- 1 file changed, 83 insertions(+), 55 deletions(-) (limited to 'kernel/sched/core.c') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 723452608bed..d3e2c5a7c1b7 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6974,17 +6974,29 @@ out_unlock: EXPORT_SYMBOL(set_user_nice); /* - * can_nice - check if a task can reduce its nice value + * is_nice_reduction - check if nice value is an actual reduction + * + * Similar to can_nice() but does not perform a capability check. + * * @p: task * @nice: nice value */ -int can_nice(const struct task_struct *p, const int nice) +static bool is_nice_reduction(const struct task_struct *p, const int nice) { /* Convert nice value [19,-20] to rlimit style value [1,40]: */ int nice_rlim = nice_to_rlimit(nice); - return (nice_rlim <= task_rlimit(p, RLIMIT_NICE) || - capable(CAP_SYS_NICE)); + return (nice_rlim <= task_rlimit(p, RLIMIT_NICE)); +} + +/* + * can_nice - check if a task can reduce its nice value + * @p: task + * @nice: nice value + */ +int can_nice(const struct task_struct *p, const int nice) +{ + return is_nice_reduction(p, nice) || capable(CAP_SYS_NICE); } #ifdef __ARCH_WANT_SYS_NICE @@ -7263,6 +7275,69 @@ static bool check_same_owner(struct task_struct *p) return match; } +/* + * Allow unprivileged RT tasks to decrease priority. + * Only issue a capable test if needed and only once to avoid an audit + * event on permitted non-privileged operations: + */ +static int user_check_sched_setscheduler(struct task_struct *p, + const struct sched_attr *attr, + int policy, int reset_on_fork) +{ + if (fair_policy(policy)) { + if (attr->sched_nice < task_nice(p) && + !is_nice_reduction(p, attr->sched_nice)) + goto req_priv; + } + + if (rt_policy(policy)) { + unsigned long rlim_rtprio = task_rlimit(p, RLIMIT_RTPRIO); + + /* Can't set/change the rt policy: */ + if (policy != p->policy && !rlim_rtprio) + goto req_priv; + + /* Can't increase priority: */ + if (attr->sched_priority > p->rt_priority && + attr->sched_priority > rlim_rtprio) + goto req_priv; + } + + /* + * Can't set/change SCHED_DEADLINE policy at all for now + * (safest behavior); in the future we would like to allow + * unprivileged DL tasks to increase their relative deadline + * or reduce their runtime (both ways reducing utilization) + */ + if (dl_policy(policy)) + goto req_priv; + + /* + * Treat SCHED_IDLE as nice 20. Only allow a switch to + * SCHED_NORMAL if the RLIMIT_NICE would normally permit it. + */ + if (task_has_idle_policy(p) && !idle_policy(policy)) { + if (!is_nice_reduction(p, task_nice(p))) + goto req_priv; + } + + /* Can't change other user's priorities: */ + if (!check_same_owner(p)) + goto req_priv; + + /* Normal users shall not reset the sched_reset_on_fork flag: */ + if (p->sched_reset_on_fork && !reset_on_fork) + goto req_priv; + + return 0; + +req_priv: + if (!capable(CAP_SYS_NICE)) + return -EPERM; + + return 0; +} + static int __sched_setscheduler(struct task_struct *p, const struct sched_attr *attr, bool user, bool pi) @@ -7304,58 +7379,11 @@ recheck: (rt_policy(policy) != (attr->sched_priority != 0))) return -EINVAL; - /* - * Allow unprivileged RT tasks to decrease priority: - */ - if (user && !capable(CAP_SYS_NICE)) { - if (fair_policy(policy)) { - if (attr->sched_nice < task_nice(p) && - !can_nice(p, attr->sched_nice)) - return -EPERM; - } - - if (rt_policy(policy)) { - unsigned long rlim_rtprio = - task_rlimit(p, RLIMIT_RTPRIO); - - /* Can't set/change the rt policy: */ - if (policy != p->policy && !rlim_rtprio) - return -EPERM; - - /* Can't increase priority: */ - if (attr->sched_priority > p->rt_priority && - attr->sched_priority > rlim_rtprio) - return -EPERM; - } - - /* - * Can't set/change SCHED_DEADLINE policy at all for now - * (safest behavior); in the future we would like to allow - * unprivileged DL tasks to increase their relative deadline - * or reduce their runtime (both ways reducing utilization) - */ - if (dl_policy(policy)) - return -EPERM; - - /* - * Treat SCHED_IDLE as nice 20. Only allow a switch to - * SCHED_NORMAL if the RLIMIT_NICE would normally permit it. - */ - if (task_has_idle_policy(p) && !idle_policy(policy)) { - if (!can_nice(p, task_nice(p))) - return -EPERM; - } - - /* Can't change other user's priorities: */ - if (!check_same_owner(p)) - return -EPERM; - - /* Normal users shall not reset the sched_reset_on_fork flag: */ - if (p->sched_reset_on_fork && !reset_on_fork) - return -EPERM; - } - if (user) { + retval = user_check_sched_setscheduler(p, attr, policy, reset_on_fork); + if (retval) + return retval; + if (attr->sched_flags & SCHED_FLAG_SUGOV) return -EINVAL; -- cgit v1.2.3 From bb4479994945e9170534389a7762eb56149320ac Mon Sep 17 00:00:00 2001 From: Dietmar Eggemann Date: Tue, 21 Jun 2022 10:04:10 +0100 Subject: sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() effective_cpu_util() already has a `int cpu' parameter which allows to retrieve the CPU capacity scale factor (or maximum CPU capacity) inside this function via an arch_scale_cpu_capacity(cpu). A lot of code calling effective_cpu_util() (or the shim sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call arch_scale_cpu_capacity() already. But not having to pass it into effective_cpu_util() will make the EAS wake-up code easier, especially when the maximum CPU capacity reduced by the thermal pressure is passed through the EAS wake-up functions. Due to the asymmetric CPU capacity support of arm/arm64 architectures, arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via per_cpu(cpu_scale, cpu) on such a system. On all other architectures it is a a compile-time constant (SCHED_CAPACITY_SCALE). Signed-off-by: Dietmar Eggemann Signed-off-by: Peter Zijlstra (Intel) Acked-by: Vincent Guittot Tested-by: Lukasz Luba Link: https://lkml.kernel.org/r/20220621090414.433602-4-vdonnefort@google.com --- drivers/powercap/dtpm_cpu.c | 33 +++++++++------------------------ drivers/thermal/cpufreq_cooling.c | 6 ++---- include/linux/sched.h | 2 +- kernel/sched/core.c | 11 ++++++----- kernel/sched/cpufreq_schedutil.c | 5 ++--- kernel/sched/fair.c | 21 ++++++++++----------- kernel/sched/sched.h | 2 +- 7 files changed, 31 insertions(+), 49 deletions(-) (limited to 'kernel/sched/core.c') diff --git a/drivers/powercap/dtpm_cpu.c b/drivers/powercap/dtpm_cpu.c index f5eced0842b3..6a88eb7e9f75 100644 --- a/drivers/powercap/dtpm_cpu.c +++ b/drivers/powercap/dtpm_cpu.c @@ -71,34 +71,19 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit) static u64 scale_pd_power_uw(struct cpumask *pd_mask, u64 power) { - unsigned long max = 0, sum_util = 0; + unsigned long max, sum_util = 0; int cpu; - for_each_cpu_and(cpu, pd_mask, cpu_online_mask) { - - /* - * The capacity is the same for all CPUs belonging to - * the same perf domain, so a single call to - * arch_scale_cpu_capacity() is enough. However, we - * need the CPU parameter to be initialized by the - * loop, so the call ends up in this block. - * - * We can initialize 'max' with a cpumask_first() call - * before the loop but the bits computation is not - * worth given the arch_scale_cpu_capacity() just - * returns a value where the resulting assembly code - * will be optimized by the compiler. - */ - max = arch_scale_cpu_capacity(cpu); - sum_util += sched_cpu_util(cpu, max); - } - /* - * In the improbable case where all the CPUs of the perf - * domain are offline, 'max' will be zero and will lead to an - * illegal operation with a zero division. + * The capacity is the same for all CPUs belonging to + * the same perf domain. */ - return max ? (power * ((sum_util << 10) / max)) >> 10 : 0; + max = arch_scale_cpu_capacity(cpumask_first(pd_mask)); + + for_each_cpu_and(cpu, pd_mask, cpu_online_mask) + sum_util += sched_cpu_util(cpu); + + return (power * ((sum_util << 10) / max)) >> 10; } static u64 get_pd_power_uw(struct dtpm *dtpm) diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c index b8151d95a806..b263b0fde03c 100644 --- a/drivers/thermal/cpufreq_cooling.c +++ b/drivers/thermal/cpufreq_cooling.c @@ -137,11 +137,9 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev, static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu, int cpu_idx) { - unsigned long max = arch_scale_cpu_capacity(cpu); - unsigned long util; + unsigned long util = sched_cpu_util(cpu); - util = sched_cpu_util(cpu, max); - return (util * 100) / max; + return (util * 100) / arch_scale_cpu_capacity(cpu); } #else /* !CONFIG_SMP */ static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu, diff --git a/include/linux/sched.h b/include/linux/sched.h index c46f3a63b758..88b8817b827d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2257,7 +2257,7 @@ static inline bool owner_on_cpu(struct task_struct *owner) } /* Returns effective CPU energy utilization, as seen by the scheduler */ -unsigned long sched_cpu_util(int cpu, unsigned long max); +unsigned long sched_cpu_util(int cpu); #endif /* CONFIG_SMP */ #ifdef CONFIG_RSEQ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d3e2c5a7c1b7..c538a0ac4617 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7125,12 +7125,14 @@ struct task_struct *idle_task(int cpu) * required to meet deadlines. */ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, - unsigned long max, enum cpu_util_type type, + enum cpu_util_type type, struct task_struct *p) { - unsigned long dl_util, util, irq; + unsigned long dl_util, util, irq, max; struct rq *rq = cpu_rq(cpu); + max = arch_scale_cpu_capacity(cpu); + if (!uclamp_is_used() && type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) { return max; @@ -7210,10 +7212,9 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, return min(max, util); } -unsigned long sched_cpu_util(int cpu, unsigned long max) +unsigned long sched_cpu_util(int cpu) { - return effective_cpu_util(cpu, cpu_util_cfs(cpu), max, - ENERGY_UTIL, NULL); + return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL); } #endif /* CONFIG_SMP */ diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 3dbf351d12d5..1207c78f85c1 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -157,11 +157,10 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy, static void sugov_get_util(struct sugov_cpu *sg_cpu) { struct rq *rq = cpu_rq(sg_cpu->cpu); - unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu); - sg_cpu->max = max; + sg_cpu->max = arch_scale_cpu_capacity(sg_cpu->cpu); sg_cpu->bw_dl = cpu_bw_dl(rq); - sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(sg_cpu->cpu), max, + sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(sg_cpu->cpu), FREQUENCY_UTIL, NULL); } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 80be1f1a5620..6de09b26b455 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6712,12 +6712,11 @@ static long compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd) { struct cpumask *pd_mask = perf_domain_span(pd); - unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask)); - unsigned long max_util = 0, sum_util = 0; - unsigned long _cpu_cap = cpu_cap; + unsigned long max_util = 0, sum_util = 0, cpu_cap; int cpu; - _cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask)); + cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask)); + cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask)); /* * The capacity state of CPUs of the current rd can be driven by CPUs @@ -6754,10 +6753,10 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd) * is already enough to scale the EM reported power * consumption at the (eventually clamped) cpu_capacity. */ - cpu_util = effective_cpu_util(cpu, util_running, cpu_cap, - ENERGY_UTIL, NULL); + cpu_util = effective_cpu_util(cpu, util_running, ENERGY_UTIL, + NULL); - sum_util += min(cpu_util, _cpu_cap); + sum_util += min(cpu_util, cpu_cap); /* * Performance domain frequency: utilization clamping @@ -6766,12 +6765,12 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd) * NOTE: in case RT tasks are running, by default the * FREQUENCY_UTIL's utilization can be max OPP. */ - cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap, - FREQUENCY_UTIL, tsk); - max_util = max(max_util, min(cpu_util, _cpu_cap)); + cpu_util = effective_cpu_util(cpu, util_freq, FREQUENCY_UTIL, + tsk); + max_util = max(max_util, min(cpu_util, cpu_cap)); } - return em_cpu_energy(pd->em_pd, max_util, sum_util, _cpu_cap); + return em_cpu_energy(pd->em_pd, max_util, sum_util, cpu_cap); } /* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 76b0027fd0c8..73ae32898f25 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2886,7 +2886,7 @@ enum cpu_util_type { }; unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, - unsigned long max, enum cpu_util_type type, + enum cpu_util_type type, struct task_struct *p); static inline unsigned long cpu_bw_dl(struct rq *rq) -- cgit v1.2.3 From ec4fc801a02d96180c597238fe87141471b70971 Mon Sep 17 00:00:00 2001 From: Dietmar Eggemann Date: Thu, 23 Jun 2022 11:11:02 +0200 Subject: sched/fair: Rename select_idle_mask to select_rq_mask On 21/06/2022 11:04, Vincent Donnefort wrote: > From: Dietmar Eggemann https://lkml.kernel.org/r/202206221253.ZVyGQvPX-lkp@intel.com discovered that this patch doesn't build anymore (on tip sched/core or linux-next) because of commit f5b2eeb499910 ("sched/fair: Consider CPU affinity when allowing NUMA imbalance in find_idlest_group()"). New version of [PATCH v11 4/7] sched/fair: Rename select_idle_mask to select_rq_mask below. -- >8 -- Decouple the name of the per-cpu cpumask select_idle_mask from its usage in select_idle_[cpu/capacity]() of the CFS run-queue selection (select_task_rq_fair()). This is to support the reuse of this cpumask in the Energy Aware Scheduling (EAS) path (find_energy_efficient_cpu()) of the CFS run-queue selection. Signed-off-by: Dietmar Eggemann Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Vincent Guittot Tested-by: Lukasz Luba Link: https://lkml.kernel.org/r/250691c7-0e2b-05ab-bedf-b245c11d9400@arm.com --- kernel/sched/core.c | 4 ++-- kernel/sched/fair.c | 10 +++++----- 2 files changed, 7 insertions(+), 7 deletions(-) (limited to 'kernel/sched/core.c') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c538a0ac4617..dd69e85b7879 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9536,7 +9536,7 @@ static struct kmem_cache *task_group_cache __read_mostly; #endif DECLARE_PER_CPU(cpumask_var_t, load_balance_mask); -DECLARE_PER_CPU(cpumask_var_t, select_idle_mask); +DECLARE_PER_CPU(cpumask_var_t, select_rq_mask); void __init sched_init(void) { @@ -9585,7 +9585,7 @@ void __init sched_init(void) for_each_possible_cpu(i) { per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node( cpumask_size(), GFP_KERNEL, cpu_to_node(i)); - per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node( + per_cpu(select_rq_mask, i) = (cpumask_var_t)kzalloc_node( cpumask_size(), GFP_KERNEL, cpu_to_node(i)); } #endif /* CONFIG_CPUMASK_OFFSTACK */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6de09b26b455..e3f750135f78 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5894,7 +5894,7 @@ dequeue_throttle: /* Working cpumask for: load_balance, load_balance_newidle. */ DEFINE_PER_CPU(cpumask_var_t, load_balance_mask); -DEFINE_PER_CPU(cpumask_var_t, select_idle_mask); +DEFINE_PER_CPU(cpumask_var_t, select_rq_mask); #ifdef CONFIG_NO_HZ_COMMON @@ -6384,7 +6384,7 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd */ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target) { - struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); + struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask); int i, cpu, idle_cpu = -1, nr = INT_MAX; struct sched_domain_shared *sd_share; struct rq *this_rq = this_rq(); @@ -6482,7 +6482,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target) int cpu, best_cpu = -1; struct cpumask *cpus; - cpus = this_cpu_cpumask_var_ptr(select_idle_mask); + cpus = this_cpu_cpumask_var_ptr(select_rq_mask); cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); task_util = uclamp_task_util(p); @@ -6532,7 +6532,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) } /* - * per-cpu select_idle_mask usage + * per-cpu select_rq_mask usage */ lockdep_assert_irqs_disabled(); @@ -9255,7 +9255,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) * take care of it. */ if (p->nr_cpus_allowed != NR_CPUS) { - struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); + struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask); cpumask_and(cpus, sched_group_span(local), p->cpus_ptr); imb_numa_nr = min(cpumask_weight(cpus), sd->imb_numa_nr); -- cgit v1.2.3 From c02d5546ea34d589c83eda5055dbd727a396642b Mon Sep 17 00:00:00 2001 From: Uros Bizjak Date: Wed, 29 Jun 2022 17:15:52 +0200 Subject: sched/core: Use try_cmpxchg in set_nr_{and_not,if}_polling Use try_cmpxchg instead of cmpxchg (*ptr, old, new) != old in set_nr_{and_not,if}_polling. x86 cmpxchg returns success in ZF flag, so this change saves a compare after cmpxchg. The definition of cmpxchg based fetch_or was changed in the same way as atomic_fetch_##op definitions were changed in e6790e4b5d5e97dc287f3496dd2cf2dbabdfdb35. Also declare these two functions as inline to ensure inlining. In the case of set_nr_and_not_polling, the compiler (gcc) tries to outsmart itself by constructing the boolean return value with logic operations on the fetched value, and these extra operations enlarge the function over the inlining threshold value. Signed-off-by: Uros Bizjak Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20220629151552.6015-1-ubizjak@gmail.com --- kernel/sched/core.c | 24 +++++++++--------------- 1 file changed, 9 insertions(+), 15 deletions(-) (limited to 'kernel/sched/core.c') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index dd69e85b7879..c703d177f62d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -873,15 +873,11 @@ static inline void hrtick_rq_init(struct rq *rq) ({ \ typeof(ptr) _ptr = (ptr); \ typeof(mask) _mask = (mask); \ - typeof(*_ptr) _old, _val = *_ptr; \ + typeof(*_ptr) _val = *_ptr; \ \ - for (;;) { \ - _old = cmpxchg(_ptr, _val, _val | _mask); \ - if (_old == _val) \ - break; \ - _val = _old; \ - } \ - _old; \ + do { \ + } while (!try_cmpxchg(_ptr, &_val, _val | _mask)); \ + _val; \ }) #if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG) @@ -890,7 +886,7 @@ static inline void hrtick_rq_init(struct rq *rq) * this avoids any races wrt polling state changes and thereby avoids * spurious IPIs. */ -static bool set_nr_and_not_polling(struct task_struct *p) +static inline bool set_nr_and_not_polling(struct task_struct *p) { struct thread_info *ti = task_thread_info(p); return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG); @@ -905,30 +901,28 @@ static bool set_nr_and_not_polling(struct task_struct *p) static bool set_nr_if_polling(struct task_struct *p) { struct thread_info *ti = task_thread_info(p); - typeof(ti->flags) old, val = READ_ONCE(ti->flags); + typeof(ti->flags) val = READ_ONCE(ti->flags); for (;;) { if (!(val & _TIF_POLLING_NRFLAG)) return false; if (val & _TIF_NEED_RESCHED) return true; - old = cmpxchg(&ti->flags, val, val | _TIF_NEED_RESCHED); - if (old == val) + if (try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED)) break; - val = old; } return true; } #else -static bool set_nr_and_not_polling(struct task_struct *p) +static inline bool set_nr_and_not_polling(struct task_struct *p) { set_tsk_need_resched(p); return true; } #ifdef CONFIG_SMP -static bool set_nr_if_polling(struct task_struct *p) +static inline bool set_nr_if_polling(struct task_struct *p) { return false; } -- cgit v1.2.3 From 401e4963bf45c800e3e9ea0d3a0289d738005fd4 Mon Sep 17 00:00:00 2001 From: John Keeping Date: Fri, 8 Jul 2022 17:27:02 +0100 Subject: sched/core: Always flush pending blk_plug With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two normal priority tasks (SCHED_OTHER, nice level zero): INFO: task kworker/u8:0:8 blocked for more than 491 seconds. Not tainted 5.15.49-rt46 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000 Workqueue: writeback wb_workfn (flush-7:0) [] (__schedule) from [] (schedule+0xdc/0x134) [] (schedule) from [] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174) [] (rt_mutex_slowlock_block.constprop.0) from [] +(rt_mutex_slowlock.constprop.0+0xac/0x174) [] (rt_mutex_slowlock.constprop.0) from [] (fat_write_inode+0x34/0x54) [] (fat_write_inode) from [] (__writeback_single_inode+0x354/0x3ec) [] (__writeback_single_inode) from [] (writeback_sb_inodes+0x250/0x45c) [] (writeback_sb_inodes) from [] (__writeback_inodes_wb+0x7c/0xb8) [] (__writeback_inodes_wb) from [] (wb_writeback+0x2c8/0x2e4) [] (wb_writeback) from [] (wb_workfn+0x1a4/0x3e4) [] (wb_workfn) from [] (process_one_work+0x1fc/0x32c) [] (process_one_work) from [] (worker_thread+0x22c/0x2d8) [] (worker_thread) from [] (kthread+0x16c/0x178) [] (kthread) from [] (ret_from_fork+0x14/0x38) Exception stack(0xc10e3fb0 to 0xc10e3ff8) 3fa0: 00000000 00000000 00000000 00000000 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 INFO: task tar:2083 blocked for more than 491 seconds. Not tainted 5.15.49-rt46 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:tar state:D stack: 0 pid: 2083 ppid: 2082 flags:0x00000000 [] (__schedule) from [] (schedule+0xdc/0x134) [] (schedule) from [] (io_schedule+0x14/0x24) [] (io_schedule) from [] (bit_wait_io+0xc/0x30) [] (bit_wait_io) from [] (__wait_on_bit_lock+0x54/0xa8) [] (__wait_on_bit_lock) from [] (out_of_line_wait_on_bit_lock+0x84/0xb0) [] (out_of_line_wait_on_bit_lock) from [] (fat_mirror_bhs+0xa0/0x144) [] (fat_mirror_bhs) from [] (fat_alloc_clusters+0x138/0x2a4) [] (fat_alloc_clusters) from [] (fat_alloc_new_dir+0x34/0x250) [] (fat_alloc_new_dir) from [] (vfat_mkdir+0x58/0x148) [] (vfat_mkdir) from [] (vfs_mkdir+0x68/0x98) [] (vfs_mkdir) from [] (do_mkdirat+0xb0/0xec) [] (do_mkdirat) from [] (ret_fast_syscall+0x0/0x1c) Exception stack(0xc2e1bfa8 to 0xc2e1bff0) bfa0: 01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000 bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190 bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc Here the kworker is waiting on msdos_sb_info::s_lock which is held by tar which is in turn waiting for a buffer which is locked waiting to be flushed, but this operation is plugged in the kworker. The lock is a normal struct mutex, so tsk_is_pi_blocked() will always return false on !RT and thus the behaviour changes for RT. It seems that the intent here is to skip blk_flush_plug() in the case where a non-preemptible lock (such as a spinlock) has been converted to a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT schedule flag. But sched_submit_work() is only called from schedule() which is never called in this scenario, so the check can simply be deleted. Looking at the history of the -rt patchset, in fact this change was present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b ("sched/core: Rework the __schedule() preempt argument"). As described in [1]: The schedule process must distinguish between blocking on a regular sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock and rwlock): - rwsem and mutex must flush block requests (blk_schedule_flush_plug()) even if blocked on a lock. This can not deadlock because this also happens for non-RT. There should be a warning if the scheduling point is within a RCU read section. - spinlock and rwlock must not flush block requests. This will deadlock if the callback attempts to acquire a lock which is already acquired. Similarly to being preempted, there should be no warning if the scheduling point is within a RCU read section. and with the tsk_is_pi_blocked() in the scheduler path, we hit the first issue. [1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0022-locking-rtmutex-Use-custom-scheduling-function-for-s.patch?h=linux-5.10.y-rt-patches Signed-off-by: John Keeping Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Steven Rostedt (Google) Link: https://lkml.kernel.org/r/20220708162702.1758865-1-john@metanate.com --- include/linux/sched/rt.h | 8 -------- kernel/sched/core.c | 8 ++++++-- 2 files changed, 6 insertions(+), 10 deletions(-) (limited to 'kernel/sched/core.c') diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h index e5af028c08b4..994c25640e15 100644 --- a/include/linux/sched/rt.h +++ b/include/linux/sched/rt.h @@ -39,20 +39,12 @@ static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *p) } extern void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task); extern void rt_mutex_adjust_pi(struct task_struct *p); -static inline bool tsk_is_pi_blocked(struct task_struct *tsk) -{ - return tsk->pi_blocked_on != NULL; -} #else static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *task) { return NULL; } # define rt_mutex_adjust_pi(p) do { } while (0) -static inline bool tsk_is_pi_blocked(struct task_struct *tsk) -{ - return false; -} #endif extern void normalize_rt_tasks(void); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c703d177f62d..a463dbc92fcd 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6470,8 +6470,12 @@ static inline void sched_submit_work(struct task_struct *tsk) io_wq_worker_sleeping(tsk); } - if (tsk_is_pi_blocked(tsk)) - return; + /* + * spinlock and rwlock must not flush block requests. This will + * deadlock if the callback attempts to acquire a lock which is + * already acquired. + */ + SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT); /* * If we are going to sleep and we have plugged IO queued, -- cgit v1.2.3