When running a ISV workload on a large system (240 Cores, SMT8), it was
observed from perf profile that newidle_balance and enqueue_task_fair
were consuming more cycles. Perf annotate showed that most of the time
was spent on accessing overutilized field of root domain.
Aboorva was able to simulate similar perf profile by making some
changes to stress-ng --wait. Both newidle_balance and enqueue_task_fair
consume close to 5-7%. Perf annotate shows that most of the cycles are spent
in accessing rd,rd->overutilized field.
Overutilized was added for EAS(Energy aware scheduler) to decide whether
to do load balance or not. Simultaneous access to overutilized by
multiple CPUs lead cache invalidations due to true sharing. Updating
overutilized is not required for non-EAS platforms. Since overutilized and
overload are part of the same cacheline, there is false sharing as well.
Patch 1/2 is the main patch. It helps in reducing the above said issue.
Both the functions don't show up in the profile. With patch comparison is in
changelog. With the patch stated problem in the ISV workload also got
solved and throughput has improved.
Patch 2/2 is only code refactoring to use the helper function instead of
direct access of the field
Detailed perf annotate can be found in cover letter of v1.
v4 -> v3:
- corrected a mistake where EAS check was missed.
v3 -> v2:
- Pierre and Dietmar suggested we could add one more EAS check before
calling cpu_overutilized. That makes sense since that value is not used
anyway in Non-EAS case.
- Refactored the code as dietmar suggested to avoid additional call to
sched_energy_enabled().
- Minor edits to change log.
v2 -> v1:
Chen Yu pointed out minor issue in code. Corrected that code and updated
the changelog.
v1: https://lore.kernel.org/lkml/[email protected]/
v2: https://lore.kernel.org/lkml/[email protected]/
v3: https://lore.kernel.org/lkml/[email protected]/
Shrikanth Hegde (2):
sched/fair: Add EAS checks before updating overutilized
sched/fair: Use helper function to access rd->overutilized
kernel/sched/fair.c | 63 ++++++++++++++++++++++++++++++---------------
1 file changed, 42 insertions(+), 21 deletions(-)
--
2.39.3
Overutilized field of root domain is only used for EAS(energy aware scheduler)
to decide whether to do load balance or not. It is not used if EAS
not possible.
Currently enqueue_task_fair and task_tick_fair accesses, sometime updates
this field. In update_sd_lb_stats it is updated often. This causes cache
contention due to true sharing and burns a lot of cycles. overload and
overutilized are part of the same cacheline. Updating it often invalidates
the cacheline. That causes access to overload to slow down due to
false sharing. Hence add EAS check before accessing/updating this field.
EAS check is optimized at compile time or it is a static branch.
Hence it shouldn't cost much.
With the patch, both enqueue_task_fair and newidle_balance don't show
up as hot routines in perf profile.
6.8-rc4:
7.18% swapper [kernel.vmlinux] [k] enqueue_task_fair
6.78% s [kernel.vmlinux] [k] newidle_balance
+patch:
0.14% swapper [kernel.vmlinux] [k] enqueue_task_fair
0.00% swapper [kernel.vmlinux] [k] newidle_balance
Minor change: trace_sched_overutilized_tp expect that second argument to
be bool. So do a int to bool conversion for that.
Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator")
Reviewed-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Shrikanth Hegde <[email protected]>
---
kernel/sched/fair.c | 49 +++++++++++++++++++++++++++------------------
1 file changed, 30 insertions(+), 19 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a16129f9a5c..a71f8a1506e4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6670,15 +6670,29 @@ static inline bool cpu_overutilized(int cpu)
return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
}
-static inline void update_overutilized_status(struct rq *rq)
+static inline void set_rd_overutilized_status(struct root_domain *rd,
+ unsigned int status)
{
- if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
- WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
- trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED);
- }
+ WRITE_ONCE(rd->overutilized, status);
+ trace_sched_overutilized_tp(rd, !!status);
+}
+
+static inline void check_update_overutilized_status(struct rq *rq)
+{
+ /*
+ * overutilized field is used for load balancing decisions only
+ * if energy aware scheduler is being used
+ */
+ if (!sched_energy_enabled())
+ return;
+
+ if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
+ set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
}
#else
-static inline void update_overutilized_status(struct rq *rq) { }
+static inline void check_update_overutilized_status(struct rq *rq) { }
+static inline void set_rd_overutilized_status(struct root_domain *rd,
+ unsigned int status) { }
#endif
/* Runqueue only has SCHED_IDLE tasks enqueued */
@@ -6779,7 +6793,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
* and the following generally works well enough in practice.
*/
if (!task_new)
- update_overutilized_status(rq);
+ check_update_overutilized_status(rq);
enqueue_throttle:
assert_list_leaf_cfs_rq(rq);
@@ -9902,7 +9916,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (nr_running > 1)
*sg_status |= SG_OVERLOAD;
- if (cpu_overutilized(i))
+ if (sched_energy_enabled() && cpu_overutilized(i))
*sg_status |= SG_OVERUTILIZED;
#ifdef CONFIG_NUMA_BALANCING
@@ -10596,19 +10610,16 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
env->fbq_type = fbq_classify_group(&sds->busiest_stat);
if (!env->sd->parent) {
- struct root_domain *rd = env->dst_rq->rd;
-
/* update overload indicator if we are at root domain */
- WRITE_ONCE(rd->overload, sg_status & SG_OVERLOAD);
+ WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD);
/* Update over-utilization (tipping point, U >= 0) indicator */
- WRITE_ONCE(rd->overutilized, sg_status & SG_OVERUTILIZED);
- trace_sched_overutilized_tp(rd, sg_status & SG_OVERUTILIZED);
- } else if (sg_status & SG_OVERUTILIZED) {
- struct root_domain *rd = env->dst_rq->rd;
-
- WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
- trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
+ if (sched_energy_enabled()) {
+ set_rd_overutilized_status(env->dst_rq->rd,
+ sg_status & SG_OVERUTILIZED);
+ }
+ } else if (sched_energy_enabled() && (sg_status & SG_OVERUTILIZED)) {
+ set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED);
}
update_idle_cpu_scan(env, sum_util);
@@ -12609,7 +12620,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
task_tick_numa(rq, curr);
update_misfit_status(curr, rq);
- update_overutilized_status(task_rq(curr));
+ check_update_overutilized_status(task_rq(curr));
task_tick_core(rq, curr);
}
--
2.39.3
Overutilized field is accessed directly in multiple places.
So it could use a helper function. That way one might be more
informed that it needs to be used only in case of EAS.
No change in functionality intended.
Signed-off-by: Shrikanth Hegde <[email protected]>
---
kernel/sched/fair.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a71f8a1506e4..650909a648d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6670,6 +6670,15 @@ static inline bool cpu_overutilized(int cpu)
return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
}
+/*
+ * Ensure that caller can do EAS. overutilized value
+ * make sense only if EAS is enabled
+ */
+static inline int is_rd_overutilized(struct root_domain *rd)
+{
+ return READ_ONCE(rd->overutilized);
+}
+
static inline void set_rd_overutilized_status(struct root_domain *rd,
unsigned int status)
{
@@ -6686,13 +6695,14 @@ static inline void check_update_overutilized_status(struct rq *rq)
if (!sched_energy_enabled())
return;
- if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
+ if (!is_rd_overutilized(rq->rd) && cpu_overutilized(rq->cpu))
set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
}
#else
static inline void check_update_overutilized_status(struct rq *rq) { }
static inline void set_rd_overutilized_status(struct root_domain *rd,
unsigned int status) { }
+static inline int is_rd_overutilized(struct root_domain *rd) { }
#endif
/* Runqueue only has SCHED_IDLE tasks enqueued */
@@ -7974,7 +7984,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
rcu_read_lock();
pd = rcu_dereference(rd->pd);
- if (!pd || READ_ONCE(rd->overutilized))
+ if (!pd || is_rd_overutilized(rd))
goto unlock;
/*
@@ -10859,7 +10869,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
if (sched_energy_enabled()) {
struct root_domain *rd = env->dst_rq->rd;
- if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
+ if (rcu_dereference(rd->pd) && !is_rd_overutilized(rd))
goto out_balanced;
}
--
2.39.3
On 03/01/24 20:47, Shrikanth Hegde wrote:
> Overutilized field of root domain is only used for EAS(energy aware scheduler)
> to decide whether to do load balance or not. It is not used if EAS
> not possible.
>
> Currently enqueue_task_fair and task_tick_fair accesses, sometime updates
> this field. In update_sd_lb_stats it is updated often. This causes cache
> contention due to true sharing and burns a lot of cycles. overload and
> overutilized are part of the same cacheline. Updating it often invalidates
> the cacheline. That causes access to overload to slow down due to
> false sharing. Hence add EAS check before accessing/updating this field.
> EAS check is optimized at compile time or it is a static branch.
> Hence it shouldn't cost much.
>
> With the patch, both enqueue_task_fair and newidle_balance don't show
> up as hot routines in perf profile.
>
> 6.8-rc4:
> 7.18% swapper [kernel.vmlinux] [k] enqueue_task_fair
> 6.78% s [kernel.vmlinux] [k] newidle_balance
> +patch:
> 0.14% swapper [kernel.vmlinux] [k] enqueue_task_fair
> 0.00% swapper [kernel.vmlinux] [k] newidle_balance
>
> Minor change: trace_sched_overutilized_tp expect that second argument to
> be bool. So do a int to bool conversion for that.
>
> Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator")
> Reviewed-by: Srikar Dronamraju <[email protected]>
> Signed-off-by: Shrikanth Hegde <[email protected]>
> ---
> kernel/sched/fair.c | 49 +++++++++++++++++++++++++++------------------
> 1 file changed, 30 insertions(+), 19 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6a16129f9a5c..a71f8a1506e4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6670,15 +6670,29 @@ static inline bool cpu_overutilized(int cpu)
> return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
> }
>
> -static inline void update_overutilized_status(struct rq *rq)
> +static inline void set_rd_overutilized_status(struct root_domain *rd,
> + unsigned int status)
> {
> - if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
> - WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
> - trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED);
> - }
Can we add
if (!sched_energy_enabled())
return;
here and avoid sprinkling the condition in other various places instead?
> + WRITE_ONCE(rd->overutilized, status);
> + trace_sched_overutilized_tp(rd, !!status);
> +}
> +
> +static inline void check_update_overutilized_status(struct rq *rq)
> +{
> + /*
> + * overutilized field is used for load balancing decisions only
> + * if energy aware scheduler is being used
> + */
nit: I think this comment is unnecessary but I don't mind keeping it
> + if (!sched_energy_enabled())
> + return;
> +
> + if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
> + set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
> }
> #else
> -static inline void update_overutilized_status(struct rq *rq) { }
> +static inline void check_update_overutilized_status(struct rq *rq) { }
> +static inline void set_rd_overutilized_status(struct root_domain *rd,
> + unsigned int status) { }
> #endif
>
> /* Runqueue only has SCHED_IDLE tasks enqueued */
> @@ -6779,7 +6793,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> * and the following generally works well enough in practice.
> */
> if (!task_new)
> - update_overutilized_status(rq);
> + check_update_overutilized_status(rq);
>
> enqueue_throttle:
> assert_list_leaf_cfs_rq(rq);
> @@ -9902,7 +9916,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> if (nr_running > 1)
> *sg_status |= SG_OVERLOAD;
>
> - if (cpu_overutilized(i))
> + if (sched_energy_enabled() && cpu_overutilized(i))
I think we can drop sched_energy_enable() here if we add it to
set_rd_overutilized_status()
> *sg_status |= SG_OVERUTILIZED;
>
> #ifdef CONFIG_NUMA_BALANCING
> @@ -10596,19 +10610,16 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> env->fbq_type = fbq_classify_group(&sds->busiest_stat);
>
> if (!env->sd->parent) {
> - struct root_domain *rd = env->dst_rq->rd;
> -
> /* update overload indicator if we are at root domain */
> - WRITE_ONCE(rd->overload, sg_status & SG_OVERLOAD);
> + WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD);
>
> /* Update over-utilization (tipping point, U >= 0) indicator */
> - WRITE_ONCE(rd->overutilized, sg_status & SG_OVERUTILIZED);
> - trace_sched_overutilized_tp(rd, sg_status & SG_OVERUTILIZED);
> - } else if (sg_status & SG_OVERUTILIZED) {
> - struct root_domain *rd = env->dst_rq->rd;
> -
> - WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
> - trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
> + if (sched_energy_enabled()) {
ditto
> + set_rd_overutilized_status(env->dst_rq->rd,
> + sg_status & SG_OVERUTILIZED);
> + }
> + } else if (sched_energy_enabled() && (sg_status & SG_OVERUTILIZED)) {
ditto
> + set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED);
> }
>
> update_idle_cpu_scan(env, sum_util);
> @@ -12609,7 +12620,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> task_tick_numa(rq, curr);
>
> update_misfit_status(curr, rq);
> - update_overutilized_status(task_rq(curr));
> + check_update_overutilized_status(task_rq(curr));
>
> task_tick_core(rq, curr);
> }
> --
> 2.39.3
>
On 03/01/24 20:47, Shrikanth Hegde wrote:
> Overutilized field is accessed directly in multiple places.
> So it could use a helper function. That way one might be more
> informed that it needs to be used only in case of EAS.
>
> No change in functionality intended.
>
> Signed-off-by: Shrikanth Hegde <[email protected]>
Can we do the same for rd->overload too? A set_rd_overload_status() would be
a nice addition too. Anyway.
Reviewed-by: Qais Yousef <[email protected]>
Thanks!
--
Qais Yousef
> ---
> kernel/sched/fair.c | 16 +++++++++++++---
> 1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a71f8a1506e4..650909a648d0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6670,6 +6670,15 @@ static inline bool cpu_overutilized(int cpu)
> return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
> }
>
> +/*
> + * Ensure that caller can do EAS. overutilized value
> + * make sense only if EAS is enabled
> + */
> +static inline int is_rd_overutilized(struct root_domain *rd)
> +{
> + return READ_ONCE(rd->overutilized);
> +}
> +
> static inline void set_rd_overutilized_status(struct root_domain *rd,
> unsigned int status)
> {
> @@ -6686,13 +6695,14 @@ static inline void check_update_overutilized_status(struct rq *rq)
> if (!sched_energy_enabled())
> return;
>
> - if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
> + if (!is_rd_overutilized(rq->rd) && cpu_overutilized(rq->cpu))
> set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
> }
> #else
> static inline void check_update_overutilized_status(struct rq *rq) { }
> static inline void set_rd_overutilized_status(struct root_domain *rd,
> unsigned int status) { }
> +static inline int is_rd_overutilized(struct root_domain *rd) { }
> #endif
>
> /* Runqueue only has SCHED_IDLE tasks enqueued */
> @@ -7974,7 +7984,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>
> rcu_read_lock();
> pd = rcu_dereference(rd->pd);
> - if (!pd || READ_ONCE(rd->overutilized))
> + if (!pd || is_rd_overutilized(rd))
> goto unlock;
>
> /*
> @@ -10859,7 +10869,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> if (sched_energy_enabled()) {
> struct root_domain *rd = env->dst_rq->rd;
>
> - if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
> + if (rcu_dereference(rd->pd) && !is_rd_overutilized(rd))
> goto out_balanced;
> }
>
> --
> 2.39.3
>
On 3/4/24 12:20 AM, Qais Yousef wrote:
> On 03/01/24 20:47, Shrikanth Hegde wrote:
>> Overutilized field of root domain is only used for EAS(energy aware scheduler)
[...]
Hi Qais, Thanks for taking a look.
>> ---
>> kernel/sched/fair.c | 49 +++++++++++++++++++++++++++------------------
>> 1 file changed, 30 insertions(+), 19 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 6a16129f9a5c..a71f8a1506e4 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6670,15 +6670,29 @@ static inline bool cpu_overutilized(int cpu)
>> return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
>> }
>>
>> -static inline void update_overutilized_status(struct rq *rq)
>> +static inline void set_rd_overutilized_status(struct root_domain *rd,
>> + unsigned int status)
>> {
>> - if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
>> - WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
>> - trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED);
>> - }
>
> Can we add
>
> if (!sched_energy_enabled())
> return;
This is very close to what i had till v2. But it was pointed out that, it
would end up calling sched_energy_enabled twice in check_update_overutilized_status.
In check_update_overutilized_status, it would be better to avoid access to
overutilized and computing cpu_overutilized if EAS is not enabled.
I am okay with either code. keeping sched_energy_enabled in set_rd_overutilized_status
would be less code and more readable. But would call sched_energy_enabled twice.
Dietmar, Pierre,
Could you please provide your inputs here?
>
> here and avoid sprinkling the condition in other various places instead?
>
>> + WRITE_ONCE(rd->overutilized, status);
>> + trace_sched_overutilized_tp(rd, !!status);
>> +}
>> +
>> +static inline void check_update_overutilized_status(struct rq *rq)
>> +{
>> + /*
>> + * overutilized field is used for load balancing decisions only
>> + * if energy aware scheduler is being used
>> + */
>
> nit: I think this comment is unnecessary but I don't mind keeping it
>
>> + if (!sched_energy_enabled())
>> + return;
>> +
>> + if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
>> + set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
>> }
>> #else
>> -static inline void update_overutilized_status(struct rq *rq) { }
>> +static inline void check_update_overutilized_status(struct rq *rq) { }
>> +static inline void set_rd_overutilized_status(struct root_domain *rd,
>> + unsigned int status) { }
>> #endif
>>
>> /* Runqueue only has SCHED_IDLE tasks enqueued */
>> @@ -6779,7 +6793,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> * and the following generally works well enough in practice.
>> */
>> if (!task_new)
>> - update_overutilized_status(rq);
>> + check_update_overutilized_status(rq);
>>
>> enqueue_throttle:
>> assert_list_leaf_cfs_rq(rq);
>> @@ -9902,7 +9916,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>> if (nr_running > 1)
>> *sg_status |= SG_OVERLOAD;
>>
>> - if (cpu_overutilized(i))
>> + if (sched_energy_enabled() && cpu_overutilized(i))
>
> I think we can drop sched_energy_enable() here if we add it to
> set_rd_overutilized_status()
we can avoid additional call to cpu_overutilized. So we should keep it.
>
>> *sg_status |= SG_OVERUTILIZED;
>>
>> #ifdef CONFIG_NUMA_BALANCING
>> @@ -10596,19 +10610,16 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>> env->fbq_type = fbq_classify_group(&sds->busiest_stat);
>>
>> if (!env->sd->parent) {
>> - struct root_domain *rd = env->dst_rq->rd;
>> -
>> /* update overload indicator if we are at root domain */
>> - WRITE_ONCE(rd->overload, sg_status & SG_OVERLOAD);
>> + WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD);
>>
>> /* Update over-utilization (tipping point, U >= 0) indicator */
>> - WRITE_ONCE(rd->overutilized, sg_status & SG_OVERUTILIZED);
>> - trace_sched_overutilized_tp(rd, sg_status & SG_OVERUTILIZED);
>> - } else if (sg_status & SG_OVERUTILIZED) {
>> - struct root_domain *rd = env->dst_rq->rd;
>> -
>> - WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
>> - trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
>> + if (sched_energy_enabled()) {
>
> ditto
First comment would apply for these two.
>> + set_rd_overutilized_status(env->dst_rq->rd,
>> + sg_status & SG_OVERUTILIZED);
>> + }
>> + } else if (sched_energy_enabled() && (sg_status & SG_OVERUTILIZED)) {
>
> ditto
>
>> + set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED);
>> }
>>
>> update_idle_cpu_scan(env, sum_util);
>> @@ -12609,7 +12620,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>> task_tick_numa(rq, curr);
>>
>> update_misfit_status(curr, rq);
>> - update_overutilized_status(task_rq(curr));
>> + check_update_overutilized_status(task_rq(curr));
>>
>> task_tick_core(rq, curr);
>> }
>> --
>> 2.39.3
>>
On 3/4/24 12:24 AM, Qais Yousef wrote:
> On 03/01/24 20:47, Shrikanth Hegde wrote:
>> Overutilized field is accessed directly in multiple places.
>> So it could use a helper function. That way one might be more
>> informed that it needs to be used only in case of EAS.
>>
>> No change in functionality intended.
>>
>> Signed-off-by: Shrikanth Hegde <[email protected]>
>
> Can we do the same for rd->overload too? A set_rd_overload_status() would be
> a nice addition too. Anyway.
We have some more experiments going around overload.
For example, currently it is writing sg_status & SG_OVERLOAD without checking if it has
changed first. On large systems that are not overloaded, that may help by reducing the
bus traffic.
I will pick up this after we have some more data on the above.
>
> Reviewed-by: Qais Yousef <[email protected]>
>
Thank you.
>
> Thanks!
On 03/04/24 13:58, Shrikanth Hegde wrote:
>
>
> On 3/4/24 12:24 AM, Qais Yousef wrote:
> > On 03/01/24 20:47, Shrikanth Hegde wrote:
> >> Overutilized field is accessed directly in multiple places.
> >> So it could use a helper function. That way one might be more
> >> informed that it needs to be used only in case of EAS.
> >>
> >> No change in functionality intended.
> >>
> >> Signed-off-by: Shrikanth Hegde <[email protected]>
> >
> > Can we do the same for rd->overload too? A set_rd_overload_status() would be
> > a nice addition too. Anyway.
>
>
> We have some more experiments going around overload.
> For example, currently it is writing sg_status & SG_OVERLOAD without checking if it has
> changed first. On large systems that are not overloaded, that may help by reducing the
> bus traffic.
>
> I will pick up this after we have some more data on the above.
*thumps up*
>
> >
> > Reviewed-by: Qais Yousef <[email protected]>
> >
>
> Thank you.
>
> >
> > Thanks!
On 03/04/24 13:54, Shrikanth Hegde wrote:
>
>
> On 3/4/24 12:20 AM, Qais Yousef wrote:
> > On 03/01/24 20:47, Shrikanth Hegde wrote:
> >> Overutilized field of root domain is only used for EAS(energy aware scheduler)
>
> [...]
>
>
> Hi Qais, Thanks for taking a look.
>
> >> ---
> >> kernel/sched/fair.c | 49 +++++++++++++++++++++++++++------------------
> >> 1 file changed, 30 insertions(+), 19 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 6a16129f9a5c..a71f8a1506e4 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6670,15 +6670,29 @@ static inline bool cpu_overutilized(int cpu)
> >> return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
> >> }
> >>
> >> -static inline void update_overutilized_status(struct rq *rq)
> >> +static inline void set_rd_overutilized_status(struct root_domain *rd,
> >> + unsigned int status)
> >> {
> >> - if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
> >> - WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
> >> - trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED);
> >> - }
> >
> > Can we add
> >
> > if (!sched_energy_enabled())
> > return;
>
> This is very close to what i had till v2. But it was pointed out that, it
> would end up calling sched_energy_enabled twice in check_update_overutilized_status.
It's a static key. It will either patch the code to be a NOP and return, or
work normally. I don't see a problem.
> In check_update_overutilized_status, it would be better to avoid access to
> overutilized and computing cpu_overutilized if EAS is not enabled.
cpu_overutilized() could gain a protection with sched_energy_enabled() too.
I think it's better to encapsulate the deps within the function.
>
> I am okay with either code. keeping sched_energy_enabled in set_rd_overutilized_status
> would be less code and more readable. But would call sched_energy_enabled twice.
>
> Dietmar, Pierre,
> Could you please provide your inputs here?
I prefer not sprinkling sched_energy_enabled() for every user. But FWIW the
code looks correct to me and these stylistic issues are not a blocker for me
Reviewed-by: Qais Yousef <[email protected]>
>
>
> >
> > here and avoid sprinkling the condition in other various places instead?
> >
> >> + WRITE_ONCE(rd->overutilized, status);
> >> + trace_sched_overutilized_tp(rd, !!status);
> >> +}
> >> +
> >> +static inline void check_update_overutilized_status(struct rq *rq)
> >> +{
> >> + /*
> >> + * overutilized field is used for load balancing decisions only
> >> + * if energy aware scheduler is being used
> >> + */
> >
> > nit: I think this comment is unnecessary but I don't mind keeping it
> >
> >> + if (!sched_energy_enabled())
> >> + return;
> >> +
> >> + if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
> >> + set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
> >> }
> >> #else
> >> -static inline void update_overutilized_status(struct rq *rq) { }
> >> +static inline void check_update_overutilized_status(struct rq *rq) { }
> >> +static inline void set_rd_overutilized_status(struct root_domain *rd,
> >> + unsigned int status) { }
> >> #endif
> >>
> >> /* Runqueue only has SCHED_IDLE tasks enqueued */
> >> @@ -6779,7 +6793,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >> * and the following generally works well enough in practice.
> >> */
> >> if (!task_new)
> >> - update_overutilized_status(rq);
> >> + check_update_overutilized_status(rq);
> >>
> >> enqueue_throttle:
> >> assert_list_leaf_cfs_rq(rq);
> >> @@ -9902,7 +9916,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >> if (nr_running > 1)
> >> *sg_status |= SG_OVERLOAD;
> >>
> >> - if (cpu_overutilized(i))
> >> + if (sched_energy_enabled() && cpu_overutilized(i))
> >
> > I think we can drop sched_energy_enable() here if we add it to
> > set_rd_overutilized_status()
>
> we can avoid additional call to cpu_overutilized. So we should keep it.
>
> >
> >> *sg_status |= SG_OVERUTILIZED;
> >>
> >> #ifdef CONFIG_NUMA_BALANCING
> >> @@ -10596,19 +10610,16 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >> env->fbq_type = fbq_classify_group(&sds->busiest_stat);
> >>
> >> if (!env->sd->parent) {
> >> - struct root_domain *rd = env->dst_rq->rd;
> >> -
> >> /* update overload indicator if we are at root domain */
> >> - WRITE_ONCE(rd->overload, sg_status & SG_OVERLOAD);
> >> + WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD);
> >>
> >> /* Update over-utilization (tipping point, U >= 0) indicator */
> >> - WRITE_ONCE(rd->overutilized, sg_status & SG_OVERUTILIZED);
> >> - trace_sched_overutilized_tp(rd, sg_status & SG_OVERUTILIZED);
> >> - } else if (sg_status & SG_OVERUTILIZED) {
> >> - struct root_domain *rd = env->dst_rq->rd;
> >> -
> >> - WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
> >> - trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
> >> + if (sched_energy_enabled()) {
> >
> > ditto
>
> First comment would apply for these two.
>
> >> + set_rd_overutilized_status(env->dst_rq->rd,
> >> + sg_status & SG_OVERUTILIZED);
> >> + }
> >> + } else if (sched_energy_enabled() && (sg_status & SG_OVERUTILIZED)) {
> >
> > ditto
> >
> >> + set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED);
> >> }
> >>
> >> update_idle_cpu_scan(env, sum_util);
> >> @@ -12609,7 +12620,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> >> task_tick_numa(rq, curr);
> >>
> >> update_misfit_status(curr, rq);
> >> - update_overutilized_status(task_rq(curr));
> >> + check_update_overutilized_status(task_rq(curr));
> >>
> >> task_tick_core(rq, curr);
> >> }
> >> --
> >> 2.39.3
> >>
On Fri, 1 Mar 2024 at 16:18, Shrikanth Hegde <[email protected]> wrote:
>
> Overutilized field is accessed directly in multiple places.
> So it could use a helper function. That way one might be more
> informed that it needs to be used only in case of EAS.
>
> No change in functionality intended.
>
> Signed-off-by: Shrikanth Hegde <[email protected]>
> ---
> kernel/sched/fair.c | 16 +++++++++++++---
> 1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a71f8a1506e4..650909a648d0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6670,6 +6670,15 @@ static inline bool cpu_overutilized(int cpu)
> return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
> }
>
> +/*
> + * Ensure that caller can do EAS. overutilized value
> + * make sense only if EAS is enabled
> + */
> +static inline int is_rd_overutilized(struct root_domain *rd)
> +{
> + return READ_ONCE(rd->overutilized);
> +}
It seems that is_rd_overutilized() is always used with
sched_energy_enabled() in the pattern:
If (sched_energy_enabled() && !is_rd_overutilized(rd))
do something
This pattern includes feec() case where we have in select_task_rq_fair():
If (sched_energy_enabled())
feec():
|-> if (is_rd_overutilized())
|-> goto unlock
which could be changed into
If (sched_energy_enabled() && !is_rd_overutilized(rd))
feec()
Then you can create the function is_rd_not_overutilized() instead of
is_rd_overutilized()
-static inline int is_rd_overutilized(struct root_domain *rd)
+static inline int is_rd_not_overutilized(struct root_domain *rd)
{
- return READ_ONCE(rd->overutilized);
+ return sched_energy_enabled() && READ_ONCE(rd->overutilized);
}
and use is_rd_not_overutilized() instead
> +
> static inline void set_rd_overutilized_status(struct root_domain *rd,
> unsigned int status)
> {
> @@ -6686,13 +6695,14 @@ static inline void check_update_overutilized_status(struct rq *rq)
> if (!sched_energy_enabled())
> return;
>
> - if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
> + if (!is_rd_overutilized(rq->rd) && cpu_overutilized(rq->cpu))
> set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
> }
> #else
> static inline void check_update_overutilized_status(struct rq *rq) { }
> static inline void set_rd_overutilized_status(struct root_domain *rd,
> unsigned int status) { }
> +static inline int is_rd_overutilized(struct root_domain *rd) { }
It should be
static inline int is_rd_overutilized(struct root_domain *rd) { return 0; }
> #endif
>
> /* Runqueue only has SCHED_IDLE tasks enqueued */
> @@ -7974,7 +7984,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>
> rcu_read_lock();
> pd = rcu_dereference(rd->pd);
> - if (!pd || READ_ONCE(rd->overutilized))
> + if (!pd || is_rd_overutilized(rd))
> goto unlock;
>
> /*
> @@ -10859,7 +10869,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> if (sched_energy_enabled()) {
> struct root_domain *rd = env->dst_rq->rd;
>
> - if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
> + if (rcu_dereference(rd->pd) && !is_rd_overutilized(rd))
> goto out_balanced;
> }
>
> --
> 2.39.3
>
On 3/5/24 3:53 PM, Qais Yousef wrote:
> On 03/04/24 13:54, Shrikanth Hegde wrote:
>>
>>
>> On 3/4/24 12:20 AM, Qais Yousef wrote:
>>> On 03/01/24 20:47, Shrikanth Hegde wrote:
>>>> Overutilized field of root domain is only used for EAS(energy aware scheduler)
>>
>> [...]
>>
>>
>> Hi Qais, Thanks for taking a look.
>>
>>>> ---
>>>> kernel/sched/fair.c | 49 +++++++++++++++++++++++++++------------------
>>>> 1 file changed, 30 insertions(+), 19 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 6a16129f9a5c..a71f8a1506e4 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -6670,15 +6670,29 @@ static inline bool cpu_overutilized(int cpu)
>>>> return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
>>>> }
>>>>
>>>> -static inline void update_overutilized_status(struct rq *rq)
>>>> +static inline void set_rd_overutilized_status(struct root_domain *rd,
>>>> + unsigned int status)
>>>> {
>>>> - if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
>>>> - WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
>>>> - trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED);
>>>> - }
>>>
>>> Can we add
>>>
>>> if (!sched_energy_enabled())
>>> return;
>>
>> This is very close to what i had till v2. But it was pointed out that, it
>> would end up calling sched_energy_enabled twice in check_update_overutilized_status.
>
> It's a static key. It will either patch the code to be a NOP and return, or
> work normally. I don't see a problem.
Yes. That's what i thought initially as well. It does make the code simpler.
I will change it to use similar to what i had in v2 in next version. I will wait for a while
to hear any issues with that.
>
>> In check_update_overutilized_status, it would be better to avoid access to
>> overutilized and computing cpu_overutilized if EAS is not enabled.
>
> cpu_overutilized() could gain a protection with sched_energy_enabled() too.
> I think it's better to encapsulate the deps within the function.
>
ok. let me try to incorporate that.
>>
>> I am okay with either code. keeping sched_energy_enabled in set_rd_overutilized_status
>> would be less code and more readable. But would call sched_energy_enabled twice.
>>
>> Dietmar, Pierre,
>> Could you please provide your inputs here?
>
> I prefer not sprinkling sched_energy_enabled() for every user. But FWIW the
> code looks correct to me and these stylistic issues are not a blocker for me
>
> Reviewed-by: Qais Yousef <[email protected]>
>
>>
Thank you.
>>
On 3/5/24 7:35 PM, Vincent Guittot wrote:
> On Fri, 1 Mar 2024 at 16:18, Shrikanth Hegde <[email protected]> wrote:
>
>
> It seems that is_rd_overutilized() is always used with
> sched_energy_enabled() in the pattern:
>
> If (sched_energy_enabled() && !is_rd_overutilized(rd))
> do something
>
> This pattern includes feec() case where we have in select_task_rq_fair():
>
> If (sched_energy_enabled())
> feec():
> |-> if (is_rd_overutilized())
> |-> goto unlock
>
> which could be changed into
> If (sched_energy_enabled() && !is_rd_overutilized(rd))
> feec()
>
> Then you can create the function is_rd_not_overutilized() instead of
> is_rd_overutilized()
>
> -static inline int is_rd_overutilized(struct root_domain *rd)
> +static inline int is_rd_not_overutilized(struct root_domain *rd)
> {
> - return READ_ONCE(rd->overutilized);
> + return sched_energy_enabled() && READ_ONCE(rd->overutilized);
> }
>
> and use is_rd_not_overutilized() instead
>
Ok. Makes sense. I will keep this patch as is. and use the above
approach in a new patch.
>> +
>> static inline void set_rd_overutilized_status(struct root_domain *rd,
>> unsigned int status)
>> {
>> @@ -6686,13 +6695,14 @@ static inline void check_update_overutilized_status(struct rq *rq)
>> if (!sched_energy_enabled())
>> return;
>>
>> - if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
>> + if (!is_rd_overutilized(rq->rd) && cpu_overutilized(rq->cpu))
>> set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
>> }
>> #else
>> static inline void check_update_overutilized_status(struct rq *rq) { }
>> static inline void set_rd_overutilized_status(struct root_domain *rd,
>> unsigned int status) { }
>> +static inline int is_rd_overutilized(struct root_domain *rd) { }
>
> It should be
> static inline int is_rd_overutilized(struct root_domain *rd) { return 0; }
ok.
>
>> #endif
>>
>> /* Runqueue only has SCHED_IDLE tasks enqueued */
>> @@ -7974,7 +7984,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>>
>> rcu_read_lock();
>>