2024-02-23 15:08:26

by Shrikanth Hegde

[permalink] [raw]
Subject: [PATCH 0/2] sched/fair: Limit access to overutilized

When running a ISV workload on a large system (240 Cores, SMT8), it was
observed from perf profile that newidle_balance and enqueue_task_fair
were consuming more cycles. Perf annotate showed that most of the time
was spent on accessing overutilized field of root domain.

Aboorva was able to simulate similar perf profile by making some
changes to stress-ng --wait. Both newidle_balance and enqueue_task_fair
consume close to 5-7%. Perf annotate shows that most of the cycles are spent
in accessing rd,rd->overutilized field.

perf profile:
7.18% swapper [kernel.vmlinux] [k] enqueue_task_fair
6.78% s [kernel.vmlinux] [k] newidle_balance

perf annotate of enqueue_task_fair:
1.66 : c000000000223ba4: beq c000000000223c50 <enqueue_task_fair+0x238>
: 6789 update_overutilized_status():
: 6675 if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
95.42 : c000000000223ba8: ld r8,2752(r28)
0.08 : c000000000223bac: lwz r9,540(r8)
Debugging it further, in enqueue_task_fair:
ld r8,2752(r28) <-- loads rd
lwz r9,540(r8) <-- loads rd->overutilized.
Frequent write to rd in other CPUs causes load/store tearing and hence
loading rd could take more time.

Perf annotate of newidle_balance:
: 12333 sd = rcu_dereference_check_sched_domain(this_rq->sd);
41.54 : c000000000228070: ld r30,2760(r31)
: 12335 if (!READ_ONCE(this_rq->rd->overload) ||
0.07 : c000000000228074: lwz r9,536(r9)
Similarly, in newidle_balance,
ld r9,2752(r31) <-- loads rd
lwz r9,536(r9) <-- loads rd->overload
Though overutilized is not used in this function. The writes to overutilized
could cause the load of overload to take more time. Both overload and
overutilized are part of the same cacheline.

overutilized was added for EAS(Energy aware scheduler) to choose either
EAS aware load balancing or regular load balance. Hence these fields
should only be updated if EAS is active.

As checked, on x86 and powerpc both overload and overutilized share the
same cacheline in rd. Updating overutilized is not required in non-EAS
platforms. Hence this patch can help reduce cache issues in such archs.

Patch 1/2 is the main patch. It helps in reducing the above said issue.
Both the functions don't show up in the profile. With patch comparison is in
changelog. With the patch stated problem in the ISV workload also got
solved and throughput has improved. Fixes tag 2802bf3cd936 maybe removed
if it causes issues with clean backport all the way. I didn't know what
would be right thing to do here.
Patch 2/2 is only code refactoring to use the helper function instead of
direct access of the field, so one would come to know that it is accessed
only in EAS. This depends on 1/2 to be applied first

Thanks to Aboorva Devarajan and Nysal Jan K A for helping in
recreating,debugging this issue and verifying the patch.

Shrikanth Hegde (2):
sched/fair: Add EAS checks before updating overutilized
sched/fair: Use helper function to access rd->overutilized

kernel/sched/fair.c | 50 +++++++++++++++++++++++++++++++++------------
1 file changed, 37 insertions(+), 13 deletions(-)

--
2.39.3



2024-02-23 15:08:27

by Shrikanth Hegde

[permalink] [raw]
Subject: [PATCH 1/2] sched/fair: Add EAS checks before updating overutilized

Overutilized field of root domain is only used for EAS(energy aware scheduler)
to decide whether to do regular load balance or EAS aware load balance. It
is not used if EAS not possible.

Currently enqueue_task_fair and task_tick_fair accesses, sometime updates
this field. In update_sd_lb_stats it is updated often.
Which causes cache contention due to load/store tearing and burns
a lot of cycles. Hence add EAS check before updating this field.
EAS check is optimized at compile time or it is static branch.
Hence it shouldn't cost much.

With the patch, both enqueue_task_fair and newidle_balance don't show
up as hot routines in perf profile.

6.8-rc4:
7.18% swapper [kernel.vmlinux] [k] enqueue_task_fair
6.78% s [kernel.vmlinux] [k] newidle_balance
+patch:
0.14% swapper [kernel.vmlinux] [k] enqueue_task_fair
0.00% swapper [kernel.vmlinux] [k] newidle_balance

While here, Fix updating overutilized as either SG_OVERUTILIZED or 0
instead. Current code can make it 0, 1 or 2. This shouldn't alter the
functionality.

Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator")
Signed-off-by: Shrikanth Hegde <[email protected]>
---
kernel/sched/fair.c | 36 +++++++++++++++++++++++++-----------
1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e30e2bb77a0..9529d9ef2c5b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6670,15 +6670,30 @@ static inline bool cpu_overutilized(int cpu)
return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
}

-static inline void update_overutilized_status(struct rq *rq)
+static inline void update_rd_overutilized_status(struct root_domain *rd,
+ int status)
{
- if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
- WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
- trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED);
+ if (sched_energy_enabled()) {
+ WRITE_ONCE(rd->overutilized, status);
+ trace_sched_overutilized_tp(rd, !!status);
+ }
+}
+
+static inline void check_update_overutilized_status(struct rq *rq)
+{
+ /*
+ * overutilized field is used for load balancing decisions only
+ * if energy aware scheduler is being used
+ */
+ if (sched_energy_enabled()) {
+ if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
+ update_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
}
}
#else
-static inline void update_overutilized_status(struct rq *rq) { }
+static inline void check_update_overutilized_status(struct rq *rq) { }
+static inline void update_rd_overutilized_status(struct root_domain *rd,
+ bool status) { }
#endif

/* Runqueue only has SCHED_IDLE tasks enqueued */
@@ -6779,7 +6794,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
* and the following generally works well enough in practice.
*/
if (!task_new)
- update_overutilized_status(rq);
+ check_update_overutilized_status(rq);

enqueue_throttle:
assert_list_leaf_cfs_rq(rq);
@@ -10613,13 +10628,12 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
WRITE_ONCE(rd->overload, sg_status & SG_OVERLOAD);

/* Update over-utilization (tipping point, U >= 0) indicator */
- WRITE_ONCE(rd->overutilized, sg_status & SG_OVERUTILIZED);
- trace_sched_overutilized_tp(rd, sg_status & SG_OVERUTILIZED);
+ update_rd_overutilized_status(rd,
+ (sg_status & SG_OVERUTILIZED) ? SG_OVERUTILIZED : 0);
} else if (sg_status & SG_OVERUTILIZED) {
struct root_domain *rd = env->dst_rq->rd;

- WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
- trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
+ update_rd_overutilized_status(rd, SG_OVERUTILIZED);
}

update_idle_cpu_scan(env, sum_util);
@@ -12625,7 +12639,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
task_tick_numa(rq, curr);

update_misfit_status(curr, rq);
- update_overutilized_status(task_rq(curr));
+ check_update_overutilized_status(task_rq(curr));

task_tick_core(rq, curr);
}
--
2.39.3


2024-02-23 15:10:59

by Shrikanth Hegde

[permalink] [raw]
Subject: [PATCH 2/2] sched/fair: Use helper function to access rd->overutilized

Overutilized field is accessed directly in multiple places.
So it could use a helper function. That way one might be more
informed that it needs to be used only in case of EAS.

No change in functionality intended.

Signed-off-by: Shrikanth Hegde <[email protected]>
---
kernel/sched/fair.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9529d9ef2c5b..29a3481c55c9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6670,6 +6670,15 @@ static inline bool cpu_overutilized(int cpu)
return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
}

+/*
+ * Ensure that caller can do EAS. overutilized value
+ * make sense only if EAS is enabled
+ */
+static inline int is_rd_overutilized(struct root_domain *rd)
+{
+ return READ_ONCE(rd->overutilized);
+}
+
static inline void update_rd_overutilized_status(struct root_domain *rd,
int status)
{
@@ -6686,7 +6695,7 @@ static inline void check_update_overutilized_status(struct rq *rq)
* if energy aware scheduler is being used
*/
if (sched_energy_enabled()) {
- if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
+ if (!is_rd_overutilized(rq->rd) && cpu_overutilized(rq->cpu))
update_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
}
}
@@ -6694,6 +6703,7 @@ static inline void check_update_overutilized_status(struct rq *rq)
static inline void check_update_overutilized_status(struct rq *rq) { }
static inline void update_rd_overutilized_status(struct root_domain *rd,
bool status) { }
+static inline int is_rd_overutilized(struct root_domain *rd) { }
#endif

/* Runqueue only has SCHED_IDLE tasks enqueued */
@@ -7969,7 +7979,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)

rcu_read_lock();
pd = rcu_dereference(rd->pd);
- if (!pd || READ_ONCE(rd->overutilized))
+ if (!pd || is_rd_overutilized(rd))
goto unlock;

/*
@@ -10873,7 +10883,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
if (sched_energy_enabled()) {
struct root_domain *rd = env->dst_rq->rd;

- if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
+ if (rcu_dereference(rd->pd) && !is_rd_overutilized(rd))
goto out_balanced;
}

--
2.39.3


2024-02-27 16:46:39

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Add EAS checks before updating overutilized

On 2024-02-23 at 20:37:06 +0530, Shrikanth Hegde wrote:
> Overutilized field of root domain is only used for EAS(energy aware scheduler)
> to decide whether to do regular load balance or EAS aware load balance. It
> is not used if EAS not possible.
>
> Currently enqueue_task_fair and task_tick_fair accesses, sometime updates
> this field. In update_sd_lb_stats it is updated often.
> Which causes cache contention due to load/store tearing and burns
> a lot of cycles.

Looks like a typical cache false sharing: CPU1 updates the rd->overutilized,
which invalid the cache line when CPU2 access adjacent rd->overload.
This changes looks good to me, just some minor questions:

> Hence add EAS check before updating this field.
> EAS check is optimized at compile time or it is static branch.
> Hence it shouldn't cost much.
>
> With the patch, both enqueue_task_fair and newidle_balance don't show
> up as hot routines in perf profile.
>
> 6.8-rc4:
> 7.18% swapper [kernel.vmlinux] [k] enqueue_task_fair
> 6.78% s [kernel.vmlinux] [k] newidle_balance
> +patch:
> 0.14% swapper [kernel.vmlinux] [k] enqueue_task_fair
> 0.00% swapper [kernel.vmlinux] [k] newidle_balance
>
> While here, Fix updating overutilized as either SG_OVERUTILIZED or 0
> instead. Current code can make it 0, 1 or 2. This shouldn't alter the
> functionality.

Just wonder where 1 comes from? In current code we either write SG_OVERUTILIZED
or sg_status & SG_OVERUTILIZED.

>
> Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator")
> Signed-off-by: Shrikanth Hegde <[email protected]>
> ---
> kernel/sched/fair.c | 36 +++++++++++++++++++++++++-----------
> 1 file changed, 25 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8e30e2bb77a0..9529d9ef2c5b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6670,15 +6670,30 @@ static inline bool cpu_overutilized(int cpu)
> return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
> }
>
> -static inline void update_overutilized_status(struct rq *rq)
> +static inline void update_rd_overutilized_status(struct root_domain *rd,
> + int status)
> {
> - if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
> - WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
> - trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED);
> + if (sched_energy_enabled()) {
> + WRITE_ONCE(rd->overutilized, status);
> + trace_sched_overutilized_tp(rd, !!status);

Is this !!status intentional? The original one is SG_OVERUTILIZED = 2,
now it is either 0 or 1.

thanks,
Chenyu

2024-02-27 17:50:38

by Shrikanth Hegde

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Add EAS checks before updating overutilized



On 2/27/24 10:15 PM, Chen Yu wrote:

> On 2024-02-23 at 20:37:06 +0530, Shrikanth Hegde wrote:
>> Overutilized field of root domain is only used for EAS(energy aware scheduler)
>> to decide whether to do regular load balance or EAS aware load balance. It
>> is not used if EAS not possible.
>>
>> Currently enqueue_task_fair and task_tick_fair accesses, sometime updates
>> this field. In update_sd_lb_stats it is updated often.
>> Which causes cache contention due to load/store tearing and burns
>> a lot of cycles.
>
> Looks like a typical cache false sharing: CPU1 updates the rd->overutilized,
> which invalid the cache line when CPU2 access adjacent rd->overload.
> This changes looks good to me, just some minor questions:

Thanks for taking a look and reviewing it.

>
>> Hence add EAS check before updating this field.
>> EAS check is optimized at compile time or it is static branch.
>> Hence it shouldn't cost much.
>>
>> With the patch, both enqueue_task_fair and newidle_balance don't show
>> up as hot routines in perf profile.
>>
>> 6.8-rc4:
>> 7.18% swapper [kernel.vmlinux] [k] enqueue_task_fair
>> 6.78% s [kernel.vmlinux] [k] newidle_balance
>> +patch:
>> 0.14% swapper [kernel.vmlinux] [k] enqueue_task_fair
>> 0.00% swapper [kernel.vmlinux] [k] newidle_balance
>>
>> While here, Fix updating overutilized as either SG_OVERUTILIZED or 0
>> instead. Current code can make it 0, 1 or 2. This shouldn't alter the
>> functionality.
>
> Just wonder where 1 comes from? In current code we either write SG_OVERUTILIZED
> or sg_status & SG_OVERUTILIZED.

Thanks for catching this, Silly mistake.
Because of if conditions around I wrongly thought it would be 1.

I will correct that and send a next version soon.

>
>>
>> Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator")
>> Signed-off-by: Shrikanth Hegde <[email protected]>
>> ---
>> kernel/sched/fair.c | 36 +++++++++++++++++++++++++-----------
>> 1 file changed, 25 insertions(+), 11 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8e30e2bb77a0..9529d9ef2c5b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6670,15 +6670,30 @@ static inline bool cpu_overutilized(int cpu)
>> return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
>> }
>>
>> -static inline void update_overutilized_status(struct rq *rq)
>> +static inline void update_rd_overutilized_status(struct root_domain *rd,
>> + int status)
>> {
>> - if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
>> - WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
>> - trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED);
>> + if (sched_energy_enabled()) {
>> + WRITE_ONCE(rd->overutilized, status);
>> + trace_sched_overutilized_tp(rd, !!status);
>
> Is this !!status intentional? The original one is SG_OVERUTILIZED = 2,
> now it is either 0 or 1.
>

Yes. this is intentional. To convert into to bool.
The tracepoint hook currently defines the second argument as bool.

include/trace/events/sched.h
DECLARE_TRACE(sched_overutilized_tp,
TP_PROTO(struct root_domain *rd, bool overutilized),
TP_ARGS(rd, overutilized));

> thanks,
> Chenyu