2024-03-06 10:28:49

by Shrikanth Hegde

[permalink] [raw]
Subject: [PATCH v5 0/3] sched/fair: Limit access to overutilized

When running an ISV workload on a large system (240 Cores, SMT8), it was
observed from perf profile that newidle_balance and enqueue_task_fair
were consuming more cycles. Perf annotate showed that most of the time
was spent on accessing overutilized field of root domain. Similar perf profile
was simulated by making some changes to stress-ng --wait. Both
newidle_balance and enqueue_task_fair consume close to 5-7%.

Overutilized was added for EAS(Energy aware scheduler) to decide whether
to do load balance or not. Simultaneous access to overutilized by
multiple CPUs lead cache invalidations due to true sharing. Updating
overutilized is not required for non-EAS platforms. Since overutilized and
overload are part of the same cacheline, there is false sharing as well.

Patch 1/3 - Main patch. It helps in reducing the above said issue.
Both the functions don't show up in the profile. With patch comparison is in
changelog. With the patch stated problem in the ISV workload also got
solved and throughput has improved.
Patch 2/3 - Code refactoring to use the helper function instead of
direct access of the field. Keeping this patch so patch 3/3 becomes
easier to understand. Depends on 1/3.
Patch 3/3 - Refactoring the code since most of the patterns are observed
are eas && !overutilzed. Changed the helper function accordingly.
Depends on 2/3.

More details can be found in cover letter of v1.

v4 -> v5:
- Added EAS check inside helper functions instead of sprinkling it
around. EAS check is static branch at best. So that keeps the code
simpler.
- added EAS check inside cpu_overutilized as suggested by Qais.
- Added patch 3 since most of the code does eas && !overutilized
pattern as suggested by Vincent.

v3 -> v4:
- corrected a mistake where EAS check was missed.
v2 -> v3:
- Pierre and Dietmar suggested we could add one more EAS check before
calling cpu_overutilized. That makes sense since that value is not used
anyway in Non-EAS case.
- Refactored the code as dietmar suggested to avoid additional call to
sched_energy_enabled().
- Minor edits to change log.
v1 -> v2:
Chen Yu pointed out minor issue in code. Corrected that code and updated
the changelog.

v1: https://lore.kernel.org/lkml/[email protected]/
v2: https://lore.kernel.org/lkml/[email protected]/
v3: https://lore.kernel.org/lkml/[email protected]/
v4: https://lore.kernel.org/lkml/[email protected]/

Shrikanth Hegde (3):
sched/fair: Add EAS checks before updating overutilized
sched/fair: Use helper function to access rd->overutilized
sched/fair: Combine EAS check with overutilized access

kernel/sched/fair.c | 86 +++++++++++++++++++++++++++++++--------------
1 file changed, 59 insertions(+), 27 deletions(-)

--
2.39.3



2024-03-06 10:31:02

by Shrikanth Hegde

[permalink] [raw]
Subject: [PATCH v5 2/3] sched/fair: Use helper function to access rd->overutilized

Overutilized field is accessed directly in multiple places.
So it could use a helper function. That way one might be more
informed that it needs to be used only in case of EAS.

No change in functionality intended.

Reviewed-by: Qais Yousef <[email protected]>
Signed-off-by: Shrikanth Hegde <[email protected]>
---
kernel/sched/fair.c | 20 +++++++++++++++++---
1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 997e570d9423..e3086c8930ea 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6675,6 +6675,15 @@ static inline bool cpu_overutilized(int cpu)
return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
}

+/*
+ * Ensure that caller can do EAS. overutilized value
+ * make sense only if EAS is enabled
+ */
+static inline int is_rd_overutilized(struct root_domain *rd)
+{
+ return READ_ONCE(rd->overutilized);
+}
+
static inline void set_rd_overutilized_status(struct root_domain *rd,
unsigned int status)
{
@@ -6694,7 +6703,7 @@ static inline void check_update_overutilized_status(struct rq *rq)
if (!sched_energy_enabled())
return;

- if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
+ if (!is_rd_overutilized(rq->rd) && cpu_overutilized(rq->cpu))
set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
}
#else
@@ -6708,6 +6717,11 @@ static inline void set_rd_overutilized_status(struct root_domain *rd,
{
return 0;
}
+
+static inline int is_rd_overutilized(struct root_domain *rd)
+{
+ return 0;
+}
#endif

/* Runqueue only has SCHED_IDLE tasks enqueued */
@@ -7989,7 +8003,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)

rcu_read_lock();
pd = rcu_dereference(rd->pd);
- if (!pd || READ_ONCE(rd->overutilized))
+ if (!pd || is_rd_overutilized(rd))
goto unlock;

/*
@@ -10872,7 +10886,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
if (sched_energy_enabled()) {
struct root_domain *rd = env->dst_rq->rd;

- if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
+ if (rcu_dereference(rd->pd) && !is_rd_overutilized(rd))
goto out_balanced;
}

--
2.39.3


2024-03-06 11:07:55

by Shrikanth Hegde

[permalink] [raw]
Subject: [PATCH v5 3/3] sched/fair: Combine EAS check with overutilized access

Access to overutilized is always used with sched_energy_enabled in
the pattern:

if (sched_energy_enabled && !overutilized)
do something

So modify the helper function to return this pattern. This is more
readable code as it would say, do something when root domain is not
overutilized.

No change in functionality intended.

Suggested-by: Vincent Guittot <[email protected]>
Signed-off-by: Shrikanth Hegde <[email protected]>
---
kernel/sched/fair.c | 26 ++++++++++----------------
1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e3086c8930ea..62f247bdec86 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6676,12 +6676,11 @@ static inline bool cpu_overutilized(int cpu)
}

/*
- * Ensure that caller can do EAS. overutilized value
- * make sense only if EAS is enabled
+ * overutilized value make sense only if EAS is enabled
*/
-static inline int is_rd_overutilized(struct root_domain *rd)
+static inline int is_rd_not_overutilized(struct root_domain *rd)
{
- return READ_ONCE(rd->overutilized);
+ return sched_energy_enabled() && !READ_ONCE(rd->overutilized);
}

static inline void set_rd_overutilized_status(struct root_domain *rd,
@@ -6700,10 +6699,8 @@ static inline void check_update_overutilized_status(struct rq *rq)
* overutilized field is used for load balancing decisions only
* if energy aware scheduler is being used
*/
- if (!sched_energy_enabled())
- return;

- if (!is_rd_overutilized(rq->rd) && cpu_overutilized(rq->cpu))
+ if (is_rd_not_overutilized(rq->rd) && cpu_overutilized(rq->cpu))
set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
}
#else
@@ -6718,7 +6715,7 @@ static inline void set_rd_overutilized_status(struct root_domain *rd,
return 0;
}

-static inline int is_rd_overutilized(struct root_domain *rd)
+static inline int is_rd_not_overutilized(struct root_domain *rd)
{
return 0;
}
@@ -8003,7 +8000,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)

rcu_read_lock();
pd = rcu_dereference(rd->pd);
- if (!pd || is_rd_overutilized(rd))
+ if (!pd)
goto unlock;

/*
@@ -8206,7 +8203,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
cpumask_test_cpu(cpu, p->cpus_ptr))
return cpu;

- if (sched_energy_enabled()) {
+ if (is_rd_not_overutilized(this_rq()->rd)) {
new_cpu = find_energy_efficient_cpu(p, prev_cpu);
if (new_cpu >= 0)
return new_cpu;
@@ -10883,12 +10880,9 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
if (busiest->group_type == group_misfit_task)
goto force_balance;

- if (sched_energy_enabled()) {
- struct root_domain *rd = env->dst_rq->rd;
-
- if (rcu_dereference(rd->pd) && !is_rd_overutilized(rd))
- goto out_balanced;
- }
+ if (is_rd_not_overutilized(env->dst_rq->rd) &&
+ rcu_dereference(env->dst_rq->rd->pd))
+ goto out_balanced;

/* ASYM feature bypasses nice load balance check */
if (busiest->group_type == group_asym_packing)
--
2.39.3


2024-03-06 17:40:42

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 2/3] sched/fair: Use helper function to access rd->overutilized

On Wed, 6 Mar 2024 at 11:25, Shrikanth Hegde <[email protected]> wrote:
>
> Overutilized field is accessed directly in multiple places.
> So it could use a helper function. That way one might be more
> informed that it needs to be used only in case of EAS.
>
> No change in functionality intended.
>
> Reviewed-by: Qais Yousef <[email protected]>
> Signed-off-by: Shrikanth Hegde <[email protected]>
> ---
> kernel/sched/fair.c | 20 +++++++++++++++++---
> 1 file changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 997e570d9423..e3086c8930ea 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6675,6 +6675,15 @@ static inline bool cpu_overutilized(int cpu)
> return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
> }
>
> +/*
> + * Ensure that caller can do EAS. overutilized value
> + * make sense only if EAS is enabled
> + */
> +static inline int is_rd_overutilized(struct root_domain *rd)
> +{
> + return READ_ONCE(rd->overutilized);
> +}
> +
> static inline void set_rd_overutilized_status(struct root_domain *rd,
> unsigned int status)
> {
> @@ -6694,7 +6703,7 @@ static inline void check_update_overutilized_status(struct rq *rq)
> if (!sched_energy_enabled())
> return;
>
> - if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
> + if (!is_rd_overutilized(rq->rd) && cpu_overutilized(rq->cpu))
> set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
> }
> #else
> @@ -6708,6 +6717,11 @@ static inline void set_rd_overutilized_status(struct root_domain *rd,
> {
> return 0;
> }
> +
> +static inline int is_rd_overutilized(struct root_domain *rd)

I don't think that is_rd_overutilized() is called outside #ifdef
CONFIG_SMP so you can remove it.


> +{
> + return 0;
> +}
> #endif
>
> /* Runqueue only has SCHED_IDLE tasks enqueued */
> @@ -7989,7 +8003,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>
> rcu_read_lock();
> pd = rcu_dereference(rd->pd);
> - if (!pd || READ_ONCE(rd->overutilized))
> + if (!pd || is_rd_overutilized(rd))
> goto unlock;
>
> /*
> @@ -10872,7 +10886,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> if (sched_energy_enabled()) {
> struct root_domain *rd = env->dst_rq->rd;
>
> - if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
> + if (rcu_dereference(rd->pd) && !is_rd_overutilized(rd))
> goto out_balanced;
> }
>
> --
> 2.39.3
>