On POWER8 and POWER9, the last level cache (L2) has been at the level of
a group of 8 threads (SMT8 on POWER8, a big-core comprising of a pair of
SMT4 cores on POWER9). However, on POWER10, the LLC domain is at the
level of a group of SMT4 threads within the SMT8 core. Due to the
shrinking in the size of the LLC domain, the probability of finding an
idle CPU in the LLC domain of the target is lesser on POWER10 compared
to the previous generation processors.
With commit 9538abee18cc ("powerpc/smp: Add support detecting
thread-groups sharing L2 cache") benchmarks such as Daytrader
(https://github.com/WASdev/sample.daytrader7) show a drop in throughput
in a configuration consisting of 1 JVM spanning across 6-8 Bigcores on
POWER10. Analysis showed that this was because more number of wakeups
were happening on busy CPUs when the utilization was 60-70%. This drop
in throughput also shows up as a drop in CPU utilization. However most
other benchmarks benefit with detecting the thread-groups that share L2
cache.
Current order of preference to pick a LLC while waking a wake-affine
task:
1. Between the waker CPU and previous CPU, prefer the LLC of the CPU
that is idle.
2. Between the waker CPU and previous CPU, prefer the LLC of the CPU
that is less lightly loaded.
In the current situation where waker and previous CPUs are busy, but
only one of its LLC has an idle CPU, Scheduler may end up picking a LLC
with no idle CPUs. To mitigate this, add a new step between 1 and 2
where Scheduler compares idle CPUs in waker and previous LLCs and picks
the appropriate one.
The other alternative is to search for an idle CPU in the other LLC, if
the current select_idle_sibling is unable to find an idle CPU in the
preferred LLC. But that may increase the time to select a CPU.
5.11-rc6 5.11-rc6+revert 5.11-rc6+patch
8CORE/1JVM 80USERS throughput 6651.6 6716.3 (0.97%) 6940 (4.34%)
sys/user:time 59.75/23.86 61.77/24.55 60/24
8CORE/2JVM 80USERS throughput 6425.4 6446.8 (0.33%) 6473.2 (0.74%)
sys/user:time 70.59/24.25 72.28/23.77 70/24
8CORE/4JVM 80USERS throughput 5355.3 5551.2 (3.66%) 5586.6 (4.32%)
sys/user:time 76.74/21.79 76.54/22.73 76/22
8CORE/8JVM 80USERS throughput 4420.6 4553.3 (3.00%) 4405.8 (-0.33%)
sys/user:time 79.13/20.32 78.76/21.01 79/20
Cc: LKML <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michael Neuling <[email protected]>
Cc: Gautham R Shenoy <[email protected]>
Cc: Parth Shah <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vincent Guittot <[email protected]>
Co-developed-by: Gautham R Shenoy <[email protected]>
Signed-off-by: Gautham R Shenoy <[email protected]>
Co-developed-by: Parth Shah <[email protected]>
Signed-off-by: Parth Shah <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++++--
kernel/sched/features.h | 2 ++
2 files changed, 41 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a8bd7b13634..d49bfcdc4a19 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5869,6 +5869,36 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
}
+static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
+{
+ struct sched_domain_shared *tsds, *psds;
+ int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
+
+ tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
+ tnr_busy = atomic_read(&tsds->nr_busy_cpus);
+ tllc_size = per_cpu(sd_llc_size, this_cpu);
+
+ psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
+ pnr_busy = atomic_read(&psds->nr_busy_cpus);
+ pllc_size = per_cpu(sd_llc_size, prev_cpu);
+
+ /* No need to compare, if both LLCs are fully loaded */
+ if (pnr_busy == pllc_size && tnr_busy == pllc_size)
+ return nr_cpumask_bits;
+
+ if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
+ return this_cpu;
+
+ /* For better wakeup latency, prefer idler LLC to cache affinity */
+ diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size;
+ if (!diff)
+ return nr_cpumask_bits;
+ if (diff < 0)
+ return this_cpu;
+
+ return prev_cpu;
+}
+
static int wake_affine(struct sched_domain *sd, struct task_struct *p,
int this_cpu, int prev_cpu, int sync)
{
@@ -5877,6 +5907,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
if (sched_feat(WA_IDLE))
target = wake_affine_idle(this_cpu, prev_cpu, sync);
+ if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits &&
+ !cpus_share_cache(this_cpu, prev_cpu))
+ target = prefer_idler_llc(this_cpu, prev_cpu, sync);
+
if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
@@ -5884,8 +5918,11 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
if (target == nr_cpumask_bits)
return prev_cpu;
- schedstat_inc(sd->ttwu_move_affine);
- schedstat_inc(p->se.statistics.nr_wakeups_affine);
+ if (target == this_cpu) {
+ schedstat_inc(sd->ttwu_move_affine);
+ schedstat_inc(p->se.statistics.nr_wakeups_affine);
+ }
+
return target;
}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 1bc2b158fc51..e2de3ba8d5b1 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -83,6 +83,8 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)
SCHED_FEAT(WA_IDLE, true)
SCHED_FEAT(WA_WEIGHT, true)
+SCHED_FEAT(WA_IDLER_LLC, true)
+SCHED_FEAT(WA_WAKER, false)
SCHED_FEAT(WA_BIAS, true)
/*
--
2.18.4
On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote:
> Current order of preference to pick a LLC while waking a wake-affine
> task:
> 1. Between the waker CPU and previous CPU, prefer the LLC of the CPU
> that is idle.
>
> 2. Between the waker CPU and previous CPU, prefer the LLC of the CPU
> that is less lightly loaded.
>
> In the current situation where waker and previous CPUs are busy, but
> only one of its LLC has an idle CPU, Scheduler may end up picking a
> LLC
> with no idle CPUs. To mitigate this, add a new step between 1 and 2
> where Scheduler compares idle CPUs in waker and previous LLCs and
> picks
> the appropriate one.
I like that idea a lot. That could also solve some of the
issues sometimes observed on multi-node x86 systems, and
probably on the newer AMD chips with several LLCs on chip.
> + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> + return this_cpu;
I wonder if we need to use a slightly lower threshold on
very large LLCs, both to account for the fact that the
select_idle_cpu code may not find the single idle CPU
among a dozen busy ones, or because on a system with
hyperthreading we may often be better off picking another
LLC for HT contention issues?
Maybe we could use "tnr_busy * 4 <
tllc_size * 3" or
something like that?
That way we will only try to find the last 5 idle
CPUs
in a 22 CPU LLC if the other LLC also has fewer than 6
idle cores.
That might increase our chances of finding an idle CPU
with SIS_PROP enabled, and might allow WA_WAKER to be
true by default.
> + /* For better wakeup latency, prefer idler LLC to cache
> affinity */
> + diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size;
> + if (!diff)
> + return nr_cpumask_bits;
> + if (diff < 0)
> + return this_cpu;
> +
> + return prev_cpu;
> +}
--
All Rights Reversed.
* Rik van Riel <[email protected]> [2021-02-27 14:56:07]:
> > In the current situation where waker and previous CPUs are busy, but
> > only one of its LLC has an idle CPU, Scheduler may end up picking a
> > LLC
> > with no idle CPUs. To mitigate this, add a new step between 1 and 2
> > where Scheduler compares idle CPUs in waker and previous LLCs and
> > picks
> > the appropriate one.
>
> I like that idea a lot. That could also solve some of the
Thanks.
> issues sometimes observed on multi-node x86 systems, and
> probably on the newer AMD chips with several LLCs on chip.
>
Okay.
> > + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> > + return this_cpu;
>
> I wonder if we need to use a slightly lower threshold on
> very large LLCs, both to account for the fact that the
> select_idle_cpu code may not find the single idle CPU
> among a dozen busy ones, or because on a system with
> hyperthreading we may often be better off picking another
> LLC for HT contention issues?
>
> Maybe we could use "tnr_busy * 4 <
> tllc_size * 3" or
> something like that?
>
> That way we will only try to find the last 5 idle
> CPUs
> in a 22 CPU LLC if the other LLC also has fewer than 6
> idle cores.
>
> That might increase our chances of finding an idle CPU
> with SIS_PROP enabled, and might allow WA_WAKER to be
> true by default.
Agree we need to be conservative esp if we want to make WA_WAKER on by
default. I would still like to hear from other people if they think its ok
to enable it by default. I wonder if enabling it by default can cause some
load imbalances leading to more active load balance down the line. I
haven't benchmarked with WA_WAKER enabled.
Thanks Rik for your inputs.
--
Thanks and Regards
Srikar Dronamraju
On Fri, Feb 26, 2021 at 10:10:29PM +0530, Srikar Dronamraju wrote:
> +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> +{
> + struct sched_domain_shared *tsds, *psds;
> + int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> +
> + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> + tnr_busy = atomic_read(&tsds->nr_busy_cpus);
> + tllc_size = per_cpu(sd_llc_size, this_cpu);
> +
> + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> + pnr_busy = atomic_read(&psds->nr_busy_cpus);
> + pllc_size = per_cpu(sd_llc_size, prev_cpu);
> +
nr_busy_cpus is NO_HZ_COMMON So this code that consumes it should be
too.
* Peter Zijlstra <[email protected]> [2021-03-01 16:40:33]:
> On Fri, Feb 26, 2021 at 10:10:29PM +0530, Srikar Dronamraju wrote:
> > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> > +{
> > + struct sched_domain_shared *tsds, *psds;
> > + int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> > +
> > + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> > + tnr_busy = atomic_read(&tsds->nr_busy_cpus);
> > + tllc_size = per_cpu(sd_llc_size, this_cpu);
> > +
> > + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> > + pnr_busy = atomic_read(&psds->nr_busy_cpus);
> > + pllc_size = per_cpu(sd_llc_size, prev_cpu);
> > +
>
> nr_busy_cpus is NO_HZ_COMMON So this code that consumes it should be
> too.
Thanks Peter, will take care of this along with other changes including
calling within rcu_read_lock and checking for tsds and psds after
rcu_dereference.
--
Thanks and Regards
Srikar Dronamraju
* Peter Zijlstra <[email protected]> [2021-03-01 18:18:28]:
> On Mon, Mar 01, 2021 at 10:36:01PM +0530, Srikar Dronamraju wrote:
> > * Peter Zijlstra <[email protected]> [2021-03-01 16:44:42]:
> >
> > > On Sat, Feb 27, 2021 at 02:56:07PM -0500, Rik van Riel wrote:
> > > > On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote:
> > >
> > >
> > > > Maybe we could use "tnr_busy * 4 <
> > > > tllc_size * 3" or
> > > > something like that?
> > >
> > > How about:
> > >
> > > tnr_busy < tllc_size / topology_max_smt_threads()
> > >
> > > ?
> >
> > Isn't topology_max_smt_threads only for x86 as of today?
> > Or Am I missing out?
>
> Oh, could be, I didn't grep :/ We could have core code keep track of the
> smt count I suppose.
Could we use cpumask_weight(cpu_smt_mask(this_cpu)) instead?
--
Thanks and Regards
Srikar Dronamraju
On Tue, Mar 02, 2021 at 01:09:46PM +0530, Srikar Dronamraju wrote:
> > Oh, could be, I didn't grep :/ We could have core code keep track of the
> > smt count I suppose.
>
> Could we use cpumask_weight(cpu_smt_mask(this_cpu)) instead?
cpumask_weight() is potentially super expensive. With CPUMASK_OFFSTACK
you get at least one more cache miss and then the bitmap might be really
long.
Best to compute the results once and use it later.
On 26/02/2021 17:40, Srikar Dronamraju wrote:
[...]
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a8bd7b13634..d49bfcdc4a19 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5869,6 +5869,36 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
> return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
> }
>
> +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> +{
> + struct sched_domain_shared *tsds, *psds;
> + int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> +
> + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> + tnr_busy = atomic_read(&tsds->nr_busy_cpus);
> + tllc_size = per_cpu(sd_llc_size, this_cpu);
> +
> + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> + pnr_busy = atomic_read(&psds->nr_busy_cpus);
> + pllc_size = per_cpu(sd_llc_size, prev_cpu);
> +
> + /* No need to compare, if both LLCs are fully loaded */
> + if (pnr_busy == pllc_size && tnr_busy == pllc_size)
^
shouldn't this be tllc_size ?
> + return nr_cpumask_bits;
> +
> + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> + return this_cpu;
> +
> + /* For better wakeup latency, prefer idler LLC to cache affinity */
> + diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size;
> + if (!diff)
> + return nr_cpumask_bits;
> + if (diff < 0)
> + return this_cpu;
> +
> + return prev_cpu;
> +}
> +
* Peter Zijlstra <[email protected]> [2021-03-02 10:10:32]:
> On Tue, Mar 02, 2021 at 01:09:46PM +0530, Srikar Dronamraju wrote:
> > > Oh, could be, I didn't grep :/ We could have core code keep track of the
> > > smt count I suppose.
> >
> > Could we use cpumask_weight(cpu_smt_mask(this_cpu)) instead?
>
> cpumask_weight() is potentially super expensive. With CPUMASK_OFFSTACK
> you get at least one more cache miss and then the bitmap might be really
> long.
>
> Best to compute the results once and use it later.
Oh okay .. Noted.
--
Thanks and Regards
Srikar Dronamraju
On Sat, Feb 27, 2021 at 02:56:07PM -0500, Rik van Riel wrote:
> On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote:
> > + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> > + return this_cpu;
>
> I wonder if we need to use a slightly lower threshold on
> very large LLCs, both to account for the fact that the
> select_idle_cpu code may not find the single idle CPU
> among a dozen busy ones, or because on a system with
> hyperthreading we may often be better off picking another
> LLC for HT contention issues?
>
> Maybe we could use "tnr_busy * 4 <
> tllc_size * 3" or
> something like that?
How about:
tnr_busy < tllc_size / topology_max_smt_threads()
?
* Peter Zijlstra <[email protected]> [2021-03-01 16:44:42]:
> On Sat, Feb 27, 2021 at 02:56:07PM -0500, Rik van Riel wrote:
> > On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote:
>
> > > + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> > > + return this_cpu;
> >
> > I wonder if we need to use a slightly lower threshold on
> > very large LLCs, both to account for the fact that the
> > select_idle_cpu code may not find the single idle CPU
> > among a dozen busy ones, or because on a system with
> > hyperthreading we may often be better off picking another
> > LLC for HT contention issues?
> >
> > Maybe we could use "tnr_busy * 4 <
> > tllc_size * 3" or
> > something like that?
>
> How about:
>
> tnr_busy < tllc_size / topology_max_smt_threads()
>
> ?
Isn't topology_max_smt_threads only for x86 as of today?
Or Am I missing out?
--
Thanks and Regards
Srikar Dronamraju
On Mon, Mar 01, 2021 at 10:36:01PM +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <[email protected]> [2021-03-01 16:44:42]:
>
> > On Sat, Feb 27, 2021 at 02:56:07PM -0500, Rik van Riel wrote:
> > > On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote:
> >
> > > > + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> > > > + return this_cpu;
> > >
> > > I wonder if we need to use a slightly lower threshold on
> > > very large LLCs, both to account for the fact that the
> > > select_idle_cpu code may not find the single idle CPU
> > > among a dozen busy ones, or because on a system with
> > > hyperthreading we may often be better off picking another
> > > LLC for HT contention issues?
> > >
> > > Maybe we could use "tnr_busy * 4 <
> > > tllc_size * 3" or
> > > something like that?
> >
> > How about:
> >
> > tnr_busy < tllc_size / topology_max_smt_threads()
> >
> > ?
>
> Isn't topology_max_smt_threads only for x86 as of today?
> Or Am I missing out?
Oh, could be, I didn't grep :/ We could have core code keep track of the
smt count I suppose.
* Dietmar Eggemann <[email protected]> [2021-03-02 10:53:06]:
> On 26/02/2021 17:40, Srikar Dronamraju wrote:
>
> [...]
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8a8bd7b13634..d49bfcdc4a19 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5869,6 +5869,36 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
> > return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
> > }
> >
> > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> > +{
> > + struct sched_domain_shared *tsds, *psds;
> > + int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> > +
> > + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> > + tnr_busy = atomic_read(&tsds->nr_busy_cpus);
> > + tllc_size = per_cpu(sd_llc_size, this_cpu);
> > +
> > + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> > + pnr_busy = atomic_read(&psds->nr_busy_cpus);
> > + pllc_size = per_cpu(sd_llc_size, prev_cpu);
> > +
> > + /* No need to compare, if both LLCs are fully loaded */
> > + if (pnr_busy == pllc_size && tnr_busy == pllc_size)
>
> ^
> shouldn't this be tllc_size ?
Yes, thanks for pointing out.
--
Thanks and Regards
Srikar Dronamraju
On Fri, 26 Feb 2021 at 17:41, Srikar Dronamraju
<[email protected]> wrote:
>
> On POWER8 and POWER9, the last level cache (L2) has been at the level of
> a group of 8 threads (SMT8 on POWER8, a big-core comprising of a pair of
> SMT4 cores on POWER9). However, on POWER10, the LLC domain is at the
> level of a group of SMT4 threads within the SMT8 core. Due to the
> shrinking in the size of the LLC domain, the probability of finding an
> idle CPU in the LLC domain of the target is lesser on POWER10 compared
> to the previous generation processors.
>
> With commit 9538abee18cc ("powerpc/smp: Add support detecting
> thread-groups sharing L2 cache") benchmarks such as Daytrader
> (https://github.com/WASdev/sample.daytrader7) show a drop in throughput
> in a configuration consisting of 1 JVM spanning across 6-8 Bigcores on
> POWER10. Analysis showed that this was because more number of wakeups
> were happening on busy CPUs when the utilization was 60-70%. This drop
> in throughput also shows up as a drop in CPU utilization. However most
> other benchmarks benefit with detecting the thread-groups that share L2
> cache.
>
> Current order of preference to pick a LLC while waking a wake-affine
> task:
> 1. Between the waker CPU and previous CPU, prefer the LLC of the CPU
> that is idle.
>
> 2. Between the waker CPU and previous CPU, prefer the LLC of the CPU
> that is less lightly loaded.
>
> In the current situation where waker and previous CPUs are busy, but
> only one of its LLC has an idle CPU, Scheduler may end up picking a LLC
> with no idle CPUs. To mitigate this, add a new step between 1 and 2
> where Scheduler compares idle CPUs in waker and previous LLCs and picks
> the appropriate one.
>
> The other alternative is to search for an idle CPU in the other LLC, if
> the current select_idle_sibling is unable to find an idle CPU in the
> preferred LLC. But that may increase the time to select a CPU.
>
>
> 5.11-rc6 5.11-rc6+revert 5.11-rc6+patch
> 8CORE/1JVM 80USERS throughput 6651.6 6716.3 (0.97%) 6940 (4.34%)
> sys/user:time 59.75/23.86 61.77/24.55 60/24
>
> 8CORE/2JVM 80USERS throughput 6425.4 6446.8 (0.33%) 6473.2 (0.74%)
> sys/user:time 70.59/24.25 72.28/23.77 70/24
>
> 8CORE/4JVM 80USERS throughput 5355.3 5551.2 (3.66%) 5586.6 (4.32%)
> sys/user:time 76.74/21.79 76.54/22.73 76/22
>
> 8CORE/8JVM 80USERS throughput 4420.6 4553.3 (3.00%) 4405.8 (-0.33%)
> sys/user:time 79.13/20.32 78.76/21.01 79/20
>
> Cc: LKML <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Michael Neuling <[email protected]>
> Cc: Gautham R Shenoy <[email protected]>
> Cc: Parth Shah <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Co-developed-by: Gautham R Shenoy <[email protected]>
> Signed-off-by: Gautham R Shenoy <[email protected]>
> Co-developed-by: Parth Shah <[email protected]>
> Signed-off-by: Parth Shah <[email protected]>
> Signed-off-by: Srikar Dronamraju <[email protected]>
> ---
> kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++++--
> kernel/sched/features.h | 2 ++
> 2 files changed, 41 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a8bd7b13634..d49bfcdc4a19 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5869,6 +5869,36 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
> return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
> }
>
> +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> +{
> + struct sched_domain_shared *tsds, *psds;
> + int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> +
> + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> + tnr_busy = atomic_read(&tsds->nr_busy_cpus);
> + tllc_size = per_cpu(sd_llc_size, this_cpu);
> +
> + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> + pnr_busy = atomic_read(&psds->nr_busy_cpus);
> + pllc_size = per_cpu(sd_llc_size, prev_cpu);
> +
> + /* No need to compare, if both LLCs are fully loaded */
> + if (pnr_busy == pllc_size && tnr_busy == pllc_size)
> + return nr_cpumask_bits;
> +
> + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> + return this_cpu;
Why have you chosen to favor this_cpu instead of prev_cpu unlike for wake_idle ?
> +
> + /* For better wakeup latency, prefer idler LLC to cache affinity */
> + diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size;
> + if (!diff)
> + return nr_cpumask_bits;
> + if (diff < 0)
> + return this_cpu;
> +
> + return prev_cpu;
> +}
> +
> static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> int this_cpu, int prev_cpu, int sync)
> {
> @@ -5877,6 +5907,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> if (sched_feat(WA_IDLE))
> target = wake_affine_idle(this_cpu, prev_cpu, sync);
>
> + if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits &&
> + !cpus_share_cache(this_cpu, prev_cpu))
> + target = prefer_idler_llc(this_cpu, prev_cpu, sync);
could you use the same naming convention as others function ?
wake_affine_llc as an example
> +
> if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
> target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
>
> @@ -5884,8 +5918,11 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> if (target == nr_cpumask_bits)
> return prev_cpu;
>
> - schedstat_inc(sd->ttwu_move_affine);
> - schedstat_inc(p->se.statistics.nr_wakeups_affine);
> + if (target == this_cpu) {
How is this condition related to $subject ?
> + schedstat_inc(sd->ttwu_move_affine);
> + schedstat_inc(p->se.statistics.nr_wakeups_affine);
> + }
> +
> return target;
> }
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 1bc2b158fc51..e2de3ba8d5b1 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -83,6 +83,8 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)
>
> SCHED_FEAT(WA_IDLE, true)
> SCHED_FEAT(WA_WEIGHT, true)
> +SCHED_FEAT(WA_IDLER_LLC, true)
> +SCHED_FEAT(WA_WAKER, false)
> SCHED_FEAT(WA_BIAS, true)
>
> /*
> --
> 2.18.4
>
* Vincent Guittot <[email protected]> [2021-03-08 14:52:39]:
> On Fri, 26 Feb 2021 at 17:41, Srikar Dronamraju
> <[email protected]> wrote:
> >
Thanks Vincent for your review comments.
> > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> > +{
> > + struct sched_domain_shared *tsds, *psds;
> > + int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> > +
> > + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> > + tnr_busy = atomic_read(&tsds->nr_busy_cpus);
> > + tllc_size = per_cpu(sd_llc_size, this_cpu);
> > +
> > + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> > + pnr_busy = atomic_read(&psds->nr_busy_cpus);
> > + pllc_size = per_cpu(sd_llc_size, prev_cpu);
> > +
> > + /* No need to compare, if both LLCs are fully loaded */
> > + if (pnr_busy == pllc_size && tnr_busy == pllc_size)
> > + return nr_cpumask_bits;
> > +
> > + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> > + return this_cpu;
>
> Why have you chosen to favor this_cpu instead of prev_cpu unlike for wake_idle ?
At this point, we know the waker running on this_cpu and wakee which was
running on prev_cpu are affine to each other and this_cpu and prev_cpu dont
share cache. I chose to move them close to each other to benefit from the
cache sharing. Based on feedback from Peter and Rik, I made the check more
conservative i.e tnr_busy <= tllc_size/smt_weight (where smt_weight is the
cpumask weight of smt domain for this_cpu) i.e if we have a free core in
this llc domain, chose this_cpu. select_idle_sibling() should pick an idle
cpu/core/smt within the llc domain for this_cpu.
Do you feel, this may not be the correct option?
We are also experimenting with another option, were we call prefer_idler_cpu
after wa_weight. I.e
1. if wake_affine_weight choses this_cpu but llc in prev_cpu has an idle
smt/CPU but there are no idle smt/CPU in this_cpu, then chose idle smt/CPU
in prev_cpu
2. if wake_affine_weight choses nr_cpumask(aka prev_cpu) but llc in this_cpu
has an idle smt/CPU but there are no idle smt/CPU in prev_cpu, then chose
idle smt/CPU in this_cpu
> > +
> > + /* For better wakeup latency, prefer idler LLC to cache affinity */
> > + diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size;
> > + if (!diff)
> > + return nr_cpumask_bits;
> > + if (diff < 0)
> > + return this_cpu;
> > +
> > + return prev_cpu;
> > +}
> > +
> > static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> > int this_cpu, int prev_cpu, int sync)
> > {
> > @@ -5877,6 +5907,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> > if (sched_feat(WA_IDLE))
> > target = wake_affine_idle(this_cpu, prev_cpu, sync);
> >
> > + if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits &&
> > + !cpus_share_cache(this_cpu, prev_cpu))
> > + target = prefer_idler_llc(this_cpu, prev_cpu, sync);
>
> could you use the same naming convention as others function ?
> wake_affine_llc as an example
I guess you meant s/prefer_idler_llc/wake_affine_llc/
Sure. I can modify.
>
> > +
> > if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
> > target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
> >
> > @@ -5884,8 +5918,11 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> > if (target == nr_cpumask_bits)
> > return prev_cpu;
> >
> > - schedstat_inc(sd->ttwu_move_affine);
> > - schedstat_inc(p->se.statistics.nr_wakeups_affine);
> > + if (target == this_cpu) {
>
> How is this condition related to $subject ?
Before this change, wake_affine_weight and wake_affine_idle would either
return this_cpu or nr_cpumask_bits. Just before this check, we check if
target is nr_cpumask_bits and return prev_cpu. So the stats were only
incremented when target was this_cpu.
However with prefer_idler_llc, we may return this_cpu, prev_cpu or
nr_cpumask_bits. Now we only to update stats when we have chosen to migrate
the task to this_cpu. Hence I had this check.
If we use the slightly lazier approach which is check for wa_weight first
before wa_idler_llc, then we may not need this change at all.
--
Thanks and Regards
Srikar Dronamraju
On Wed, 10 Mar 2021 at 06:53, Srikar Dronamraju
<[email protected]> wrote:
>
> * Vincent Guittot <[email protected]> [2021-03-08 14:52:39]:
>
> > On Fri, 26 Feb 2021 at 17:41, Srikar Dronamraju
> > <[email protected]> wrote:
> > >
>
> Thanks Vincent for your review comments.
>
> > > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> > > +{
> > > + struct sched_domain_shared *tsds, *psds;
> > > + int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> > > +
> > > + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> > > + tnr_busy = atomic_read(&tsds->nr_busy_cpus);
> > > + tllc_size = per_cpu(sd_llc_size, this_cpu);
> > > +
> > > + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> > > + pnr_busy = atomic_read(&psds->nr_busy_cpus);
> > > + pllc_size = per_cpu(sd_llc_size, prev_cpu);
> > > +
> > > + /* No need to compare, if both LLCs are fully loaded */
> > > + if (pnr_busy == pllc_size && tnr_busy == pllc_size)
> > > + return nr_cpumask_bits;
> > > +
> > > + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> > > + return this_cpu;
> >
> > Why have you chosen to favor this_cpu instead of prev_cpu unlike for wake_idle ?
>
> At this point, we know the waker running on this_cpu and wakee which was
> running on prev_cpu are affine to each other and this_cpu and prev_cpu dont
> share cache. I chose to move them close to each other to benefit from the
> cache sharing. Based on feedback from Peter and Rik, I made the check more
> conservative i.e tnr_busy <= tllc_size/smt_weight (where smt_weight is the
> cpumask weight of smt domain for this_cpu) i.e if we have a free core in
yeah make sense
> this llc domain, chose this_cpu. select_idle_sibling() should pick an idle
> cpu/core/smt within the llc domain for this_cpu.
>
> Do you feel, this may not be the correct option?
I was worried that we end up pulling tasks in same llc but the
condition above and wake_wide should prevent such behavior
>
> We are also experimenting with another option, were we call prefer_idler_cpu
> after wa_weight. I.e
> 1. if wake_affine_weight choses this_cpu but llc in prev_cpu has an idle
> smt/CPU but there are no idle smt/CPU in this_cpu, then chose idle smt/CPU
> in prev_cpu
> 2. if wake_affine_weight choses nr_cpumask(aka prev_cpu) but llc in this_cpu
> has an idle smt/CPU but there are no idle smt/CPU in prev_cpu, then chose
> idle smt/CPU in this_cpu
>
>
> > > +
> > > + /* For better wakeup latency, prefer idler LLC to cache affinity */
> > > + diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size;
> > > + if (!diff)
> > > + return nr_cpumask_bits;
> > > + if (diff < 0)
> > > + return this_cpu;
> > > +
> > > + return prev_cpu;
> > > +}
> > > +
> > > static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> > > int this_cpu, int prev_cpu, int sync)
> > > {
> > > @@ -5877,6 +5907,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> > > if (sched_feat(WA_IDLE))
> > > target = wake_affine_idle(this_cpu, prev_cpu, sync);
> > >
> > > + if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits &&
> > > + !cpus_share_cache(this_cpu, prev_cpu))
> > > + target = prefer_idler_llc(this_cpu, prev_cpu, sync);
> >
> > could you use the same naming convention as others function ?
> > wake_affine_llc as an example
>
> I guess you meant s/prefer_idler_llc/wake_affine_llc/
yes
> Sure. I can modify.
>
> >
> > > +
> > > if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
> > > target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
> > >
> > > @@ -5884,8 +5918,11 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> > > if (target == nr_cpumask_bits)
> > > return prev_cpu;
> > >
> > > - schedstat_inc(sd->ttwu_move_affine);
> > > - schedstat_inc(p->se.statistics.nr_wakeups_affine);
> > > + if (target == this_cpu) {
> >
> > How is this condition related to $subject ?
>
> Before this change, wake_affine_weight and wake_affine_idle would either
> return this_cpu or nr_cpumask_bits. Just before this check, we check if
> target is nr_cpumask_bits and return prev_cpu. So the stats were only
> incremented when target was this_cpu.
>
> However with prefer_idler_llc, we may return this_cpu, prev_cpu or
> nr_cpumask_bits. Now we only to update stats when we have chosen to migrate
> the task to this_cpu. Hence I had this check.
ok got it.
May be return earlier in this case like for if (target ==
nr_cpumask_bits) above
>
> If we use the slightly lazier approach which is check for wa_weight first
> before wa_idler_llc, then we may not need this change at all.
>
> --
> Thanks and Regards
> Srikar Dronamraju