Subject: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

The following commit has been merged into the sched/core branch of tip:

Commit-ID: c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
Gitweb: https://git.kernel.org/tip/c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
Author: Peter Zijlstra <[email protected]>
AuthorDate: Tue, 30 May 2023 13:20:46 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 30 May 2023 22:46:27 +02:00

sched/fair: Multi-LLC select_idle_sibling()

Tejun reported that when he targets workqueues towards a specific LLC
on his Zen2 machine with 3 cores / LLC and 4 LLCs in total, he gets
significant idle time.

This is, of course, because of how select_idle_sibling() will not
consider anything outside of the local LLC, and since all these tasks
are short running the periodic idle load balancer is ineffective.

And while it is good to keep work cache local, it is better to not
have significant idle time. Therefore, have select_idle_sibling() try
other LLCs inside the same node when the local one comes up empty.

Reported-by: Tejun Heo <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
kernel/sched/features.h | 1 +
2 files changed, 39 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48b6f0c..0172458 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7028,6 +7028,38 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
}

/*
+ * For the multiple-LLC per node case, make sure to try the other LLC's if the
+ * local LLC comes up empty.
+ */
+static int
+select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ struct sched_domain *parent = sd->parent;
+ struct sched_group *sg;
+
+ /* Make sure to not cross nodes. */
+ if (!parent || parent->flags & SD_NUMA)
+ return -1;
+
+ sg = parent->groups;
+ do {
+ int cpu = cpumask_first(sched_group_span(sg));
+ struct sched_domain *sd_child;
+
+ sd_child = per_cpu(sd_llc, cpu);
+ if (sd_child != sd) {
+ int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
+ if ((unsigned)i < nr_cpumask_bits)
+ return i;
+ }
+
+ sg = sg->next;
+ } while (sg != parent->groups);
+
+ return -1;
+}
+
+/*
* Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
* the task fits. If no CPU is big enough, but there are idle ones, try to
* maximize capacity.
@@ -7199,6 +7231,12 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if ((unsigned)i < nr_cpumask_bits)
return i;

+ if (sched_feat(SIS_NODE)) {
+ i = select_idle_node(p, sd, target);
+ if ((unsigned)i < nr_cpumask_bits)
+ return i;
+ }
+
return target;
}

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c..9e390eb 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_PROP, false)
SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_NODE, true)

/*
* Issue a WARN when we do multiple update_rq_clock() calls


2023-06-01 04:09:23

by Abel Wu

[permalink] [raw]
Subject: Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

On 5/31/23 8:04 PM, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the sched/core branch of tip:
>
> Commit-ID: c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> Gitweb: https://git.kernel.org/tip/c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> Author: Peter Zijlstra <[email protected]>
> AuthorDate: Tue, 30 May 2023 13:20:46 +02:00
> Committer: Peter Zijlstra <[email protected]>
> CommitterDate: Tue, 30 May 2023 22:46:27 +02:00
>
> sched/fair: Multi-LLC select_idle_sibling()
>
> Tejun reported that when he targets workqueues towards a specific LLC
> on his Zen2 machine with 3 cores / LLC and 4 LLCs in total, he gets
> significant idle time.
>
> This is, of course, because of how select_idle_sibling() will not
> consider anything outside of the local LLC, and since all these tasks
> are short running the periodic idle load balancer is ineffective.
>
> And while it is good to keep work cache local, it is better to not
> have significant idle time. Therefore, have select_idle_sibling() try
> other LLCs inside the same node when the local one comes up empty.
>
> Reported-by: Tejun Heo <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
> kernel/sched/features.h | 1 +
> 2 files changed, 39 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0c..0172458 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7028,6 +7028,38 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> }
>
> /*
> + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> + * local LLC comes up empty.
> + */
> +static int
> +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> +{
> + struct sched_domain *parent = sd->parent;
> + struct sched_group *sg;
> +
> + /* Make sure to not cross nodes. */
> + if (!parent || parent->flags & SD_NUMA)
> + return -1;
> +
> + sg = parent->groups;
> + do {
> + int cpu = cpumask_first(sched_group_span(sg));
> + struct sched_domain *sd_child;
> +
> + sd_child = per_cpu(sd_llc, cpu);
> + if (sd_child != sd) {

Since sd_llc is cpu private, I think it should be:

if (!cpus_share_cache(cpu, target))

> + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
> + if ((unsigned)i < nr_cpumask_bits)
> + return i;
> + }
> +
> + sg = sg->next;
> + } while (sg != parent->groups);
> +
> + return -1;
> +}
> +
> +/*
> * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> * the task fits. If no CPU is big enough, but there are idle ones, try to
> * maximize capacity.
> @@ -7199,6 +7231,12 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> if ((unsigned)i < nr_cpumask_bits)
> return i;
>
> + if (sched_feat(SIS_NODE)) {
> + i = select_idle_node(p, sd, target);
> + if ((unsigned)i < nr_cpumask_bits)
> + return i;
> + }
> +
> return target;
> }
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index ee7f23c..9e390eb 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> */
> SCHED_FEAT(SIS_PROP, false)
> SCHED_FEAT(SIS_UTIL, true)
> +SCHED_FEAT(SIS_NODE, true)
>
> /*
> * Issue a WARN when we do multiple update_rq_clock() calls

2023-06-01 08:14:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

On Thu, Jun 01, 2023 at 11:41:14AM +0800, Abel Wu wrote:
> On 5/31/23 8:04 PM, tip-bot2 for Peter Zijlstra wrote:
> > The following commit has been merged into the sched/core branch of tip:
> >
> > Commit-ID: c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> > Gitweb: https://git.kernel.org/tip/c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> > Author: Peter Zijlstra <[email protected]>
> > AuthorDate: Tue, 30 May 2023 13:20:46 +02:00
> > Committer: Peter Zijlstra <[email protected]>
> > CommitterDate: Tue, 30 May 2023 22:46:27 +02:00
> >
> > sched/fair: Multi-LLC select_idle_sibling()
> >
> > Tejun reported that when he targets workqueues towards a specific LLC
> > on his Zen2 machine with 3 cores / LLC and 4 LLCs in total, he gets
> > significant idle time.
> >
> > This is, of course, because of how select_idle_sibling() will not
> > consider anything outside of the local LLC, and since all these tasks
> > are short running the periodic idle load balancer is ineffective.
> >
> > And while it is good to keep work cache local, it is better to not
> > have significant idle time. Therefore, have select_idle_sibling() try
> > other LLCs inside the same node when the local one comes up empty.
> >
> > Reported-by: Tejun Heo <[email protected]>
> > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > ---
> > kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
> > kernel/sched/features.h | 1 +
> > 2 files changed, 39 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 48b6f0c..0172458 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7028,6 +7028,38 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > }
> > /*
> > + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> > + * local LLC comes up empty.
> > + */
> > +static int
> > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> > +{
> > + struct sched_domain *parent = sd->parent;
> > + struct sched_group *sg;
> > +
> > + /* Make sure to not cross nodes. */
> > + if (!parent || parent->flags & SD_NUMA)
> > + return -1;
> > +
> > + sg = parent->groups;
> > + do {
> > + int cpu = cpumask_first(sched_group_span(sg));
> > + struct sched_domain *sd_child;
> > +
> > + sd_child = per_cpu(sd_llc, cpu);
> > + if (sd_child != sd) {
>
> Since sd_llc is cpu private, I think it should be:
>
> if (!cpus_share_cache(cpu, target))

Hmm, yes.. either that or ensure sd is from the first cpu in it's mask.
Let me go fix that.

Thanks!

2023-06-01 09:59:24

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

Hello Peter,

Sharing some initial benchmark results with the patch below.

tl;dr

- Hackbench starts off well but performance drops as the number of groups
increases.

- schbench (old), tbench, netperf see improvement but there is a band of
outlier results when system is fully loaded or slightly overloaded.

- Stream and ycsb-mongodb are don't mind the extra search.

- SPECjbb (with default scheduler tunables) and DeathStarBench are not
very happy.

On 5/31/2023 5:34 PM, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the sched/core branch of tip:
>
> Commit-ID: c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> Gitweb: https://git.kernel.org/tip/c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> Author: Peter Zijlstra <[email protected]>
> AuthorDate: Tue, 30 May 2023 13:20:46 +02:00
> Committer: Peter Zijlstra <[email protected]>
> CommitterDate: Tue, 30 May 2023 22:46:27 +02:00
>
> sched/fair: Multi-LLC select_idle_sibling()
>
> Tejun reported that when he targets workqueues towards a specific LLC
> on his Zen2 machine with 3 cores / LLC and 4 LLCs in total, he gets
> significant idle time.
>
> This is, of course, because of how select_idle_sibling() will not
> consider anything outside of the local LLC, and since all these tasks
> are short running the periodic idle load balancer is ineffective.
>
> And while it is good to keep work cache local, it is better to not
> have significant idle time. Therefore, have select_idle_sibling() try
> other LLCs inside the same node when the local one comes up empty.

Tests were run on a dual socket 3rd Generation EPYC server(2 x64C/128T)
running in NPS1 mode. Following it the simplified machine topology:

NPS1: Each socket is a NUMA node.
Total 2 NUMA nodes in the dual socket machine.

DIE0: 0-63, 128-191
MC0: 0-7, 128-135
SMT0: 0,128
SMT1: 1,129
...
SMT7: 7,135
MC1: 8-15, 136-143
SMT8: 8,136
SMT9: 9,137
...
SMT15: 15,143
...
MC7: 56-63, 184-191
SMT56: 56,184
SMT57: 57,185
...
SMT63: 63,191

DIE1: 64-127, 192-255
MC8: 64-71, 192-199
SMT64: 64,192
SMT65: 65,193
...
SMT71: 71,199
MC9: 72-79, 200-207
SMT72: 72,200
SMT73: 72,201
...
SMT79: 79,207
...
MC15: 120-127, 248-255
SMT120: 120,248
SMT121: 121,249
...
SMT127: 127,255

Since the patch extends the idle CPU search to one domain above MC in
case of on an unsuccessful search, for the above topology, the DIE
domain becomes the wake domain with potential 128CPUs to be searched.
Following are the benchmark results:

o Kernel Versions

- tip - tip:sched/core at commit e2a1f85bf9f5 "sched/psi:
Avoid resetting the min update period when it is
unnecessary")

- peter-next-level - tip:sched/core + this patch

o Benchmark Results

Note: Benchmarks were run with boost enabled and C2 disabled to minimize
other external fact.

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

o NPS1

Test: tip peter-next-level
1-groups: 3.92 (0.00 pct) 4.05 (-3.31 pct)
2-groups: 4.58 (0.00 pct) 3.84 (16.15 pct)
4-groups: 4.99 (0.00 pct) 3.98 (20.24 pct)
8-groups: 5.67 (0.00 pct) 6.05 (-6.70 pct) * Overloaded
16-groups: 7.88 (0.00 pct) 10.56 (-34.01 pct) * Overloaded

~~~~~~~~~~~~~~~~~~
~ schbench (Old) ~
~~~~~~~~~~~~~~~~~~

o NPS1

#workers: tip peter-next-level
1: 26.00 (0.00 pct) 24.00 (7.69 pct)
2: 27.00 (0.00 pct) 24.00 (11.11 pct)
4: 31.00 (0.00 pct) 28.00 (9.67 pct)
8: 36.00 (0.00 pct) 33.00 (8.33 pct)
16: 49.00 (0.00 pct) 47.00 (4.08 pct)
32: 80.00 (0.00 pct) 81.00 (-1.25 pct)
64: 169.00 (0.00 pct) 169.00 (0.00 pct)
128: 343.00 (0.00 pct) 365.00 (-6.41 pct) * Fully Loaded
256: 42048.00 (0.00 pct) 35392.00 (15.82 pct)
512: 95104.00 (0.00 pct) 88704.00 (6.72 pct)

~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

o NPS1

Clients: tip peter-next-level
1 452.49 (0.00 pct) 457.94 (1.20 pct)
2 862.44 (0.00 pct) 879.99 (2.03 pct)
4 1604.27 (0.00 pct) 1618.87 (0.91 pct)
8 2966.77 (0.00 pct) 3040.90 (2.49 pct)
16 5176.70 (0.00 pct) 5292.29 (2.23 pct)
32 8205.24 (0.00 pct) 8949.12 (9.06 pct)
64 13956.71 (0.00 pct) 14461.42 (3.61 pct)
128 24005.50 (0.00 pct) 26052.75 (8.52 pct)
256 32457.61 (0.00 pct) 21999.41 (-32.22 pct) * Overloaded
512 34345.24 (0.00 pct) 41166.39 (19.86 pct)
1024 33432.92 (0.00 pct) 40900.84 (22.33 pct)

~~~~~~~~~~
~ stream ~
~~~~~~~~~~

o NPS1

- 10 Runs:

Test: tip peter-next-level
Copy: 271317.35 (0.00 pct) 292440.22 (7.78 pct)
Scale: 205533.77 (0.00 pct) 203362.60 (-1.05 pct)
Add: 221624.62 (0.00 pct) 225850.83 (1.90 pct)
Triad: 228500.68 (0.00 pct) 225885.25 (-1.14 pct)

- 100 Runs:

Test: tip peter-next-level
Copy: 317381.65 (0.00 pct) 318827.08 (0.45 pct)
Scale: 214145.00 (0.00 pct) 206213.69 (-3.70 pct)
Add: 239243.29 (0.00 pct) 229791.67 (-3.95 pct)
Triad: 249477.76 (0.00 pct) 236843.06 (-5.06 pct)

~~~~~~~~~~~~~~~~~~~~
~ netperf - TCP_RR ~
~~~~~~~~~~~~~~~~~~~~

o NPS1

Test: tip peter-next-level
1-clients: 102839.97 (0.00 pct) 103540.33 (0.68 pct)
2-clients: 98428.08 (0.00 pct) 100431.67 (2.03 pct)
4-clients: 92298.45 (0.00 pct) 94800.51 (2.71 pct)
8-clients: 85618.41 (0.00 pct) 89130.14 (4.10 pct)
16-clients: 78722.18 (0.00 pct) 79715.38 (1.26 pct)
32-clients: 73610.75 (0.00 pct) 72801.41 (-1.09 pct)
64-clients: 55285.07 (0.00 pct) 56184.38 (1.62 pct)
128-clients: 31176.92 (0.00 pct) 32830.06 (5.30 pct)
256-clients: 20011.44 (0.00 pct) 15135.39 (-24.36 pct) * Overloaded

~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~

o NPS1

tip peter-next-level
Hmean unixbench-dhry2reg-1 41322625.19 ( 0.00%) 41224388.33 ( -0.24%)
Hmean unixbench-dhry2reg-512 6252491108.60 ( 0.00%) 6240160851.68 ( -0.20%)
Amean unixbench-syscall-1 2501398.27 ( 0.00%) 2577323.43 * -3.04%*
Amean unixbench-syscall-512 8120524.00 ( 0.00%) 7512955.87 * 7.48%*
Hmean unixbench-pipe-1 2359346.02 ( 0.00%) 2392308.62 * 1.40%*
Hmean unixbench-pipe-512 338790322.61 ( 0.00%) 337711432.92 ( -0.32%)
Hmean unixbench-spawn-1 4261.52 ( 0.00%) 4164.90 ( -2.27%)
Hmean unixbench-spawn-512 64328.93 ( 0.00%) 62257.64 * -3.22%*
Hmean unixbench-execl-1 3677.73 ( 0.00%) 3652.08 ( -0.70%)
Hmean unixbench-execl-512 11984.83 ( 0.00%) 13585.65 * 13.36%*

~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~

o NPS1

tip: 131070.33 (var: 2.84%)
peter-next-level: 131070.33 (var: 2.84%) (0.00%)

~~~~~~~~~~~~~~~~~~~~~~~
~ SPECjbb - Multi-JVM ~
~~~~~~~~~~~~~~~~~~~~~~~

o NPS1

- Default Scheduler Tunables

kernel max-jOPS critical-jOPS
tip 100.00% 100.00%
peter-next-level 94.45% (-5.55%) 98.25% (-1.75%)

- Modified Scheduler Tunables

kernel max-jOPS critical-jOPS
tip 100.00% 100.00%
peter-next-level 100.00% (0.00%) 102.41% (2.41%)

~~~~~~~~~~~~~~~~~~
~ DeathStarBench ~
~~~~~~~~~~~~~~~~~~

Pinning Scaling tip peter-next-level
1 CCD 1 100.00% 100.30% (%diff: 0.30%)
2 CCD 2 100.00% 100.17% (%diff: 0.17%)
4 CCD 4 100.00% 99.60% (%diff: -0.40%)
8 CCD 8 100.00% 92.05% (%diff: -7.95%) *

---

Based on the above data, the results seem to be mostly positive for
the microbenchmarks but not so much for SpecJBB and DeathStarBench,
which have high utilization. There is also band of outliers when the
system is fully loaded or overloaded (~2 tasks per rq) for some of
the microbenchmarks.

I wonder if extending SIS_UTIL for SIS_NODE would help some of these
cases but I've not tried tinkering with it yet. I'll continue
testing on other NPS modes which would decrease the search scope.
I'll also try running the same bunch of workloads on an even larger
4th Generation EPYC server to see if the behavior there is similar.

Let me know if you need any data from from my test system for any
specific workload. I'll be more than happy to get them for you :)

>
> Reported-by: Tejun Heo <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
> kernel/sched/features.h | 1 +
> 2 files changed, 39 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0c..0172458 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7028,6 +7028,38 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> }
>
> /*
> + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> + * local LLC comes up empty.
> + */
> +static int
> +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> +{
> + struct sched_domain *parent = sd->parent;
> + struct sched_group *sg;
> +
> + /* Make sure to not cross nodes. */
> + if (!parent || parent->flags & SD_NUMA)
> + return -1;
> +
> + sg = parent->groups;
> + do {
> + int cpu = cpumask_first(sched_group_span(sg));
> + struct sched_domain *sd_child;
> +
> + sd_child = per_cpu(sd_llc, cpu);
> + if (sd_child != sd) {
> + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
> + if ((unsigned)i < nr_cpumask_bits)
> + return i;
> + }
> +
> + sg = sg->next;
> + } while (sg != parent->groups);
> +
> + return -1;
> +}
> +
> +/*
> * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> * the task fits. If no CPU is big enough, but there are idle ones, try to
> * maximize capacity.
> @@ -7199,6 +7231,12 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> if ((unsigned)i < nr_cpumask_bits)
> return i;
>
> + if (sched_feat(SIS_NODE)) {
> + i = select_idle_node(p, sd, target);
> + if ((unsigned)i < nr_cpumask_bits)
> + return i;
> + }
> +
> return target;
> }
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index ee7f23c..9e390eb 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> */
> SCHED_FEAT(SIS_PROP, false)
> SCHED_FEAT(SIS_UTIL, true)
> +SCHED_FEAT(SIS_NODE, true)
>
> /*
> * Issue a WARN when we do multiple update_rq_clock() calls

--
Thanks and Regards,
Prateek

2023-06-01 11:55:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

On Thu, Jun 01, 2023 at 03:03:39PM +0530, K Prateek Nayak wrote:
> Hello Peter,
>
> Sharing some initial benchmark results with the patch below.
>
> tl;dr
>
> - Hackbench starts off well but performance drops as the number of groups
> increases.
>
> - schbench (old), tbench, netperf see improvement but there is a band of
> outlier results when system is fully loaded or slightly overloaded.
>
> - Stream and ycsb-mongodb are don't mind the extra search.
>
> - SPECjbb (with default scheduler tunables) and DeathStarBench are not
> very happy.

Figures :/ Every time something like this is changed someone gets to be
sad..

> Tests were run on a dual socket 3rd Generation EPYC server(2 x64C/128T)
> running in NPS1 mode. Following it the simplified machine topology:

Right, Zen3 8 cores / LLC, 64 cores total give 8 LLC per node.

> ~~~~~~~~~~~~~~~~~~~~~~~
> ~ SPECjbb - Multi-JVM ~
> ~~~~~~~~~~~~~~~~~~~~~~~
>
> o NPS1
>
> - Default Scheduler Tunables
>
> kernel max-jOPS critical-jOPS
> tip 100.00% 100.00%
> peter-next-level 94.45% (-5.55%) 98.25% (-1.75%)
>
> - Modified Scheduler Tunables
>
> kernel max-jOPS critical-jOPS
> tip 100.00% 100.00%
> peter-next-level 100.00% (0.00%) 102.41% (2.41%)

I'm slightly confused, either the default or the tuned is better. Given
it's counting ops, I'm thinking higher is more better, so isn't this an
improvement in the tuned case?

> ~~~~~~~~~~~~~~~~~~
> ~ DeathStarBench ~
> ~~~~~~~~~~~~~~~~~~
>
> Pinning Scaling tip peter-next-level
> 1 CCD 1 100.00% 100.30% (%diff: 0.30%)
> 2 CCD 2 100.00% 100.17% (%diff: 0.17%)
> 4 CCD 4 100.00% 99.60% (%diff: -0.40%)
> 8 CCD 8 100.00% 92.05% (%diff: -7.95%) *

Right, so that's a definite loss.

> I wonder if extending SIS_UTIL for SIS_NODE would help some of these
> cases but I've not tried tinkering with it yet. I'll continue
> testing on other NPS modes which would decrease the search scope.
> I'll also try running the same bunch of workloads on an even larger
> 4th Generation EPYC server to see if the behavior there is similar.

> > /*
> > + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> > + * local LLC comes up empty.
> > + */
> > +static int
> > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> > +{
> > + struct sched_domain *parent = sd->parent;
> > + struct sched_group *sg;
> > +
> > + /* Make sure to not cross nodes. */
> > + if (!parent || parent->flags & SD_NUMA)
> > + return -1;
> > +
> > + sg = parent->groups;
> > + do {
> > + int cpu = cpumask_first(sched_group_span(sg));
> > + struct sched_domain *sd_child;
> > +
> > + sd_child = per_cpu(sd_llc, cpu);
> > + if (sd_child != sd) {
> > + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);

Given how SIS_UTIL is inside select_idle_cpu() it should already be
effective here, no?

> > + if ((unsigned)i < nr_cpumask_bits)
> > + return i;
> > + }
> > +
> > + sg = sg->next;
> > + } while (sg != parent->groups);
> > +
> > + return -1;
> > +}

This DeathStarBench thing seems to suggest that scanning up to 4 CCDs
isn't too much of a bother; so perhaps something like so?

(on top of tip/sched/core from just a few hours ago, as I had to 'fix'
this patch and force pushed the thing)

And yeah, random hacks and heuristics here :/ Does there happen to be
additional topology that could aid us here? Does the CCD fabric itself
have a distance metric we can use?

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 22e0a249e0a8..f1d6ed973410 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7036,6 +7036,7 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
{
struct sched_domain *parent = sd->parent;
struct sched_group *sg;
+ int nr = 4;

/* Make sure to not cross nodes. */
if (!parent || parent->flags & SD_NUMA)
@@ -7050,6 +7051,9 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
test_idle_cores(cpu), cpu);
if ((unsigned)i < nr_cpumask_bits)
return i;
+
+ if (!--nr)
+ return -1;
}

sg = sg->next;

2023-06-01 12:11:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

On Thu, Jun 01, 2023 at 01:13:26PM +0200, Peter Zijlstra wrote:
>
> This DeathStarBench thing seems to suggest that scanning up to 4 CCDs
> isn't too much of a bother; so perhaps something like so?
>
> (on top of tip/sched/core from just a few hours ago, as I had to 'fix'
> this patch and force pushed the thing)
>
> And yeah, random hacks and heuristics here :/ Does there happen to be
> additional topology that could aid us here? Does the CCD fabric itself
> have a distance metric we can use?

https://www.anandtech.com/show/16529/amd-epyc-milan-review/4

Specifically:

https://images.anandtech.com/doci/16529/Bounce-7763.png

That seems to suggest there are some very minor distance effects in the
CCD fabric. I didn't read the article too closely, but you'll note that
the first 4 CCDs have inter-CCD latency < 100 while the rest has > 100.

Could you also test on a Zen2 Epyc, does that require nr=8 instead of 4?
Should we perhaps write it like: 32 / llc_size ?

The Zen2 picture:

https://images.anandtech.com/doci/16315/Bounce-7742.png

Shows a more pronounced CCD fabric topology, you can really see the 2
CCX inside the CCD but also there's two ligher green squares around the
CCDs themselves.




2023-06-01 16:47:31

by Yu Chen

[permalink] [raw]
Subject: Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

On Thu, Jun 1, 2023 at 8:11 PM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Jun 01, 2023 at 03:03:39PM +0530, K Prateek Nayak wrote:
[...]
> > I wonder if extending SIS_UTIL for SIS_NODE would help some of these
> > cases but I've not tried tinkering with it yet. I'll continue
> > testing on other NPS modes which would decrease the search scope.
> > I'll also try running the same bunch of workloads on an even larger
> > 4th Generation EPYC server to see if the behavior there is similar.
>
> > > /*
> > > + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> > > + * local LLC comes up empty.
> > > + */
> > > +static int
> > > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> > > +{
> > > + struct sched_domain *parent = sd->parent;
> > > + struct sched_group *sg;
> > > +
> > > + /* Make sure to not cross nodes. */
> > > + if (!parent || parent->flags & SD_NUMA)
> > > + return -1;
> > > +
> > > + sg = parent->groups;
> > > + do {
> > > + int cpu = cpumask_first(sched_group_span(sg));
> > > + struct sched_domain *sd_child;
> > > +
> > > + sd_child = per_cpu(sd_llc, cpu);
> > > + if (sd_child != sd) {
> > > + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
>
> Given how SIS_UTIL is inside select_idle_cpu() it should already be
> effective here, no?
>
I'm thinking of this scenario, when the system is overloaded and with
SIS_NODE disabled,
the SIS_UTIL could scan for example 4 CPUs and terminates, then wakeup
on local LLC.
When SIS_NODE is enabled, it could scan for 4 * number_of_llc_domain
CPUs. The more
CPU it scans, the more likely it can find an idle CPU.
This seems to be a question of: what type of wakee is prefered to be
put on a non-idle CPU from local LLC,
or an idle CPU from remote LLC. It seems to depend on the working
set and task duration.


thanks,
Chenyu

2023-06-02 03:26:50

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

Hello Peter,

Thank you for taking a look at the report.

On 6/1/2023 4:43 PM, Peter Zijlstra wrote:
> On Thu, Jun 01, 2023 at 03:03:39PM +0530, K Prateek Nayak wrote:
>> Hello Peter,
>>
>> Sharing some initial benchmark results with the patch below.
>>
>> tl;dr
>>
>> - Hackbench starts off well but performance drops as the number of groups
>> increases.
>>
>> - schbench (old), tbench, netperf see improvement but there is a band of
>> outlier results when system is fully loaded or slightly overloaded.
>>
>> - Stream and ycsb-mongodb are don't mind the extra search.
>>
>> - SPECjbb (with default scheduler tunables) and DeathStarBench are not
>> very happy.
>
> Figures :/ Every time something like this is changed someone gets to be
> sad..
>
>> Tests were run on a dual socket 3rd Generation EPYC server(2 x64C/128T)
>> running in NPS1 mode. Following it the simplified machine topology:
>
> Right, Zen3 8 cores / LLC, 64 cores total give 8 LLC per node.

Yes, correct!

>
>> ~~~~~~~~~~~~~~~~~~~~~~~
>> ~ SPECjbb - Multi-JVM ~
>> ~~~~~~~~~~~~~~~~~~~~~~~
>>
>> o NPS1
>>
>> - Default Scheduler Tunables
>>
>> kernel max-jOPS critical-jOPS
>> tip 100.00% 100.00%
>> peter-next-level 94.45% (-5.55%) 98.25% (-1.75%)
>>
>> - Modified Scheduler Tunables
>>
>> kernel max-jOPS critical-jOPS
>> tip 100.00% 100.00%
>> peter-next-level 100.00% (0.00%) 102.41% (2.41%)
>
> I'm slightly confused, either the default or the tuned is better. Given
> it's counting ops, I'm thinking higher is more better, so isn't this an
> improvement in the tuned case?

Default is bad. I believe migrating across the LLC boundary is not that
great from cache efficiency perspective here. Setting the tunables
makes task run for longer, and in that case, able to find an idle CPU
seems to be more beneficial.

>
>> ~~~~~~~~~~~~~~~~~~
>> ~ DeathStarBench ~
>> ~~~~~~~~~~~~~~~~~~
>>
>> Pinning Scaling tip peter-next-level
>> 1 CCD 1 100.00% 100.30% (%diff: 0.30%)
>> 2 CCD 2 100.00% 100.17% (%diff: 0.17%)
>> 4 CCD 4 100.00% 99.60% (%diff: -0.40%)
>> 8 CCD 8 100.00% 92.05% (%diff: -7.95%) *
>
> Right, so that's a definite loss.
>
>> I wonder if extending SIS_UTIL for SIS_NODE would help some of these
>> cases but I've not tried tinkering with it yet. I'll continue
>> testing on other NPS modes which would decrease the search scope.
>> I'll also try running the same bunch of workloads on an even larger
>> 4th Generation EPYC server to see if the behavior there is similar.
>
>>> /*
>>> + * For the multiple-LLC per node case, make sure to try the other LLC's if the
>>> + * local LLC comes up empty.
>>> + */
>>> +static int
>>> +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
>>> +{
>>> + struct sched_domain *parent = sd->parent;
>>> + struct sched_group *sg;
>>> +
>>> + /* Make sure to not cross nodes. */
>>> + if (!parent || parent->flags & SD_NUMA)
>>> + return -1;
>>> +
>>> + sg = parent->groups;
>>> + do {
>>> + int cpu = cpumask_first(sched_group_span(sg));
>>> + struct sched_domain *sd_child;
>>> +
>>> + sd_child = per_cpu(sd_llc, cpu);
>>> + if (sd_child != sd) {
>>> + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
>
> Given how SIS_UTIL is inside select_idle_cpu() it should already be
> effective here, no?

True, but if the entire higher domain is busy, iterating over the groups
itself adds to the scheduling latency only to fallback to the target.
Wondering if keeping track of the largest "sd_llc_shared->nr_idle_scan"
and the corresponding group, and starting from there makes sense.

>
>>> + if ((unsigned)i < nr_cpumask_bits)
>>> + return i;
>>> + }
>>> +
>>> + sg = sg->next;
>>> + } while (sg != parent->groups);
>>> +
>>> + return -1;
>>> +}
>
> This DeathStarBench thing seems to suggest that scanning up to 4 CCDs
> isn't too much of a bother; so perhaps something like so?
>
> (on top of tip/sched/core from just a few hours ago, as I had to 'fix'
> this patch and force pushed the thing)
>
> And yeah, random hacks and heuristics here :/ Does there happen to be
> additional topology that could aid us here? Does the CCD fabric itself
> have a distance metric we can use?
>

We do not have a CCD to CCD distance metric unfortunately :(
However NPS modes should mitigate some of the problems of the larger
search space (but adds additional NUMA complexity) which I still need to
test.

> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 22e0a249e0a8..f1d6ed973410 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7036,6 +7036,7 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> {
> struct sched_domain *parent = sd->parent;
> struct sched_group *sg;
> + int nr = 4;
>
> /* Make sure to not cross nodes. */
> if (!parent || parent->flags & SD_NUMA)
> @@ -7050,6 +7051,9 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> test_idle_cores(cpu), cpu);
> if ((unsigned)i < nr_cpumask_bits)
> return i;
> +
> + if (!--nr)
> + return -1;
> }
>
> sg = sg->next;

--
Thanks and Regards,
Prateek

2023-06-05 15:29:54

by Marek Szyprowski

[permalink] [raw]
Subject: Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

On 31.05.2023 14:04, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the sched/core branch of tip:
>
> Commit-ID: c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> Gitweb: https://git.kernel.org/tip/c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> Author: Peter Zijlstra <[email protected]>
> AuthorDate: Tue, 30 May 2023 13:20:46 +02:00
> Committer: Peter Zijlstra <[email protected]>
> CommitterDate: Tue, 30 May 2023 22:46:27 +02:00
>
> sched/fair: Multi-LLC select_idle_sibling()
>
> Tejun reported that when he targets workqueues towards a specific LLC
> on his Zen2 machine with 3 cores / LLC and 4 LLCs in total, he gets
> significant idle time.
>
> This is, of course, because of how select_idle_sibling() will not
> consider anything outside of the local LLC, and since all these tasks
> are short running the periodic idle load balancer is ineffective.
>
> And while it is good to keep work cache local, it is better to not
> have significant idle time. Therefore, have select_idle_sibling() try
> other LLCs inside the same node when the local one comes up empty.
>
> Reported-by: Tejun Heo <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>

This patch landed in today's linux next-20230605 as commit c5214e13ad60
("sched/fair: Multi-LLC select_idle_sibling()"). Unfortunately it causes
regression on my ARM 64bit Exynos5433-based TM2e test board during the
CPU hotplug tests. From time to time I get the NULL pointer dereference.
Reverting $subject on top of linux-next fixes the issue. Let me know if
I can help somehow debugging this issue. Here is a complete log (I've
intentionally kept all the stack dumps, although they don't look very
relevant...):

# for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 >$i/online; done
Unable to handle kernel NULL pointer dereference at virtual address
0000000000000090

======================================================
WARNING: possible circular locking dependency detected
6.4.0-rc1+ #13640 Not tainted
------------------------------------------------------
cpuhp/6/43 is trying to acquire lock:
ffff80000ab65598 (console_owner){..-.}-{0:0}, at:
console_flush_all+0x1ac/0x4fc

but task is already holding lock:
ffff00002836ed48 (&p->pi_lock){-.-.}-{2:2}, at: try_to_wake_up+0x58/0x46c

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #3 (&p->pi_lock){-.-.}-{2:2}:
       _raw_spin_lock_irqsave+0x60/0x88
       try_to_wake_up+0x58/0x46c
       default_wake_function+0x14/0x20
       autoremove_wake_function+0x18/0x44
       __wake_up_common+0x94/0x170
       __wake_up_common_lock+0x7c/0xcc
       __wake_up+0x18/0x24
       tty_wakeup+0x34/0x70
       tty_port_default_wakeup+0x20/0x38
       tty_port_tty_wakeup+0x18/0x24
       uart_write_wakeup+0x18/0x28
       s3c24xx_serial_tx_chars+0x20c/0x218
       s3c64xx_serial_handle_irq+0x9c/0xe0
       __handle_irq_event_percpu+0xb0/0x2d4
       handle_irq_event+0x4c/0xb8
       handle_fasteoi_irq+0xa4/0x198
       generic_handle_domain_irq+0x2c/0x44
       gic_handle_irq+0x44/0xc4
       call_on_irq_stack+0x24/0x4c
       do_interrupt_handler+0x80/0x84
       el1_interrupt+0x34/0x64
       el1h_64_irq_handler+0x18/0x24
       el1h_64_irq+0x64/0x68
       default_idle_call+0x9c/0x150
       do_idle+0x230/0x294
       cpu_startup_entry+0x28/0x2c
       rest_init+0x100/0x190
       arch_post_acpi_subsys_init+0x0/0x8
       start_kernel+0x594/0x684
       __primary_switched+0xbc/0xc4

-> #2 (&tty->write_wait){-.-.}-{2:2}:
       _raw_spin_lock_irqsave+0x60/0x88
       __wake_up_common_lock+0x5c/0xcc
       __wake_up+0x18/0x24
       tty_wakeup+0x34/0x70
       tty_port_default_wakeup+0x20/0x38
       tty_port_tty_wakeup+0x18/0x24
       uart_write_wakeup+0x18/0x28
       s3c24xx_serial_tx_chars+0x20c/0x218
       s3c64xx_serial_handle_irq+0x9c/0xe0
       __handle_irq_event_percpu+0xb0/0x2d4
       handle_irq_event+0x4c/0xb8
       handle_fasteoi_irq+0xa4/0x198
       generic_handle_domain_irq+0x2c/0x44
       gic_handle_irq+0x44/0xc4
       call_on_irq_stack+0x24/0x4c
       do_interrupt_handler+0x80/0x84
       el1_interrupt+0x34/0x64
       el1h_64_irq_handler+0x18/0x24
       el1h_64_irq+0x64/0x68
       default_idle_call+0x9c/0x150
       do_idle+0x230/0x294
       cpu_startup_entry+0x28/0x2c
       rest_init+0x100/0x190
       arch_post_acpi_subsys_init+0x0/0x8
       start_kernel+0x594/0x684
       __primary_switched+0xbc/0xc4

-> #1 (&port_lock_key){-.-.}-{2:2}:
       _raw_spin_lock_irqsave+0x60/0x88
       s3c24xx_serial_console_write+0xfc/0x124
       console_flush_all+0x208/0x4fc
       console_unlock+0x5c/0x14c
       vprintk_emit+0x15c/0x3b0
       vprintk_default+0x38/0x44
       vprintk+0xc0/0xe4
       _printk+0x5c/0x84
       register_console+0x1f4/0x420
       uart_add_one_port+0x50c/0x53c
       s3c24xx_serial_probe+0x34c/0x72c
       platform_probe+0x68/0xd8
       really_probe+0x148/0x2b4
       __driver_probe_device+0x78/0x12c
       driver_probe_device+0xd8/0x160
       __driver_attach+0x9c/0x1ac
       bus_for_each_dev+0x74/0xd4
       driver_attach+0x24/0x30
       bus_add_driver+0xe4/0x1e8
       driver_register+0x60/0x128
       __platform_driver_register+0x28/0x34
       samsung_serial_init+0x30/0x8c
       do_one_initcall+0x74/0x2f0
       kernel_init_freeable+0x288/0x4d8
       kernel_init+0x24/0x1dc
       ret_from_fork+0x10/0x20

-> #0 (console_owner){..-.}-{0:0}:
       __lock_acquire+0x13d0/0x217c
       lock_acquire+0x1e8/0x310
       console_flush_all+0x1f4/0x4fc
       console_unlock+0x5c/0x14c
       vprintk_emit+0x15c/0x3b0
       vprintk_default+0x38/0x44
       vprintk+0xc0/0xe4
       _printk+0x5c/0x84
       die_kernel_fault+0x48/0x37c
       __do_kernel_fault+0xd8/0x19c
       do_page_fault+0xac/0x6d8
       do_translation_fault+0xac/0xb8
       do_mem_abort+0x44/0x94
       el1_abort+0x44/0x70
       el1h_64_sync_handler+0xd8/0xe4
       el1h_64_sync+0x64/0x68
       __bitmap_and+0x4c/0x78
       select_task_rq_fair+0x724/0x1a30
       try_to_wake_up+0x17c/0x46c
       wake_up_process+0x18/0x24
       complete+0x58/0x8c
       __kthread_parkme+0x74/0xc8
       kthread_parkme+0x20/0x44
       smpboot_thread_fn+0x118/0x2a0
       kthread+0x124/0x128
       ret_from_fork+0x10/0x20

other info that might help us debug this:

Chain exists of:
  console_owner --> &tty->write_wait --> &p->pi_lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&p->pi_lock);
                               lock(&tty->write_wait);
                               lock(&p->pi_lock);
  lock(console_owner);

 *** DEADLOCK ***

5 locks held by cpuhp/6/43:
 #0: ffff000023e68440 (&x->wait){....}-{2:2}, at: complete+0x24/0x8c
 #1: ffff00002836ed48 (&p->pi_lock){-.-.}-{2:2}, at:
try_to_wake_up+0x58/0x46c
 #2: ffff80000abd6ac0 (rcu_read_lock){....}-{1:2}, at:
select_task_rq_fair+0x114/0x1a30
 #3: ffff80000ab65390 (console_lock){+.+.}-{0:0}, at:
vprintk_default+0x38/0x44
 #4: ffff80000ab65440 (console_srcu){....}-{0:0}, at:
console_flush_all+0x7c/0x4fc

stack backtrace:
CPU: 6 PID: 43 Comm: cpuhp/6 Not tainted 6.4.0-rc1+ #13640
Hardware name: Samsung TM2E board (DT)
Call trace:
 dump_backtrace+0x98/0xf0
 show_stack+0x18/0x24
 dump_stack_lvl+0x60/0xac
 dump_stack+0x18/0x24
 print_circular_bug+0x26c/0x348
 check_noncircular+0x134/0x148
 __lock_acquire+0x13d0/0x217c
 lock_acquire+0x1e8/0x310
 console_flush_all+0x1f4/0x4fc
 console_unlock+0x5c/0x14c
 vprintk_emit+0x15c/0x3b0
 vprintk_default+0x38/0x44
 vprintk+0xc0/0xe4
 _printk+0x5c/0x84
 die_kernel_fault+0x48/0x37c
 __do_kernel_fault+0xd8/0x19c
 do_page_fault+0xac/0x6d8
 do_translation_fault+0xac/0xb8
 do_mem_abort+0x44/0x94
 el1_abort+0x44/0x70
 el1h_64_sync_handler+0xd8/0xe4
 el1h_64_sync+0x64/0x68
 __bitmap_and+0x4c/0x78
 select_task_rq_fair+0x724/0x1a30
 try_to_wake_up+0x17c/0x46c
 wake_up_process+0x18/0x24
 complete+0x58/0x8c
 __kthread_parkme+0x74/0xc8
 kthread_parkme+0x20/0x44
 smpboot_thread_fn+0x118/0x2a0
 kthread+0x124/0x128
 ret_from_fork+0x10/0x20
Mem abort info:
  ESR = 0x0000000096000006
  EC = 0x25: DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
  FSC = 0x06: level 2 translation fault
Data abort info:
  ISV = 0, ISS = 0x00000006
  CM = 0, WnR = 0
user pgtable: 4k pages, 48-bit VAs, pgdp=000000002783e000
[0000000000000090] pgd=080000002738f003, p4d=080000002738f003,
pud=0800000027a24003, pmd=0000000000000000
Internal error: Oops: 0000000096000006 [#1] PREEMPT SMP
Modules linked in: brcmfmac_wcc cpufreq_powersave cpufreq_conservative
brcmfmac brcmutil cfg80211 crct10dif_ce hci_uart btqca btbcm bluetooth
s5p_jpeg exynos_gsc s3fwrn5_i2c s3fwrn5 s5p_mfc nci v4l2_mem2mem
ecdh_generic nfc ecc videobuf2_dma_contig videobuf2_memops
videobuf2_v4l2 videodev rfkill panfrost videobuf2_common
drm_shmem_helper gpu_sched mc ip_tables x_tables ipv6

CPU: 6 PID: 43 Comm: cpuhp/6 Not tainted 6.4.0-rc1+ #13640
Hardware name: Samsung TM2E board (DT)
pstate: 000000c5 (nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __bitmap_and+0x4c/0x78
lr : select_idle_cpu+0x64/0x450
sp : ffff80000bd83b50
x29: ffff80000bd83b50 x28: ffff80000a152ad8 x27: ffff00002836e500
x26: ffff00002814f600 x25: ffff80000ab43e78 x24: 0000000000000000
x23: ffff80000ab3f000 x22: 0000000000000000 x21: ffff80000ab43e78
x20: 0000000000000000 x19: 0000000000000000 x18: ffff8000099ac098
x17: 0000000000000000 x16: 0000000000000067 x15: 0000000000000001
x14: 0000000000000000 x13: 00000000000000d8 x12: 0000000000000000
x11: 0000000000000001 x10: ffff80000b7c6e90 x9 : 0000000000000000
x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
x5 : 0000000000000020 x4 : 00000000000000ff x3 : 00000000fffffff8
x2 : ffff00002836e7e0 x1 : 0000000000000090 x0 : ffff0000d5fc2ad8
Call trace:
 __bitmap_and+0x4c/0x78
 select_task_rq_fair+0x724/0x1a30
 try_to_wake_up+0x17c/0x46c
 wake_up_process+0x18/0x24
 complete+0x58/0x8c
 __kthread_parkme+0x74/0xc8
 kthread_parkme+0x20/0x44
 smpboot_thread_fn+0x118/0x2a0
 kthread+0x124/0x128
 ret_from_fork+0x10/0x20
Code: 2a0803e8 4b0303e3 92800004 9ac32484 (f8687823)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Oops: Fatal exception
SMP: stopping secondary CPUs
Kernel Offset: disabled
CPU features: 0x8c0004,1c780800,0000421b
Memory Limit: none
---[ end Kernel panic - not syncing: Oops: Fatal exception ]---


> ---
> kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
> kernel/sched/features.h | 1 +
> 2 files changed, 39 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0c..0172458 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7028,6 +7028,38 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> }
>
> /*
> + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> + * local LLC comes up empty.
> + */
> +static int
> +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> +{
> + struct sched_domain *parent = sd->parent;
> + struct sched_group *sg;
> +
> + /* Make sure to not cross nodes. */
> + if (!parent || parent->flags & SD_NUMA)
> + return -1;
> +
> + sg = parent->groups;
> + do {
> + int cpu = cpumask_first(sched_group_span(sg));
> + struct sched_domain *sd_child;
> +
> + sd_child = per_cpu(sd_llc, cpu);
> + if (sd_child != sd) {
> + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
> + if ((unsigned)i < nr_cpumask_bits)
> + return i;
> + }
> +
> + sg = sg->next;
> + } while (sg != parent->groups);
> +
> + return -1;
> +}
> +
> +/*
> * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> * the task fits. If no CPU is big enough, but there are idle ones, try to
> * maximize capacity.
> @@ -7199,6 +7231,12 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> if ((unsigned)i < nr_cpumask_bits)
> return i;
>
> + if (sched_feat(SIS_NODE)) {
> + i = select_idle_node(p, sd, target);
> + if ((unsigned)i < nr_cpumask_bits)
> + return i;
> + }
> +
> return target;
> }
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index ee7f23c..9e390eb 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> */
> SCHED_FEAT(SIS_PROP, false)
> SCHED_FEAT(SIS_UTIL, true)
> +SCHED_FEAT(SIS_NODE, true)
>
> /*
> * Issue a WARN when we do multiple update_rq_clock() calls

Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland


2023-06-05 18:11:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

On Mon, Jun 05, 2023 at 05:25:30PM +0200, Marek Szyprowski wrote:
> On 31.05.2023 14:04, tip-bot2 for Peter Zijlstra wrote:
> > The following commit has been merged into the sched/core branch of tip:
> >
> > Commit-ID: c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> > Gitweb: https://git.kernel.org/tip/c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> > Author: Peter Zijlstra <[email protected]>
> > AuthorDate: Tue, 30 May 2023 13:20:46 +02:00
> > Committer: Peter Zijlstra <[email protected]>
> > CommitterDate: Tue, 30 May 2023 22:46:27 +02:00
> >
> > sched/fair: Multi-LLC select_idle_sibling()
> >
> > Tejun reported that when he targets workqueues towards a specific LLC
> > on his Zen2 machine with 3 cores / LLC and 4 LLCs in total, he gets
> > significant idle time.
> >
> > This is, of course, because of how select_idle_sibling() will not
> > consider anything outside of the local LLC, and since all these tasks
> > are short running the periodic idle load balancer is ineffective.
> >
> > And while it is good to keep work cache local, it is better to not
> > have significant idle time. Therefore, have select_idle_sibling() try
> > other LLCs inside the same node when the local one comes up empty.
> >
> > Reported-by: Tejun Heo <[email protected]>
> > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>
> This patch landed in today's linux next-20230605 as commit c5214e13ad60
> ("sched/fair: Multi-LLC select_idle_sibling()"). Unfortunately it causes
> regression on my ARM 64bit Exynos5433-based TM2e test board during the
> CPU hotplug tests. From time to time I get the NULL pointer dereference.
> Reverting $subject on top of linux-next fixes the issue. Let me know if
> I can help somehow debugging this issue. Here is a complete log (I've
> intentionally kept all the stack dumps, although they don't look very
> relevant...):

Moo... OK, since our friends from AMD need some tuning on this anyway,
i'm going to pull the patch entirely. And we'll try again once they've
sorted out the best way to do this.



2023-06-05 19:17:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

On Mon, Jun 05, 2023 at 05:25:30PM +0200, Marek Szyprowski wrote:

> nfortunately it causes
> regression on my ARM 64bit Exynos5433-based TM2e test board during the
> CPU hotplug tests.

Can you elucidate an ARM illiterate on the actual topology of that
machine?


> CPU: 6 PID: 43 Comm: cpuhp/6 Not tainted 6.4.0-rc1+ #13640
> Hardware name: Samsung TM2E board (DT)
> pstate: 000000c5 (nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> pc : __bitmap_and+0x4c/0x78
> lr : select_idle_cpu+0x64/0x450

Btw, where is lr at? Is that perhaps per_cpu(sd_llc) being NULL or
something?


> > ---
> > kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
> > kernel/sched/features.h | 1 +
> > 2 files changed, 39 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 48b6f0c..0172458 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7028,6 +7028,38 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > }
> >
> > /*
> > + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> > + * local LLC comes up empty.
> > + */
> > +static int
> > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> > +{
> > + struct sched_domain *parent = sd->parent;
> > + struct sched_group *sg;
> > +
> > + /* Make sure to not cross nodes. */
> > + if (!parent || parent->flags & SD_NUMA)
> > + return -1;
> > +
> > + sg = parent->groups;
> > + do {
> > + int cpu = cpumask_first(sched_group_span(sg));
> > + struct sched_domain *sd_child;
> > +
> > + sd_child = per_cpu(sd_llc, cpu);

IOW, sd_child end up NULL?

> > + if (sd_child != sd) {
> > + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
> > + if ((unsigned)i < nr_cpumask_bits)
> > + return i;
> > + }
> > +
> > + sg = sg->next;
> > + } while (sg != parent->groups);
> > +
> > + return -1;
> > +}

2023-06-05 22:36:10

by Marek Szyprowski

[permalink] [raw]
Subject: Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

On 05.06.2023 21:07, Peter Zijlstra wrote:
> On Mon, Jun 05, 2023 at 05:25:30PM +0200, Marek Szyprowski wrote:
>
>> nfortunately it causes
>> regression on my ARM 64bit Exynos5433-based TM2e test board during the
>> CPU hotplug tests.
> Can you elucidate an ARM illiterate on the actual topology of that
> machine?

Please check arch/arm64/boot/dts/exynos/exynos5433.dtsi This is typical
ARM big.LITTLE machine with 4 'big' (Cortex-A53 in this case) cores in
one cluster and another 4 'LITTLE' (Cortex-A57) in the latter.


>> CPU: 6 PID: 43 Comm: cpuhp/6 Not tainted 6.4.0-rc1+ #13640
>> Hardware name: Samsung TM2E board (DT)
>> pstate: 000000c5 (nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> pc : __bitmap_and+0x4c/0x78
>> lr : select_idle_cpu+0x64/0x450
> Btw, where is lr at? Is that perhaps per_cpu(sd_llc) being NULL or
> something?

If I get it right:

# aarch64-linux-gnu-objdump -Sld --start-address=0xffff8000080e7064 vmlinux

ffff8000080e7064 <select_idle_cpu>:
...
select_idle_cpu():
kernel/sched/fair.c:6987
                sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
ffff8000080e70c8:       f8747b21        ldr     x1, [x25, x20, lsl #3]
ffff8000080e70cc:       f0010340        adrp    x0, ffff80000a152000
<kvm_hyp_ctxt+0x7a0>
ffff8000080e70d0:       91302000        add     x0, x0, #0xc08
ffff8000080e70d4:       f90047e0        str     x0, [sp, #136]
ffff8000080e70d8:       f8616814        ldr     x20, [x0, x1]
ffff8000080e70dc:       9442c570        bl      ffff80000919869c
<debug_lockdep_rcu_enabled>
ffff8000080e70e0:       350017a0        cbnz    w0, ffff8000080e73d4
<select_idle_cpu+0x370>

This kvm_hyp_ctxt smells a little bad here, because this board boots
directly to EL1, so no hyp/kvm is used. This is relevant dmesg part:

--->8---
smp: Brought up 1 node, 8 CPUs
SMP: Total of 8 processors activated.
CPU features: detected: 32-bit EL0 Support
CPU features: detected: 32-bit EL1 Support
CPU features: detected: CRC32 instructions
CPU: All CPU(s) started at EL1

--->8---


>>> ---
>>> kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
>>> kernel/sched/features.h | 1 +
>>> 2 files changed, 39 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 48b6f0c..0172458 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -7028,6 +7028,38 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>> }
>>>
>>> /*
>>> + * For the multiple-LLC per node case, make sure to try the other LLC's if the
>>> + * local LLC comes up empty.
>>> + */
>>> +static int
>>> +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
>>> +{
>>> + struct sched_domain *parent = sd->parent;
>>> + struct sched_group *sg;
>>> +
>>> + /* Make sure to not cross nodes. */
>>> + if (!parent || parent->flags & SD_NUMA)
>>> + return -1;
>>> +
>>> + sg = parent->groups;
>>> + do {
>>> + int cpu = cpumask_first(sched_group_span(sg));
>>> + struct sched_domain *sd_child;
>>> +
>>> + sd_child = per_cpu(sd_llc, cpu);
> IOW, sd_child end up NULL?
>
>>> + if (sd_child != sd) {
>>> + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
>>> + if ((unsigned)i < nr_cpumask_bits)
>>> + return i;
>>> + }
>>> +
>>> + sg = sg->next;
>>> + } while (sg != parent->groups);
>>> +
>>> + return -1;
>>> +}

Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland


2023-06-06 08:59:29

by Chen Yu

[permalink] [raw]
Subject: Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

On 2023-06-05 at 21:07:46 +0200, Peter Zijlstra wrote:
> On Mon, Jun 05, 2023 at 05:25:30PM +0200, Marek Szyprowski wrote:
>
> > nfortunately it causes
> > regression on my ARM 64bit Exynos5433-based TM2e test board during the
> > CPU hotplug tests.
>
> Can you elucidate an ARM illiterate on the actual topology of that
> machine?
>
>
> > CPU: 6 PID: 43 Comm: cpuhp/6 Not tainted 6.4.0-rc1+ #13640
> > Hardware name: Samsung TM2E board (DT)
> > pstate: 000000c5 (nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > pc : __bitmap_and+0x4c/0x78
> > lr : select_idle_cpu+0x64/0x450
>
> Btw, where is lr at? Is that perhaps per_cpu(sd_llc) being NULL or
> something?
>
>
> > > ---
> > > kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
> > > kernel/sched/features.h | 1 +
> > > 2 files changed, 39 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 48b6f0c..0172458 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7028,6 +7028,38 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > > }
> > >
> > > /*
> > > + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> > > + * local LLC comes up empty.
> > > + */
> > > +static int
> > > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> > > +{
> > > + struct sched_domain *parent = sd->parent;
> > > + struct sched_group *sg;
> > > +
> > > + /* Make sure to not cross nodes. */
> > > + if (!parent || parent->flags & SD_NUMA)
> > > + return -1;
> > > +
> > > + sg = parent->groups;
> > > + do {
> > > + int cpu = cpumask_first(sched_group_span(sg));
> > > + struct sched_domain *sd_child;
> > > +
> > > + sd_child = per_cpu(sd_llc, cpu);
>
> IOW, sd_child end up NULL?
>
Just wonder if we can use rcu_dereference(per_cpu(sd_llc, cpu)),
and cpu offline/online could modify sd_llc pointer. (We used rcu dereference
for sd_llc in other places)

thanks,
Chenyu