The following commit has been merged into the sched/core branch of tip:
Commit-ID: c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
Gitweb: https://git.kernel.org/tip/c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
Author: Peter Zijlstra <[email protected]>
AuthorDate: Tue, 30 May 2023 13:20:46 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 30 May 2023 22:46:27 +02:00
sched/fair: Multi-LLC select_idle_sibling()
Tejun reported that when he targets workqueues towards a specific LLC
on his Zen2 machine with 3 cores / LLC and 4 LLCs in total, he gets
significant idle time.
This is, of course, because of how select_idle_sibling() will not
consider anything outside of the local LLC, and since all these tasks
are short running the periodic idle load balancer is ineffective.
And while it is good to keep work cache local, it is better to not
have significant idle time. Therefore, have select_idle_sibling() try
other LLCs inside the same node when the local one comes up empty.
Reported-by: Tejun Heo <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
kernel/sched/features.h | 1 +
2 files changed, 39 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48b6f0c..0172458 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7028,6 +7028,38 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
}
/*
+ * For the multiple-LLC per node case, make sure to try the other LLC's if the
+ * local LLC comes up empty.
+ */
+static int
+select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ struct sched_domain *parent = sd->parent;
+ struct sched_group *sg;
+
+ /* Make sure to not cross nodes. */
+ if (!parent || parent->flags & SD_NUMA)
+ return -1;
+
+ sg = parent->groups;
+ do {
+ int cpu = cpumask_first(sched_group_span(sg));
+ struct sched_domain *sd_child;
+
+ sd_child = per_cpu(sd_llc, cpu);
+ if (sd_child != sd) {
+ int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
+ if ((unsigned)i < nr_cpumask_bits)
+ return i;
+ }
+
+ sg = sg->next;
+ } while (sg != parent->groups);
+
+ return -1;
+}
+
+/*
* Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
* the task fits. If no CPU is big enough, but there are idle ones, try to
* maximize capacity.
@@ -7199,6 +7231,12 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if ((unsigned)i < nr_cpumask_bits)
return i;
+ if (sched_feat(SIS_NODE)) {
+ i = select_idle_node(p, sd, target);
+ if ((unsigned)i < nr_cpumask_bits)
+ return i;
+ }
+
return target;
}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c..9e390eb 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_PROP, false)
SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_NODE, true)
/*
* Issue a WARN when we do multiple update_rq_clock() calls
On 5/31/23 8:04 PM, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the sched/core branch of tip:
>
> Commit-ID: c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> Gitweb: https://git.kernel.org/tip/c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> Author: Peter Zijlstra <[email protected]>
> AuthorDate: Tue, 30 May 2023 13:20:46 +02:00
> Committer: Peter Zijlstra <[email protected]>
> CommitterDate: Tue, 30 May 2023 22:46:27 +02:00
>
> sched/fair: Multi-LLC select_idle_sibling()
>
> Tejun reported that when he targets workqueues towards a specific LLC
> on his Zen2 machine with 3 cores / LLC and 4 LLCs in total, he gets
> significant idle time.
>
> This is, of course, because of how select_idle_sibling() will not
> consider anything outside of the local LLC, and since all these tasks
> are short running the periodic idle load balancer is ineffective.
>
> And while it is good to keep work cache local, it is better to not
> have significant idle time. Therefore, have select_idle_sibling() try
> other LLCs inside the same node when the local one comes up empty.
>
> Reported-by: Tejun Heo <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
> kernel/sched/features.h | 1 +
> 2 files changed, 39 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0c..0172458 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7028,6 +7028,38 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> }
>
> /*
> + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> + * local LLC comes up empty.
> + */
> +static int
> +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> +{
> + struct sched_domain *parent = sd->parent;
> + struct sched_group *sg;
> +
> + /* Make sure to not cross nodes. */
> + if (!parent || parent->flags & SD_NUMA)
> + return -1;
> +
> + sg = parent->groups;
> + do {
> + int cpu = cpumask_first(sched_group_span(sg));
> + struct sched_domain *sd_child;
> +
> + sd_child = per_cpu(sd_llc, cpu);
> + if (sd_child != sd) {
Since sd_llc is cpu private, I think it should be:
if (!cpus_share_cache(cpu, target))
> + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
> + if ((unsigned)i < nr_cpumask_bits)
> + return i;
> + }
> +
> + sg = sg->next;
> + } while (sg != parent->groups);
> +
> + return -1;
> +}
> +
> +/*
> * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> * the task fits. If no CPU is big enough, but there are idle ones, try to
> * maximize capacity.
> @@ -7199,6 +7231,12 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> if ((unsigned)i < nr_cpumask_bits)
> return i;
>
> + if (sched_feat(SIS_NODE)) {
> + i = select_idle_node(p, sd, target);
> + if ((unsigned)i < nr_cpumask_bits)
> + return i;
> + }
> +
> return target;
> }
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index ee7f23c..9e390eb 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> */
> SCHED_FEAT(SIS_PROP, false)
> SCHED_FEAT(SIS_UTIL, true)
> +SCHED_FEAT(SIS_NODE, true)
>
> /*
> * Issue a WARN when we do multiple update_rq_clock() calls
On Thu, Jun 01, 2023 at 11:41:14AM +0800, Abel Wu wrote:
> On 5/31/23 8:04 PM, tip-bot2 for Peter Zijlstra wrote:
> > The following commit has been merged into the sched/core branch of tip:
> >
> > Commit-ID: c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> > Gitweb: https://git.kernel.org/tip/c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> > Author: Peter Zijlstra <[email protected]>
> > AuthorDate: Tue, 30 May 2023 13:20:46 +02:00
> > Committer: Peter Zijlstra <[email protected]>
> > CommitterDate: Tue, 30 May 2023 22:46:27 +02:00
> >
> > sched/fair: Multi-LLC select_idle_sibling()
> >
> > Tejun reported that when he targets workqueues towards a specific LLC
> > on his Zen2 machine with 3 cores / LLC and 4 LLCs in total, he gets
> > significant idle time.
> >
> > This is, of course, because of how select_idle_sibling() will not
> > consider anything outside of the local LLC, and since all these tasks
> > are short running the periodic idle load balancer is ineffective.
> >
> > And while it is good to keep work cache local, it is better to not
> > have significant idle time. Therefore, have select_idle_sibling() try
> > other LLCs inside the same node when the local one comes up empty.
> >
> > Reported-by: Tejun Heo <[email protected]>
> > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > ---
> > kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
> > kernel/sched/features.h | 1 +
> > 2 files changed, 39 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 48b6f0c..0172458 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7028,6 +7028,38 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > }
> > /*
> > + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> > + * local LLC comes up empty.
> > + */
> > +static int
> > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> > +{
> > + struct sched_domain *parent = sd->parent;
> > + struct sched_group *sg;
> > +
> > + /* Make sure to not cross nodes. */
> > + if (!parent || parent->flags & SD_NUMA)
> > + return -1;
> > +
> > + sg = parent->groups;
> > + do {
> > + int cpu = cpumask_first(sched_group_span(sg));
> > + struct sched_domain *sd_child;
> > +
> > + sd_child = per_cpu(sd_llc, cpu);
> > + if (sd_child != sd) {
>
> Since sd_llc is cpu private, I think it should be:
>
> if (!cpus_share_cache(cpu, target))
Hmm, yes.. either that or ensure sd is from the first cpu in it's mask.
Let me go fix that.
Thanks!
Hello Peter,
Sharing some initial benchmark results with the patch below.
tl;dr
- Hackbench starts off well but performance drops as the number of groups
increases.
- schbench (old), tbench, netperf see improvement but there is a band of
outlier results when system is fully loaded or slightly overloaded.
- Stream and ycsb-mongodb are don't mind the extra search.
- SPECjbb (with default scheduler tunables) and DeathStarBench are not
very happy.
On 5/31/2023 5:34 PM, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the sched/core branch of tip:
>
> Commit-ID: c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> Gitweb: https://git.kernel.org/tip/c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> Author: Peter Zijlstra <[email protected]>
> AuthorDate: Tue, 30 May 2023 13:20:46 +02:00
> Committer: Peter Zijlstra <[email protected]>
> CommitterDate: Tue, 30 May 2023 22:46:27 +02:00
>
> sched/fair: Multi-LLC select_idle_sibling()
>
> Tejun reported that when he targets workqueues towards a specific LLC
> on his Zen2 machine with 3 cores / LLC and 4 LLCs in total, he gets
> significant idle time.
>
> This is, of course, because of how select_idle_sibling() will not
> consider anything outside of the local LLC, and since all these tasks
> are short running the periodic idle load balancer is ineffective.
>
> And while it is good to keep work cache local, it is better to not
> have significant idle time. Therefore, have select_idle_sibling() try
> other LLCs inside the same node when the local one comes up empty.
Tests were run on a dual socket 3rd Generation EPYC server(2 x64C/128T)
running in NPS1 mode. Following it the simplified machine topology:
NPS1: Each socket is a NUMA node.
Total 2 NUMA nodes in the dual socket machine.
DIE0: 0-63, 128-191
MC0: 0-7, 128-135
SMT0: 0,128
SMT1: 1,129
...
SMT7: 7,135
MC1: 8-15, 136-143
SMT8: 8,136
SMT9: 9,137
...
SMT15: 15,143
...
MC7: 56-63, 184-191
SMT56: 56,184
SMT57: 57,185
...
SMT63: 63,191
DIE1: 64-127, 192-255
MC8: 64-71, 192-199
SMT64: 64,192
SMT65: 65,193
...
SMT71: 71,199
MC9: 72-79, 200-207
SMT72: 72,200
SMT73: 72,201
...
SMT79: 79,207
...
MC15: 120-127, 248-255
SMT120: 120,248
SMT121: 121,249
...
SMT127: 127,255
Since the patch extends the idle CPU search to one domain above MC in
case of on an unsuccessful search, for the above topology, the DIE
domain becomes the wake domain with potential 128CPUs to be searched.
Following are the benchmark results:
o Kernel Versions
- tip - tip:sched/core at commit e2a1f85bf9f5 "sched/psi:
Avoid resetting the min update period when it is
unnecessary")
- peter-next-level - tip:sched/core + this patch
o Benchmark Results
Note: Benchmarks were run with boost enabled and C2 disabled to minimize
other external fact.
~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~
o NPS1
Test: tip peter-next-level
1-groups: 3.92 (0.00 pct) 4.05 (-3.31 pct)
2-groups: 4.58 (0.00 pct) 3.84 (16.15 pct)
4-groups: 4.99 (0.00 pct) 3.98 (20.24 pct)
8-groups: 5.67 (0.00 pct) 6.05 (-6.70 pct) * Overloaded
16-groups: 7.88 (0.00 pct) 10.56 (-34.01 pct) * Overloaded
~~~~~~~~~~~~~~~~~~
~ schbench (Old) ~
~~~~~~~~~~~~~~~~~~
o NPS1
#workers: tip peter-next-level
1: 26.00 (0.00 pct) 24.00 (7.69 pct)
2: 27.00 (0.00 pct) 24.00 (11.11 pct)
4: 31.00 (0.00 pct) 28.00 (9.67 pct)
8: 36.00 (0.00 pct) 33.00 (8.33 pct)
16: 49.00 (0.00 pct) 47.00 (4.08 pct)
32: 80.00 (0.00 pct) 81.00 (-1.25 pct)
64: 169.00 (0.00 pct) 169.00 (0.00 pct)
128: 343.00 (0.00 pct) 365.00 (-6.41 pct) * Fully Loaded
256: 42048.00 (0.00 pct) 35392.00 (15.82 pct)
512: 95104.00 (0.00 pct) 88704.00 (6.72 pct)
~~~~~~~~~~
~ tbench ~
~~~~~~~~~~
o NPS1
Clients: tip peter-next-level
1 452.49 (0.00 pct) 457.94 (1.20 pct)
2 862.44 (0.00 pct) 879.99 (2.03 pct)
4 1604.27 (0.00 pct) 1618.87 (0.91 pct)
8 2966.77 (0.00 pct) 3040.90 (2.49 pct)
16 5176.70 (0.00 pct) 5292.29 (2.23 pct)
32 8205.24 (0.00 pct) 8949.12 (9.06 pct)
64 13956.71 (0.00 pct) 14461.42 (3.61 pct)
128 24005.50 (0.00 pct) 26052.75 (8.52 pct)
256 32457.61 (0.00 pct) 21999.41 (-32.22 pct) * Overloaded
512 34345.24 (0.00 pct) 41166.39 (19.86 pct)
1024 33432.92 (0.00 pct) 40900.84 (22.33 pct)
~~~~~~~~~~
~ stream ~
~~~~~~~~~~
o NPS1
- 10 Runs:
Test: tip peter-next-level
Copy: 271317.35 (0.00 pct) 292440.22 (7.78 pct)
Scale: 205533.77 (0.00 pct) 203362.60 (-1.05 pct)
Add: 221624.62 (0.00 pct) 225850.83 (1.90 pct)
Triad: 228500.68 (0.00 pct) 225885.25 (-1.14 pct)
- 100 Runs:
Test: tip peter-next-level
Copy: 317381.65 (0.00 pct) 318827.08 (0.45 pct)
Scale: 214145.00 (0.00 pct) 206213.69 (-3.70 pct)
Add: 239243.29 (0.00 pct) 229791.67 (-3.95 pct)
Triad: 249477.76 (0.00 pct) 236843.06 (-5.06 pct)
~~~~~~~~~~~~~~~~~~~~
~ netperf - TCP_RR ~
~~~~~~~~~~~~~~~~~~~~
o NPS1
Test: tip peter-next-level
1-clients: 102839.97 (0.00 pct) 103540.33 (0.68 pct)
2-clients: 98428.08 (0.00 pct) 100431.67 (2.03 pct)
4-clients: 92298.45 (0.00 pct) 94800.51 (2.71 pct)
8-clients: 85618.41 (0.00 pct) 89130.14 (4.10 pct)
16-clients: 78722.18 (0.00 pct) 79715.38 (1.26 pct)
32-clients: 73610.75 (0.00 pct) 72801.41 (-1.09 pct)
64-clients: 55285.07 (0.00 pct) 56184.38 (1.62 pct)
128-clients: 31176.92 (0.00 pct) 32830.06 (5.30 pct)
256-clients: 20011.44 (0.00 pct) 15135.39 (-24.36 pct) * Overloaded
~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~
o NPS1
tip peter-next-level
Hmean unixbench-dhry2reg-1 41322625.19 ( 0.00%) 41224388.33 ( -0.24%)
Hmean unixbench-dhry2reg-512 6252491108.60 ( 0.00%) 6240160851.68 ( -0.20%)
Amean unixbench-syscall-1 2501398.27 ( 0.00%) 2577323.43 * -3.04%*
Amean unixbench-syscall-512 8120524.00 ( 0.00%) 7512955.87 * 7.48%*
Hmean unixbench-pipe-1 2359346.02 ( 0.00%) 2392308.62 * 1.40%*
Hmean unixbench-pipe-512 338790322.61 ( 0.00%) 337711432.92 ( -0.32%)
Hmean unixbench-spawn-1 4261.52 ( 0.00%) 4164.90 ( -2.27%)
Hmean unixbench-spawn-512 64328.93 ( 0.00%) 62257.64 * -3.22%*
Hmean unixbench-execl-1 3677.73 ( 0.00%) 3652.08 ( -0.70%)
Hmean unixbench-execl-512 11984.83 ( 0.00%) 13585.65 * 13.36%*
~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~
o NPS1
tip: 131070.33 (var: 2.84%)
peter-next-level: 131070.33 (var: 2.84%) (0.00%)
~~~~~~~~~~~~~~~~~~~~~~~
~ SPECjbb - Multi-JVM ~
~~~~~~~~~~~~~~~~~~~~~~~
o NPS1
- Default Scheduler Tunables
kernel max-jOPS critical-jOPS
tip 100.00% 100.00%
peter-next-level 94.45% (-5.55%) 98.25% (-1.75%)
- Modified Scheduler Tunables
kernel max-jOPS critical-jOPS
tip 100.00% 100.00%
peter-next-level 100.00% (0.00%) 102.41% (2.41%)
~~~~~~~~~~~~~~~~~~
~ DeathStarBench ~
~~~~~~~~~~~~~~~~~~
Pinning Scaling tip peter-next-level
1 CCD 1 100.00% 100.30% (%diff: 0.30%)
2 CCD 2 100.00% 100.17% (%diff: 0.17%)
4 CCD 4 100.00% 99.60% (%diff: -0.40%)
8 CCD 8 100.00% 92.05% (%diff: -7.95%) *
---
Based on the above data, the results seem to be mostly positive for
the microbenchmarks but not so much for SpecJBB and DeathStarBench,
which have high utilization. There is also band of outliers when the
system is fully loaded or overloaded (~2 tasks per rq) for some of
the microbenchmarks.
I wonder if extending SIS_UTIL for SIS_NODE would help some of these
cases but I've not tried tinkering with it yet. I'll continue
testing on other NPS modes which would decrease the search scope.
I'll also try running the same bunch of workloads on an even larger
4th Generation EPYC server to see if the behavior there is similar.
Let me know if you need any data from from my test system for any
specific workload. I'll be more than happy to get them for you :)
>
> Reported-by: Tejun Heo <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
> kernel/sched/features.h | 1 +
> 2 files changed, 39 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0c..0172458 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7028,6 +7028,38 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> }
>
> /*
> + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> + * local LLC comes up empty.
> + */
> +static int
> +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> +{
> + struct sched_domain *parent = sd->parent;
> + struct sched_group *sg;
> +
> + /* Make sure to not cross nodes. */
> + if (!parent || parent->flags & SD_NUMA)
> + return -1;
> +
> + sg = parent->groups;
> + do {
> + int cpu = cpumask_first(sched_group_span(sg));
> + struct sched_domain *sd_child;
> +
> + sd_child = per_cpu(sd_llc, cpu);
> + if (sd_child != sd) {
> + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
> + if ((unsigned)i < nr_cpumask_bits)
> + return i;
> + }
> +
> + sg = sg->next;
> + } while (sg != parent->groups);
> +
> + return -1;
> +}
> +
> +/*
> * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> * the task fits. If no CPU is big enough, but there are idle ones, try to
> * maximize capacity.
> @@ -7199,6 +7231,12 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> if ((unsigned)i < nr_cpumask_bits)
> return i;
>
> + if (sched_feat(SIS_NODE)) {
> + i = select_idle_node(p, sd, target);
> + if ((unsigned)i < nr_cpumask_bits)
> + return i;
> + }
> +
> return target;
> }
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index ee7f23c..9e390eb 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> */
> SCHED_FEAT(SIS_PROP, false)
> SCHED_FEAT(SIS_UTIL, true)
> +SCHED_FEAT(SIS_NODE, true)
>
> /*
> * Issue a WARN when we do multiple update_rq_clock() calls
--
Thanks and Regards,
Prateek
On Thu, Jun 01, 2023 at 03:03:39PM +0530, K Prateek Nayak wrote:
> Hello Peter,
>
> Sharing some initial benchmark results with the patch below.
>
> tl;dr
>
> - Hackbench starts off well but performance drops as the number of groups
> increases.
>
> - schbench (old), tbench, netperf see improvement but there is a band of
> outlier results when system is fully loaded or slightly overloaded.
>
> - Stream and ycsb-mongodb are don't mind the extra search.
>
> - SPECjbb (with default scheduler tunables) and DeathStarBench are not
> very happy.
Figures :/ Every time something like this is changed someone gets to be
sad..
> Tests were run on a dual socket 3rd Generation EPYC server(2 x64C/128T)
> running in NPS1 mode. Following it the simplified machine topology:
Right, Zen3 8 cores / LLC, 64 cores total give 8 LLC per node.
> ~~~~~~~~~~~~~~~~~~~~~~~
> ~ SPECjbb - Multi-JVM ~
> ~~~~~~~~~~~~~~~~~~~~~~~
>
> o NPS1
>
> - Default Scheduler Tunables
>
> kernel max-jOPS critical-jOPS
> tip 100.00% 100.00%
> peter-next-level 94.45% (-5.55%) 98.25% (-1.75%)
>
> - Modified Scheduler Tunables
>
> kernel max-jOPS critical-jOPS
> tip 100.00% 100.00%
> peter-next-level 100.00% (0.00%) 102.41% (2.41%)
I'm slightly confused, either the default or the tuned is better. Given
it's counting ops, I'm thinking higher is more better, so isn't this an
improvement in the tuned case?
> ~~~~~~~~~~~~~~~~~~
> ~ DeathStarBench ~
> ~~~~~~~~~~~~~~~~~~
>
> Pinning Scaling tip peter-next-level
> 1 CCD 1 100.00% 100.30% (%diff: 0.30%)
> 2 CCD 2 100.00% 100.17% (%diff: 0.17%)
> 4 CCD 4 100.00% 99.60% (%diff: -0.40%)
> 8 CCD 8 100.00% 92.05% (%diff: -7.95%) *
Right, so that's a definite loss.
> I wonder if extending SIS_UTIL for SIS_NODE would help some of these
> cases but I've not tried tinkering with it yet. I'll continue
> testing on other NPS modes which would decrease the search scope.
> I'll also try running the same bunch of workloads on an even larger
> 4th Generation EPYC server to see if the behavior there is similar.
> > /*
> > + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> > + * local LLC comes up empty.
> > + */
> > +static int
> > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> > +{
> > + struct sched_domain *parent = sd->parent;
> > + struct sched_group *sg;
> > +
> > + /* Make sure to not cross nodes. */
> > + if (!parent || parent->flags & SD_NUMA)
> > + return -1;
> > +
> > + sg = parent->groups;
> > + do {
> > + int cpu = cpumask_first(sched_group_span(sg));
> > + struct sched_domain *sd_child;
> > +
> > + sd_child = per_cpu(sd_llc, cpu);
> > + if (sd_child != sd) {
> > + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
Given how SIS_UTIL is inside select_idle_cpu() it should already be
effective here, no?
> > + if ((unsigned)i < nr_cpumask_bits)
> > + return i;
> > + }
> > +
> > + sg = sg->next;
> > + } while (sg != parent->groups);
> > +
> > + return -1;
> > +}
This DeathStarBench thing seems to suggest that scanning up to 4 CCDs
isn't too much of a bother; so perhaps something like so?
(on top of tip/sched/core from just a few hours ago, as I had to 'fix'
this patch and force pushed the thing)
And yeah, random hacks and heuristics here :/ Does there happen to be
additional topology that could aid us here? Does the CCD fabric itself
have a distance metric we can use?
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 22e0a249e0a8..f1d6ed973410 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7036,6 +7036,7 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
{
struct sched_domain *parent = sd->parent;
struct sched_group *sg;
+ int nr = 4;
/* Make sure to not cross nodes. */
if (!parent || parent->flags & SD_NUMA)
@@ -7050,6 +7051,9 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
test_idle_cores(cpu), cpu);
if ((unsigned)i < nr_cpumask_bits)
return i;
+
+ if (!--nr)
+ return -1;
}
sg = sg->next;
On Thu, Jun 01, 2023 at 01:13:26PM +0200, Peter Zijlstra wrote:
>
> This DeathStarBench thing seems to suggest that scanning up to 4 CCDs
> isn't too much of a bother; so perhaps something like so?
>
> (on top of tip/sched/core from just a few hours ago, as I had to 'fix'
> this patch and force pushed the thing)
>
> And yeah, random hacks and heuristics here :/ Does there happen to be
> additional topology that could aid us here? Does the CCD fabric itself
> have a distance metric we can use?
https://www.anandtech.com/show/16529/amd-epyc-milan-review/4
Specifically:
https://images.anandtech.com/doci/16529/Bounce-7763.png
That seems to suggest there are some very minor distance effects in the
CCD fabric. I didn't read the article too closely, but you'll note that
the first 4 CCDs have inter-CCD latency < 100 while the rest has > 100.
Could you also test on a Zen2 Epyc, does that require nr=8 instead of 4?
Should we perhaps write it like: 32 / llc_size ?
The Zen2 picture:
https://images.anandtech.com/doci/16315/Bounce-7742.png
Shows a more pronounced CCD fabric topology, you can really see the 2
CCX inside the CCD but also there's two ligher green squares around the
CCDs themselves.
On Thu, Jun 1, 2023 at 8:11 PM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Jun 01, 2023 at 03:03:39PM +0530, K Prateek Nayak wrote:
[...]
> > I wonder if extending SIS_UTIL for SIS_NODE would help some of these
> > cases but I've not tried tinkering with it yet. I'll continue
> > testing on other NPS modes which would decrease the search scope.
> > I'll also try running the same bunch of workloads on an even larger
> > 4th Generation EPYC server to see if the behavior there is similar.
>
> > > /*
> > > + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> > > + * local LLC comes up empty.
> > > + */
> > > +static int
> > > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> > > +{
> > > + struct sched_domain *parent = sd->parent;
> > > + struct sched_group *sg;
> > > +
> > > + /* Make sure to not cross nodes. */
> > > + if (!parent || parent->flags & SD_NUMA)
> > > + return -1;
> > > +
> > > + sg = parent->groups;
> > > + do {
> > > + int cpu = cpumask_first(sched_group_span(sg));
> > > + struct sched_domain *sd_child;
> > > +
> > > + sd_child = per_cpu(sd_llc, cpu);
> > > + if (sd_child != sd) {
> > > + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
>
> Given how SIS_UTIL is inside select_idle_cpu() it should already be
> effective here, no?
>
I'm thinking of this scenario, when the system is overloaded and with
SIS_NODE disabled,
the SIS_UTIL could scan for example 4 CPUs and terminates, then wakeup
on local LLC.
When SIS_NODE is enabled, it could scan for 4 * number_of_llc_domain
CPUs. The more
CPU it scans, the more likely it can find an idle CPU.
This seems to be a question of: what type of wakee is prefered to be
put on a non-idle CPU from local LLC,
or an idle CPU from remote LLC. It seems to depend on the working
set and task duration.
thanks,
Chenyu
Hello Peter,
Thank you for taking a look at the report.
On 6/1/2023 4:43 PM, Peter Zijlstra wrote:
> On Thu, Jun 01, 2023 at 03:03:39PM +0530, K Prateek Nayak wrote:
>> Hello Peter,
>>
>> Sharing some initial benchmark results with the patch below.
>>
>> tl;dr
>>
>> - Hackbench starts off well but performance drops as the number of groups
>> increases.
>>
>> - schbench (old), tbench, netperf see improvement but there is a band of
>> outlier results when system is fully loaded or slightly overloaded.
>>
>> - Stream and ycsb-mongodb are don't mind the extra search.
>>
>> - SPECjbb (with default scheduler tunables) and DeathStarBench are not
>> very happy.
>
> Figures :/ Every time something like this is changed someone gets to be
> sad..
>
>> Tests were run on a dual socket 3rd Generation EPYC server(2 x64C/128T)
>> running in NPS1 mode. Following it the simplified machine topology:
>
> Right, Zen3 8 cores / LLC, 64 cores total give 8 LLC per node.
Yes, correct!
>
>> ~~~~~~~~~~~~~~~~~~~~~~~
>> ~ SPECjbb - Multi-JVM ~
>> ~~~~~~~~~~~~~~~~~~~~~~~
>>
>> o NPS1
>>
>> - Default Scheduler Tunables
>>
>> kernel max-jOPS critical-jOPS
>> tip 100.00% 100.00%
>> peter-next-level 94.45% (-5.55%) 98.25% (-1.75%)
>>
>> - Modified Scheduler Tunables
>>
>> kernel max-jOPS critical-jOPS
>> tip 100.00% 100.00%
>> peter-next-level 100.00% (0.00%) 102.41% (2.41%)
>
> I'm slightly confused, either the default or the tuned is better. Given
> it's counting ops, I'm thinking higher is more better, so isn't this an
> improvement in the tuned case?
Default is bad. I believe migrating across the LLC boundary is not that
great from cache efficiency perspective here. Setting the tunables
makes task run for longer, and in that case, able to find an idle CPU
seems to be more beneficial.
>
>> ~~~~~~~~~~~~~~~~~~
>> ~ DeathStarBench ~
>> ~~~~~~~~~~~~~~~~~~
>>
>> Pinning Scaling tip peter-next-level
>> 1 CCD 1 100.00% 100.30% (%diff: 0.30%)
>> 2 CCD 2 100.00% 100.17% (%diff: 0.17%)
>> 4 CCD 4 100.00% 99.60% (%diff: -0.40%)
>> 8 CCD 8 100.00% 92.05% (%diff: -7.95%) *
>
> Right, so that's a definite loss.
>
>> I wonder if extending SIS_UTIL for SIS_NODE would help some of these
>> cases but I've not tried tinkering with it yet. I'll continue
>> testing on other NPS modes which would decrease the search scope.
>> I'll also try running the same bunch of workloads on an even larger
>> 4th Generation EPYC server to see if the behavior there is similar.
>
>>> /*
>>> + * For the multiple-LLC per node case, make sure to try the other LLC's if the
>>> + * local LLC comes up empty.
>>> + */
>>> +static int
>>> +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
>>> +{
>>> + struct sched_domain *parent = sd->parent;
>>> + struct sched_group *sg;
>>> +
>>> + /* Make sure to not cross nodes. */
>>> + if (!parent || parent->flags & SD_NUMA)
>>> + return -1;
>>> +
>>> + sg = parent->groups;
>>> + do {
>>> + int cpu = cpumask_first(sched_group_span(sg));
>>> + struct sched_domain *sd_child;
>>> +
>>> + sd_child = per_cpu(sd_llc, cpu);
>>> + if (sd_child != sd) {
>>> + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
>
> Given how SIS_UTIL is inside select_idle_cpu() it should already be
> effective here, no?
True, but if the entire higher domain is busy, iterating over the groups
itself adds to the scheduling latency only to fallback to the target.
Wondering if keeping track of the largest "sd_llc_shared->nr_idle_scan"
and the corresponding group, and starting from there makes sense.
>
>>> + if ((unsigned)i < nr_cpumask_bits)
>>> + return i;
>>> + }
>>> +
>>> + sg = sg->next;
>>> + } while (sg != parent->groups);
>>> +
>>> + return -1;
>>> +}
>
> This DeathStarBench thing seems to suggest that scanning up to 4 CCDs
> isn't too much of a bother; so perhaps something like so?
>
> (on top of tip/sched/core from just a few hours ago, as I had to 'fix'
> this patch and force pushed the thing)
>
> And yeah, random hacks and heuristics here :/ Does there happen to be
> additional topology that could aid us here? Does the CCD fabric itself
> have a distance metric we can use?
>
We do not have a CCD to CCD distance metric unfortunately :(
However NPS modes should mitigate some of the problems of the larger
search space (but adds additional NUMA complexity) which I still need to
test.
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 22e0a249e0a8..f1d6ed973410 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7036,6 +7036,7 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> {
> struct sched_domain *parent = sd->parent;
> struct sched_group *sg;
> + int nr = 4;
>
> /* Make sure to not cross nodes. */
> if (!parent || parent->flags & SD_NUMA)
> @@ -7050,6 +7051,9 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> test_idle_cores(cpu), cpu);
> if ((unsigned)i < nr_cpumask_bits)
> return i;
> +
> + if (!--nr)
> + return -1;
> }
>
> sg = sg->next;
--
Thanks and Regards,
Prateek
On 31.05.2023 14:04, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the sched/core branch of tip:
>
> Commit-ID: c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> Gitweb: https://git.kernel.org/tip/c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> Author: Peter Zijlstra <[email protected]>
> AuthorDate: Tue, 30 May 2023 13:20:46 +02:00
> Committer: Peter Zijlstra <[email protected]>
> CommitterDate: Tue, 30 May 2023 22:46:27 +02:00
>
> sched/fair: Multi-LLC select_idle_sibling()
>
> Tejun reported that when he targets workqueues towards a specific LLC
> on his Zen2 machine with 3 cores / LLC and 4 LLCs in total, he gets
> significant idle time.
>
> This is, of course, because of how select_idle_sibling() will not
> consider anything outside of the local LLC, and since all these tasks
> are short running the periodic idle load balancer is ineffective.
>
> And while it is good to keep work cache local, it is better to not
> have significant idle time. Therefore, have select_idle_sibling() try
> other LLCs inside the same node when the local one comes up empty.
>
> Reported-by: Tejun Heo <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
This patch landed in today's linux next-20230605 as commit c5214e13ad60
("sched/fair: Multi-LLC select_idle_sibling()"). Unfortunately it causes
regression on my ARM 64bit Exynos5433-based TM2e test board during the
CPU hotplug tests. From time to time I get the NULL pointer dereference.
Reverting $subject on top of linux-next fixes the issue. Let me know if
I can help somehow debugging this issue. Here is a complete log (I've
intentionally kept all the stack dumps, although they don't look very
relevant...):
# for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 >$i/online; done
Unable to handle kernel NULL pointer dereference at virtual address
0000000000000090
======================================================
WARNING: possible circular locking dependency detected
6.4.0-rc1+ #13640 Not tainted
------------------------------------------------------
cpuhp/6/43 is trying to acquire lock:
ffff80000ab65598 (console_owner){..-.}-{0:0}, at:
console_flush_all+0x1ac/0x4fc
but task is already holding lock:
ffff00002836ed48 (&p->pi_lock){-.-.}-{2:2}, at: try_to_wake_up+0x58/0x46c
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&p->pi_lock){-.-.}-{2:2}:
_raw_spin_lock_irqsave+0x60/0x88
try_to_wake_up+0x58/0x46c
default_wake_function+0x14/0x20
autoremove_wake_function+0x18/0x44
__wake_up_common+0x94/0x170
__wake_up_common_lock+0x7c/0xcc
__wake_up+0x18/0x24
tty_wakeup+0x34/0x70
tty_port_default_wakeup+0x20/0x38
tty_port_tty_wakeup+0x18/0x24
uart_write_wakeup+0x18/0x28
s3c24xx_serial_tx_chars+0x20c/0x218
s3c64xx_serial_handle_irq+0x9c/0xe0
__handle_irq_event_percpu+0xb0/0x2d4
handle_irq_event+0x4c/0xb8
handle_fasteoi_irq+0xa4/0x198
generic_handle_domain_irq+0x2c/0x44
gic_handle_irq+0x44/0xc4
call_on_irq_stack+0x24/0x4c
do_interrupt_handler+0x80/0x84
el1_interrupt+0x34/0x64
el1h_64_irq_handler+0x18/0x24
el1h_64_irq+0x64/0x68
default_idle_call+0x9c/0x150
do_idle+0x230/0x294
cpu_startup_entry+0x28/0x2c
rest_init+0x100/0x190
arch_post_acpi_subsys_init+0x0/0x8
start_kernel+0x594/0x684
__primary_switched+0xbc/0xc4
-> #2 (&tty->write_wait){-.-.}-{2:2}:
_raw_spin_lock_irqsave+0x60/0x88
__wake_up_common_lock+0x5c/0xcc
__wake_up+0x18/0x24
tty_wakeup+0x34/0x70
tty_port_default_wakeup+0x20/0x38
tty_port_tty_wakeup+0x18/0x24
uart_write_wakeup+0x18/0x28
s3c24xx_serial_tx_chars+0x20c/0x218
s3c64xx_serial_handle_irq+0x9c/0xe0
__handle_irq_event_percpu+0xb0/0x2d4
handle_irq_event+0x4c/0xb8
handle_fasteoi_irq+0xa4/0x198
generic_handle_domain_irq+0x2c/0x44
gic_handle_irq+0x44/0xc4
call_on_irq_stack+0x24/0x4c
do_interrupt_handler+0x80/0x84
el1_interrupt+0x34/0x64
el1h_64_irq_handler+0x18/0x24
el1h_64_irq+0x64/0x68
default_idle_call+0x9c/0x150
do_idle+0x230/0x294
cpu_startup_entry+0x28/0x2c
rest_init+0x100/0x190
arch_post_acpi_subsys_init+0x0/0x8
start_kernel+0x594/0x684
__primary_switched+0xbc/0xc4
-> #1 (&port_lock_key){-.-.}-{2:2}:
_raw_spin_lock_irqsave+0x60/0x88
s3c24xx_serial_console_write+0xfc/0x124
console_flush_all+0x208/0x4fc
console_unlock+0x5c/0x14c
vprintk_emit+0x15c/0x3b0
vprintk_default+0x38/0x44
vprintk+0xc0/0xe4
_printk+0x5c/0x84
register_console+0x1f4/0x420
uart_add_one_port+0x50c/0x53c
s3c24xx_serial_probe+0x34c/0x72c
platform_probe+0x68/0xd8
really_probe+0x148/0x2b4
__driver_probe_device+0x78/0x12c
driver_probe_device+0xd8/0x160
__driver_attach+0x9c/0x1ac
bus_for_each_dev+0x74/0xd4
driver_attach+0x24/0x30
bus_add_driver+0xe4/0x1e8
driver_register+0x60/0x128
__platform_driver_register+0x28/0x34
samsung_serial_init+0x30/0x8c
do_one_initcall+0x74/0x2f0
kernel_init_freeable+0x288/0x4d8
kernel_init+0x24/0x1dc
ret_from_fork+0x10/0x20
-> #0 (console_owner){..-.}-{0:0}:
__lock_acquire+0x13d0/0x217c
lock_acquire+0x1e8/0x310
console_flush_all+0x1f4/0x4fc
console_unlock+0x5c/0x14c
vprintk_emit+0x15c/0x3b0
vprintk_default+0x38/0x44
vprintk+0xc0/0xe4
_printk+0x5c/0x84
die_kernel_fault+0x48/0x37c
__do_kernel_fault+0xd8/0x19c
do_page_fault+0xac/0x6d8
do_translation_fault+0xac/0xb8
do_mem_abort+0x44/0x94
el1_abort+0x44/0x70
el1h_64_sync_handler+0xd8/0xe4
el1h_64_sync+0x64/0x68
__bitmap_and+0x4c/0x78
select_task_rq_fair+0x724/0x1a30
try_to_wake_up+0x17c/0x46c
wake_up_process+0x18/0x24
complete+0x58/0x8c
__kthread_parkme+0x74/0xc8
kthread_parkme+0x20/0x44
smpboot_thread_fn+0x118/0x2a0
kthread+0x124/0x128
ret_from_fork+0x10/0x20
other info that might help us debug this:
Chain exists of:
console_owner --> &tty->write_wait --> &p->pi_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&p->pi_lock);
lock(&tty->write_wait);
lock(&p->pi_lock);
lock(console_owner);
*** DEADLOCK ***
5 locks held by cpuhp/6/43:
#0: ffff000023e68440 (&x->wait){....}-{2:2}, at: complete+0x24/0x8c
#1: ffff00002836ed48 (&p->pi_lock){-.-.}-{2:2}, at:
try_to_wake_up+0x58/0x46c
#2: ffff80000abd6ac0 (rcu_read_lock){....}-{1:2}, at:
select_task_rq_fair+0x114/0x1a30
#3: ffff80000ab65390 (console_lock){+.+.}-{0:0}, at:
vprintk_default+0x38/0x44
#4: ffff80000ab65440 (console_srcu){....}-{0:0}, at:
console_flush_all+0x7c/0x4fc
stack backtrace:
CPU: 6 PID: 43 Comm: cpuhp/6 Not tainted 6.4.0-rc1+ #13640
Hardware name: Samsung TM2E board (DT)
Call trace:
dump_backtrace+0x98/0xf0
show_stack+0x18/0x24
dump_stack_lvl+0x60/0xac
dump_stack+0x18/0x24
print_circular_bug+0x26c/0x348
check_noncircular+0x134/0x148
__lock_acquire+0x13d0/0x217c
lock_acquire+0x1e8/0x310
console_flush_all+0x1f4/0x4fc
console_unlock+0x5c/0x14c
vprintk_emit+0x15c/0x3b0
vprintk_default+0x38/0x44
vprintk+0xc0/0xe4
_printk+0x5c/0x84
die_kernel_fault+0x48/0x37c
__do_kernel_fault+0xd8/0x19c
do_page_fault+0xac/0x6d8
do_translation_fault+0xac/0xb8
do_mem_abort+0x44/0x94
el1_abort+0x44/0x70
el1h_64_sync_handler+0xd8/0xe4
el1h_64_sync+0x64/0x68
__bitmap_and+0x4c/0x78
select_task_rq_fair+0x724/0x1a30
try_to_wake_up+0x17c/0x46c
wake_up_process+0x18/0x24
complete+0x58/0x8c
__kthread_parkme+0x74/0xc8
kthread_parkme+0x20/0x44
smpboot_thread_fn+0x118/0x2a0
kthread+0x124/0x128
ret_from_fork+0x10/0x20
Mem abort info:
ESR = 0x0000000096000006
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x06: level 2 translation fault
Data abort info:
ISV = 0, ISS = 0x00000006
CM = 0, WnR = 0
user pgtable: 4k pages, 48-bit VAs, pgdp=000000002783e000
[0000000000000090] pgd=080000002738f003, p4d=080000002738f003,
pud=0800000027a24003, pmd=0000000000000000
Internal error: Oops: 0000000096000006 [#1] PREEMPT SMP
Modules linked in: brcmfmac_wcc cpufreq_powersave cpufreq_conservative
brcmfmac brcmutil cfg80211 crct10dif_ce hci_uart btqca btbcm bluetooth
s5p_jpeg exynos_gsc s3fwrn5_i2c s3fwrn5 s5p_mfc nci v4l2_mem2mem
ecdh_generic nfc ecc videobuf2_dma_contig videobuf2_memops
videobuf2_v4l2 videodev rfkill panfrost videobuf2_common
drm_shmem_helper gpu_sched mc ip_tables x_tables ipv6
CPU: 6 PID: 43 Comm: cpuhp/6 Not tainted 6.4.0-rc1+ #13640
Hardware name: Samsung TM2E board (DT)
pstate: 000000c5 (nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __bitmap_and+0x4c/0x78
lr : select_idle_cpu+0x64/0x450
sp : ffff80000bd83b50
x29: ffff80000bd83b50 x28: ffff80000a152ad8 x27: ffff00002836e500
x26: ffff00002814f600 x25: ffff80000ab43e78 x24: 0000000000000000
x23: ffff80000ab3f000 x22: 0000000000000000 x21: ffff80000ab43e78
x20: 0000000000000000 x19: 0000000000000000 x18: ffff8000099ac098
x17: 0000000000000000 x16: 0000000000000067 x15: 0000000000000001
x14: 0000000000000000 x13: 00000000000000d8 x12: 0000000000000000
x11: 0000000000000001 x10: ffff80000b7c6e90 x9 : 0000000000000000
x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
x5 : 0000000000000020 x4 : 00000000000000ff x3 : 00000000fffffff8
x2 : ffff00002836e7e0 x1 : 0000000000000090 x0 : ffff0000d5fc2ad8
Call trace:
__bitmap_and+0x4c/0x78
select_task_rq_fair+0x724/0x1a30
try_to_wake_up+0x17c/0x46c
wake_up_process+0x18/0x24
complete+0x58/0x8c
__kthread_parkme+0x74/0xc8
kthread_parkme+0x20/0x44
smpboot_thread_fn+0x118/0x2a0
kthread+0x124/0x128
ret_from_fork+0x10/0x20
Code: 2a0803e8 4b0303e3 92800004 9ac32484 (f8687823)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Oops: Fatal exception
SMP: stopping secondary CPUs
Kernel Offset: disabled
CPU features: 0x8c0004,1c780800,0000421b
Memory Limit: none
---[ end Kernel panic - not syncing: Oops: Fatal exception ]---
> ---
> kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
> kernel/sched/features.h | 1 +
> 2 files changed, 39 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0c..0172458 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7028,6 +7028,38 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> }
>
> /*
> + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> + * local LLC comes up empty.
> + */
> +static int
> +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> +{
> + struct sched_domain *parent = sd->parent;
> + struct sched_group *sg;
> +
> + /* Make sure to not cross nodes. */
> + if (!parent || parent->flags & SD_NUMA)
> + return -1;
> +
> + sg = parent->groups;
> + do {
> + int cpu = cpumask_first(sched_group_span(sg));
> + struct sched_domain *sd_child;
> +
> + sd_child = per_cpu(sd_llc, cpu);
> + if (sd_child != sd) {
> + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
> + if ((unsigned)i < nr_cpumask_bits)
> + return i;
> + }
> +
> + sg = sg->next;
> + } while (sg != parent->groups);
> +
> + return -1;
> +}
> +
> +/*
> * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> * the task fits. If no CPU is big enough, but there are idle ones, try to
> * maximize capacity.
> @@ -7199,6 +7231,12 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> if ((unsigned)i < nr_cpumask_bits)
> return i;
>
> + if (sched_feat(SIS_NODE)) {
> + i = select_idle_node(p, sd, target);
> + if ((unsigned)i < nr_cpumask_bits)
> + return i;
> + }
> +
> return target;
> }
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index ee7f23c..9e390eb 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> */
> SCHED_FEAT(SIS_PROP, false)
> SCHED_FEAT(SIS_UTIL, true)
> +SCHED_FEAT(SIS_NODE, true)
>
> /*
> * Issue a WARN when we do multiple update_rq_clock() calls
Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland
On Mon, Jun 05, 2023 at 05:25:30PM +0200, Marek Szyprowski wrote:
> On 31.05.2023 14:04, tip-bot2 for Peter Zijlstra wrote:
> > The following commit has been merged into the sched/core branch of tip:
> >
> > Commit-ID: c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> > Gitweb: https://git.kernel.org/tip/c7dfd6b9122d29d0e9a4587ab470c0564d7f92ab
> > Author: Peter Zijlstra <[email protected]>
> > AuthorDate: Tue, 30 May 2023 13:20:46 +02:00
> > Committer: Peter Zijlstra <[email protected]>
> > CommitterDate: Tue, 30 May 2023 22:46:27 +02:00
> >
> > sched/fair: Multi-LLC select_idle_sibling()
> >
> > Tejun reported that when he targets workqueues towards a specific LLC
> > on his Zen2 machine with 3 cores / LLC and 4 LLCs in total, he gets
> > significant idle time.
> >
> > This is, of course, because of how select_idle_sibling() will not
> > consider anything outside of the local LLC, and since all these tasks
> > are short running the periodic idle load balancer is ineffective.
> >
> > And while it is good to keep work cache local, it is better to not
> > have significant idle time. Therefore, have select_idle_sibling() try
> > other LLCs inside the same node when the local one comes up empty.
> >
> > Reported-by: Tejun Heo <[email protected]>
> > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>
> This patch landed in today's linux next-20230605 as commit c5214e13ad60
> ("sched/fair: Multi-LLC select_idle_sibling()"). Unfortunately it causes
> regression on my ARM 64bit Exynos5433-based TM2e test board during the
> CPU hotplug tests. From time to time I get the NULL pointer dereference.
> Reverting $subject on top of linux-next fixes the issue. Let me know if
> I can help somehow debugging this issue. Here is a complete log (I've
> intentionally kept all the stack dumps, although they don't look very
> relevant...):
Moo... OK, since our friends from AMD need some tuning on this anyway,
i'm going to pull the patch entirely. And we'll try again once they've
sorted out the best way to do this.
On Mon, Jun 05, 2023 at 05:25:30PM +0200, Marek Szyprowski wrote:
> nfortunately it causes
> regression on my ARM 64bit Exynos5433-based TM2e test board during the
> CPU hotplug tests.
Can you elucidate an ARM illiterate on the actual topology of that
machine?
> CPU: 6 PID: 43 Comm: cpuhp/6 Not tainted 6.4.0-rc1+ #13640
> Hardware name: Samsung TM2E board (DT)
> pstate: 000000c5 (nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> pc : __bitmap_and+0x4c/0x78
> lr : select_idle_cpu+0x64/0x450
Btw, where is lr at? Is that perhaps per_cpu(sd_llc) being NULL or
something?
> > ---
> > kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
> > kernel/sched/features.h | 1 +
> > 2 files changed, 39 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 48b6f0c..0172458 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7028,6 +7028,38 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > }
> >
> > /*
> > + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> > + * local LLC comes up empty.
> > + */
> > +static int
> > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> > +{
> > + struct sched_domain *parent = sd->parent;
> > + struct sched_group *sg;
> > +
> > + /* Make sure to not cross nodes. */
> > + if (!parent || parent->flags & SD_NUMA)
> > + return -1;
> > +
> > + sg = parent->groups;
> > + do {
> > + int cpu = cpumask_first(sched_group_span(sg));
> > + struct sched_domain *sd_child;
> > +
> > + sd_child = per_cpu(sd_llc, cpu);
IOW, sd_child end up NULL?
> > + if (sd_child != sd) {
> > + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
> > + if ((unsigned)i < nr_cpumask_bits)
> > + return i;
> > + }
> > +
> > + sg = sg->next;
> > + } while (sg != parent->groups);
> > +
> > + return -1;
> > +}
On 05.06.2023 21:07, Peter Zijlstra wrote:
> On Mon, Jun 05, 2023 at 05:25:30PM +0200, Marek Szyprowski wrote:
>
>> nfortunately it causes
>> regression on my ARM 64bit Exynos5433-based TM2e test board during the
>> CPU hotplug tests.
> Can you elucidate an ARM illiterate on the actual topology of that
> machine?
Please check arch/arm64/boot/dts/exynos/exynos5433.dtsi This is typical
ARM big.LITTLE machine with 4 'big' (Cortex-A53 in this case) cores in
one cluster and another 4 'LITTLE' (Cortex-A57) in the latter.
>> CPU: 6 PID: 43 Comm: cpuhp/6 Not tainted 6.4.0-rc1+ #13640
>> Hardware name: Samsung TM2E board (DT)
>> pstate: 000000c5 (nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> pc : __bitmap_and+0x4c/0x78
>> lr : select_idle_cpu+0x64/0x450
> Btw, where is lr at? Is that perhaps per_cpu(sd_llc) being NULL or
> something?
If I get it right:
# aarch64-linux-gnu-objdump -Sld --start-address=0xffff8000080e7064 vmlinux
ffff8000080e7064 <select_idle_cpu>:
...
select_idle_cpu():
kernel/sched/fair.c:6987
sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
ffff8000080e70c8: f8747b21 ldr x1, [x25, x20, lsl #3]
ffff8000080e70cc: f0010340 adrp x0, ffff80000a152000
<kvm_hyp_ctxt+0x7a0>
ffff8000080e70d0: 91302000 add x0, x0, #0xc08
ffff8000080e70d4: f90047e0 str x0, [sp, #136]
ffff8000080e70d8: f8616814 ldr x20, [x0, x1]
ffff8000080e70dc: 9442c570 bl ffff80000919869c
<debug_lockdep_rcu_enabled>
ffff8000080e70e0: 350017a0 cbnz w0, ffff8000080e73d4
<select_idle_cpu+0x370>
This kvm_hyp_ctxt smells a little bad here, because this board boots
directly to EL1, so no hyp/kvm is used. This is relevant dmesg part:
--->8---
smp: Brought up 1 node, 8 CPUs
SMP: Total of 8 processors activated.
CPU features: detected: 32-bit EL0 Support
CPU features: detected: 32-bit EL1 Support
CPU features: detected: CRC32 instructions
CPU: All CPU(s) started at EL1
--->8---
>>> ---
>>> kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
>>> kernel/sched/features.h | 1 +
>>> 2 files changed, 39 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 48b6f0c..0172458 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -7028,6 +7028,38 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>> }
>>>
>>> /*
>>> + * For the multiple-LLC per node case, make sure to try the other LLC's if the
>>> + * local LLC comes up empty.
>>> + */
>>> +static int
>>> +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
>>> +{
>>> + struct sched_domain *parent = sd->parent;
>>> + struct sched_group *sg;
>>> +
>>> + /* Make sure to not cross nodes. */
>>> + if (!parent || parent->flags & SD_NUMA)
>>> + return -1;
>>> +
>>> + sg = parent->groups;
>>> + do {
>>> + int cpu = cpumask_first(sched_group_span(sg));
>>> + struct sched_domain *sd_child;
>>> +
>>> + sd_child = per_cpu(sd_llc, cpu);
> IOW, sd_child end up NULL?
>
>>> + if (sd_child != sd) {
>>> + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
>>> + if ((unsigned)i < nr_cpumask_bits)
>>> + return i;
>>> + }
>>> +
>>> + sg = sg->next;
>>> + } while (sg != parent->groups);
>>> +
>>> + return -1;
>>> +}
Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland
On 2023-06-05 at 21:07:46 +0200, Peter Zijlstra wrote:
> On Mon, Jun 05, 2023 at 05:25:30PM +0200, Marek Szyprowski wrote:
>
> > nfortunately it causes
> > regression on my ARM 64bit Exynos5433-based TM2e test board during the
> > CPU hotplug tests.
>
> Can you elucidate an ARM illiterate on the actual topology of that
> machine?
>
>
> > CPU: 6 PID: 43 Comm: cpuhp/6 Not tainted 6.4.0-rc1+ #13640
> > Hardware name: Samsung TM2E board (DT)
> > pstate: 000000c5 (nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > pc : __bitmap_and+0x4c/0x78
> > lr : select_idle_cpu+0x64/0x450
>
> Btw, where is lr at? Is that perhaps per_cpu(sd_llc) being NULL or
> something?
>
>
> > > ---
> > > kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
> > > kernel/sched/features.h | 1 +
> > > 2 files changed, 39 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 48b6f0c..0172458 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7028,6 +7028,38 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > > }
> > >
> > > /*
> > > + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> > > + * local LLC comes up empty.
> > > + */
> > > +static int
> > > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> > > +{
> > > + struct sched_domain *parent = sd->parent;
> > > + struct sched_group *sg;
> > > +
> > > + /* Make sure to not cross nodes. */
> > > + if (!parent || parent->flags & SD_NUMA)
> > > + return -1;
> > > +
> > > + sg = parent->groups;
> > > + do {
> > > + int cpu = cpumask_first(sched_group_span(sg));
> > > + struct sched_domain *sd_child;
> > > +
> > > + sd_child = per_cpu(sd_llc, cpu);
>
> IOW, sd_child end up NULL?
>
Just wonder if we can use rcu_dereference(per_cpu(sd_llc, cpu)),
and cpu offline/online could modify sd_llc pointer. (We used rcu dereference
for sd_llc in other places)
thanks,
Chenyu