Subject: [PATCH] sched: fair: don't depend on wake_wide if waker and wakee are already in same LLC

when waker and wakee are already in the same LLC, it is pointless to worry
about the competition caused by pulling wakee to waker's LLC domain.

Signed-off-by: Barry Song <[email protected]>
---
kernel/sched/fair.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3248e24a90b0..cfb1bd47acc3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6795,7 +6795,15 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
new_cpu = prev_cpu;
}

- want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
+ /*
+ * we use wake_wide to make smarter pull and avoid cruel
+ * competition because of jam-packed tasks in waker's LLC
+ * domain. But if waker and wakee have been already in
+ * same LLC domain, it seems it is pointless to depend
+ * on wake_wide
+ */
+ want_affine = (cpus_share_cache(cpu, prev_cpu) || !wake_wide(p)) &&
+ cpumask_test_cpu(cpu, p->cpus_ptr);
}

rcu_read_lock();
--
2.25.1


2021-05-26 13:08:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: fair: don't depend on wake_wide if waker and wakee are already in same LLC


$subject is weird; sched/fair: is the right tag, and then start with a
capital letter.

On Wed, May 26, 2021 at 09:10:57PM +1200, Barry Song wrote:
> when waker and wakee are already in the same LLC, it is pointless to worry
> about the competition caused by pulling wakee to waker's LLC domain.

But there's more than LLC.

> Signed-off-by: Barry Song <[email protected]>
> ---
> kernel/sched/fair.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3248e24a90b0..cfb1bd47acc3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6795,7 +6795,15 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> new_cpu = prev_cpu;
> }
>
> - want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
> + /*
> + * we use wake_wide to make smarter pull and avoid cruel
> + * competition because of jam-packed tasks in waker's LLC
> + * domain. But if waker and wakee have been already in
> + * same LLC domain, it seems it is pointless to depend
> + * on wake_wide
> + */
> + want_affine = (cpus_share_cache(cpu, prev_cpu) || !wake_wide(p)) &&
> + cpumask_test_cpu(cpu, p->cpus_ptr);
> }

And no supportive numbers...

Subject: RE: [PATCH] sched: fair: don't depend on wake_wide if waker and wakee are already in same LLC



> -----Original Message-----
> From: Peter Zijlstra [mailto:[email protected]]
> Sent: Thursday, May 27, 2021 12:16 AM
> To: Song Bao Hua (Barry Song) <[email protected]>
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; yangyicong
> <[email protected]>; tangchengchang <[email protected]>;
> Linuxarm <[email protected]>
> Subject: Re: [PATCH] sched: fair: don't depend on wake_wide if waker and wakee
> are already in same LLC
>
>
> $subject is weird; sched/fair: is the right tag, and then start with a
> capital letter.
>
> On Wed, May 26, 2021 at 09:10:57PM +1200, Barry Song wrote:
> > when waker and wakee are already in the same LLC, it is pointless to worry
> > about the competition caused by pulling wakee to waker's LLC domain.
>
> But there's more than LLC.

I suppose other concerns might be about the "idle" and "load" of
waker's cpu and wakee's prev_cpu. Here even though we disable
wake_wide(), wake_affine() still has chance to select wakee's
prev_cpu rather than pulling to waker. So disabling wake_wide()
doesn't mean we will 100% pull.

static int wake_affine(struct sched_domain *sd, struct task_struct *p,
int this_cpu, int prev_cpu, int sync)
{
int target = nr_cpumask_bits;

if (sched_feat(WA_IDLE))
target = wake_affine_idle(this_cpu, prev_cpu, sync);

if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);

if (target == nr_cpumask_bits)
return prev_cpu;

..
return target;
}

Furthermore, select_idle_sibling() can also pick wakee's prev_cpu
if it is idle:

static int select_idle_sibling(struct task_struct *p, int prev, int target)
{
...

/*
* If the previous CPU is cache affine and idle, don't be stupid:
*/
if (prev != target && cpus_share_cache(prev, target) &&
(available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
asym_fits_capacity(task_util, prev))
return prev;
...
}

Except those, could you please give me some clue about what else
you have concerns on?

>
> > Signed-off-by: Barry Song <[email protected]>
> > ---
> > kernel/sched/fair.c | 10 +++++++++-
> > 1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 3248e24a90b0..cfb1bd47acc3 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6795,7 +6795,15 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu,
> int wake_flags)
> > new_cpu = prev_cpu;
> > }
> >
> > - want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
> > + /*
> > + * we use wake_wide to make smarter pull and avoid cruel
> > + * competition because of jam-packed tasks in waker's LLC
> > + * domain. But if waker and wakee have been already in
> > + * same LLC domain, it seems it is pointless to depend
> > + * on wake_wide
> > + */
> > + want_affine = (cpus_share_cache(cpu, prev_cpu) || !wake_wide(p)) &&
> > + cpumask_test_cpu(cpu, p->cpus_ptr);
> > }
>
> And no supportive numbers...

Sorry for the confusion.

I actually put some supportive numbers at the below thread which
derived this patch:
https://lore.kernel.org/lkml/[email protected]/

when I tried to give Dietmar some pgbench data in that thread,
I found in kunpeng920, while software ran in one die/numa with
24cores sharing LLC, disabling wake_wide() brought the best
pgbench result.

llc_as_factor don't_use_wake_wide
Hmean 1 10869.27 ( 0.00%) 10723.08 * -1.34%*
Hmean 8 19580.59 ( 0.00%) 19469.34 * -0.57%*
Hmean 12 29643.56 ( 0.00%) 29520.16 * -0.42%*
Hmean 24 43194.47 ( 0.00%) 43774.78 * 1.34%*
Hmean 32 40163.23 ( 0.00%) 40742.93 * 1.44%*
Hmean 48 42249.29 ( 0.00%) 48329.00 * 14.39%*

The test was done by https://github.com/gormanm/mmtests
and
./run-mmtests.sh --config ./configs/config-db-pgbench-timed-ro-medium test_tag

Commit "sched: Implement smarter wake-affine logic"
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=62470419
says pgbench can improve by wake_wide(), but I've actually
seen the opposite result while waker and wakee are already
in one LLC.

Not quite sure if it is specific to kunpeng920, perhaps
I need to run the same test on some x86 machines.

Thanks
Barry

2021-05-27 12:18:04

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] sched: fair: don't depend on wake_wide if waker and wakee are already in same LLC

On Wed, May 26, 2021 at 09:38:19PM +0000, Song Bao Hua (Barry Song) wrote:
> > And no supportive numbers...
>
> Sorry for the confusion.
>
> I actually put some supportive numbers at the below thread which
> derived this patch:
> https://lore.kernel.org/lkml/[email protected]/
>
> when I tried to give Dietmar some pgbench data in that thread,
> I found in kunpeng920, while software ran in one die/numa with
> 24cores sharing LLC, disabling wake_wide() brought the best
> pgbench result.
>
> llc_as_factor don't_use_wake_wide
> Hmean 1 10869.27 ( 0.00%) 10723.08 * -1.34%*
> Hmean 8 19580.59 ( 0.00%) 19469.34 * -0.57%*
> Hmean 12 29643.56 ( 0.00%) 29520.16 * -0.42%*
> Hmean 24 43194.47 ( 0.00%) 43774.78 * 1.34%*
> Hmean 32 40163.23 ( 0.00%) 40742.93 * 1.44%*
> Hmean 48 42249.29 ( 0.00%) 48329.00 * 14.39%*
>
> The test was done by https://github.com/gormanm/mmtests
> and
> ./run-mmtests.sh --config ./configs/config-db-pgbench-timed-ro-medium test_tag
>

Out of curiousity, I briefly tested this on a Zen2 machine which also
has multiple LLCs per node. Only tbench4 was executed and I cancelled
the other tests because of results like this

tbench4
5.13.0-rc2 5.13.0-rc2
vanilla sched-nowakewidellc-v1r1
Hmean 1 349.34 ( 0.00%) 334.18 * -4.34%*
Hmean 2 668.49 ( 0.00%) 659.12 * -1.40%*
Hmean 4 1307.90 ( 0.00%) 1274.35 * -2.57%*
Hmean 8 2482.08 ( 0.00%) 2377.84 * -4.20%*
Hmean 16 4460.06 ( 0.00%) 4656.28 * 4.40%*
Hmean 32 9463.76 ( 0.00%) 8909.61 * -5.86%*
Hmean 64 15865.30 ( 0.00%) 19682.77 * 24.06%*
Hmean 128 24350.06 ( 0.00%) 21593.20 * -11.32%*
Hmean 256 39593.90 ( 0.00%) 31389.33 * -20.72%*
Hmean 512 37851.54 ( 0.00%) 30260.23 * -20.06%*

--
Mel Gorman
SUSE Labs

Subject: RE: [PATCH] sched: fair: don't depend on wake_wide if waker and wakee are already in same LLC



> -----Original Message-----
> From: Mel Gorman [mailto:[email protected]]
> Sent: Friday, May 28, 2021 12:14 AM
> To: Song Bao Hua (Barry Song) <[email protected]>
> Cc: Peter Zijlstra <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> yangyicong <[email protected]>; tangchengchang
> <[email protected]>; Linuxarm <[email protected]>
> Subject: Re: [PATCH] sched: fair: don't depend on wake_wide if waker and wakee
> are already in same LLC
>
> On Wed, May 26, 2021 at 09:38:19PM +0000, Song Bao Hua (Barry Song) wrote:
> > > And no supportive numbers...
> >
> > Sorry for the confusion.
> >
> > I actually put some supportive numbers at the below thread which
> > derived this patch:
> >
> https://lore.kernel.org/lkml/[email protected]
> m/
> >
> > when I tried to give Dietmar some pgbench data in that thread,
> > I found in kunpeng920, while software ran in one die/numa with
> > 24cores sharing LLC, disabling wake_wide() brought the best
> > pgbench result.
> >
> > llc_as_factor don't_use_wake_wide
> > Hmean 1 10869.27 ( 0.00%) 10723.08 * -1.34%*
> > Hmean 8 19580.59 ( 0.00%) 19469.34 * -0.57%*
> > Hmean 12 29643.56 ( 0.00%) 29520.16 * -0.42%*
> > Hmean 24 43194.47 ( 0.00%) 43774.78 * 1.34%*
> > Hmean 32 40163.23 ( 0.00%) 40742.93 * 1.44%*
> > Hmean 48 42249.29 ( 0.00%) 48329.00 * 14.39%*
> >
> > The test was done by https://github.com/gormanm/mmtests
> > and
> > ./run-mmtests.sh --config ./configs/config-db-pgbench-timed-ro-medium
> test_tag
> >
>
> Out of curiousity, I briefly tested this on a Zen2 machine which also
> has multiple LLCs per node. Only tbench4 was executed and I cancelled
> the other tests because of results like this
>
> tbench4
> 5.13.0-rc2 5.13.0-rc2
> vanilla sched-nowakewidellc-v1r1
> Hmean 1 349.34 ( 0.00%) 334.18 * -4.34%*
> Hmean 2 668.49 ( 0.00%) 659.12 * -1.40%*
> Hmean 4 1307.90 ( 0.00%) 1274.35 * -2.57%*
> Hmean 8 2482.08 ( 0.00%) 2377.84 * -4.20%*
> Hmean 16 4460.06 ( 0.00%) 4656.28 * 4.40%*
> Hmean 32 9463.76 ( 0.00%) 8909.61 * -5.86%*
> Hmean 64 15865.30 ( 0.00%) 19682.77 * 24.06%*
> Hmean 128 24350.06 ( 0.00%) 21593.20 * -11.32%*
> Hmean 256 39593.90 ( 0.00%) 31389.33 * -20.72%*
> Hmean 512 37851.54 ( 0.00%) 30260.23 * -20.06%*

Thanks, Mel.

I guess a major difference between your testing and mine
is that I was actually running tests on CPUs sharing LLC
rather than in CPUs which are crossing several LLCs.

In the tested machine kunpeng920, 24 cores share LLC and
become one NUMA, though we have 4 numa, the benchmark was
running in one LLC domain and one NUMA only. This is supposed
to benefit those use cases using "numactl -N x tasks" to bind
tasks, which is quite common.

I also tried to reproduce tbench4 in my desktop with intel
i9 10cores(20 threads) sharing only one 20MB LLC:

$ lstopo -c
Machine (15GB total) cpuset=0x000fffff
Package L#0 cpuset=0x000fffff
NUMANode L#0 (P#0 15GB) cpuset=0x000fffff
L3 L#0 (20MB) cpuset=0x000fffff
L2 L#0 (256KB) cpuset=0x00000401
L1d L#0 (32KB) cpuset=0x00000401
L1i L#0 (32KB) cpuset=0x00000401
Core L#0 cpuset=0x00000401
PU L#0 (P#0) cpuset=0x00000001
PU L#1 (P#10) cpuset=0x00000400
L2 L#1 (256KB) cpuset=0x00000802
L1d L#1 (32KB) cpuset=0x00000802
L1i L#1 (32KB) cpuset=0x00000802
Core L#1 cpuset=0x00000802
PU L#2 (P#1) cpuset=0x00000002
PU L#3 (P#11) cpuset=0x00000800
L2 L#2 (256KB) cpuset=0x00001004
L1d L#2 (32KB) cpuset=0x00001004
L1i L#2 (32KB) cpuset=0x00001004
Core L#2 cpuset=0x00001004
PU L#4 (P#2) cpuset=0x00000004
PU L#5 (P#12) cpuset=0x00001000
L2 L#3 (256KB) cpuset=0x00002008
L1d L#3 (32KB) cpuset=0x00002008
L1i L#3 (32KB) cpuset=0x00002008
Core L#3 cpuset=0x00002008
PU L#6 (P#3) cpuset=0x00000008
PU L#7 (P#13) cpuset=0x00002000
L2 L#4 (256KB) cpuset=0x00004010
L1d L#4 (32KB) cpuset=0x00004010
L1i L#4 (32KB) cpuset=0x00004010
Core L#4 cpuset=0x00004010
PU L#8 (P#4) cpuset=0x00000010
PU L#9 (P#14) cpuset=0x00004000
L2 L#5 (256KB) cpuset=0x00008020
L1d L#5 (32KB) cpuset=0x00008020
L1i L#5 (32KB) cpuset=0x00008020
Core L#5 cpuset=0x00008020
PU L#10 (P#5) cpuset=0x00000020
PU L#11 (P#15) cpuset=0x00008000
L2 L#6 (256KB) cpuset=0x00010040
L1d L#6 (32KB) cpuset=0x00010040
L1i L#6 (32KB) cpuset=0x00010040
Core L#6 cpuset=0x00010040
PU L#12 (P#6) cpuset=0x00000040
PU L#13 (P#16) cpuset=0x00010000
L2 L#7 (256KB) cpuset=0x00020080
L1d L#7 (32KB) cpuset=0x00020080
L1i L#7 (32KB) cpuset=0x00020080
Core L#7 cpuset=0x00020080
PU L#14 (P#7) cpuset=0x00000080
PU L#15 (P#17) cpuset=0x00020000
L2 L#8 (256KB) cpuset=0x00040100
L1d L#8 (32KB) cpuset=0x00040100
L1i L#8 (32KB) cpuset=0x00040100
Core L#8 cpuset=0x00040100
PU L#16 (P#8) cpuset=0x00000100
PU L#17 (P#18) cpuset=0x00040000
L2 L#9 (256KB) cpuset=0x00080200
L1d L#9 (32KB) cpuset=0x00080200
L1i L#9 (32KB) cpuset=0x00080200
Core L#9 cpuset=0x00080200
PU L#18 (P#9) cpuset=0x00000200
PU L#19 (P#19) cpuset=0x00080000

The benchmark of tbenchs is still positive:

tbench4

5.13-rc4 5.13-rc4
disable-llc-wakewide/

Hmean 1 514.87 ( 0.00%) 505.17 * -1.88%*
Hmean 2 914.45 ( 0.00%) 918.45 * 0.44%*
Hmean 4 1483.81 ( 0.00%) 1485.38 * 0.11%*
Hmean 8 2211.62 ( 0.00%) 2236.02 * 1.10%*
Hmean 16 2129.80 ( 0.00%) 2450.81 * 15.07%*
Hmean 32 5098.35 ( 0.00%) 5085.20 * -0.26%*
Hmean 64 4797.62 ( 0.00%) 4801.34 * 0.08%*
Hmean 80 4802.89 ( 0.00%) 4780.40 * -0.47%*

I guess something which work across several LLC domains
cause performance regression.

I wonder how your test will be like if you pin the testing
to CPUs within one LLC?

Thanks
Barry

2021-06-01 08:00:55

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] sched: fair: don't depend on wake_wide if waker and wakee are already in same LLC

On Mon, May 31, 2021 at 10:21:55PM +0000, Song Bao Hua (Barry Song) wrote:
> The benchmark of tbenchs is still positive:
>
> tbench4
>
> 5.13-rc4 5.13-rc4
> disable-llc-wakewide/
>
> Hmean 1 514.87 ( 0.00%) 505.17 * -1.88%*
> Hmean 2 914.45 ( 0.00%) 918.45 * 0.44%*
> Hmean 4 1483.81 ( 0.00%) 1485.38 * 0.11%*
> Hmean 8 2211.62 ( 0.00%) 2236.02 * 1.10%*
> Hmean 16 2129.80 ( 0.00%) 2450.81 * 15.07%*
> Hmean 32 5098.35 ( 0.00%) 5085.20 * -0.26%*
> Hmean 64 4797.62 ( 0.00%) 4801.34 * 0.08%*
> Hmean 80 4802.89 ( 0.00%) 4780.40 * -0.47%*
>
> I guess something which work across several LLC domains
> cause performance regression.
>
> I wonder how your test will be like if you pin the testing
> to CPUs within one LLC?
>

While I could do this, what would be the benefit? Running within one LLC
would be running the test in one small fraction of the entire machine as
the machine has multiple LLCs per NUMA node. A patch dealing with how the
scheduler works with respect to LLC should take different configurations
into consideration as best as possible.

--
Mel Gorman
SUSE Labs

Subject: RE: [PATCH] sched: fair: don't depend on wake_wide if waker and wakee are already in same LLC



> -----Original Message-----
> From: Mel Gorman [mailto:[email protected]]
> Sent: Tuesday, June 1, 2021 7:59 PM
> To: Song Bao Hua (Barry Song) <[email protected]>
> Cc: Peter Zijlstra <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> yangyicong <[email protected]>; tangchengchang
> <[email protected]>; Linuxarm <[email protected]>
> Subject: Re: [PATCH] sched: fair: don't depend on wake_wide if waker and wakee
> are already in same LLC
>
> On Mon, May 31, 2021 at 10:21:55PM +0000, Song Bao Hua (Barry Song) wrote:
> > The benchmark of tbenchs is still positive:
> >
> > tbench4
> >
> > 5.13-rc4 5.13-rc4
> > disable-llc-wakewide/
> >
> > Hmean 1 514.87 ( 0.00%) 505.17 * -1.88%*
> > Hmean 2 914.45 ( 0.00%) 918.45 * 0.44%*
> > Hmean 4 1483.81 ( 0.00%) 1485.38 * 0.11%*
> > Hmean 8 2211.62 ( 0.00%) 2236.02 * 1.10%*
> > Hmean 16 2129.80 ( 0.00%) 2450.81 * 15.07%*
> > Hmean 32 5098.35 ( 0.00%) 5085.20 * -0.26%*
> > Hmean 64 4797.62 ( 0.00%) 4801.34 * 0.08%*
> > Hmean 80 4802.89 ( 0.00%) 4780.40 * -0.47%*
> >
> > I guess something which work across several LLC domains
> > cause performance regression.
> >
> > I wonder how your test will be like if you pin the testing
> > to CPUs within one LLC?
> >
>
> While I could do this, what would be the benefit? Running within one LLC
> would be running the test in one small fraction of the entire machine as
> the machine has multiple LLCs per NUMA node. A patch dealing with how the
> scheduler works with respect to LLC should take different configurations
> into consideration as best as possible.

I do agree with this. And I do admit this patch is lacking of
consideration and testing of supporting various configurations.
But more input of numbers will be helpful on figuring out a better
solution which can either extend to wider configurations or shrink
to some specific machines like those whose whole numa share
LLC or desktops whose all cpus share LLC in v2. eg:
My pc with the newest i9 intel has all 10 cpus(20 threads) sharing
LLC.

>
> --
> Mel Gorman
> SUSE Labs

Thanks
Barry