2023-05-15 17:30:47

by Chen Yu

[permalink] [raw]
Subject: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

[Problem Statement]

High core count system running a lot of frequent context switch
workloads suffers from performance downgrading due to Cache-to-Cache
latency.

The will-it-scale context_switch1 test case exposes the issue. The
test platform has 2 x 56C/112T and 224 CPUs in total. To evaluate the
C2C overhead within 1 LLC, will-it-scale was tested with 1 socket/node
online, so there are 56C/112T CPUs when running will-it-scale.

will-it-scale launches 112 instances. Each instance is composed of
2 tasks, and each pair of tasks would do ping-pong scheduling via
pipe_read() and pipe_write(). No task is bound to any CPU.

According to the perf profile, update_cfs_group() and update_load_avg()
have taken a lot of cycles:

20.26% 19.89% [kernel.kallsyms] [k] update_cfs_group
13.53% 12.15% [kernel.kallsyms] [k] update_load_avg

And the perf c2c result indicates a high average cost of C2C overhead
from HITM events, between the reader update_cfs_group() and the writer
update_load_avg(). Both compete for the same cache line of tg->load_avg.
This issue has been investigated and root caused by Aaron Lu[1], and it
becomes more severe if there are too many cross-core task migrations
during wakeup.

[Proposal]

Scan for an idle sibling within the SMT domain first.

In a previous context switch cycle, if the waker and the wakee wake up
each other, then it is possible that they have shared resources, and
the wakee can be put on an idle sibling next to the waker to avoid the C2C
overhead.

Mike Galbraith helped me test the SIS_CURRENT[2], which wakes up the
short task on the current CPU. But it seems that although SIS_CURRENT brings
improvement on high-end platforms, it could raise the risk of stacking
tasks and hurt latency on low-end system. Such system has a smaller number of
CPUs in the LLC, and the reduction of C2C can not offset the hurt to
latency on such platforms. Thanks Mike for providing a lot of test data
and suggesting choosing an idle shared L2 to mitigate C2C. Also Tim and
Len has mentioned that we do have cluster domains, and maybe the cluster
wakeup patch from Yicong and Barry could be leveraged to mitigate C2C[3].
However, in the current code, the SMT domain does not have SD_CLUSTER flag
so the above patch can not be reused for now.

The current patch only deals with SMT domain, but since C2C is mainly about
cache sync between Cores sharing the L2,the cluster-based wakeup could
be enhanced to include SMT domain as well.

[Benchmark]

The baseline is on sched/core branch on top of
commit a6fcdd8d95f7 ("sched/debug: Correct printing for rq->nr_uninterruptible")

Tested will-it-scale context_switch1 case, it shows good improvement
both on a server and a desktop:

Intel(R) Xeon(R) Platinum 8480+, Sapphire Rapids 2 x 56C/112T = 224 CPUs
context_switch1_processes -s 100 -t 112 -n
baseline SIS_PAIR
1.0 +68.13%

Intel Core(TM) i9-10980XE, Cascade Lake 18C/36T
context_switch1_processes -s 100 -t 18 -n
baseline SIS_PAIR
1.0 +45.2%


[Limitations]
This patch only brings benefits when there is an idle sibling in the SMT domain.
If every CPU in the system is saturated, this patch makes no difference(unlike
SIS_CURRENT)
An optimized way should detect the saturated case, and try its best to put
cache-friendly task pairs within 1 Core.

Before starting the full test, it would be appreciated to have suggestions
on whether this is in the right direction. Thanks.

[1] https://lore.kernel.org/lkml/20230327053955.GA570404@ziqianlu-desk2/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://lore.kernel.org/lkml/[email protected]/

Signed-off-by: Chen Yu <[email protected]>
---
kernel/sched/fair.c | 15 +++++++++++++++
kernel/sched/features.h | 1 +
2 files changed, 16 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48b6f0ca13ac..e65028dcd6a6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7125,6 +7125,21 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
asym_fits_cpu(task_util, util_min, util_max, target))
return target;

+ /*
+ * If the waker and the wakee are good friends to each other,
+ * putting them within the same SMT domain could reduce C2C
+ * overhead. SMT idle sibling should be preferred to wakee's
+ * previous CPU, because the latter could still have the risk of C2C
+ * overhead.
+ */
+ if (sched_feat(SIS_PAIR) && sched_smt_active() &&
+ current->last_wakee == p && p->last_wakee == current) {
+ i = select_idle_smt(p, smp_processor_id());
+
+ if ((unsigned int)i < nr_cpumask_bits)
+ return i;
+ }
+
/*
* If the previous CPU is cache affine and idle, don't be stupid:
*/
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..86b5c4f16199 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_PROP, false)
SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_PAIR, true)

/*
* Issue a WARN when we do multiple update_rq_clock() calls
--
2.25.1



2023-05-16 06:50:12

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

On Tue, 2023-05-16 at 09:11 +0800, Chen Yu wrote:
> [Problem Statement]
>
...

> 20.26%    19.89%  [kernel.kallsyms]          [k] update_cfs_group
> 13.53%    12.15%  [kernel.kallsyms]          [k] update_load_avg

Yup, that's a serious problem, but...

> [Benchmark]
>
> The baseline is on sched/core branch on top of
> commit a6fcdd8d95f7 ("sched/debug: Correct printing for rq->nr_uninterruptible")
>
> Tested will-it-scale context_switch1 case, it shows good improvement
> both on a server and a desktop:
>
> Intel(R) Xeon(R) Platinum 8480+, Sapphire Rapids 2 x 56C/112T = 224 CPUs
> context_switch1_processes -s 100 -t 112 -n
> baseline                   SIS_PAIR
> 1.0                        +68.13%
>
> Intel Core(TM) i9-10980XE, Cascade Lake 18C/36T
> context_switch1_processes -s 100 -t 18 -n
> baseline                   SIS_PAIR
> 1.0                        +45.2%

git@homer: ./context_switch1_processes -s 100 -t 8 -n
(running in an autogroup)

PerfTop: 30853 irqs/sec kernel:96.8% exact: 96.8% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, 8 CPUs)
------------------------------------------------------------------------------------------------------------

5.72% [kernel] [k] switch_mm_irqs_off
4.23% [kernel] [k] __update_load_avg_se
3.76% [kernel] [k] __update_load_avg_cfs_rq
3.70% [kernel] [k] __schedule
3.65% [kernel] [k] entry_SYSCALL_64
3.22% [kernel] [k] enqueue_task_fair
2.91% [kernel] [k] update_curr
2.67% [kernel] [k] select_task_rq_fair
2.60% [kernel] [k] pipe_read
2.55% [kernel] [k] __switch_to
2.54% [kernel] [k] __calc_delta
2.44% [kernel] [k] dequeue_task_fair
2.38% [kernel] [k] reweight_entity
2.13% [kernel] [k] pipe_write
1.96% [kernel] [k] restore_fpregs_from_fpstate
1.93% [kernel] [k] select_idle_smt
1.77% [kernel] [k] update_load_avg <==
1.73% [kernel] [k] native_sched_clock
1.66% [kernel] [k] try_to_wake_up
1.52% [kernel] [k] _raw_spin_lock_irqsave
1.47% [kernel] [k] update_min_vruntime
1.42% [kernel] [k] update_cfs_group <==
1.36% [kernel] [k] vfs_write
1.32% [kernel] [k] prepare_to_wait_event

...not one with global scope. My little i7-4790 can play ping-pong all
day long, as can untold numbers of other boxen around the globe.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0ca13ac..e65028dcd6a6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7125,6 +7125,21 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>             asym_fits_cpu(task_util, util_min, util_max, target))
>                 return target;
>  
> +       /*
> +        * If the waker and the wakee are good friends to each other,
> +        * putting them within the same SMT domain could reduce C2C
> +        * overhead. SMT idle sibling should be preferred to wakee's
> +        * previous CPU, because the latter could still have the risk of C2C
> +        * overhead.
> +        */
> +       if (sched_feat(SIS_PAIR) && sched_smt_active() &&
> +           current->last_wakee == p && p->last_wakee == current) {
> +               i = select_idle_smt(p, smp_processor_id());
> +
> +               if ((unsigned int)i < nr_cpumask_bits)
> +                       return i;
> +       }
> +
>         /*
>          * If the previous CPU is cache affine and idle, don't be stupid:
>          */

Global scope solutions for non-global issues tend to not work out.  

Below is a sample of potential scaling wreckage for boxen that are NOT
akin to the one you're watching turn caches into silicon based pudding.

Note the *_RR numbers. Those poked me in the eye because they closely
resemble pipe ping-pong, all fun and games with about as close to zero
work other than scheduling as network-land can get, but for my box, SMT
was the third best option of three.

You just can't beat idle core selection when it comes to getting work
done, which is why SIS evolved to select cores first.

Your box and ilk need help that treats the disease and not the symptom,
or barring that, help that precisely targets boxen having the disease.

-Mike

10 seconds of 1 netperf client/server instance, no knobs twiddled.

TCP_SENDFILE-1 stacked Avg: 65387
TCP_SENDFILE-1 cross-smt Avg: 65658
TCP_SENDFILE-1 cross-core Avg: 96318

TCP_STREAM-1 stacked Avg: 44322
TCP_STREAM-1 cross-smt Avg: 42390
TCP_STREAM-1 cross-core Avg: 77850

TCP_MAERTS-1 stacked Avg: 36636
TCP_MAERTS-1 cross-smt Avg: 42333
TCP_MAERTS-1 cross-core Avg: 74122

UDP_STREAM-1 stacked Avg: 52618
UDP_STREAM-1 cross-smt Avg: 55298
UDP_STREAM-1 cross-core Avg: 97415

TCP_RR-1 stacked Avg: 242606
TCP_RR-1 cross-smt Avg: 140863
TCP_RR-1 cross-core Avg: 219400

UDP_RR-1 stacked Avg: 282253
UDP_RR-1 cross-smt Avg: 202062
UDP_RR-1 cross-core Avg: 288620

2023-05-16 09:10:31

by Chen Yu

[permalink] [raw]
Subject: Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

On 2023-05-16 at 08:23:35 +0200, Mike Galbraith wrote:
> On Tue, 2023-05-16 at 09:11 +0800, Chen Yu wrote:
> > [Problem Statement]
> >
> ...
>
> > 20.26%??? 19.89%? [kernel.kallsyms]????????? [k] update_cfs_group
> > 13.53%??? 12.15%? [kernel.kallsyms]????????? [k] update_load_avg
>
> Yup, that's a serious problem, but...
>
> > [Benchmark]
> >
> > The baseline is on sched/core branch on top of
> > commit a6fcdd8d95f7 ("sched/debug: Correct printing for rq->nr_uninterruptible")
> >
> > Tested will-it-scale context_switch1 case, it shows good improvement
> > both on a server and a desktop:
> >
> > Intel(R) Xeon(R) Platinum 8480+, Sapphire Rapids 2 x 56C/112T = 224 CPUs
> > context_switch1_processes -s 100 -t 112 -n
> > baseline?????????????????? SIS_PAIR
> > 1.0??????????????????????? +68.13%
> >
> > Intel Core(TM) i9-10980XE, Cascade Lake 18C/36T
> > context_switch1_processes -s 100 -t 18 -n
> > baseline?????????????????? SIS_PAIR
> > 1.0??????????????????????? +45.2%
>
> git@homer: ./context_switch1_processes -s 100 -t 8 -n
> (running in an autogroup)
>
> PerfTop: 30853 irqs/sec kernel:96.8% exact: 96.8% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, 8 CPUs)
> ------------------------------------------------------------------------------------------------------------
>
> 5.72% [kernel] [k] switch_mm_irqs_off
> 4.23% [kernel] [k] __update_load_avg_se
> 3.76% [kernel] [k] __update_load_avg_cfs_rq
> 3.70% [kernel] [k] __schedule
> 3.65% [kernel] [k] entry_SYSCALL_64
> 3.22% [kernel] [k] enqueue_task_fair
> 2.91% [kernel] [k] update_curr
> 2.67% [kernel] [k] select_task_rq_fair
> 2.60% [kernel] [k] pipe_read
> 2.55% [kernel] [k] __switch_to
> 2.54% [kernel] [k] __calc_delta
> 2.44% [kernel] [k] dequeue_task_fair
> 2.38% [kernel] [k] reweight_entity
> 2.13% [kernel] [k] pipe_write
> 1.96% [kernel] [k] restore_fpregs_from_fpstate
> 1.93% [kernel] [k] select_idle_smt
> 1.77% [kernel] [k] update_load_avg <==
> 1.73% [kernel] [k] native_sched_clock
> 1.66% [kernel] [k] try_to_wake_up
> 1.52% [kernel] [k] _raw_spin_lock_irqsave
> 1.47% [kernel] [k] update_min_vruntime
> 1.42% [kernel] [k] update_cfs_group <==
> 1.36% [kernel] [k] vfs_write
> 1.32% [kernel] [k] prepare_to_wait_event
>
> ...not one with global scope. My little i7-4790 can play ping-pong all
> day long, as can untold numbers of other boxen around the globe.
>
That is true, on smaller systems, the C2C overhead is not that high.
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 48b6f0ca13ac..e65028dcd6a6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7125,6 +7125,21 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> > ??????????? asym_fits_cpu(task_util, util_min, util_max, target))
> > ????????????????return target;
> > ?
> > +???????/*
> > +??????? * If the waker and the wakee are good friends to each other,
> > +??????? * putting them within the same SMT domain could reduce C2C
> > +??????? * overhead. SMT idle sibling should be preferred to wakee's
> > +??????? * previous CPU, because the latter could still have the risk of C2C
> > +??????? * overhead.
> > +??????? */
> > +???????if (sched_feat(SIS_PAIR) && sched_smt_active() &&
> > +?????????? current->last_wakee == p && p->last_wakee == current) {
> > +???????????????i = select_idle_smt(p, smp_processor_id());
> > +
> > +???????????????if ((unsigned int)i < nr_cpumask_bits)
> > +???????????????????????return i;
> > +???????}
> > +
> > ????????/*
> > ???????? * If the previous CPU is cache affine and idle, don't be stupid:
> > ???????? */
>
> Global scope solutions for non-global issues tend to not work out. ?
>
> Below is a sample of potential scaling wreckage for boxen that are NOT
> akin to the one you're watching turn caches into silicon based pudding.
>
> Note the *_RR numbers. Those poked me in the eye because they closely
> resemble pipe ping-pong, all fun and games with about as close to zero
> work other than scheduling as network-land can get, but for my box, SMT
> was the third best option of three.
>
> You just can't beat idle core selection when it comes to getting work
> done, which is why SIS evolved to select cores first.
>
There could be some corner cases. Under some conditions choosing an idle
CPU within the local core might be better to select a new idle core. The tricky
part is that SMT is quite special, SMTs share L2, but SMTs also
compete for other critical resources. For short tasks having a close relationship with
each other, putting them together on a local Core (on a high count
system) could sometimes bring benefit. The short duration means that the task
pair have less chance to compete for instruction unit shared by SMTs.
But the short-duration threshold depends on the number of CPUs in the LLC.
> Your box and ilk need help that treats the disease and not the symptom,
> or barring that, help that precisely targets boxen having the disease.
>
IMO this issue could be generic, the cost of C2C is O(sqrt (n)), in theory on
a system with a large number of LLC and with SMT enabled, the issue is easy to
be detected.

As an example, I did not choose a super big system,
but a desktop i9-10980XE, launches 2 pairs of ping-ping tasks.

Each pair of tasks are bound to 1 dedicated core:
./context_switch1_processes -s 30 -t 2
average:956883

No CPU affinity for the tasks:
./context_switch1_processes -s 30 -t 2 -n
average:849209

We can see that, waking up the wakee on local core brings benefits on this platform.

To make a comparison, I also launched the same test on my laptop
i5-8300H, which has 4Core/8CPUs, and I did not see any difference when running 2 pairs
of will-it-scale, but I did notice an improvement if wakees are woken up on local
core when launching 4 pairs(I guess this is because C2C reduction accumulates):

Each pair of tasks are bound to 1 dedicated core:
./context_switch1_processes -s 30 -t 4
average:731965

No CPU affinity for the tasks:
./context_switch1_processes -s 30 -t 4 -n
average:644337


thanks,
Chenyu

> -Mike
>
> 10 seconds of 1 netperf client/server instance, no knobs twiddled.
>
> TCP_SENDFILE-1 stacked Avg: 65387
> TCP_SENDFILE-1 cross-smt Avg: 65658
> TCP_SENDFILE-1 cross-core Avg: 96318
>
> TCP_STREAM-1 stacked Avg: 44322
> TCP_STREAM-1 cross-smt Avg: 42390
> TCP_STREAM-1 cross-core Avg: 77850
>
> TCP_MAERTS-1 stacked Avg: 36636
> TCP_MAERTS-1 cross-smt Avg: 42333
> TCP_MAERTS-1 cross-core Avg: 74122
>
> UDP_STREAM-1 stacked Avg: 52618
> UDP_STREAM-1 cross-smt Avg: 55298
> UDP_STREAM-1 cross-core Avg: 97415
>
> TCP_RR-1 stacked Avg: 242606
> TCP_RR-1 cross-smt Avg: 140863
> TCP_RR-1 cross-core Avg: 219400
>
> UDP_RR-1 stacked Avg: 282253
> UDP_RR-1 cross-smt Avg: 202062
> UDP_RR-1 cross-core Avg: 288620

2023-05-16 12:05:18

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

On Tue, 2023-05-16 at 16:41 +0800, Chen Yu wrote:
> >
> IMO this issue could be generic, the cost of C2C is O(sqrt (n)), in theory on
> a system with a large number of LLC and with SMT enabled, the issue is easy to
> be detected.
>
> As an example, I did not choose a super big system,
> but a desktop i9-10980XE, launches 2 pairs of ping-ping tasks.
>
> Each pair of tasks are bound to 1 dedicated core:
> ./context_switch1_processes -s 30 -t 2
> average:956883
>
> No CPU affinity for the tasks:
> ./context_switch1_processes -s 30 -t 2 -n
> average:849209
>
> We can see that, waking up the wakee on local core brings benefits on this platform.

Sure, tiny ping-pong balls fly really fast in a shared L2 siblings. The
players being truly synchronous, there's not a whole lot of room for
resource competition, and they do about as close to nothing but
schedule as you can get.

That's not a workload, much less one worthy of special consideration.
It is a useful tool though, exposed a big socket issue, good job tool.
Changing task placement strategy in response to that issue makes zero
sense to me. There are many ways to schedule and wake other tasks at
high frequency, all use the same paths. Reduce the pain that big box
sees when playing high speed ping-pong, and real workloads will profit
in boxen large and small. More in large than small, but nobody loses,
everybody wins.

-Mike
>

2023-05-17 17:18:00

by Chen Yu

[permalink] [raw]
Subject: Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

On 2023-05-16 at 13:51:26 +0200, Mike Galbraith wrote:
> On Tue, 2023-05-16 at 16:41 +0800, Chen Yu wrote:
> > >
> > IMO this issue could be generic, the cost of C2C is O(sqrt (n)), in theory on
> > a system with a large number of LLC and with SMT enabled, the issue is easy to
> > be detected.
> >
> > As an example, I did not choose a super big system,
> > but a desktop i9-10980XE, launches 2 pairs of ping-ping tasks.
> >
> > Each pair of tasks are bound to 1 dedicated core:
> > ./context_switch1_processes -s 30 -t 2
> > average:956883
> >
> > No CPU affinity for the tasks:
> > ./context_switch1_processes -s 30 -t 2 -n
> > average:849209
> >
> > We can see that, waking up the wakee on local core brings benefits on this platform.
>
> Sure, tiny ping-pong balls fly really fast in a shared L2 siblings. The
> players being truly synchronous, there's not a whole lot of room for
> resource competition, and they do about as close to nothing but
> schedule as you can get.
>
> That's not a workload, much less one worthy of special consideration.
> It is a useful tool though, exposed a big socket issue, good job tool.
> Changing task placement strategy in response to that issue makes zero
> sense to me. There are many ways to schedule and wake other tasks at
> high frequency, all use the same paths. Reduce the pain that big box
> sees when playing high speed ping-pong, and real workloads will profit
> in boxen large and small. More in large than small, but nobody loses,
> everybody wins.
>
I'm thinking of two directions based on current patch:

1. Check the task duration, if it is a high speed ping-pong pair, let the
wakee search for an idle SMT sibling on current core.

This strategy give the best overall performance improvement, but
the short task duration tweak based on online CPU number would be
an obstacle.

Or

2. Honors the idle core.
That is to say, if there is an idle core in the system, choose that
idle core first. Otherwise, fall back to searching for an idle smt
sibling rather than choosing a idle CPU in a random half-busy core.

This strategy could partially mitigate the C2C overhead, and not
breaking the idle-core-first strategy. So I had a try on it, with
above change, I did see some improvement when the system is around
half busy(afterall, the idle_has_core has to be false):


Core(TM) i9-10980XE Cascade Lake, 18C/36T
netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR 9-threads 1.00 ( 0.50) +0.48 ( 0.39)
TCP_RR 18-threads 1.00 ( 0.39) +10.95 ( 0.31)
TCP_RR 27-threads 1.00 ( 0.31) +8.82 ( 0.17)
TCP_RR 36-threads 1.00 ( 0.86) +0.02 ( 0.28)
TCP_RR 45-threads 1.00 ( 4.17) +0.22 ( 4.02)
TCP_RR 54-threads 1.00 ( 1.28) -0.32 ( 1.66)
TCP_RR 63-threads 1.00 ( 3.21) +0.28 ( 2.76)
TCP_RR 72-threads 1.00 ( 0.36) -0.38 ( 0.61)
UDP_RR 9-threads 1.00 ( 0.47) -0.19 ( 0.61)
UDP_RR 18-threads 1.00 ( 0.12) +8.30 ( 0.12)
UDP_RR 27-threads 1.00 ( 3.70) +9.62 ( 0.21)
UDP_RR 36-threads 1.00 ( 0.37) +0.02 ( 0.19)
UDP_RR 45-threads 1.00 ( 13.52) -0.76 ( 17.59)
UDP_RR 54-threads 1.00 ( 11.44) +0.41 ( 1.43)
UDP_RR 63-threads 1.00 ( 17.26) +0.42 ( 3.00)
UDP_RR 72-threads 1.00 ( 0.36) -0.29 ( 7.90)

TCP_MAERTS 9-threads 1.00 ( 0.11) +0.04 ( 0.06)
TCP_MAERTS 18-threads 1.00 ( 0.26) +1.60 ( 0.07)
TCP_MAERTS 27-threads 1.00 ( 0.03) -0.02 ( 0.01)
TCP_MAERTS 36-threads 1.00 ( 0.01) -0.01 ( 0.01)
TCP_MAERTS 45-threads 1.00 ( 0.01) +0.00 ( 0.01)
TCP_MAERTS 54-threads 1.00 ( 0.00) +0.00 ( 0.01)
TCP_MAERTS 63-threads 1.00 ( 0.01) +0.00 ( 0.01)
TCP_MAERTS 72-threads 1.00 ( 0.01) +0.00 ( 0.00)
TCP_SENDFILE 9-threads 1.00 ( 0.10) +0.34 ( 0.25)
TCP_SENDFILE 18-threads 1.00 ( 0.01) -0.02 ( 0.03)
TCP_SENDFILE 27-threads 1.00 ( 0.02) -0.01 ( 0.01)
TCP_SENDFILE 36-threads 1.00 ( 0.00) -0.00 ( 0.00)
TCP_SENDFILE 45-threads 1.00 ( 0.00) +0.00 ( 0.00)
TCP_SENDFILE 54-threads 1.00 ( 0.00) +0.00 ( 0.00)
TCP_SENDFILE 63-threads 1.00 ( 0.00) +0.00 ( 0.00)
TCP_SENDFILE 72-threads 1.00 ( 0.00) -0.00 ( 0.00)
TCP_STREAM 9-threads 1.00 ( 0.01) -0.12 ( 0.01)
TCP_STREAM 18-threads 1.00 ( 0.37) +1.46 ( 0.10)
TCP_STREAM 27-threads 1.00 ( 0.01) -0.01 ( 0.02)
TCP_STREAM 36-threads 1.00 ( 0.01) +0.00 ( 0.01)
TCP_STREAM 45-threads 1.00 ( 0.01) +0.00 ( 0.01)
TCP_STREAM 54-threads 1.00 ( 0.01) +0.00 ( 0.01)
TCP_STREAM 63-threads 1.00 ( 0.01) +0.00 ( 0.01)
TCP_STREAM 72-threads 1.00 ( 0.01) -0.00 ( 0.01)


UDP_STREAM 9-threads 1.00 ( 99.99) +1.40 ( 99.99)
UDP_STREAM 18-threads 1.00 (101.21) +7.33 (100.51)
UDP_STREAM 27-threads 1.00 ( 99.98) +0.03 ( 99.98)
UDP_STREAM 36-threads 1.00 ( 99.97) +0.16 ( 99.97)
UDP_STREAM 45-threads 1.00 (100.06) -0.00 (100.06)
UDP_STREAM 54-threads 1.00 (100.02) +0.10 (100.02)
UDP_STREAM 63-threads 1.00 (100.25) +0.05 (100.25)
UDP_STREAM 72-threads 1.00 ( 99.89) -0.03 ( 99.87)

The UDP_STREAM seems to be quite unstable, ignore it for now...

Xeon(R) Platinum 8480+ Sapphire Rapids, 56C/112T, 1 socket online
netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR 28-threads 1.00 ( 2.19) -0.32 ( 2.26)
TCP_RR 56-threads 1.00 ( 4.09) -0.22 ( 3.85)
TCP_RR 84-threads 1.00 ( 0.23) +10.11 ( 0.27)
TCP_RR 112-threads 1.00 ( 0.16) +5.12 ( 0.19)
TCP_RR 140-threads 1.00 ( 9.58) -0.08 ( 10.26)
TCP_RR 168-threads 1.00 ( 10.96) -0.61 ( 11.98)
TCP_RR 196-threads 1.00 ( 14.20) -1.31 ( 14.34)
TCP_RR 224-threads 1.00 ( 9.70) -0.84 ( 9.80)
UDP_RR 28-threads 1.00 ( 0.99) -0.29 ( 1.06)
UDP_RR 56-threads 1.00 ( 6.68) +0.96 ( 7.26)
UDP_RR 84-threads 1.00 ( 15.05) +10.37 ( 24.45)
UDP_RR 112-threads 1.00 ( 8.55) +3.98 ( 12.17)
UDP_RR 140-threads 1.00 ( 14.55) +0.25 ( 12.53)
UDP_RR 168-threads 1.00 ( 12.87) -0.81 ( 18.60)
UDP_RR 196-threads 1.00 ( 16.61) +0.10 ( 17.12)
UDP_RR 224-threads 1.00 ( 15.81) +0.00 ( 14.11)

Similarly there are improvements when system is above half-busy.

[Limitations]
Per the test, this patch only brings benefits when there is an idle sibling
in the SMT domain and no idle core available. This is a trade-off to not
break the latency of low end platform but mitigate the C2C on high end ones.



diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7d2613ab392c..572d663065e3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7126,6 +7126,23 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
asym_fits_cpu(task_util, util_min, util_max, target))
return target;

+ /*
+ * If the waker and the wakee are good friends to each other,
+ * putting them within the same SMT domain could reduce C2C
+ * overhead. But this only applies when there is no idle core
+ * available. SMT idle sibling should be prefered to wakee's
+ * previous CPU, because the latter could still have the risk of C2C
+ * overhead.
+ *
+ */
+ if (sched_feat(SIS_PAIR) && sched_smt_active() && !has_idle_core &&
+ current->last_wakee == p && p->last_wakee == current) {
+ i = select_idle_smt(p, smp_processor_id());
+
+ if ((unsigned int)i < nr_cpumask_bits)
+ return i;
+ }
+
/*
* If the previous CPU is cache affine and idle, don't be stupid:
*/
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..86b5c4f16199 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_PROP, false)
SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_PAIR, true)

/*
* Issue a WARN when we do multiple update_rq_clock() calls
--
2.25.1


2023-05-17 19:59:22

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

On Thu, 2023-05-18 at 00:57 +0800, Chen Yu wrote:
> >
> I'm thinking of two directions based on current patch:
>
> 1. Check the task duration, if it is a high speed ping-pong pair, let the
>    wakee search for an idle SMT sibling on current core.
>
>    This strategy give the best overall performance improvement, but
>    the short task duration tweak based on online CPU number would be
>    an obstacle.

Duration is pretty useless, as it says nothing about concurrency.
Taking the 500us metric as an example, one pipe ping-pong can meet
that, and toss up to nearly 50% of throughput out the window if you
stack based only on duration.

> Or
>
> 2. Honors the idle core.
>    That is to say, if there is an idle core in the system, choose that
>    idle core first. Otherwise, fall back to searching for an idle smt
>    sibling rather than choosing a idle CPU in a random half-busy core.
>
>    This strategy could partially mitigate the C2C overhead, and not
>    breaking the idle-core-first strategy. So I had a try on it, with
>    above change, I did see some improvement when the system is around
>    half busy(afterall, the idle_has_core has to be false):

If mitigation is the goal, and until the next iteration of socket
growth that's not a waste of effort, continuing to honor idle core is
the only option that has a ghost of a chance.

That said, I don't like the waker/wakee have met heuristic much either,
because tasks waking one another before can just as well mean they met
at a sleeping lock, it does not necessarily imply latency bound IPC.

I haven't met a heuristic I like, and that includes the ones I invent.
The smarter you try to make them, the more precious fast path cycles
they eat, and there's a never ending supply of holes in the damn things
that want plugging. A prime example was the SIS_CURRENT heuristic self
destructing in my box, rendering that patch a not quite free noop :)

-Mike

2023-05-18 04:02:27

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

Hello Chenyu,

I'll do some light testing with some benchmarks and share results on the
thread but meanwhile I have a few observations with the implementation.

On 5/17/2023 10:27 PM, Chen Yu wrote:
> On 2023-05-16 at 13:51:26 +0200, Mike Galbraith wrote:
>> On Tue, 2023-05-16 at 16:41 +0800, Chen Yu wrote:
>> [..snip..]
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7d2613ab392c..572d663065e3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7126,6 +7126,23 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> asym_fits_cpu(task_util, util_min, util_max, target))
> return target;
>
> + /*
> + * If the waker and the wakee are good friends to each other,
> + * putting them within the same SMT domain could reduce C2C
> + * overhead. But this only applies when there is no idle core
> + * available. SMT idle sibling should be prefered to wakee's
> + * previous CPU, because the latter could still have the risk of C2C
> + * overhead.
> + *
> + */
> + if (sched_feat(SIS_PAIR) && sched_smt_active() && !has_idle_core &&

"has_idle_core" is not populated at this point and will always be false
from the initialization. Should there be a:

has_idle_core = test_idle_cores(? /* Which CPU? */);
if (sched_feat(SIS_PAIR) ...) {
...
}
has_idle_core = false;

?: "has_idle_core" is currently used in select_idle_sibling() from the
perspective of the target MC. Does switching target to current core
(which may not be on the same MC) if target MC does not have an idle core
make sense in all scenarios?

> + current->last_wakee == p && p->last_wakee == current) {
> + i = select_idle_smt(p, smp_processor_id());

Also wondering if asym_fits_cpu() check is needed in some way here.
Consider a case where waker is on a weaker capacity CPU but wakee
previously ran on a stronger capacity CPU. It might be worthwhile
to wake the wakee on previous CPU if the current CPU does not fit
the task's utilization and move the pair to the CPU with larger
capacity during the next wakeup. wake_affine_weight() would select
a target based on load and capacity consideration but here we
switch the wakeup target to a thread on the current core.

Wondering if the capacity details already considered in the path?

> +
> + if ((unsigned int)i < nr_cpumask_bits)
> + return i;
> + }
> +
> /*
> * If the previous CPU is cache affine and idle, don't be stupid:
> */
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index ee7f23c76bd3..86b5c4f16199 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> */
> SCHED_FEAT(SIS_PROP, false)
> SCHED_FEAT(SIS_UTIL, true)
> +SCHED_FEAT(SIS_PAIR, true)
>
> /*
> * Issue a WARN when we do multiple update_rq_clock() calls

--
Thanks and Regards,
Prateek

2023-05-18 04:06:35

by Chen Yu

[permalink] [raw]
Subject: Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

On 2023-05-17 at 21:52:21 +0200, Mike Galbraith wrote:
> On Thu, 2023-05-18 at 00:57 +0800, Chen Yu wrote:
> > >
> > I'm thinking of two directions based on current patch:
> >
> > 1. Check the task duration, if it is a high speed ping-pong pair, let the
> > ?? wakee search for an idle SMT sibling on current core.
> >
> > ?? This strategy give the best overall performance improvement, but
> > ?? the short task duration tweak based on online CPU number would be
> > ?? an obstacle.
>
> Duration is pretty useless, as it says nothing about concurrency.
> Taking the 500us metric as an example, one pipe ping-pong can meet
> that, and toss up to nearly 50% of throughput out the window if you
> stack based only on duration.
>
> > Or
> >
> > 2. Honors the idle core.
> > ?? That is to say, if there is an idle core in the system, choose that
> > ?? idle core first. Otherwise, fall back to searching for an idle smt
> > ?? sibling rather than choosing a idle CPU in a random half-busy core.
> >
> > ?? This strategy could partially mitigate the C2C overhead, and not
> > ?? breaking the idle-core-first strategy. So I had a try on it, with
> > ?? above change, I did see some improvement when the system is around
> > ?? half busy(afterall, the idle_has_core has to be false):
>
> If mitigation is the goal, and until the next iteration of socket
> growth that's not a waste of effort, continuing to honor idle core is
> the only option that has a ghost of a chance.
>
> That said, I don't like the waker/wakee have met heuristic much either,
> because tasks waking one another before can just as well mean they met
> at a sleeping lock, it does not necessarily imply latency bound IPC.
>
Yes, for a sleeping lock case, it does not matter whether it is woken up
on sibling idle, or an idle CPU on another half-busy core. But for the
pair sharing data, it could bring benefit.
> I haven't met a heuristic I like, and that includes the ones I invent.
> The smarter you try to make them, the more precious fast path cycles
> they eat, and there's a never ending supply of holes in the damn things
> that want plugging. A prime example was the SIS_CURRENT heuristic self
> destructing in my box, rendering that patch a not quite free noop :)
>
Yes.. SIS_CURRENT is not a universal win.


thanks,
Chenyu

2023-05-18 04:40:41

by Chen Yu

[permalink] [raw]
Subject: Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

Hi Prateek,
On 2023-05-18 at 09:00:38 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
>
> I'll do some light testing with some benchmarks and share results on the
> thread but meanwhile I have a few observations with the implementation.
>
Thanks for reviewing this change.
> On 5/17/2023 10:27 PM, Chen Yu wrote:
> > On 2023-05-16 at 13:51:26 +0200, Mike Galbraith wrote:
> >> On Tue, 2023-05-16 at 16:41 +0800, Chen Yu wrote:
> >> [..snip..]
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 7d2613ab392c..572d663065e3 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7126,6 +7126,23 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> > asym_fits_cpu(task_util, util_min, util_max, target))
> > return target;
> >
> > + /*
> > + * If the waker and the wakee are good friends to each other,
> > + * putting them within the same SMT domain could reduce C2C
> > + * overhead. But this only applies when there is no idle core
> > + * available. SMT idle sibling should be prefered to wakee's
> > + * previous CPU, because the latter could still have the risk of C2C
> > + * overhead.
> > + *
> > + */
> > + if (sched_feat(SIS_PAIR) && sched_smt_active() && !has_idle_core &&
>
> "has_idle_core" is not populated at this point and will always be false
> from the initialization. Should there be a:
>
> has_idle_core = test_idle_cores(? /* Which CPU? */);
Yes you are right, I have 2 patches, the first one is to check has_idle_core
in the beginning but I forgot to send it out but only the second one.
> if (sched_feat(SIS_PAIR) ...) {
> ...
> }
> has_idle_core = false;
>
> ?: "has_idle_core" is currently used in select_idle_sibling() from the
> perspective of the target MC. Does switching target to current core
> (which may not be on the same MC) if target MC does not have an idle core
> make sense in all scenarios?
Right, we should check whether target equals to current CPU. Since I tested
with 1 socket online, this issue did not expose
>
> > + current->last_wakee == p && p->last_wakee == current) {
> > + i = select_idle_smt(p, smp_processor_id());
>
> Also wondering if asym_fits_cpu() check is needed in some way here.
> Consider a case where waker is on a weaker capacity CPU but wakee
> previously ran on a stronger capacity CPU. It might be worthwhile
> to wake the wakee on previous CPU if the current CPU does not fit
> the task's utilization and move the pair to the CPU with larger
> capacity during the next wakeup. wake_affine_weight() would select
> a target based on load and capacity consideration but here we
> switch the wakeup target to a thread on the current core.
>
> Wondering if the capacity details already considered in the path?
>
Good point, I guess what you mean is that, target could be other CPU rather than
the current one, there should be a check if the target equals to current CPU.
Let me refine the patch and have a test.

thanks,
Chenyu
> > +
> > + if ((unsigned int)i < nr_cpumask_bits)
> > + return i;
> > + }
> > +
> > /*
> > * If the previous CPU is cache affine and idle, don't be stupid:
> > */
> > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > index ee7f23c76bd3..86b5c4f16199 100644
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> > */
> > SCHED_FEAT(SIS_PROP, false)
> > SCHED_FEAT(SIS_UTIL, true)
> > +SCHED_FEAT(SIS_PAIR, true)
> >
> > /*
> > * Issue a WARN when we do multiple update_rq_clock() calls
>
> --
> Thanks and Regards,
> Prateek

2023-05-18 10:49:26

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

Hello Chenyu,

On 5/18/2023 9:47 AM, Chen Yu wrote:
> Hi Prateek,
> On 2023-05-18 at 09:00:38 +0530, K Prateek Nayak wrote:
>> [..snip..]
>>> + /*
>>> + * If the waker and the wakee are good friends to each other,
>>> + * putting them within the same SMT domain could reduce C2C
>>> + * overhead. But this only applies when there is no idle core
>>> + * available. SMT idle sibling should be prefered to wakee's
>>> + * previous CPU, because the latter could still have the risk of C2C
>>> + * overhead.
>>> + *
>>> + */
>>> + if (sched_feat(SIS_PAIR) && sched_smt_active() && !has_idle_core &&
>>
>> "has_idle_core" is not populated at this point and will always be false
>> from the initialization. Should there be a:
>>
>> has_idle_core = test_idle_cores(? /* Which CPU? */);
> Yes you are right, I have 2 patches, the first one is to check has_idle_core
> in the beginning but I forgot to send it out but only the second one.
>> if (sched_feat(SIS_PAIR) ...) {
>> ...
>> }
>> has_idle_core = false;
>>
>> ?: "has_idle_core" is currently used in select_idle_sibling() from the
>> perspective of the target MC. Does switching target to current core
>> (which may not be on the same MC) if target MC does not have an idle core
>> make sense in all scenarios?
> Right, we should check whether target equals to current CPU. Since I tested
> with 1 socket online, this issue did not expose

Ah! I see. Yup a single socket (with one MC) will not run into any issues
with the current implementation.

>>
>>> + current->last_wakee == p && p->last_wakee == current) {
>>> + i = select_idle_smt(p, smp_processor_id());
>>
>> Also wondering if asym_fits_cpu() check is needed in some way here.
>> Consider a case where waker is on a weaker capacity CPU but wakee
>> previously ran on a stronger capacity CPU. It might be worthwhile
>> to wake the wakee on previous CPU if the current CPU does not fit
>> the task's utilization and move the pair to the CPU with larger
>> capacity during the next wakeup. wake_affine_weight() would select
>> a target based on load and capacity consideration but here we
>> switch the wakeup target to a thread on the current core.
>>
>> Wondering if the capacity details already considered in the path?
>>
> Good point, I guess what you mean is that, target could be other CPU rather than
> the current one, there should be a check if the target equals to current CPU.

Yup. That should handle the asymmetric capacity condition too but
wondering if it makes the case too narrow to see the same benefit.

Can you perhaps try "cpus_share_cache(target, smp_processor_id())"
instead of a "target == smp_processor_id()"? Since we use similar
logic to test if p->recent_used_cpu is a good target or not?

This will be equivalent to your current implementation for a single
socket with one LLC and as for dual socket or multiple LLC case,
we can be sure "has_idle_core" is indicates the status of MC which
is shared by both target and current cpu.

> Let me refine the patch and have a test.
>

I'll hold off queuing a full test run until then.

> thanks,
> Chenyu
>
> [..snip..]
--
Thanks and Regards,
Prateek

2023-05-19 11:37:56

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

On Thu, 2023-05-18 at 11:41 +0800, Chen Yu wrote:
> On 2023-05-17 at 21:52:21 +0200, Mike Galbraith wrote:
> >
> > That said, I don't like the waker/wakee have met heuristic much either,
> > because tasks waking one another before can just as well mean they met
> > at a sleeping lock, it does not necessarily imply latency bound IPC.
> >
> Yes, for a sleeping lock case, it does not matter whether it is woken up
> on sibling idle, or an idle CPU on another half-busy core. But for the
> pair sharing data, it could bring benefit.

That reply keeps bouncing about in my head, annoying me enough that I'm
going to reply to it so I can finally stop thinking about pipe ping-
pong and the risks of big socket only issue mitigation patches.

The object that inspired SIS_CURRENT, which then morphed into SIS_PAIR
is in effect a mutex. The numbers derived from operation of that mutex
are not really relevant to IPC or context switches for that matter
(says me;), they're all about memory access cost deltas.

-Mike

2023-05-22 04:48:41

by Chen Yu

[permalink] [raw]
Subject: Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

On 2023-05-18 at 15:56:12 +0530, K Prateek Nayak wrote:
[snip]
> >>
> >> Also wondering if asym_fits_cpu() check is needed in some way here.
> >> Consider a case where waker is on a weaker capacity CPU but wakee
> >> previously ran on a stronger capacity CPU. It might be worthwhile
> >> to wake the wakee on previous CPU if the current CPU does not fit
> >> the task's utilization and move the pair to the CPU with larger
> >> capacity during the next wakeup. wake_affine_weight() would select
> >> a target based on load and capacity consideration but here we
> >> switch the wakeup target to a thread on the current core.
> >>
> >> Wondering if the capacity details already considered in the path?
> >>
> > Good point, I guess what you mean is that, target could be other CPU rather than
> > the current one, there should be a check if the target equals to current CPU.
>
> Yup. That should handle the asymmetric capacity condition too but
> wondering if it makes the case too narrow to see the same benefit.
>
> Can you perhaps try "cpus_share_cache(target, smp_processor_id())"
> instead of a "target == smp_processor_id()"? Since we use similar
> logic to test if p->recent_used_cpu is a good target or not?
>
> This will be equivalent to your current implementation for a single
> socket with one LLC and as for dual socket or multiple LLC case,
> we can be sure "has_idle_core" is indicates the status of MC which
> is shared by both target and current cpu.
>
Right, in this way we can avoid the issue that target and current CPU
are in difference LLCs and has_idle_core does not reflect that.
And asym_fits_cpu() might also be needed to check if the task can fit in.
> > Let me refine the patch and have a test.
> >
>
> I'll hold off queuing a full test run until then.
>
Thank you. I'm also thinking of removing the check of last_wakee,
so there is no much heuristic involved. I'll do some investigation.

Meanwhile, I looked back at Yicong's proposal on waking up task
on local cluster first. It did show some improvement on Jacobsville,
I guess that could also be a chance to reduce C2C latency.

thanks,
Chenyu
> > thanks,
> > Chenyu
> >
> > [..snip..]
> --
> Thanks and Regards,
> Prateek

2023-05-22 07:34:45

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

On Mon, 2023-05-22 at 12:10 +0800, Chen Yu wrote:
>
> Meanwhile, I looked back at Yicong's proposal on waking up task
> on local cluster first. It did show some improvement on Jacobsville,
> I guess that could also be a chance to reduce C2C latency.

Something else to consider is that communication data comes in many
size chunks and volumes, and cache footprints can easily be entirely
local constructs unrelated to any transferred data.

At one extreme of the huge spectrum of possibilities, a couple less
than brilliant tasks playing high speed ping-pong can bounce all over a
box with zero consequences, but for a pair more akin to say Einstein
and Bohr pondering chalkboards full of mind bending math and meeting
occasionally at the water cooler to exchange snarky remarks, needlessly
bouncing them about forces them to repopulate chalkboards, and C2C
traffic you try to avoid via bounce you generate via bounce.

Try as you may, methinks you're unlikely to succeed at avoiding C2C in
a box where roughly half of all paths are C2C. What tasks have in
their pockets and what they'll do with a CPU at any point in time is
unknown and unknowable by the scheduler, dooming pinpoint placement
accuracy as a goal.

-Mike

2023-05-25 07:50:30

by Chen Yu

[permalink] [raw]
Subject: Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

On 2023-05-22 at 09:10:33 +0200, Mike Galbraith wrote:
> On Mon, 2023-05-22 at 12:10 +0800, Chen Yu wrote:
> >
> > Meanwhile, I looked back at Yicong's proposal on waking up task
> > on local cluster first. It did show some improvement on Jacobsville,
> > I guess that could also be a chance to reduce C2C latency.
>
> Something else to consider is that communication data comes in many
> size chunks and volumes, and cache footprints can easily be entirely
> local constructs unrelated to any transferred data.
>
> At one extreme of the huge spectrum of possibilities, a couple less
> than brilliant tasks playing high speed ping-pong can bounce all over a
> box with zero consequences, but for a pair more akin to say Einstein
> and Bohr pondering chalkboards full of mind bending math and meeting
> occasionally at the water cooler to exchange snarky remarks, needlessly
> bouncing them about forces them to repopulate chalkboards, and C2C
> traffic you try to avoid via bounce you generate via bounce.
>
> Try as you may, methinks you're unlikely to succeed at avoiding C2C in
> a box where roughly half of all paths are C2C. What tasks have in
> their pockets and what they'll do with a CPU at any point in time is
> unknown and unknowable by the scheduler, dooming pinpoint placement
> accuracy as a goal.
>
I guess what you mean is that, for a wakee has large local data cache
footprint, it is not a good idea to wakeup the wakee on a remote core.
Because in that way the wakee has to repopulate the cache from scratch.
Yes, the problem is that currently the scheduler is lacking of metric
to indicate the task's working set, or per-task-cache-footprint-track
(although we have numa balancing to calculate per-task-node-statistics).
If provided with this cache-aware metric, the wakee can be put to a candidate
CPU where the cache locallity(either LLC or L2) is friendly to the wakee.
Because there is no such accurate metric, the heuristic seems to be an compromised
way to predict the task placement.

The C2C was mainly caused by accessing global tg->load, so besides
wakeup placement, there should also be other way to mitigate C2C,
such as reducing the frequency of accessing tg->load.

Besides that, while studying the history of wake_wide(), I suddenly
found that 10 years ago Michael has proposed exactly the same strategy to
check if task A and B are waking up each other, if they are, put them
together, otherwise, spread them to different LLC:
https://lkml.org/lkml/2013/3/6/73
And this version has finnaly evolved to what wake_wide() looks like today
in your patch:
https://marc.info/?l=linux-kernel&m=143688840122477
If I understand correctly, if wake_wide() can decide whether to wakeup the
task on an idle CPU on local LLC or remote LLC, does it also
make sense to extend wake_wide() for SMT domain and L2 domain?
Say, if wake_wide(p, nr_smt) returns true, then find an idle CPU on remote
SMT domain, otherwise scan for an idle CPU in local SMT domain. In this
case, does has_idle_core check matter?

thanks,
Chenyu

2023-05-25 09:53:03

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

On Thu, 2023-05-25 at 15:47 +0800, Chen Yu wrote:
> On 2023-05-22 at 09:10:33 +0200, Mike Galbraith wrote:
> >
> > At one extreme of the huge spectrum of possibilities, a couple less
> > than brilliant tasks playing high speed ping-pong can bounce all over a
> > box with zero consequences, but for a pair more akin to say Einstein
> > and Bohr pondering chalkboards full of mind bending math and meeting
> > occasionally at the water cooler to exchange snarky remarks, needlessly
> > bouncing them about forces them to repopulate chalkboards, and C2C
> > traffic you try to avoid via bounce you generate via bounce.
> >
> I guess what you mean is that, for a wakee has large local data cache
> footprint, it is not a good idea to wakeup the wakee on a remote core.
> Because in that way the wakee has to repopulate the cache from scratch.

Yeah, and all variations in between.

> Yes, the problem is that currently the scheduler is lacking of metric
> to indicate the task's working set, or per-task-cache-footprint-track
> (although we have numa balancing to calculate per-task-node-statistics).
> If provided with this cache-aware metric, the wakee can be put to a candidate
> CPU where the cache locallity(either LLC or L2) is friendly to the wakee.
> Because there is no such accurate metric, the heuristic seems to be an compromised
> way to predict the task placement.

Nah, it's a dart toss. With a box full of net blaster tools, the odds
may even be favorable, but who knows what the wild will do.

> The C2C was mainly caused by accessing global tg->load, so besides
> wakeup placement, there should also be other way to mitigate C2C,
> such as reducing the frequency of accessing tg->load.

Attacking that is the only thing that makes any sense to me.

> Besides that, while studying the history of wake_wide(), I suddenly
> found that 10 years ago Michael has proposed exactly the same strategy to
> check if task A and B are waking up each other, if they are, put them
> together, otherwise, spread them to different LLC:
> https://lkml.org/lkml/2013/3/6/73
> And this version has finnaly evolved to what wake_wide() looks like today
> in your patch:

Yeah, I've touched that, but it's still busted. I watched firefox
burst wake a way too big thread pool, but since worker-bees collected
zero flips, the heuristic says all is well, move along ginormous swarm.

-Mike