2023-06-21 08:29:42

by Deng Pan

[permalink] [raw]
Subject: [PATCH v2] sched/task_group: Re-layout structure to reduce false sharing

When running UnixBench/Pipe-based Context Switching case, we observed
high false sharing for accessing ‘load_avg’ against rt_se and rt_rq
when config CONFIG_RT_GROUP_SCHED is turned on.

Pipe-based Context Switching case is a typical sleep/wakeup scenario,
in which load_avg is frequenly loaded and stored, at the meantime, rt_se
and rt_rq are frequently loaded. Unfortunately, they are in the same
cacheline.

This change re-layouts the structure:
1. Move rt_se and rt_rq to a 2nd cacheline.
2. Keep ‘parent’ field in the 2nd cacheline since it's also accessed
very often when cgroups are nested, thanks Tim Chen for providing the
insight.

Tested on Intel Icelake 2 sockets 80C/160T platform, based on v6.4-rc5.

With this change, Pipe-based Context Switching 160 parallel score is
improved ~9.6%, perf record tool reports rt_se and rt_rq access cycles
are reduced from ~14.5% to ~0.3%, perf c2c tool shows the false-sharing
is resolved as expected:

Baseline:
=================================================
Shared Cache Line Distribution Pareto
=================================================
----------------------------------------------------------------------
0 1031 3927 3322 50 0 0xff284d17b5c0fa00
----------------------------------------------------------------------
63.72% 65.16% 0.00% 0.00% 0.00% 0x0 1 1 0xffffffffa134934e 4247 3249 4057 13874 160 [k] update_cfs_group [kernel.kallsyms] update_cfs_group+78 0 1
7.47% 3.23% 98.43% 0.00% 0.00% 0x0 1 1 0xffffffffa13478ac 12034 13166 7699 8149 160 [k] update_load_avg [kernel.kallsyms] update_load_avg+940 0 1
0.58% 0.18% 0.39% 98.00% 0.00% 0x0 1 1 0xffffffffa13478b4 40713 44343 33768 158 95 [k] update_load_avg [kernel.kallsyms] update_load_avg+948 0 1
0.00% 0.08% 1.17% 0.00% 0.00% 0x0 1 1 0xffffffffa1348076 0 14303 6006 75 61 [k] __update_blocked_fair [kernel.kallsyms] __update_blocked_fair+998 0 1
0.19% 0.03% 0.00% 0.00% 0.00% 0x0 1 1 0xffffffffa1349355 30718 2820 23693 246 117 [k] update_cfs_group [kernel.kallsyms] update_cfs_group+85 0 1
0.00% 0.00% 0.00% 2.00% 0.00% 0x0 1 1 0xffffffffa134807e 0 0 24401 2 2 [k] __update_blocked_fair [kernel.kallsyms] __update_blocked_fair+1006 0 1
14.16% 16.30% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffffa133c5c7 5101 4028 4839 7354 160 [k] set_task_cpu [kernel.kallsyms] set_task_cpu+279 0 1
0.00% 0.03% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffffa133c5ce 0 18646 25195 30 28 [k] set_task_cpu [kernel.kallsyms] set_task_cpu+286 0 1
13.87% 14.97% 0.00% 0.00% 0.00% 0x10 1 1 0xffffffffa133c5b5 4138 3738 5608 6321 160 [k] set_task_cpu [kernel.kallsyms] set_task_cpu+261 0 1
0.00% 0.03% 0.00% 0.00% 0.00% 0x10 1 1 0xffffffffa133c5bc 0 6321 26398 149 88 [k] set_task_cpu [kernel.kallsyms] set_task_cpu+268 0 1

With this change:
=================================================
Shared Cache Line Distribution Pareto
=================================================
----------------------------------------------------------------------
0 1118 3340 3118 57 0 0xff1d6ca01ecc5e80
----------------------------------------------------------------------
91.59% 94.46% 0.00% 0.00% 0.00% 0x0 1 1 0xffffffff8914934e 4710 4211 5158 14218 160 [k] update_cfs_group [kernel.kallsyms] update_cfs_group+78 0 1
7.42% 4.82% 97.98% 0.00% 0.00% 0x0 1 1 0xffffffff891478ac 15225 14713 8593 7858 160 [k] update_load_avg [kernel.kallsyms] update_load_avg+940 0 1
0.81% 0.66% 0.58% 98.25% 0.00% 0x0 1 1 0xffffffff891478b4 38486 44799 33123 186 107 [k] update_load_avg [kernel.kallsyms] update_load_avg+948 0 1
0.18% 0.06% 0.00% 0.00% 0.00% 0x0 1 1 0xffffffff89149355 20077 32046 22302 388 144 [k] update_cfs_group [kernel.kallsyms] update_cfs_group+85 0 1
0.00% 0.00% 1.41% 0.00% 0.00% 0x0 1 1 0xffffffff89148076 0 0 6804 85 64 [k] __update_blocked_fair [kernel.kallsyms] __update_blocked_fair+998 0 1
0.00% 0.00% 0.03% 1.75% 0.00% 0x0 1 1 0xffffffff8914807e 0 0 26581 3 3 [k] __update_blocked_fair [kernel.kallsyms] __update_blocked_fair+1006 0 1

Besides above, Hackbench, netperf and schbench were also tested, no
obvious regression detected.

hackbench
=========
case load baseline(std%) compare%( std%)
process-pipe 1-groups 1.00 ( 0.87) -0.95 ( 1.72)
process-pipe 2-groups 1.00 ( 0.57) +9.11 ( 14.44)
process-pipe 4-groups 1.00 ( 0.64) +6.77 ( 2.50)
process-pipe 8-groups 1.00 ( 0.28) -4.39 ( 2.02)
process-sockets 1-groups 1.00 ( 2.37) +1.13 ( 0.76)
process-sockets 2-groups 1.00 ( 7.83) -3.41 ( 4.78)
process-sockets 4-groups 1.00 ( 2.24) +0.71 ( 2.13)
process-sockets 8-groups 1.00 ( 0.39) +1.05 ( 0.19)
threads-pipe 1-groups 1.00 ( 1.85) -2.22 ( 0.66)
threads-pipe 2-groups 1.00 ( 2.36) +3.48 ( 6.44)
threads-pipe 4-groups 1.00 ( 3.07) -7.92 ( 5.82)
threads-pipe 8-groups 1.00 ( 1.00) +2.68 ( 1.28)
threads-sockets 1-groups 1.00 ( 0.34) +1.19 ( 1.96)
threads-sockets 2-groups 1.00 ( 6.24) -4.88 ( 2.10)
threads-sockets 4-groups 1.00 ( 2.26) +0.41 ( 1.58)
threads-sockets 8-groups 1.00 ( 0.46) +0.07 ( 2.19)

netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR 40-threads 1.00 ( 0.78) -0.18 ( 1.80)
TCP_RR 80-threads 1.00 ( 0.72) -1.62 ( 0.84)
TCP_RR 120-threads 1.00 ( 0.74) -0.35 ( 0.99)
TCP_RR 160-threads 1.00 ( 30.79) -1.75 ( 29.57)
TCP_RR 200-threads 1.00 ( 17.45) -2.89 ( 16.64)
TCP_RR 240-threads 1.00 ( 27.73) -2.46 ( 19.35)
TCP_RR 280-threads 1.00 ( 32.76) -3.00 ( 30.65)
TCP_RR 320-threads 1.00 ( 41.73) -3.14 ( 37.84)
UDP_RR 40-threads 1.00 ( 1.21) +0.02 ( 1.68)
UDP_RR 80-threads 1.00 ( 0.33) -0.47 ( 9.59)
UDP_RR 120-threads 1.00 ( 12.38) +0.30 ( 13.42)
UDP_RR 160-threads 1.00 ( 29.10) +8.17 ( 34.51)
UDP_RR 200-threads 1.00 ( 21.04) -1.72 ( 20.96)
UDP_RR 240-threads 1.00 ( 38.11) -2.54 ( 38.15)
UDP_RR 280-threads 1.00 ( 31.56) -0.73 ( 32.70)
UDP_RR 320-threads 1.00 ( 41.54) -2.00 ( 44.39)

schbench
========
case load baseline(std%) compare%( std%)
normal 1-mthreads 1.00 ( 4.16) +3.53 ( 0.86)
normal 2-mthreads 1.00 ( 2.86) +1.69 ( 2.91)
normal 4-mthreads 1.00 ( 4.97) -6.53 ( 8.20)
normal 8-mthreads 1.00 ( 0.86) -0.70 ( 0.54)

Reviewed-by: Tim Chen <[email protected]>
Signed-off-by: Deng Pan <[email protected]>
---
V1 -> V2:
- Add comment in data structure
- More data support in commit log

kernel/sched/sched.h | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec7b3e0a2b20..4fbd4b3a4bdd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -389,6 +389,19 @@ struct task_group {
#endif
#endif

+ struct rcu_head rcu;
+ struct list_head list;
+
+ struct list_head siblings;
+ struct list_head children;
+
+ /*
+ * To reduce false sharing, current layout is optimized to make
+ * sure load_avg is in a different cacheline from parent, rt_se
+ * and rt_rq.
+ */
+ struct task_group *parent;
+
#ifdef CONFIG_RT_GROUP_SCHED
struct sched_rt_entity **rt_se;
struct rt_rq **rt_rq;
@@ -396,13 +409,6 @@ struct task_group {
struct rt_bandwidth rt_bandwidth;
#endif

- struct rcu_head rcu;
- struct list_head list;
-
- struct task_group *parent;
- struct list_head siblings;
- struct list_head children;
-
#ifdef CONFIG_SCHED_AUTOGROUP
struct autogroup *autogroup;
#endif
--
2.39.3



2023-06-26 05:58:36

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2] sched/task_group: Re-layout structure to reduce false sharing

On Wed, Jun 21, 2023 at 04:14:25PM +0800, Deng Pan wrote:
> When running UnixBench/Pipe-based Context Switching case, we observed
> high false sharing for accessing ‘load_avg’ against rt_se and rt_rq
> when config CONFIG_RT_GROUP_SCHED is turned on.
>
> Pipe-based Context Switching case is a typical sleep/wakeup scenario,
> in which load_avg is frequenly loaded and stored, at the meantime, rt_se
> and rt_rq are frequently loaded. Unfortunately, they are in the same
> cacheline.
>
> This change re-layouts the structure:
> 1. Move rt_se and rt_rq to a 2nd cacheline.
> 2. Keep ‘parent’ field in the 2nd cacheline since it's also accessed
> very often when cgroups are nested, thanks Tim Chen for providing the
> insight.
>
> Tested on Intel Icelake 2 sockets 80C/160T platform, based on v6.4-rc5.
>
> With this change, Pipe-based Context Switching 160 parallel score is
> improved ~9.6%, perf record tool reports rt_se and rt_rq access cycles
> are reduced from ~14.5% to ~0.3%, perf c2c tool shows the false-sharing
> is resolved as expected:

I also give it a run on an Icelake and saw similar things when
CONFIG_RT_GROUP_SCHED is on.

For hackbench/pipe/thread, set_task_cpu() dropped from 1.67% to 0.51%
according to perf cycle; for netperf/nr_client=nr_cpu/UDP_RR,
set_task_cpu() dropped from 5.06% to 1.08%.

The patch looks good to me, just a nit below.

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ec7b3e0a2b20..4fbd4b3a4bdd 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -389,6 +389,19 @@ struct task_group {
> #endif
> #endif
>
> + struct rcu_head rcu;
> + struct list_head list;
> +
> + struct list_head siblings;
> + struct list_head children;
> +
> + /*
> + * To reduce false sharing, current layout is optimized to make
> + * sure load_avg is in a different cacheline from parent, rt_se
> + * and rt_rq.
> + */
> + struct task_group *parent;
> +

I wonder if we can simply do:

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec7b3e0a2b20..31b73e8d9568 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -385,7 +385,9 @@ struct task_group {
* it in its own cacheline separated from the fields above which
* will also be accessed at each tick.
*/
- atomic_long_t load_avg ____cacheline_aligned;
+ struct {
+ atomic_long_t load_avg;
+ } ____cacheline_aligned_in_smp;
#endif
#endif

This way it can make sure there is no false sharing with load_avg no
matter how the layout of this structure changes in the future.

Your patch has the advantage of not adding any more padding, thus saves
some space; the example code above has the advantage of no need to worry
about future changes that might break the expected alignment, but it
does make the structure size a little larger(704 -> 768).

Thanks,
Aaron

> #ifdef CONFIG_RT_GROUP_SCHED
> struct sched_rt_entity **rt_se;
> struct rt_rq **rt_rq;
> @@ -396,13 +409,6 @@ struct task_group {
> struct rt_bandwidth rt_bandwidth;
> #endif
>
> - struct rcu_head rcu;
> - struct list_head list;
> -
> - struct task_group *parent;
> - struct list_head siblings;
> - struct list_head children;
> -
> #ifdef CONFIG_SCHED_AUTOGROUP
> struct autogroup *autogroup;
> #endif
> --
> 2.39.3
>

2023-06-26 08:04:03

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH v2] sched/task_group: Re-layout structure to reduce false sharing

On 2023-06-26 at 13:47:56 +0800, Aaron Lu wrote:
> On Wed, Jun 21, 2023 at 04:14:25PM +0800, Deng Pan wrote:
> > When running UnixBench/Pipe-based Context Switching case, we observed
> > high false sharing for accessing ‘load_avg’ against rt_se and rt_rq
> > when config CONFIG_RT_GROUP_SCHED is turned on.
> >
> > Pipe-based Context Switching case is a typical sleep/wakeup scenario,
> > in which load_avg is frequenly loaded and stored, at the meantime, rt_se
> > and rt_rq are frequently loaded. Unfortunately, they are in the same
> > cacheline.
> >
> > This change re-layouts the structure:
> > 1. Move rt_se and rt_rq to a 2nd cacheline.
> > 2. Keep ‘parent’ field in the 2nd cacheline since it's also accessed
> > very often when cgroups are nested, thanks Tim Chen for providing the
> > insight.
> >
> > Tested on Intel Icelake 2 sockets 80C/160T platform, based on v6.4-rc5.
> >
> > With this change, Pipe-based Context Switching 160 parallel score is
> > improved ~9.6%, perf record tool reports rt_se and rt_rq access cycles
> > are reduced from ~14.5% to ~0.3%, perf c2c tool shows the false-sharing
> > is resolved as expected:
>
> I also give it a run on an Icelake and saw similar things when
> CONFIG_RT_GROUP_SCHED is on.
>
> For hackbench/pipe/thread, set_task_cpu() dropped from 1.67% to 0.51%
> according to perf cycle; for netperf/nr_client=nr_cpu/UDP_RR,
> set_task_cpu() dropped from 5.06% to 1.08%.
>
> The patch looks good to me, just a nit below.
>
I also saw overall throughput improvements of netperf on Sapphire Rapids
with CONFIG_RT_GROUP_SCHED set, as this platform suffers from C2C so
this patch helps a lot.

netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR 56-threads 1.00 ( 1.61) +2.20 ( 1.39)
TCP_RR 112-threads 1.00 ( 2.71) -0.75 ( 2.29)
TCP_RR 168-threads 1.00 ( 4.39) -14.26 ( 4.99)
TCP_RR 224-threads 1.00 ( 4.21) -5.52 ( 5.07)
TCP_RR 280-threads 1.00 ( 1.89) +246.41 ( 61.31)
TCP_RR 336-threads 1.00 ( 53.49) +164.89 ( 21.45)
TCP_RR 392-threads 1.00 ( 42.46) +162.16 ( 31.33)
TCP_RR 448-threads 1.00 ( 44.61) +113.64 ( 41.74)
UDP_RR 56-threads 1.00 ( 3.63) -1.27 ( 3.73)
UDP_RR 112-threads 1.00 ( 7.83) -4.16 ( 16.57)
UDP_RR 168-threads 1.00 ( 18.08) -16.54 ( 17.27)
UDP_RR 224-threads 1.00 ( 12.60) -5.77 ( 12.79)
UDP_RR 280-threads 1.00 ( 9.37) -0.57 ( 15.75)
UDP_RR 336-threads 1.00 ( 14.87) +200.81 ( 34.90)
UDP_RR 392-threads 1.00 ( 38.85) -10.15 ( 46.04)
UDP_RR 448-threads 1.00 ( 35.06) -8.93 ( 55.56)

> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index ec7b3e0a2b20..4fbd4b3a4bdd 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -389,6 +389,19 @@ struct task_group {
> > #endif
> > #endif
> >
> > + struct rcu_head rcu;
> > + struct list_head list;
> > +
> > + struct list_head siblings;
> > + struct list_head children;
> > +
> > + /*
> > + * To reduce false sharing, current layout is optimized to make
> > + * sure load_avg is in a different cacheline from parent, rt_se
> > + * and rt_rq.
> > + */
> > + struct task_group *parent;
> > +
>
> I wonder if we can simply do:
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ec7b3e0a2b20..31b73e8d9568 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -385,7 +385,9 @@ struct task_group {
> * it in its own cacheline separated from the fields above which
> * will also be accessed at each tick.
> */
> - atomic_long_t load_avg ____cacheline_aligned;
> + struct {
> + atomic_long_t load_avg;
> + } ____cacheline_aligned_in_smp;
> #endif
> #endif
>
> This way it can make sure there is no false sharing with load_avg no
> matter how the layout of this structure changes in the future.
>
> Your patch has the advantage of not adding any more padding, thus saves
> some space; the example code above has the advantage of no need to worry
> about future changes that might break the expected alignment, but it
> does make the structure size a little larger(704 -> 768).
>
Looks reasonable that we don't need to adjust the layout in the future.
Besides the cache line alignment, if the task is not a rt one,
why do we have to touch that, I wonder if the following change can avoid that:

thanks,
Chenyu

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec7b3e0a2b20..067f1310bad2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1958,8 +1958,10 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
#endif

#ifdef CONFIG_RT_GROUP_SCHED
- p->rt.rt_rq = tg->rt_rq[cpu];
- p->rt.parent = tg->rt_se[cpu];
+ if (p->sched_class = &rt_sched_class) {
+ p->rt.rt_rq = tg->rt_rq[cpu];
+ p->rt.parent = tg->rt_se[cpu];
+ }
#endif
}

--
2.25.1

> Thanks,
> Aaron
>
> > #ifdef CONFIG_RT_GROUP_SCHED
> > struct sched_rt_entity **rt_se;
> > struct rt_rq **rt_rq;
> > @@ -396,13 +409,6 @@ struct task_group {
> > struct rt_bandwidth rt_bandwidth;
> > #endif
> >
> > - struct rcu_head rcu;
> > - struct list_head list;
> > -
> > - struct task_group *parent;
> > - struct list_head siblings;
> > - struct list_head children;
> > -
> > #ifdef CONFIG_SCHED_AUTOGROUP
> > struct autogroup *autogroup;
> > #endif
> > --
> > 2.39.3
> >

2023-06-26 13:04:16

by Deng Pan

[permalink] [raw]
Subject: RE: [PATCH v2] sched/task_group: Re-layout structure to reduce false sharing



> -----Original Message-----
> From: Lu, Aaron <[email protected]>
> Sent: Monday, June 26, 2023 1:48 PM
> To: Deng, Pan <[email protected]>
> Cc: Chen, Tim C <[email protected]>; [email protected];
> [email protected]; [email protected]; Li, Tianyou
> <[email protected]>; Ma, Yu <[email protected]>; Zhu, Lipeng
> <[email protected]>; Chen, Yu C <[email protected]>; Tim Chen
> <[email protected]>
> Subject: Re: [PATCH v2] sched/task_group: Re-layout structure to reduce false
> sharing
>
> On Wed, Jun 21, 2023 at 04:14:25PM +0800, Deng Pan wrote:
> > When running UnixBench/Pipe-based Context Switching case, we observed
> > high false sharing for accessing ‘load_avg’ against rt_se and rt_rq
> > when config CONFIG_RT_GROUP_SCHED is turned on.
> >
> > Pipe-based Context Switching case is a typical sleep/wakeup scenario,
> > in which load_avg is frequenly loaded and stored, at the meantime,
> > rt_se and rt_rq are frequently loaded. Unfortunately, they are in the
> > same cacheline.
> >
> > This change re-layouts the structure:
> > 1. Move rt_se and rt_rq to a 2nd cacheline.
> > 2. Keep ‘parent’ field in the 2nd cacheline since it's also accessed
> > very often when cgroups are nested, thanks Tim Chen for providing the
> > insight.
> >
> > Tested on Intel Icelake 2 sockets 80C/160T platform, based on v6.4-rc5.
> >
> > With this change, Pipe-based Context Switching 160 parallel score is
> > improved ~9.6%, perf record tool reports rt_se and rt_rq access cycles
> > are reduced from ~14.5% to ~0.3%, perf c2c tool shows the
> > false-sharing is resolved as expected:
>
> I also give it a run on an Icelake and saw similar things when
> CONFIG_RT_GROUP_SCHED is on.
>
> For hackbench/pipe/thread, set_task_cpu() dropped from 1.67% to 0.51%
> according to perf cycle; for netperf/nr_client=nr_cpu/UDP_RR,
> set_task_cpu() dropped from 5.06% to 1.08%.

Thank you very much Aaron!

>
> The patch looks good to me, just a nit below.
>
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index
> > ec7b3e0a2b20..4fbd4b3a4bdd 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -389,6 +389,19 @@ struct task_group { #endif #endif
> >
> > + struct rcu_head rcu;
> > + struct list_head list;
> > +
> > + struct list_head siblings;
> > + struct list_head children;
> > +
> > + /*
> > + * To reduce false sharing, current layout is optimized to make
> > + * sure load_avg is in a different cacheline from parent, rt_se
> > + * and rt_rq.
> > + */
> > + struct task_group *parent;
> > +
>
> I wonder if we can simply do:
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index
> ec7b3e0a2b20..31b73e8d9568 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -385,7 +385,9 @@ struct task_group {
> * it in its own cacheline separated from the fields above which
> * will also be accessed at each tick.
> */
> - atomic_long_t load_avg ____cacheline_aligned;
> + struct {
> + atomic_long_t load_avg;
> + } ____cacheline_aligned_in_smp;
> #endif
> #endif
>
> This way it can make sure there is no false sharing with load_avg no matter
> how the layout of this structure changes in the future.
>
> Your patch has the advantage of not adding any more padding, thus saves
> some space; the example code above has the advantage of no need to worry
> about future changes that might break the expected alignment, but it does
> make the structure size a little larger(704 -> 768).
>

Right, the original purpose is to keep the structure size, actually I don't have a
strong preference here whether keep the size or the maintainability, would like
to know maintainer's insight :)

Thanks
Pan

> Thanks,
> Aaron
>
> > #ifdef CONFIG_RT_GROUP_SCHED
> > struct sched_rt_entity **rt_se;
> > struct rt_rq **rt_rq;
> > @@ -396,13 +409,6 @@ struct task_group {
> > struct rt_bandwidth rt_bandwidth;
> > #endif
> >
> > - struct rcu_head rcu;
> > - struct list_head list;
> > -
> > - struct task_group *parent;
> > - struct list_head siblings;
> > - struct list_head children;
> > -
> > #ifdef CONFIG_SCHED_AUTOGROUP
> > struct autogroup *autogroup;
> > #endif
> > --
> > 2.39.3
> >

2023-06-26 13:05:09

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2] sched/task_group: Re-layout structure to reduce false sharing

On Mon, Jun 26, 2023 at 03:52:17PM +0800, Chen Yu wrote:
> Besides the cache line alignment, if the task is not a rt one,
> why do we have to touch that, I wonder if the following change can avoid that:
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ec7b3e0a2b20..067f1310bad2 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1958,8 +1958,10 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
> #endif
>
> #ifdef CONFIG_RT_GROUP_SCHED
> - p->rt.rt_rq = tg->rt_rq[cpu];
> - p->rt.parent = tg->rt_se[cpu];
> + if (p->sched_class = &rt_sched_class) {
== :-)

> + p->rt.rt_rq = tg->rt_rq[cpu];
> + p->rt.parent = tg->rt_se[cpu];
> + }
> #endif
> }

If a task starts life as a SCHED_NORMAL one and then after some time
it's changed to a RT one, then during its next ttwu(), if it didn't
migrate, then set_task_rq() will not be called and p->rt.rt_rq will
keep as NULL which will cause problem when this task gets enqueued as
a rt one.

The follow diff seems to cure this issue:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c7db597e8175..8c57148e668c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7801,6 +7801,20 @@ static int __sched_setscheduler(struct task_struct *p,
}
__setscheduler_uclamp(p, attr);

+#ifdef CONFIG_RT_GROUP_SCHED
+ /*
+ * Make sure when this task becomes a rt one,
+ * its rt fields have valid value.
+ */
+ if (rt_prio(newprio)) {
+ struct task_group *tg = task_group(p);
+ int cpu = cpu_of(rq);
+
+ p->rt.rt_rq = tg->rt_rq[cpu];
+ p->rt.parent = tg->rt_se[cpu];
+ }
+#endif
+
if (queued) {
/*
* We enqueue to tail when the priority of a task is

But I'm not sure if it's worth the trouble.

Thanks,
Aaron

2023-06-26 17:30:04

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v2] sched/task_group: Re-layout structure to reduce false sharing

On Mon, 2023-06-26 at 15:52 +0800, Chen Yu wrote:
>
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ec7b3e0a2b20..067f1310bad2 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1958,8 +1958,10 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
> #endif
>
> #ifdef CONFIG_RT_GROUP_SCHED
> - p->rt.rt_rq = tg->rt_rq[cpu];
> - p->rt.parent = tg->rt_se[cpu];
> + if (p->sched_class = &rt_sched_class) {
> + p->rt.rt_rq = tg->rt_rq[cpu];
> + p->rt.parent = tg->rt_se[cpu];
> + }

If we avoid the assignment for non-rt task, do you see it further improves
c2c bounce?

Tim


2023-06-27 08:34:53

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH v2] sched/task_group: Re-layout structure to reduce false sharing

On 2023-06-26 at 10:09:10 -0700, Tim Chen wrote:
> On Mon, 2023-06-26 at 15:52 +0800, Chen Yu wrote:
> >
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index ec7b3e0a2b20..067f1310bad2 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1958,8 +1958,10 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
> > #endif
> >
> > #ifdef CONFIG_RT_GROUP_SCHED
> > - p->rt.rt_rq = tg->rt_rq[cpu];
> > - p->rt.parent = tg->rt_se[cpu];
> > + if (p->sched_class = &rt_sched_class) {
> > + p->rt.rt_rq = tg->rt_rq[cpu];
> > + p->rt.parent = tg->rt_se[cpu];
> > + }
>
> If we avoid the assignment for non-rt task, do you see it further improves
> c2c bounce?
>
I did not see noticeable extra throughput improvement from netperf, with
above change + Aaron's fix + Pan's cacheline alignment adjustment.

thanks,
Chenyu

2023-06-27 08:36:07

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH v2] sched/task_group: Re-layout structure to reduce false sharing

On 2023-06-26 at 20:53:35 +0800, Aaron Lu wrote:
> On Mon, Jun 26, 2023 at 03:52:17PM +0800, Chen Yu wrote:
> > Besides the cache line alignment, if the task is not a rt one,
> > why do we have to touch that, I wonder if the following change can avoid that:
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index ec7b3e0a2b20..067f1310bad2 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1958,8 +1958,10 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
> > #endif
> >
> > #ifdef CONFIG_RT_GROUP_SCHED
> > - p->rt.rt_rq = tg->rt_rq[cpu];
> > - p->rt.parent = tg->rt_se[cpu];
> > + if (p->sched_class = &rt_sched_class) {
> == :-)
>
Yes.
> > + p->rt.rt_rq = tg->rt_rq[cpu];
> > + p->rt.parent = tg->rt_se[cpu];
> > + }
> > #endif
> > }
>
> If a task starts life as a SCHED_NORMAL one and then after some time
> it's changed to a RT one, then during its next ttwu(), if it didn't
> migrate, then set_task_rq() will not be called and p->rt.rt_rq will
> keep as NULL which will cause problem when this task gets enqueued as
> a rt one.
>
Yeah, this case should be covered to avoid the NULL pointer issue.
It seems that set_task_rq() is widely used to assign task's rt field
irrelevant of what context the task is in to avoid extra check.
> The follow diff seems to cure this issue:
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c7db597e8175..8c57148e668c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7801,6 +7801,20 @@ static int __sched_setscheduler(struct task_struct *p,
> }
> __setscheduler_uclamp(p, attr);
>
> +#ifdef CONFIG_RT_GROUP_SCHED
> + /*
> + * Make sure when this task becomes a rt one,
> + * its rt fields have valid value.
> + */
> + if (rt_prio(newprio)) {
> + struct task_group *tg = task_group(p);
> + int cpu = cpu_of(rq);
> +
> + p->rt.rt_rq = tg->rt_rq[cpu];
> + p->rt.parent = tg->rt_se[cpu];
> + }
> +#endif
> +
> if (queued) {
> /*
> * We enqueue to tail when the priority of a task is
>
> But I'm not sure if it's worth the trouble.
>
Yeah, cache alignment change seems to be simpler.

thanks,
Chenyu
> Thanks,
> Aaron

2023-06-27 10:34:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] sched/task_group: Re-layout structure to reduce false sharing

On Mon, Jun 26, 2023 at 01:47:56PM +0800, Aaron Lu wrote:

> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index ec7b3e0a2b20..4fbd4b3a4bdd 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -389,6 +389,19 @@ struct task_group {
> > #endif
> > #endif
> >
> > + struct rcu_head rcu;
> > + struct list_head list;
> > +
> > + struct list_head siblings;
> > + struct list_head children;
> > +
> > + /*
> > + * To reduce false sharing, current layout is optimized to make
> > + * sure load_avg is in a different cacheline from parent, rt_se
> > + * and rt_rq.
> > + */

That comment is misleading I think; you don't particularly care about
those fields more than any other active fields that would cause false
sharing.

> > + struct task_group *parent;
> > +
>
> I wonder if we can simply do:
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ec7b3e0a2b20..31b73e8d9568 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -385,7 +385,9 @@ struct task_group {
> * it in its own cacheline separated from the fields above which
> * will also be accessed at each tick.
> */
> - atomic_long_t load_avg ____cacheline_aligned;
> + struct {
> + atomic_long_t load_avg;
> + } ____cacheline_aligned_in_smp;
> #endif
> #endif
>
> This way it can make sure there is no false sharing with load_avg no
> matter how the layout of this structure changes in the future.

This. Also, ISTR there was a series to split this atomic across nodes;
whatever happend to that, and can we still measure an improvement over
this with that approach?

2023-06-27 10:42:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] sched/task_group: Re-layout structure to reduce false sharing

On Mon, Jun 26, 2023 at 08:53:35PM +0800, Aaron Lu wrote:
> On Mon, Jun 26, 2023 at 03:52:17PM +0800, Chen Yu wrote:
> > Besides the cache line alignment, if the task is not a rt one,
> > why do we have to touch that, I wonder if the following change can avoid that:
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index ec7b3e0a2b20..067f1310bad2 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1958,8 +1958,10 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
> > #endif
> >
> > #ifdef CONFIG_RT_GROUP_SCHED
> > - p->rt.rt_rq = tg->rt_rq[cpu];
> > - p->rt.parent = tg->rt_se[cpu];
> > + if (p->sched_class = &rt_sched_class) {
> == :-)
>
> > + p->rt.rt_rq = tg->rt_rq[cpu];
> > + p->rt.parent = tg->rt_se[cpu];
> > + }
> > #endif
> > }
>
> If a task starts life as a SCHED_NORMAL one and then after some time
> it's changed to a RT one, then during its next ttwu(), if it didn't
> migrate, then set_task_rq() will not be called and p->rt.rt_rq will
> keep as NULL which will cause problem when this task gets enqueued as
> a rt one.
>
> The follow diff seems to cure this issue:
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c7db597e8175..8c57148e668c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7801,6 +7801,20 @@ static int __sched_setscheduler(struct task_struct *p,
> }
> __setscheduler_uclamp(p, attr);
>
> +#ifdef CONFIG_RT_GROUP_SCHED
> + /*
> + * Make sure when this task becomes a rt one,
> + * its rt fields have valid value.
> + */
> + if (rt_prio(newprio)) {
> + struct task_group *tg = task_group(p);
> + int cpu = cpu_of(rq);
> +
> + p->rt.rt_rq = tg->rt_rq[cpu];
> + p->rt.parent = tg->rt_se[cpu];
> + }
> +#endif
> +
> if (queued) {
> /*
> * We enqueue to tail when the priority of a task is
>
> But I'm not sure if it's worth the trouble.

Not sufficient, you can become RT through PI and not pass
__sched_setscheduler().

The common code-path in this case would be check_class_changed(), that's
called for oth PI and __sched_setscheduler().

Anyway, not against this per-se, but RT_GROUP_SCHED is utter shite and
nobody should be using it. Also, if there's no measurable performance
gain (as stated elsewhere IIRC) we shouldn't be adding complexity.

2023-06-27 15:36:24

by Deng Pan

[permalink] [raw]
Subject: RE: [PATCH v2] sched/task_group: Re-layout structure to reduce false sharing



> -----Original Message-----
> From: Chen, Yu C <[email protected]>
> Sent: Tuesday, June 27, 2023 4:08 PM
> To: Tim Chen <[email protected]>
> Cc: Lu, Aaron <[email protected]>; Peter Zijlstra <[email protected]>;
> Vincent Guittot <[email protected]>; Ingo Molnar
> <[email protected]>; Juri Lelli <[email protected]>; Deng, Pan
> <[email protected]>; Chen, Tim C <[email protected]>; linux-
> [email protected]; Li, Tianyou <[email protected]>; Ma, Yu
> <[email protected]>; Zhu, Lipeng <[email protected]>
> Subject: Re: [PATCH v2] sched/task_group: Re-layout structure to reduce
> false sharing
>
> On 2023-06-26 at 10:09:10 -0700, Tim Chen wrote:
> > On Mon, 2023-06-26 at 15:52 +0800, Chen Yu wrote:
> > >
> > >
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index
> > > ec7b3e0a2b20..067f1310bad2 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -1958,8 +1958,10 @@ static inline void set_task_rq(struct
> > > task_struct *p, unsigned int cpu) #endif
> > >
> > > #ifdef CONFIG_RT_GROUP_SCHED
> > > - p->rt.rt_rq = tg->rt_rq[cpu];
> > > - p->rt.parent = tg->rt_se[cpu];
> > > + if (p->sched_class = &rt_sched_class) {
> > > + p->rt.rt_rq = tg->rt_rq[cpu];
> > > + p->rt.parent = tg->rt_se[cpu];
> > > + }
> >
> > If we avoid the assignment for non-rt task, do you see it further
> > improves c2c bounce?
> >
> I did not see noticeable extra throughput improvement from netperf, with
> above change + Aaron's fix + Pan's cacheline alignment adjustment.

Echo Yu.
Cacheline alignment adjustment resolved set_task_rq c2c bounce with load_avg,
The remained cycles of "assignment for non-rt task" is little according to perf record
data, the should be the reason of no extra throughput improvement.

Thanks
Pan
>
> thanks,
> Chenyu

2023-06-27 16:49:22

by Deng Pan

[permalink] [raw]
Subject: RE: [PATCH v2] sched/task_group: Re-layout structure to reduce false sharing



> -----Original Message-----
> From: Peter Zijlstra <[email protected]>
> Sent: Tuesday, June 27, 2023 6:15 PM
> To: Lu, Aaron <[email protected]>
> Cc: Deng, Pan <[email protected]>; Chen, Tim C <[email protected]>;
> [email protected]; [email protected]; Li, Tianyou
> <[email protected]>; Ma, Yu <[email protected]>; Zhu, Lipeng
> <[email protected]>; Chen, Yu C <[email protected]>; Tim Chen
> <[email protected]>
> Subject: Re: [PATCH v2] sched/task_group: Re-layout structure to reduce false
> sharing
>
> On Mon, Jun 26, 2023 at 01:47:56PM +0800, Aaron Lu wrote:
>
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index
> > > ec7b3e0a2b20..4fbd4b3a4bdd 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -389,6 +389,19 @@ struct task_group { #endif #endif
> > >
> > > + struct rcu_head rcu;
> > > + struct list_head list;
> > > +
> > > + struct list_head siblings;
> > > + struct list_head children;
> > > +
> > > + /*
> > > + * To reduce false sharing, current layout is optimized to make
> > > + * sure load_avg is in a different cacheline from parent, rt_se
> > > + * and rt_rq.
> > > + */
>
> That comment is misleading I think; you don't particularly care about those
> fields more than any other active fields that would cause false sharing.
>

How about this one:
/*
* load_avg can also cause cacheline bouncing with parent, rt_se
* and rt_rq, current layout is optimized to make sure they are in
* different cachelines.
*/

> > > + struct task_group *parent;
> > > +
> >
> > I wonder if we can simply do:
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index
> > ec7b3e0a2b20..31b73e8d9568 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -385,7 +385,9 @@ struct task_group {
> > * it in its own cacheline separated from the fields above which
> > * will also be accessed at each tick.
> > */
> > - atomic_long_t load_avg ____cacheline_aligned;
> > + struct {
> > + atomic_long_t load_avg;
> > + } ____cacheline_aligned_in_smp;
> > #endif
> > #endif
> >
> > This way it can make sure there is no false sharing with load_avg no
> > matter how the layout of this structure changes in the future.
>
> This. Also, ISTR there was a series to split this atomic across nodes; whatever
> happend to that, and can we still measure an improvement over this with that
> approach?

I just ran unixbench context-switching in 1 node with 40C/80T, without this change
perf c2c data shows c2c bouncing is still there, perf record data shows set_task_cpu
takes ~4.5% overall cycles. With this change, that false-sharing is resolved, and
set_task_cpu cycles drop to 0.5%.

Thanks
Pan

2023-06-28 08:43:25

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2] sched/task_group: Re-layout structure to reduce false sharing

On Tue, Jun 27, 2023 at 12:14:37PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 26, 2023 at 01:47:56PM +0800, Aaron Lu wrote:
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index ec7b3e0a2b20..31b73e8d9568 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -385,7 +385,9 @@ struct task_group {
> > * it in its own cacheline separated from the fields above which
> > * will also be accessed at each tick.
> > */
> > - atomic_long_t load_avg ____cacheline_aligned;
> > + struct {
> > + atomic_long_t load_avg;
> > + } ____cacheline_aligned_in_smp;
> > #endif
> > #endif
> >
> > This way it can make sure there is no false sharing with load_avg no
> > matter how the layout of this structure changes in the future.
>
> This. Also, ISTR there was a series to split this atomic across nodes;

Yes.

> whatever happend to that,

After collecting more data, the test summary is:
- on SPR, hackbench time reduced ~8% and netperf(UDP_RR/nr_thread=100%)
performance increased ~50%;
- on Icelake, performance regressed about 1%-2% for postgres_sysbench
and hackbench, netperf has no performance change;
- on Cascade Lake, netperf/UDP_RR/nr_thread=50% sees performance
drop ~3%; others have no performance change.

So it is a win for SPR and has small regressions for Icelake and Cascade
Lake. Daniel tested on AMD machines and he also saw some minor
regressions. The win for SPR is most likely due to the per-node
tg->load_avg patch made all CPUs contending on the same cacheline to
contending on two cachelines and that helped SPR because when many CPUs
contending on the same cacheline, SPR is likely to enter the "Ingress
Queue Overflow" state and that is bad for performance. I also did some
experiments on a 2 sockets SPR to place the two per-node tg->load_avg on
the same node and I also saw similar performance improvement.

Based on the test results and the reason why SPR sees improvement, I
didn't continue to push it.

Another thing: after making tg->load_avg per node, update_tg_load_avg()
is strictly local, that's good but update_cfs_group() needs to read all
counters on each node and that means it will still cause the per-node
tg->load_avg bounce across nodes and update_cfs_group() is called very
frequently. I suppose it's where those small regressions come from but
the solution is not obvious.

> and can we still measure an improvement over this with that approach?

Let me re-run those tests and see how things change.

In my previous tests I didn't turn on CONFIG_RT_GROUP_SCHED. To test
this, I suppose I'll turn CONFIG_RT_GROUP_SCHED on and apply this change
here that made tg->load_avg in a dedicated cacheline, then see how
performances change with the "Make tg->load_avg per node" patch. Will
report back once done.

Thanks,
Aaron

2023-06-30 09:47:17

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v2] sched/task_group: Re-layout structure to reduce false sharing

On Wed, Jun 28, 2023 at 01:18:34PM +0800, Aaron Lu wrote:
> On Tue, Jun 27, 2023 at 12:14:37PM +0200, Peter Zijlstra wrote:
> > and can we still measure an improvement over this with that approach?
>
> Let me re-run those tests and see how things change.
>
> In my previous tests I didn't turn on CONFIG_RT_GROUP_SCHED. To test
> this, I suppose I'll turn CONFIG_RT_GROUP_SCHED on and apply this change
> here that made tg->load_avg in a dedicated cacheline, then see how
> performances change with the "Make tg->load_avg per node" patch. Will
> report back once done.

The test summary is:
- On 2sockets/112cores/224threads SPR, it's still overall a win.
Transactions of postgres_sysbench improved 47.7%, hackbench improved
13.5% and netperf improved 39.5%;
- On 2sockets/64cores/128threads Icelake, hackbench and netperf have
improvement while postgres_sysbench transaction slightly dropped.
hackbench improved 6.2%, netperf improved 20.3% and transactions of
postgres_sysbench dropped 1.2%;
- On 2sockets/48cores/96threads CascadeLake, hackbench and netperf are
roughly flat.

Below are detailed results:

SPR: 2socket/112cores/224threads

postgres_sysbench/1instance/100%(nr_client=nr_cpu)
kernel transactions(higher is better)
aligned 89623.85?0.35%
per_node 132401.37?0.83%

hackbench/pipe/threads
kernel time(less is better)
aligned 47.43?0.48%
per_node 41.02?0.69%

netperf/UDP_RR/100%(nr_client=nr_cpu)
kernel throughput(higher is better)
aligned 9415.97?3.81%
per_node 13131.24?2.67%

ICL: 2sockets/64cores/128threads

postgres_sysbench/1instance/75%
kernel transactions
aligned 62291.58?0.64%
per_node 61561.40?0.39%

hackbench/pipe/threads
kernel time
aligned 41.66?0.04%
per_node 39.07?0.36%

netperf/UDP_RR/100%
kernel throughput
aligned 21365.01?3.32%
per_node 25692.05?2.03%

CSL: 2sockets/48cores/96threads

hackbench/pipe/threads
kernel time
aligned: 48.78?0.61%
per_node: 48.95?1.06

netperf/UDP_RR/100%
kernel throughput
aligned 25853.82?11.46%
per_node 25264.38?0.85%

I think I'll spin a new version for the "Make tg->load_avg per-node"
patch with all the information I collected.

2023-07-06 14:43:22

by Deng Pan

[permalink] [raw]
Subject: RE: [PATCH v2] sched/task_group: Re-layout structure to reduce false sharing

Hi Peter,

> -----Original Message-----
> From: Deng, Pan
> Sent: Wednesday, June 28, 2023 12:13 AM
> To: Peter Zijlstra <[email protected]>; Lu, Aaron <[email protected]>
> Cc: Chen, Tim C <[email protected]>; [email protected]; linux-
> [email protected]; Li, Tianyou <[email protected]>; Ma, Yu
> <[email protected]>; Zhu, Lipeng <[email protected]>; Chen, Yu C
> <[email protected]>; Tim Chen <[email protected]>
> Subject: RE: [PATCH v2] sched/task_group: Re-layout structure to reduce
> false sharing
>
>
>
> > -----Original Message-----
> > From: Peter Zijlstra <[email protected]>
> > Sent: Tuesday, June 27, 2023 6:15 PM
> > To: Lu, Aaron <[email protected]>
> > Cc: Deng, Pan <[email protected]>; Chen, Tim C
> > <[email protected]>; [email protected];
> > [email protected]; Li, Tianyou <[email protected]>; Ma,
> > Yu <[email protected]>; Zhu, Lipeng <[email protected]>; Chen, Yu C
> > <[email protected]>; Tim Chen <[email protected]>
> > Subject: Re: [PATCH v2] sched/task_group: Re-layout structure to
> > reduce false sharing
> >
> > On Mon, Jun 26, 2023 at 01:47:56PM +0800, Aaron Lu wrote:
> >
> > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index
> > > > ec7b3e0a2b20..4fbd4b3a4bdd 100644
> > > > --- a/kernel/sched/sched.h
> > > > +++ b/kernel/sched/sched.h
> > > > @@ -389,6 +389,19 @@ struct task_group { #endif #endif
> > > >
> > > > + struct rcu_head rcu;
> > > > + struct list_head list;
> > > > +
> > > > + struct list_head siblings;
> > > > + struct list_head children;
> > > > +
> > > > + /*
> > > > + * To reduce false sharing, current layout is optimized to make
> > > > + * sure load_avg is in a different cacheline from parent, rt_se
> > > > + * and rt_rq.
> > > > + */
> >
> > That comment is misleading I think; you don't particularly care about
> > those fields more than any other active fields that would cause false
> sharing.
> >
>
> How about this one:
> /*
> * load_avg can also cause cacheline bouncing with parent, rt_se
> * and rt_rq, current layout is optimized to make sure they are in
> * different cachelines.
> */
>
Does it work for you? Please feel free to drop any suggestion.

> > > > + struct task_group *parent;
> > > > +
> > >
> > > I wonder if we can simply do:
> > >
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index
> > > ec7b3e0a2b20..31b73e8d9568 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -385,7 +385,9 @@ struct task_group {
> > > * it in its own cacheline separated from the fields above which
> > > * will also be accessed at each tick.
> > > */
> > > - atomic_long_t load_avg ____cacheline_aligned;
> > > + struct {
> > > + atomic_long_t load_avg;
> > > + } ____cacheline_aligned_in_smp;
> > > #endif
> > > #endif
> > >
> > > This way it can make sure there is no false sharing with load_avg no
> > > matter how the layout of this structure changes in the future.
> >
> > This. Also, ISTR there was a series to split this atomic across nodes;
> > whatever happend to that, and can we still measure an improvement over
> > this with that approach?
>
> I just ran unixbench context-switching in 1 node with 40C/80T, without this
> change perf c2c data shows c2c bouncing is still there, perf record data
> shows set_task_cpu takes ~4.5% overall cycles. With this change, that false-
> sharing is resolved, and set_task_cpu cycles drop to 0.5%.
>
I mean even the only 1 NUMA node situation, this change benefits.
Aaron posted his performance data of "split atomic across nodes" over this patch
@https://lore.kernel.org/lkml/20230630093500.GA579792@ziqianlu-dell/,
looks they are complementary, so is it possible to merge this change firstly?

Thanks
Pan

> Thanks
> Pan