2020-09-14 10:05:59

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 4/4] sched/fair: reduce busy load balance interval

The busy_factor, which increases load balance interval when a cpu is busy,
is set to 32 by default. This value generates some huge LB interval on
large system like the THX2 made of 2 node x 28 cores x 4 threads.
For such system, the interval increases from 112ms to 3584ms at MC level.
And from 228ms to 7168ms at NUMA level.

Even on smaller system, a lower busy factor has shown improvement on the
fair distribution of the running time so let reduce it for all.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/topology.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 1a84b778755d..a8477c9e8569 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1336,7 +1336,7 @@ sd_init(struct sched_domain_topology_level *tl,
*sd = (struct sched_domain){
.min_interval = sd_weight,
.max_interval = 2*sd_weight,
- .busy_factor = 32,
+ .busy_factor = 16,
.imbalance_pct = 117,

.cache_nice_tries = 0,
--
2.17.1


2020-09-15 09:14:40

by Jiang Biao

[permalink] [raw]
Subject: Re: [PATCH 4/4] sched/fair: reduce busy load balance interval

Hi, Vincent

On Mon, 14 Sep 2020 at 18:07, Vincent Guittot
<[email protected]> wrote:
>
> The busy_factor, which increases load balance interval when a cpu is busy,
> is set to 32 by default. This value generates some huge LB interval on
> large system like the THX2 made of 2 node x 28 cores x 4 threads.
> For such system, the interval increases from 112ms to 3584ms at MC level.
> And from 228ms to 7168ms at NUMA level.
Agreed that the interval is too big for that case.
But would it be too small for an AMD environment(like ROME) with 8cpu
at MC level(CCX), if we reduce busy_factor?
For that case, the interval could be reduced from 256ms to 128ms.
Or should we define an MIN_INTERVAL for MC level to avoid too small interval?

Thx.
Regards,
Jiang

>
> Even on smaller system, a lower busy factor has shown improvement on the
> fair distribution of the running time so let reduce it for all.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/topology.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 1a84b778755d..a8477c9e8569 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1336,7 +1336,7 @@ sd_init(struct sched_domain_topology_level *tl,
> *sd = (struct sched_domain){
> .min_interval = sd_weight,
> .max_interval = 2*sd_weight,
> - .busy_factor = 32,
> + .busy_factor = 16,
> .imbalance_pct = 117,
>
> .cache_nice_tries = 0,
> --
> 2.17.1
>

2020-09-15 09:32:19

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 4/4] sched/fair: reduce busy load balance interval

On Tue, 15 Sep 2020 at 11:11, Jiang Biao <[email protected]> wrote:
>
> Hi, Vincent
>
> On Mon, 14 Sep 2020 at 18:07, Vincent Guittot
> <[email protected]> wrote:
> >
> > The busy_factor, which increases load balance interval when a cpu is busy,
> > is set to 32 by default. This value generates some huge LB interval on
> > large system like the THX2 made of 2 node x 28 cores x 4 threads.
> > For such system, the interval increases from 112ms to 3584ms at MC level.
> > And from 228ms to 7168ms at NUMA level.
> Agreed that the interval is too big for that case.
> But would it be too small for an AMD environment(like ROME) with 8cpu
> at MC level(CCX), if we reduce busy_factor?

Are you sure that this is too small ? As mentioned in the commit
message below, I tested it on small system (2x4 cores Arm64) and i
have seen some improvements

> For that case, the interval could be reduced from 256ms to 128ms.
> Or should we define an MIN_INTERVAL for MC level to avoid too small interval?

What would be a too small interval ?

Before this patch we have for a level with 8 cores:
when idle, the interval is 8ms and increase to 256ms when busy
After the patch, we have
When idle the interval is still 8ms and increase to 128ms when busy

Regards,
Vincent

>
> Thx.
> Regards,
> Jiang
>
> >
> > Even on smaller system, a lower busy factor has shown improvement on the
> > fair distribution of the running time so let reduce it for all.
> >
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
> > kernel/sched/topology.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 1a84b778755d..a8477c9e8569 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1336,7 +1336,7 @@ sd_init(struct sched_domain_topology_level *tl,
> > *sd = (struct sched_domain){
> > .min_interval = sd_weight,
> > .max_interval = 2*sd_weight,
> > - .busy_factor = 32,
> > + .busy_factor = 16,
> > .imbalance_pct = 117,
> >
> > .cache_nice_tries = 0,
> > --
> > 2.17.1
> >

2020-09-15 19:13:08

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 4/4] sched/fair: reduce busy load balance interval


On 14/09/20 11:03, Vincent Guittot wrote:
> The busy_factor, which increases load balance interval when a cpu is busy,
> is set to 32 by default. This value generates some huge LB interval on
> large system like the THX2 made of 2 node x 28 cores x 4 threads.
> For such system, the interval increases from 112ms to 3584ms at MC level.
> And from 228ms to 7168ms at NUMA level.
>
> Even on smaller system, a lower busy factor has shown improvement on the
> fair distribution of the running time so let reduce it for all.
>

ISTR you mentioned taking this one step further and making
(interval * busy_factor) scale logarithmically with the number of CPUs to
avoid reaching outrageous numbers. Did you experiment with that already?

> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/topology.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 1a84b778755d..a8477c9e8569 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1336,7 +1336,7 @@ sd_init(struct sched_domain_topology_level *tl,
> *sd = (struct sched_domain){
> .min_interval = sd_weight,
> .max_interval = 2*sd_weight,
> - .busy_factor = 32,
> + .busy_factor = 16,
> .imbalance_pct = 117,
>
> .cache_nice_tries = 0,

2020-09-16 00:50:38

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 4/4] sched/fair: reduce busy load balance interval

On Tue, 15 Sep 2020 at 13:36, Jiang Biao <[email protected]> wrote:
>
> Hi, Vincent
>
> On Tue, 15 Sep 2020 at 17:28, Vincent Guittot
> <[email protected]> wrote:
> >
> > On Tue, 15 Sep 2020 at 11:11, Jiang Biao <[email protected]> wrote:
> > >
> > > Hi, Vincent
> > >
> > > On Mon, 14 Sep 2020 at 18:07, Vincent Guittot
> > > <[email protected]> wrote:
> > > >
> > > > The busy_factor, which increases load balance interval when a cpu is busy,
> > > > is set to 32 by default. This value generates some huge LB interval on
> > > > large system like the THX2 made of 2 node x 28 cores x 4 threads.
> > > > For such system, the interval increases from 112ms to 3584ms at MC level.
> > > > And from 228ms to 7168ms at NUMA level.
> > > Agreed that the interval is too big for that case.
> > > But would it be too small for an AMD environment(like ROME) with 8cpu
> > > at MC level(CCX), if we reduce busy_factor?
> >
> > Are you sure that this is too small ? As mentioned in the commit
> > message below, I tested it on small system (2x4 cores Arm64) and i
> > have seen some improvements
> Not so sure. :)
> Small interval means more frequent balances and more cost consumed for
> balancing, especially for pinned vm cases.

If you are running only pinned threads, the interval can increase
above 512ms which means 8sec after applying the busy factor

> For our case, we have AMD ROME servers made of 2node x 48cores x
> 2thread, and 8c at MC level(within a CCX). The 256ms interval seems a
> little too big for us, compared to Intel Cascadlake CPU with 48c at MC

so IIUC your topology is :
2 nodes at NUMA
6 CCX at DIE level
8 cores per CCX at MC
2 threads per core at SMT

> level, whose balance interval is 1536ms. 128ms seems a little more
> waste. :)

the 256ms/128ms interval only looks at 8 cores whereas the 1536
intervall looks for the whole 48 cores

> I guess more balance costs may hurt the throughput of sysbench like
> benchmark.. Just a guess.
>
> >
> > > For that case, the interval could be reduced from 256ms to 128ms.
> > > Or should we define an MIN_INTERVAL for MC level to avoid too small interval?
> >
> > What would be a too small interval ?
> That's hard to say. :)
> My guess is just for large server system cases.
>
> Thanks.
> Regards,
> Jiang

2020-09-16 00:52:35

by Jiang Biao

[permalink] [raw]
Subject: Re: [PATCH 4/4] sched/fair: reduce busy load balance interval

Hi, Vincent

On Tue, 15 Sep 2020 at 17:28, Vincent Guittot
<[email protected]> wrote:
>
> On Tue, 15 Sep 2020 at 11:11, Jiang Biao <[email protected]> wrote:
> >
> > Hi, Vincent
> >
> > On Mon, 14 Sep 2020 at 18:07, Vincent Guittot
> > <[email protected]> wrote:
> > >
> > > The busy_factor, which increases load balance interval when a cpu is busy,
> > > is set to 32 by default. This value generates some huge LB interval on
> > > large system like the THX2 made of 2 node x 28 cores x 4 threads.
> > > For such system, the interval increases from 112ms to 3584ms at MC level.
> > > And from 228ms to 7168ms at NUMA level.
> > Agreed that the interval is too big for that case.
> > But would it be too small for an AMD environment(like ROME) with 8cpu
> > at MC level(CCX), if we reduce busy_factor?
>
> Are you sure that this is too small ? As mentioned in the commit
> message below, I tested it on small system (2x4 cores Arm64) and i
> have seen some improvements
Not so sure. :)
Small interval means more frequent balances and more cost consumed for
balancing, especially for pinned vm cases.
For our case, we have AMD ROME servers made of 2node x 48cores x
2thread, and 8c at MC level(within a CCX). The 256ms interval seems a
little too big for us, compared to Intel Cascadlake CPU with 48c at MC
level, whose balance interval is 1536ms. 128ms seems a little more
waste. :)
I guess more balance costs may hurt the throughput of sysbench like
benchmark.. Just a guess.

>
> > For that case, the interval could be reduced from 256ms to 128ms.
> > Or should we define an MIN_INTERVAL for MC level to avoid too small interval?
>
> What would be a too small interval ?
That's hard to say. :)
My guess is just for large server system cases.

Thanks.
Regards,
Jiang

2020-09-16 01:17:46

by Jiang Biao

[permalink] [raw]
Subject: Re: [PATCH 4/4] sched/fair: reduce busy load balance interval

Hi,

On Tue, 15 Sep 2020 at 20:43, Vincent Guittot
<[email protected]> wrote:
>
> On Tue, 15 Sep 2020 at 13:36, Jiang Biao <[email protected]> wrote:
> >
> > Hi, Vincent
> >
> > On Tue, 15 Sep 2020 at 17:28, Vincent Guittot
> > <[email protected]> wrote:
> > >
> > > On Tue, 15 Sep 2020 at 11:11, Jiang Biao <[email protected]> wrote:
> > > >
> > > > Hi, Vincent
> > > >
> > > > On Mon, 14 Sep 2020 at 18:07, Vincent Guittot
> > > > <[email protected]> wrote:
> > > > >
> > > > > The busy_factor, which increases load balance interval when a cpu is busy,
> > > > > is set to 32 by default. This value generates some huge LB interval on
> > > > > large system like the THX2 made of 2 node x 28 cores x 4 threads.
> > > > > For such system, the interval increases from 112ms to 3584ms at MC level.
> > > > > And from 228ms to 7168ms at NUMA level.
> > > > Agreed that the interval is too big for that case.
> > > > But would it be too small for an AMD environment(like ROME) with 8cpu
> > > > at MC level(CCX), if we reduce busy_factor?
> > >
> > > Are you sure that this is too small ? As mentioned in the commit
> > > message below, I tested it on small system (2x4 cores Arm64) and i
> > > have seen some improvements
> > Not so sure. :)
> > Small interval means more frequent balances and more cost consumed for
> > balancing, especially for pinned vm cases.
>
> If you are running only pinned threads, the interval can increase
> above 512ms which means 8sec after applying the busy factor
Yep. :)

>
> > For our case, we have AMD ROME servers made of 2node x 48cores x
> > 2thread, and 8c at MC level(within a CCX). The 256ms interval seems a
> > little too big for us, compared to Intel Cascadlake CPU with 48c at MC
>
> so IIUC your topology is :
> 2 nodes at NUMA
> 6 CCX at DIE level
> 8 cores per CCX at MC
> 2 threads per core at SMT
Yes.

>
> > level, whose balance interval is 1536ms. 128ms seems a little more
> > waste. :)
>
> the 256ms/128ms interval only looks at 8 cores whereas the 1536
> intervall looks for the whole 48 cores
Yes. The real problem for us is the cpu number difference between MC
and DIE level is too big(8 VS. 96), 3072ms for DIE level is too big(reduce
busy_factor is good enough), while 128ms for MC level seems a little waste(
if reduce busy_factor)
And no objection for this patch. It still looks ok for us.

Thx.
Regards,
Jiang

2020-09-16 07:03:21

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 4/4] sched/fair: reduce busy load balance interval

On Tue, 15 Sep 2020 at 21:04, Valentin Schneider
<[email protected]> wrote:
>
>
> On 14/09/20 11:03, Vincent Guittot wrote:
> > The busy_factor, which increases load balance interval when a cpu is busy,
> > is set to 32 by default. This value generates some huge LB interval on
> > large system like the THX2 made of 2 node x 28 cores x 4 threads.
> > For such system, the interval increases from 112ms to 3584ms at MC level.
> > And from 228ms to 7168ms at NUMA level.
> >
> > Even on smaller system, a lower busy factor has shown improvement on the
> > fair distribution of the running time so let reduce it for all.
> >
>
> ISTR you mentioned taking this one step further and making
> (interval * busy_factor) scale logarithmically with the number of CPUs to
> avoid reaching outrageous numbers. Did you experiment with that already?

Yes I have tried the logarithmically scale but It didn't give any
benefit compared to this solution for the fairness problem but
impacted other use cases because it impacts idle interval and it also
adds more constraints in the computation of the interval and
busy_factor because we can end up with the same interval for 2
consecutive levels .

That being said, it might be useful for other cases but i haven't look
further for this

>
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
> > kernel/sched/topology.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 1a84b778755d..a8477c9e8569 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1336,7 +1336,7 @@ sd_init(struct sched_domain_topology_level *tl,
> > *sd = (struct sched_domain){
> > .min_interval = sd_weight,
> > .max_interval = 2*sd_weight,
> > - .busy_factor = 32,
> > + .busy_factor = 16,
> > .imbalance_pct = 117,
> >
> > .cache_nice_tries = 0,

2020-09-16 08:37:57

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 4/4] sched/fair: reduce busy load balance interval


On 16/09/20 08:02, Vincent Guittot wrote:
> On Tue, 15 Sep 2020 at 21:04, Valentin Schneider
> <[email protected]> wrote:
>>
>>
>> On 14/09/20 11:03, Vincent Guittot wrote:
>> > The busy_factor, which increases load balance interval when a cpu is busy,
>> > is set to 32 by default. This value generates some huge LB interval on
>> > large system like the THX2 made of 2 node x 28 cores x 4 threads.
>> > For such system, the interval increases from 112ms to 3584ms at MC level.
>> > And from 228ms to 7168ms at NUMA level.
>> >
>> > Even on smaller system, a lower busy factor has shown improvement on the
>> > fair distribution of the running time so let reduce it for all.
>> >
>>
>> ISTR you mentioned taking this one step further and making
>> (interval * busy_factor) scale logarithmically with the number of CPUs to
>> avoid reaching outrageous numbers. Did you experiment with that already?
>
> Yes I have tried the logarithmically scale but It didn't give any
> benefit compared to this solution for the fairness problem but
> impacted other use cases because it impacts idle interval and it also
> adds more constraints in the computation of the interval and
> busy_factor because we can end up with the same interval for 2
> consecutive levels .
>

Right, I suppose we could frob a topology level index in there to prevent
that if we really wanted to...

> That being said, it might be useful for other cases but i haven't look
> further for this
>

Fair enough!

>>
>> > Signed-off-by: Vincent Guittot <[email protected]>
>> > ---
>> > kernel/sched/topology.c | 2 +-
>> > 1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> > index 1a84b778755d..a8477c9e8569 100644
>> > --- a/kernel/sched/topology.c
>> > +++ b/kernel/sched/topology.c
>> > @@ -1336,7 +1336,7 @@ sd_init(struct sched_domain_topology_level *tl,
>> > *sd = (struct sched_domain){
>> > .min_interval = sd_weight,
>> > .max_interval = 2*sd_weight,
>> > - .busy_factor = 32,
>> > + .busy_factor = 16,
>> > .imbalance_pct = 117,
>> >
>> > .cache_nice_tries = 0,