2023-10-03 20:25:43

by Julia Lawall

[permalink] [raw]
Subject: EEVDF and NUMA balancing

Is it expected that the commit e8f331bcc270 should have an impact on the
frequency of NUMA balancing?

The NAS benchmark ua.C.x (NPB3.4-OMP,
https://github.com/mbdevpl/nas-parallel-benchmarks.git) on a 4-socket
Intel Xeon 6130 suffers from some NUMA moves that leave some sockets with
too few threads and other sockets with too many threads. Prior to the
commit e8f331bcc270, this was corrected by subsequent load balancing,
leading to run times of 20-40 seconds (around 20 seconds can be achieved
if one just turns NUMA balancing off). After commit e8f331bcc270, the
running time can go up to 150 seconds. In the worst case, I have seen a
core remain idle for 75 seconds. It seems that the load balancer at the
NUMA domain level is not able to do anything, because when a core on the
overloaded socket has multiple threads, they are tasks that were NUMA
balanced to the socket, and thus should not leave. So the "busiest" core
chosen by find_busiest_queue doesn't actually contain any stealable
threads. Maybe it could be worth stealing from a core that has only one
task in this case, in hopes that the tasks that are tied to a socket will
spread out better across it if more space is available?

An example run is attached. The cores are renumbered according to the
sockets, so there is an overload on socket 1 and an underload on sockets
2.

julia


Attachments:
ua.C.x_yeti-2_ge8f331bcc270_performance_18_socketorder.pdf (323.51 kB)

2023-10-03 21:52:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing

On Tue, Oct 03, 2023 at 10:25:08PM +0200, Julia Lawall wrote:
> Is it expected that the commit e8f331bcc270 should have an impact on the
> frequency of NUMA balancing?

Definitely not expected. The only effect of that commit was supposed to
be the runqueue order of tasks. I'll go stare at it in the morning --
definitely too late for critical thinking atm.

Thanks!

> The NAS benchmark ua.C.x (NPB3.4-OMP,
> https://github.com/mbdevpl/nas-parallel-benchmarks.git) on a 4-socket
> Intel Xeon 6130 suffers from some NUMA moves that leave some sockets with
> too few threads and other sockets with too many threads. Prior to the
> commit e8f331bcc270, this was corrected by subsequent load balancing,
> leading to run times of 20-40 seconds (around 20 seconds can be achieved
> if one just turns NUMA balancing off). After commit e8f331bcc270, the
> running time can go up to 150 seconds. In the worst case, I have seen a
> core remain idle for 75 seconds. It seems that the load balancer at the
> NUMA domain level is not able to do anything, because when a core on the
> overloaded socket has multiple threads, they are tasks that were NUMA
> balanced to the socket, and thus should not leave. So the "busiest" core
> chosen by find_busiest_queue doesn't actually contain any stealable
> threads. Maybe it could be worth stealing from a core that has only one
> task in this case, in hopes that the tasks that are tied to a socket will
> spread out better across it if more space is available?
>
> An example run is attached. The cores are renumbered according to the
> sockets, so there is an overload on socket 1 and an underload on sockets
> 2.
>
> julia


2023-10-04 12:02:16

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing



On Tue, 3 Oct 2023, Peter Zijlstra wrote:

> On Tue, Oct 03, 2023 at 10:25:08PM +0200, Julia Lawall wrote:
> > Is it expected that the commit e8f331bcc270 should have an impact on the
> > frequency of NUMA balancing?
>
> Definitely not expected. The only effect of that commit was supposed to
> be the runqueue order of tasks. I'll go stare at it in the morning --
> definitely too late for critical thinking atm.

Maybe it's just randomly making a bad situation worse rather than directly
introduing a problem. There is a high standard deviatind in the
performance. Here are some results with hyperfine. The general trends
are reproducible.

julia

Parent of e8f331bcc270, and typical of earlier commits:

::::::::::::::
ua.C.x_yeti-4_g76cae9dbe185_performance.json
::::::::::::::
{
"results": [
{
"command": "./ua.C.x",
"mean": 30.404105904309993,
"stddev": 6.453760260515126,
"median": 29.533294615035,
"user": 3858.47296929,
"system": 11.516864580000004,
"min": 21.987556851035002,
"max": 50.464735263034996,
"times": [
34.413034851035,
27.065085820035,
26.838279920035,
26.351314604035,
32.374011336035,
25.954025885035,
23.035775634035,
44.235798762034996,
31.300110969035,
23.880906093035,
50.464735263034996,
35.448494361034996,
27.299214444035,
27.225401613035,
25.065921751035,
25.729637724035,
21.987556851035002,
26.925861508035002,
29.757618969035,
33.824266792035,
23.601111060035,
27.949622236035,
33.836797180035,
31.107119088035,
34.467454332035,
25.538367186035,
44.052246282035,
36.811265399034994,
25.450476009035,
23.805947650035,
32.977559361035,
33.023708943035,
30.331184650035002,
31.707529155035,
30.281404379035,
43.624723016035,
29.552102609035,
29.514486621035,
26.272782395035,
23.081295470035002
]
}
]
}
::::::::::::::
ua.C.x_yeti-4_ge8f331bcc270_performance.json
::::::::::::::
{
"results": [
{
"command": "./ua.C.x",
"mean": 39.475254171930004,
"stddev": 23.25418332945763,
"median": 32.146023067405,
"user": 4990.425470314998,
"system": 10.6357894,
"min": 21.404253416405,
"max": 142.348752034405,
"times": [
39.670084545405,
22.450176801405,
33.077489706405,
65.853454333405,
23.453408823405,
24.179283189404998,
59.538350766404996,
27.435145718405,
22.806777380405,
44.347348933405,
26.028480016405,
24.918487113405,
105.289569793405,
32.857970958405,
31.176198789405,
39.639462769405,
38.234222138405,
41.646424303405,
31.434075176405,
25.651942354404998,
42.029314429405,
26.871583034405,
62.334539310405,
142.348752034405,
23.912191729405,
24.219083951405,
22.243050782405,
22.957280548405,
35.763612381405,
30.797416492405,
50.024712290405,
25.385043529405,
27.676768642404998,
49.878477271404996,
30.451312037405,
35.842247874405,
49.171212633405,
48.880110438405,
47.130850438405,
21.404253416405
]
}
]
}




>
> Thanks!
>
> > The NAS benchmark ua.C.x (NPB3.4-OMP,
> > https://github.com/mbdevpl/nas-parallel-benchmarks.git) on a 4-socket
> > Intel Xeon 6130 suffers from some NUMA moves that leave some sockets with
> > too few threads and other sockets with too many threads. Prior to the
> > commit e8f331bcc270, this was corrected by subsequent load balancing,
> > leading to run times of 20-40 seconds (around 20 seconds can be achieved
> > if one just turns NUMA balancing off). After commit e8f331bcc270, the
> > running time can go up to 150 seconds. In the worst case, I have seen a
> > core remain idle for 75 seconds. It seems that the load balancer at the
> > NUMA domain level is not able to do anything, because when a core on the
> > overloaded socket has multiple threads, they are tasks that were NUMA
> > balanced to the socket, and thus should not leave. So the "busiest" core
> > chosen by find_busiest_queue doesn't actually contain any stealable
> > threads. Maybe it could be worth stealing from a core that has only one
> > task in this case, in hopes that the tasks that are tied to a socket will
> > spread out better across it if more space is available?
> >
> > An example run is attached. The cores are renumbered according to the
> > sockets, so there is an overload on socket 1 and an underload on sockets
> > 2.
> >
> > julia
>
>
>

2023-10-04 12:06:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing

On Wed, Oct 04, 2023 at 02:01:26PM +0200, Julia Lawall wrote:
>
>
> On Tue, 3 Oct 2023, Peter Zijlstra wrote:
>
> > On Tue, Oct 03, 2023 at 10:25:08PM +0200, Julia Lawall wrote:
> > > Is it expected that the commit e8f331bcc270 should have an impact on the
> > > frequency of NUMA balancing?
> >
> > Definitely not expected. The only effect of that commit was supposed to
> > be the runqueue order of tasks. I'll go stare at it in the morning --
> > definitely too late for critical thinking atm.
>
> Maybe it's just randomly making a bad situation worse rather than directly
> introduing a problem. There is a high standard deviatind in the
> performance. Here are some results with hyperfine. The general trends
> are reproducible.

OK,. I'm still busy trying to bring a 4 socket machine up-to-date...
gawd I hate the boot times on those machines :/

But yeah, I was thinking similar things, I really can't spot an obvious
fail in that commit.

I'll go have a poke once the darn machine is willing to submit :-)

2023-10-04 16:25:04

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing



On Wed, 4 Oct 2023, Peter Zijlstra wrote:

> On Wed, Oct 04, 2023 at 02:01:26PM +0200, Julia Lawall wrote:
> >
> >
> > On Tue, 3 Oct 2023, Peter Zijlstra wrote:
> >
> > > On Tue, Oct 03, 2023 at 10:25:08PM +0200, Julia Lawall wrote:
> > > > Is it expected that the commit e8f331bcc270 should have an impact on the
> > > > frequency of NUMA balancing?
> > >
> > > Definitely not expected. The only effect of that commit was supposed to
> > > be the runqueue order of tasks. I'll go stare at it in the morning --
> > > definitely too late for critical thinking atm.
> >
> > Maybe it's just randomly making a bad situation worse rather than directly
> > introduing a problem. There is a high standard deviatind in the
> > performance. Here are some results with hyperfine. The general trends
> > are reproducible.
>
> OK,. I'm still busy trying to bring a 4 socket machine up-to-date...
> gawd I hate the boot times on those machines :/
>
> But yeah, I was thinking similar things, I really can't spot an obvious
> fail in that commit.
>
> I'll go have a poke once the darn machine is willing to submit :-)

I tried a two-socket machine, but in 50 runs the problem doesn't show up.

The commit e8f331bcc270 starts with

- if (sched_feat(PLACE_LAG) && cfs_rq->nr_running > 1) {
+ if (sched_feat(PLACE_LAG) && cfs_rq->nr_running) {

This seemed like a big change - cfs_rq->nr_running > 1 should be rarely
true in ua, while cfs_rq->nr_running should always be true. Adding back
the > 1 and simply replacing the test by 0 both had no effect, though.

julia

2023-10-04 17:48:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing

On Wed, Oct 04, 2023 at 06:24:39PM +0200, Julia Lawall wrote:
>
>
> On Wed, 4 Oct 2023, Peter Zijlstra wrote:
>
> > On Wed, Oct 04, 2023 at 02:01:26PM +0200, Julia Lawall wrote:
> > >
> > >
> > > On Tue, 3 Oct 2023, Peter Zijlstra wrote:
> > >
> > > > On Tue, Oct 03, 2023 at 10:25:08PM +0200, Julia Lawall wrote:
> > > > > Is it expected that the commit e8f331bcc270 should have an impact on the
> > > > > frequency of NUMA balancing?
> > > >
> > > > Definitely not expected. The only effect of that commit was supposed to
> > > > be the runqueue order of tasks. I'll go stare at it in the morning --
> > > > definitely too late for critical thinking atm.
> > >
> > > Maybe it's just randomly making a bad situation worse rather than directly
> > > introduing a problem. There is a high standard deviatind in the
> > > performance. Here are some results with hyperfine. The general trends
> > > are reproducible.
> >
> > OK,. I'm still busy trying to bring a 4 socket machine up-to-date...
> > gawd I hate the boot times on those machines :/
> >
> > But yeah, I was thinking similar things, I really can't spot an obvious
> > fail in that commit.
> >
> > I'll go have a poke once the darn machine is willing to submit :-)
>
> I tried a two-socket machine, but in 50 runs the problem doesn't show up.

I've had to re-install the 4 socket thing -- lost the day to this
trainwreck :/ Because obvoiusly the BMC needs Java and that all don't
work anymore -- so I had to go sit next to the jet-engine thing with a
keyboard and monitor.

I'll go build the benchmark thing tomorrow, if I can figure out how that
works, this NAS stuff looked 'special'. Nothing simple like ./configure;
make -j$lots :/

> The commit e8f331bcc270 starts with
>
> - if (sched_feat(PLACE_LAG) && cfs_rq->nr_running > 1) {
> + if (sched_feat(PLACE_LAG) && cfs_rq->nr_running) {
>
> This seemed like a big change - cfs_rq->nr_running > 1 should be rarely
> true in ua, while cfs_rq->nr_running should always be true. Adding back
> the > 1 and simply replacing the test by 0 both had no effect, though.

Yeah, this is because I flip the order of place_entity() and
nr_running++ around later in the patch. Previously it would increment
before place, now it does place before increment.

2023-10-04 18:05:05

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing

> I'll go build the benchmark thing tomorrow, if I can figure out how that
> works, this NAS stuff looked 'special'. Nothing simple like ./configure;
> make -j$lots :/

Starting from git clone, I had to do:

cd NPB3.4-OMP
mkdir bin
cd config
cp make.def.template make.def
cd ..
make ua CLASS=C

You also need gfortran to be installed.


>
> > The commit e8f331bcc270 starts with
> >
> > - if (sched_feat(PLACE_LAG) && cfs_rq->nr_running > 1) {
> > + if (sched_feat(PLACE_LAG) && cfs_rq->nr_running) {
> >
> > This seemed like a big change - cfs_rq->nr_running > 1 should be rarely
> > true in ua, while cfs_rq->nr_running should always be true. Adding back
> > the > 1 and simply replacing the test by 0 both had no effect, though.
>
> Yeah, this is because I flip the order of place_entity() and
> nr_running++ around later in the patch. Previously it would increment
> before place, now it does place before increment.

Ah, ok, not likely the source of the problem then.

Thanks,
julia

2023-10-04 18:16:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing


* Julia Lawall <[email protected]> wrote:

>
>
> On Wed, 4 Oct 2023, Peter Zijlstra wrote:
>
> > On Wed, Oct 04, 2023 at 02:01:26PM +0200, Julia Lawall wrote:
> > >
> > >
> > > On Tue, 3 Oct 2023, Peter Zijlstra wrote:
> > >
> > > > On Tue, Oct 03, 2023 at 10:25:08PM +0200, Julia Lawall wrote:
> > > > > Is it expected that the commit e8f331bcc270 should have an impact on the
> > > > > frequency of NUMA balancing?
> > > >
> > > > Definitely not expected. The only effect of that commit was supposed to
> > > > be the runqueue order of tasks. I'll go stare at it in the morning --
> > > > definitely too late for critical thinking atm.
> > >
> > > Maybe it's just randomly making a bad situation worse rather than directly
> > > introduing a problem. There is a high standard deviatind in the
> > > performance. Here are some results with hyperfine. The general trends
> > > are reproducible.
> >
> > OK,. I'm still busy trying to bring a 4 socket machine up-to-date...
> > gawd I hate the boot times on those machines :/
> >
> > But yeah, I was thinking similar things, I really can't spot an obvious
> > fail in that commit.
> >
> > I'll go have a poke once the darn machine is willing to submit :-)
>
> I tried a two-socket machine, but in 50 runs the problem doesn't show up.
>
> The commit e8f331bcc270 starts with
>
> - if (sched_feat(PLACE_LAG) && cfs_rq->nr_running > 1) {
> + if (sched_feat(PLACE_LAG) && cfs_rq->nr_running) {
>
> This seemed like a big change - cfs_rq->nr_running > 1 should be rarely
> true in ua, while cfs_rq->nr_running should always be true. Adding back
> the > 1 and simply replacing the test by 0 both had no effect, though.

BTW., in terms of statistical reliability, one of the biggest ...
stochastic elements of scheduler balancing is wakeup-preemption - which
you can turn off via:

echo NO_WAKEUP_PREEMPTION > /debug/sched/features

or:

echo NO_WAKEUP_PREEMPTION > /sys/kernel/debug/sched/features

If you can measure a performance regression with WAKEUP_PREEMPTION turned
off in *both* kernels, there's likely a material change (regression) in the
quality of NUMA load-balancing.

If it goes away or changes dramatically with WAKEUP_PREEMPTION off, then
I'd pin this effect to EEVDF causing timing changes that are subtly
shifting NUMA & SMP balancing decisions past some critical threshold that
is detrimental to this particular workload.

( Obviously both are regressions we care about - but doing this test would
help categorize the nature of the regression. )

Thanks,

Ingo

2023-10-04 18:20:15

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing



On Wed, 4 Oct 2023, Ingo Molnar wrote:

>
> * Julia Lawall <[email protected]> wrote:
>
> >
> >
> > On Wed, 4 Oct 2023, Peter Zijlstra wrote:
> >
> > > On Wed, Oct 04, 2023 at 02:01:26PM +0200, Julia Lawall wrote:
> > > >
> > > >
> > > > On Tue, 3 Oct 2023, Peter Zijlstra wrote:
> > > >
> > > > > On Tue, Oct 03, 2023 at 10:25:08PM +0200, Julia Lawall wrote:
> > > > > > Is it expected that the commit e8f331bcc270 should have an impact on the
> > > > > > frequency of NUMA balancing?
> > > > >
> > > > > Definitely not expected. The only effect of that commit was supposed to
> > > > > be the runqueue order of tasks. I'll go stare at it in the morning --
> > > > > definitely too late for critical thinking atm.
> > > >
> > > > Maybe it's just randomly making a bad situation worse rather than directly
> > > > introduing a problem. There is a high standard deviatind in the
> > > > performance. Here are some results with hyperfine. The general trends
> > > > are reproducible.
> > >
> > > OK,. I'm still busy trying to bring a 4 socket machine up-to-date...
> > > gawd I hate the boot times on those machines :/
> > >
> > > But yeah, I was thinking similar things, I really can't spot an obvious
> > > fail in that commit.
> > >
> > > I'll go have a poke once the darn machine is willing to submit :-)
> >
> > I tried a two-socket machine, but in 50 runs the problem doesn't show up.
> >
> > The commit e8f331bcc270 starts with
> >
> > - if (sched_feat(PLACE_LAG) && cfs_rq->nr_running > 1) {
> > + if (sched_feat(PLACE_LAG) && cfs_rq->nr_running) {
> >
> > This seemed like a big change - cfs_rq->nr_running > 1 should be rarely
> > true in ua, while cfs_rq->nr_running should always be true. Adding back
> > the > 1 and simply replacing the test by 0 both had no effect, though.
>
> BTW., in terms of statistical reliability, one of the biggest ...
> stochastic elements of scheduler balancing is wakeup-preemption - which
> you can turn off via:
>
> echo NO_WAKEUP_PREEMPTION > /debug/sched/features
>
> or:
>
> echo NO_WAKEUP_PREEMPTION > /sys/kernel/debug/sched/features
>
> If you can measure a performance regression with WAKEUP_PREEMPTION turned
> off in *both* kernels, there's likely a material change (regression) in the
> quality of NUMA load-balancing.
>
> If it goes away or changes dramatically with WAKEUP_PREEMPTION off, then
> I'd pin this effect to EEVDF causing timing changes that are subtly
> shifting NUMA & SMP balancing decisions past some critical threshold that
> is detrimental to this particular workload.
>
> ( Obviously both are regressions we care about - but doing this test would
> help categorize the nature of the regression. )

Thanks for the suggestion. I will try that,

julia

2023-10-04 19:49:10

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing



On Wed, 4 Oct 2023, Ingo Molnar wrote:

>
> * Julia Lawall <[email protected]> wrote:
>
> >
> >
> > On Wed, 4 Oct 2023, Peter Zijlstra wrote:
> >
> > > On Wed, Oct 04, 2023 at 02:01:26PM +0200, Julia Lawall wrote:
> > > >
> > > >
> > > > On Tue, 3 Oct 2023, Peter Zijlstra wrote:
> > > >
> > > > > On Tue, Oct 03, 2023 at 10:25:08PM +0200, Julia Lawall wrote:
> > > > > > Is it expected that the commit e8f331bcc270 should have an impact on the
> > > > > > frequency of NUMA balancing?
> > > > >
> > > > > Definitely not expected. The only effect of that commit was supposed to
> > > > > be the runqueue order of tasks. I'll go stare at it in the morning --
> > > > > definitely too late for critical thinking atm.
> > > >
> > > > Maybe it's just randomly making a bad situation worse rather than directly
> > > > introduing a problem. There is a high standard deviatind in the
> > > > performance. Here are some results with hyperfine. The general trends
> > > > are reproducible.
> > >
> > > OK,. I'm still busy trying to bring a 4 socket machine up-to-date...
> > > gawd I hate the boot times on those machines :/
> > >
> > > But yeah, I was thinking similar things, I really can't spot an obvious
> > > fail in that commit.
> > >
> > > I'll go have a poke once the darn machine is willing to submit :-)
> >
> > I tried a two-socket machine, but in 50 runs the problem doesn't show up.
> >
> > The commit e8f331bcc270 starts with
> >
> > - if (sched_feat(PLACE_LAG) && cfs_rq->nr_running > 1) {
> > + if (sched_feat(PLACE_LAG) && cfs_rq->nr_running) {
> >
> > This seemed like a big change - cfs_rq->nr_running > 1 should be rarely
> > true in ua, while cfs_rq->nr_running should always be true. Adding back
> > the > 1 and simply replacing the test by 0 both had no effect, though.
>
> BTW., in terms of statistical reliability, one of the biggest ...
> stochastic elements of scheduler balancing is wakeup-preemption - which
> you can turn off via:
>
> echo NO_WAKEUP_PREEMPTION > /debug/sched/features
>
> or:
>
> echo NO_WAKEUP_PREEMPTION > /sys/kernel/debug/sched/features
>
> If you can measure a performance regression with WAKEUP_PREEMPTION turned
> off in *both* kernels, there's likely a material change (regression) in the
> quality of NUMA load-balancing.

This is the one that is the case. In 76cae9dbe185, which is the parent of
e8f331bcc270, there are some runtimes that are a bit slower than with
WAKEUP_PREEMPTION enabled, but e8f331bcc270 now has a lot more larger
numbers.

julia

::::::::::::::
ua.C.x_yeti-4_g76cae9dbe185_performance.json
::::::::::::::
{
"results": [
{
"command": "./ua.C.x",
"mean": 31.74652352868501,
"stddev": 7.549670079336034,
"median": 30.825414389609996,
"user": 4026.9165747550005,
"system": 10.66025179,
"min": 21.93520805911,
"max": 60.24540388411,
"times": [
29.249717538109998,
25.16879339411,
27.684410376109998,
25.03525483911,
28.33494802611,
35.398653784109996,
31.886805430109998,
33.34682179411,
35.09637591511,
28.834901557109998,
37.71707762411,
50.09627815011,
29.848774250109997,
33.66924291011,
25.62988106911,
21.93520805911,
37.52640704311,
26.35386307811,
23.05612102511,
25.65851957311,
33.62976770911,
22.55545402511,
35.509719604109996,
47.88084531411,
27.17976105411,
34.56864677911,
34.40073639211,
35.77985792611,
31.57792414811,
60.24540388411,
35.10386024211,
32.36256473411,
31.019663444109998,
25.05048613411,
30.631165335109998,
25.21739748711,
28.57051109611,
29.122454695109997,
31.79110048411,
26.13556522311
]
}
]
}
::::::::::::::
ua.C.x_yeti-4_ge8f331bcc270_performance.json
::::::::::::::
{
"results": [
{
"command": "./ua.C.x",
"mean": 45.55904025022,
"stddev": 24.917491037841696,
"median": 34.83258273512,
"user": 5760.045859355001,
"system": 10.244032225000002,
"min": 21.96719301362,
"max": 105.35666167362,
"times": [
80.22088338362,
34.76734203462,
32.01118466362,
105.35666167362,
28.23154239862,
39.79766051762,
26.89012012362,
21.96719301362,
25.05284109962,
62.19280101062,
84.22492245362,
78.83791121262,
25.67714166762,
76.34033861162,
27.57704435562,
27.83207362162,
30.93298156162,
31.29140204262,
38.02797884462,
23.80228286862,
91.19093656262,
41.32158529962,
36.27444925062,
28.47759006162,
36.42187360462,
26.13298492862,
32.64434456262,
25.03750352662,
42.02328407262,
25.30765174962,
37.82597961162,
34.89782343562,
73.64093796562,
34.05860726262,
78.25896451662,
27.36415754262,
35.27277725262,
27.48229668562,
85.76905357362,
101.92650138361999
]
}
]
}









>
> If it goes away or changes dramatically with WAKEUP_PREEMPTION off, then
> I'd pin this effect to EEVDF causing timing changes that are subtly
> shifting NUMA & SMP balancing decisions past some critical threshold that
> is detrimental to this particular workload.
>
> ( Obviously both are regressions we care about - but doing this test would
> help categorize the nature of the regression. )
>
> Thanks,
>
> Ingo
>

2023-10-09 10:31:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing

On Wed, Oct 04, 2023 at 08:04:34PM +0200, Julia Lawall wrote:
> > I'll go build the benchmark thing tomorrow, if I can figure out how that
> > works, this NAS stuff looked 'special'. Nothing simple like ./configure;
> > make -j$lots :/
>
> Starting from git clone, I had to do:
>
> cd NPB3.4-OMP
> mkdir bin
> cd config
> cp make.def.template make.def
> cd ..
> make ua CLASS=C
>
> You also need gfortran to be installed.

W00t, that worked like a charm.

The sad new is that I can't seem to reproduce the issue:

So my (freshly re-installed with debian testing) 4 socket Intel(R)
Xeon(R) CPU E7-8890 v3 machine gives me:

root@hsw-ex:/usr/local/src/nas-parallel-benchmarks/NPB3.4-OMP/bin# uname -a
Linux hsw-ex 6.6.0-rc4+ #2 SMP PREEMPT_DYNAMIC Mon Oct 9 11:14:21 CEST 2023 x86_64 GNU/Linux
root@hsw-ex:/usr/local/src/nas-parallel-benchmarks/NPB3.4-OMP/bin# cat /proc/sys/kernel/numa_balancing
1
root@hsw-ex:/usr/local/src/nas-parallel-benchmarks/NPB3.4-OMP/bin# ./ua.C.x | grep "Time in seconds"
Time in seconds = 26.69
root@hsw-ex:/usr/local/src/nas-parallel-benchmarks/NPB3.4-OMP/bin# ./ua.C.x | grep "Time in seconds"
Time in seconds = 26.31
root@hsw-ex:/usr/local/src/nas-parallel-benchmarks/NPB3.4-OMP/bin# ./ua.C.x | grep "Time in seconds"
Time in seconds = 25.43


And this is using a .config very near what Debian ships for 6.5 (make
olddefconfig -CONFIG_DEBUG_INFO_BTF)

I'll try again in a little bit, perhaps I'm suffering PEBKAC :-)


2023-10-09 14:08:01

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing



On Mon, 9 Oct 2023, Peter Zijlstra wrote:

> On Wed, Oct 04, 2023 at 08:04:34PM +0200, Julia Lawall wrote:
> > > I'll go build the benchmark thing tomorrow, if I can figure out how that
> > > works, this NAS stuff looked 'special'. Nothing simple like ./configure;
> > > make -j$lots :/
> >
> > Starting from git clone, I had to do:
> >
> > cd NPB3.4-OMP
> > mkdir bin
> > cd config
> > cp make.def.template make.def
> > cd ..
> > make ua CLASS=C
> >
> > You also need gfortran to be installed.
>
> W00t, that worked like a charm.
>
> The sad new is that I can't seem to reproduce the issue:
>
> So my (freshly re-installed with debian testing) 4 socket Intel(R)
> Xeon(R) CPU E7-8890 v3 machine gives me:
>
> root@hsw-ex:/usr/local/src/nas-parallel-benchmarks/NPB3.4-OMP/bin# uname -a
> Linux hsw-ex 6.6.0-rc4+ #2 SMP PREEMPT_DYNAMIC Mon Oct 9 11:14:21 CEST 2023 x86_64 GNU/Linux
> root@hsw-ex:/usr/local/src/nas-parallel-benchmarks/NPB3.4-OMP/bin# cat /proc/sys/kernel/numa_balancing
> 1
> root@hsw-ex:/usr/local/src/nas-parallel-benchmarks/NPB3.4-OMP/bin# ./ua.C.x | grep "Time in seconds"
> Time in seconds = 26.69
> root@hsw-ex:/usr/local/src/nas-parallel-benchmarks/NPB3.4-OMP/bin# ./ua.C.x | grep "Time in seconds"
> Time in seconds = 26.31
> root@hsw-ex:/usr/local/src/nas-parallel-benchmarks/NPB3.4-OMP/bin# ./ua.C.x | grep "Time in seconds"
> Time in seconds = 25.43

How many runs did you try? I would suggest 50.

25-26 looks like what I get when things go well.

julia

>
>
> And this is using a .config very near what Debian ships for 6.5 (make
> olddefconfig -CONFIG_DEBUG_INFO_BTF)
>
> I'll try again in a little bit, perhaps I'm suffering PEBKAC :-)
>
>
>

2023-12-18 13:59:17

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing

Hello,

I have looked further into the NUMA balancing issue.

The context is that there are 2N threads running on 2N cores, one thread
gets NUMA balanced to the other socket, leaving N+1 threads on one socket
and N-1 threads on the other socket. This condition typically persists
for one or more seconds.

Previously, I reported this on a 4-socket machine, but it can also occur
on a 2-socket machine, with other tests from the NAS benchmark suite
(sp.B, bt.B, etc).

Since there are N+1 threads on one of the sockets, it would seem that load
balancing would quickly kick in to bring some thread back to socket with
only N-1 threads. This doesn't happen, though, because actually most of
the threads have some NUMA effects such that they have a preferred node.
So there is a high chance that an attempt to steal will fail, because both
threads have a preference for the socket.

At this point, the only hope is active balancing. However, triggering
active balancing requires the success of the following condition in
imbalanced_active_balance:

if ((env->migration_type == migrate_task) &&
(sd->nr_balance_failed > sd->cache_nice_tries+2))

sd->nr_balance_failed does not increase because the core is idle. When a
core is idle, it comes to the load_balance function from schedule() though
newidle_balance. newidle_balance always sends in the flag CPU_NEWLY_IDLE,
even if the core has been idle for a long time.

Changing newidle_balance to use CPU_IDLE rather than CPU_NEWLY_IDLE when
the core was already idle before the call to schedule() is not enough
though, because there is also the constraint on the migration type. That
turns out to be (mostly?) migrate_util. Removing the following
code from find_busiest_queue:

/*
* Don't try to pull utilization from a CPU with one
* running task. Whatever its utilization, we will fail
* detach the task.
*/
if (nr_running <= 1)
continue;

and changing the above test to:

if ((env->migration_type == migrate_task || env->migration_type == migrate_util) &&
(sd->nr_balance_failed > sd->cache_nice_tries+2))

seems to solve the problem.

I will test this on more applications. But let me know if the above
solution seems completely inappropriate. Maybe it violates some other
constraints.

I have no idea why this problem became more visible with EEVDF. It seems
to have to do with the time slices all turning out to be the same. I got
the same behavior in 6.5 by overwriting the timeslice calculation to
always return 1. But I don't see the connection between the timeslice and
the behavior of the idle task.

thanks,
julia

2023-12-18 17:20:55

by Vincent Guittot

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing

On Mon, 18 Dec 2023 at 14:58, Julia Lawall <[email protected]> wrote:
>
> Hello,
>
> I have looked further into the NUMA balancing issue.
>
> The context is that there are 2N threads running on 2N cores, one thread
> gets NUMA balanced to the other socket, leaving N+1 threads on one socket
> and N-1 threads on the other socket. This condition typically persists
> for one or more seconds.
>
> Previously, I reported this on a 4-socket machine, but it can also occur
> on a 2-socket machine, with other tests from the NAS benchmark suite
> (sp.B, bt.B, etc).
>
> Since there are N+1 threads on one of the sockets, it would seem that load
> balancing would quickly kick in to bring some thread back to socket with
> only N-1 threads. This doesn't happen, though, because actually most of
> the threads have some NUMA effects such that they have a preferred node.
> So there is a high chance that an attempt to steal will fail, because both
> threads have a preference for the socket.
>
> At this point, the only hope is active balancing. However, triggering
> active balancing requires the success of the following condition in
> imbalanced_active_balance:
>
> if ((env->migration_type == migrate_task) &&
> (sd->nr_balance_failed > sd->cache_nice_tries+2))
>
> sd->nr_balance_failed does not increase because the core is idle. When a
> core is idle, it comes to the load_balance function from schedule() though
> newidle_balance. newidle_balance always sends in the flag CPU_NEWLY_IDLE,
> even if the core has been idle for a long time.

Do you mean that you never kick a normal idle load balance ?

>
> Changing newidle_balance to use CPU_IDLE rather than CPU_NEWLY_IDLE when
> the core was already idle before the call to schedule() is not enough
> though, because there is also the constraint on the migration type. That
> turns out to be (mostly?) migrate_util. Removing the following
> code from find_busiest_queue:
>
> /*
> * Don't try to pull utilization from a CPU with one
> * running task. Whatever its utilization, we will fail
> * detach the task.
> */
> if (nr_running <= 1)
> continue;

I'm surprised that load_balance wants to "migrate_util" instead of
"migrate_task"

You have N+1 threads on a group of 2N CPUs so you should have at most
1 thread per CPUs in your busiest group. In theory you should have the
local "group_has_spare" and the busiest "group_fully_busy" (at most).
This means that no group should be overloaded and load_balance should
not try to migrate utli but only task


>
> and changing the above test to:
>
> if ((env->migration_type == migrate_task || env->migration_type == migrate_util) &&
> (sd->nr_balance_failed > sd->cache_nice_tries+2))
>
> seems to solve the problem.
>
> I will test this on more applications. But let me know if the above
> solution seems completely inappropriate. Maybe it violates some other
> constraints.
>
> I have no idea why this problem became more visible with EEVDF. It seems
> to have to do with the time slices all turning out to be the same. I got
> the same behavior in 6.5 by overwriting the timeslice calculation to
> always return 1. But I don't see the connection between the timeslice and
> the behavior of the idle task.
>
> thanks,
> julia

2023-12-18 22:31:59

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing



On Mon, 18 Dec 2023, Vincent Guittot wrote:

> On Mon, 18 Dec 2023 at 14:58, Julia Lawall <[email protected]> wrote:
> >
> > Hello,
> >
> > I have looked further into the NUMA balancing issue.
> >
> > The context is that there are 2N threads running on 2N cores, one thread
> > gets NUMA balanced to the other socket, leaving N+1 threads on one socket
> > and N-1 threads on the other socket. This condition typically persists
> > for one or more seconds.
> >
> > Previously, I reported this on a 4-socket machine, but it can also occur
> > on a 2-socket machine, with other tests from the NAS benchmark suite
> > (sp.B, bt.B, etc).
> >
> > Since there are N+1 threads on one of the sockets, it would seem that load
> > balancing would quickly kick in to bring some thread back to socket with
> > only N-1 threads. This doesn't happen, though, because actually most of
> > the threads have some NUMA effects such that they have a preferred node.
> > So there is a high chance that an attempt to steal will fail, because both
> > threads have a preference for the socket.
> >
> > At this point, the only hope is active balancing. However, triggering
> > active balancing requires the success of the following condition in
> > imbalanced_active_balance:
> >
> > if ((env->migration_type == migrate_task) &&
> > (sd->nr_balance_failed > sd->cache_nice_tries+2))
> >
> > sd->nr_balance_failed does not increase because the core is idle. When a
> > core is idle, it comes to the load_balance function from schedule() though
> > newidle_balance. newidle_balance always sends in the flag CPU_NEWLY_IDLE,
> > even if the core has been idle for a long time.
>
> Do you mean that you never kick a normal idle load balance ?

OK, it seems that both happen, at different times. But the calls to
trigger_load_balance seem to rarely do more than the SMT level.

I have attached part of a trace in which I print various things that
happen during the idle period.

>
> >
> > Changing newidle_balance to use CPU_IDLE rather than CPU_NEWLY_IDLE when
> > the core was already idle before the call to schedule() is not enough
> > though, because there is also the constraint on the migration type. That
> > turns out to be (mostly?) migrate_util. Removing the following
> > code from find_busiest_queue:
> >
> > /*
> > * Don't try to pull utilization from a CPU with one
> > * running task. Whatever its utilization, we will fail
> > * detach the task.
> > */
> > if (nr_running <= 1)
> > continue;
>
> I'm surprised that load_balance wants to "migrate_util" instead of
> "migrate_task"

In the attached trace, there are 147 occurrences of migrate_util, and 3
occurrences of migrate_task. But even when migrate_task appears, the
counter has gotten knocked back down, due to the calls to newidle_balance.

> You have N+1 threads on a group of 2N CPUs so you should have at most
> 1 thread per CPUs in your busiest group.

One CPU has 2 threads, and the others have one. The one with two threads
is returned as the busiest one. But nothing happens, because both of them
prefer the socket that they are on.

> In theory you should have the
> local "group_has_spare" and the busiest "group_fully_busy" (at most).
> This means that no group should be overloaded and load_balance should
> not try to migrate utli but only task

I didn't collect information about the groups. I will look into that.

julia

>
>
> >
> > and changing the above test to:
> >
> > if ((env->migration_type == migrate_task || env->migration_type == migrate_util) &&
> > (sd->nr_balance_failed > sd->cache_nice_tries+2))
> >
> > seems to solve the problem.
> >
> > I will test this on more applications. But let me know if the above
> > solution seems completely inappropriate. Maybe it violates some other
> > constraints.
> >
> > I have no idea why this problem became more visible with EEVDF. It seems
> > to have to do with the time slices all turning out to be the same. I got
> > the same behavior in 6.5 by overwriting the timeslice calculation to
> > always return 1. But I don't see the connection between the timeslice and
> > the behavior of the idle task.
> >
> > thanks,
> > julia
>


Attachments:
tt (305.73 kB)

2023-12-19 17:39:29

by Vincent Guittot

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing

On Mon, 18 Dec 2023 at 23:31, Julia Lawall <[email protected]> wrote:
>
>
>
> On Mon, 18 Dec 2023, Vincent Guittot wrote:
>
> > On Mon, 18 Dec 2023 at 14:58, Julia Lawall <[email protected]> wrote:
> > >
> > > Hello,
> > >
> > > I have looked further into the NUMA balancing issue.
> > >
> > > The context is that there are 2N threads running on 2N cores, one thread
> > > gets NUMA balanced to the other socket, leaving N+1 threads on one socket
> > > and N-1 threads on the other socket. This condition typically persists
> > > for one or more seconds.
> > >
> > > Previously, I reported this on a 4-socket machine, but it can also occur
> > > on a 2-socket machine, with other tests from the NAS benchmark suite
> > > (sp.B, bt.B, etc).
> > >
> > > Since there are N+1 threads on one of the sockets, it would seem that load
> > > balancing would quickly kick in to bring some thread back to socket with
> > > only N-1 threads. This doesn't happen, though, because actually most of
> > > the threads have some NUMA effects such that they have a preferred node.
> > > So there is a high chance that an attempt to steal will fail, because both
> > > threads have a preference for the socket.
> > >
> > > At this point, the only hope is active balancing. However, triggering
> > > active balancing requires the success of the following condition in
> > > imbalanced_active_balance:
> > >
> > > if ((env->migration_type == migrate_task) &&
> > > (sd->nr_balance_failed > sd->cache_nice_tries+2))
> > >
> > > sd->nr_balance_failed does not increase because the core is idle. When a
> > > core is idle, it comes to the load_balance function from schedule() though
> > > newidle_balance. newidle_balance always sends in the flag CPU_NEWLY_IDLE,
> > > even if the core has been idle for a long time.
> >
> > Do you mean that you never kick a normal idle load balance ?
>
> OK, it seems that both happen, at different times. But the calls to
> trigger_load_balance seem to rarely do more than the SMT level.

yes, the min period is equal to "cpumask_weight of sched_domain" ms, 2
ms at SMT level and 2N ms at numa level.

>
> I have attached part of a trace in which I print various things that
> happen during the idle period.
>
> >
> > >
> > > Changing newidle_balance to use CPU_IDLE rather than CPU_NEWLY_IDLE when
> > > the core was already idle before the call to schedule() is not enough
> > > though, because there is also the constraint on the migration type. That
> > > turns out to be (mostly?) migrate_util. Removing the following
> > > code from find_busiest_queue:
> > >
> > > /*
> > > * Don't try to pull utilization from a CPU with one
> > > * running task. Whatever its utilization, we will fail
> > > * detach the task.
> > > */
> > > if (nr_running <= 1)
> > > continue;
> >
> > I'm surprised that load_balance wants to "migrate_util" instead of
> > "migrate_task"
>
> In the attached trace, there are 147 occurrences of migrate_util, and 3
> occurrences of migrate_task. But even when migrate_task appears, the
> counter has gotten knocked back down, due to the calls to newidle_balance.
>
> > You have N+1 threads on a group of 2N CPUs so you should have at most
> > 1 thread per CPUs in your busiest group.
>
> One CPU has 2 threads, and the others have one. The one with two threads
> is returned as the busiest one. But nothing happens, because both of them
> prefer the socket that they are on.

This explains way load_balance uses migrate_util and not migrate_task.
One CPU with 2 threads can be overloaded

ok, so it seems that your 1st problem is that you have 2 threads on
the same CPU whereas you should have an idle core in this numa node.
All cores are sharing the same LLC, aren't they ?

You should not have more than 1 thread per CPU when there are N+1
threads on a node with N cores / 2N CPUs. This will enable the
load_balance to try to migrate a task instead of some util(ization)
and you should reach the active load balance.

>
> > In theory you should have the
> > local "group_has_spare" and the busiest "group_fully_busy" (at most).
> > This means that no group should be overloaded and load_balance should
> > not try to migrate utli but only task
>
> I didn't collect information about the groups. I will look into that.
>
> julia
>
> >
> >
> > >
> > > and changing the above test to:
> > >
> > > if ((env->migration_type == migrate_task || env->migration_type == migrate_util) &&
> > > (sd->nr_balance_failed > sd->cache_nice_tries+2))
> > >
> > > seems to solve the problem.
> > >
> > > I will test this on more applications. But let me know if the above
> > > solution seems completely inappropriate. Maybe it violates some other
> > > constraints.
> > >
> > > I have no idea why this problem became more visible with EEVDF. It seems
> > > to have to do with the time slices all turning out to be the same. I got
> > > the same behavior in 6.5 by overwriting the timeslice calculation to
> > > always return 1. But I don't see the connection between the timeslice and
> > > the behavior of the idle task.
> > >
> > > thanks,
> > > julia
> >

2023-12-19 17:57:12

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing

> > One CPU has 2 threads, and the others have one. The one with two threads
> > is returned as the busiest one. But nothing happens, because both of them
> > prefer the socket that they are on.
>
> This explains way load_balance uses migrate_util and not migrate_task.
> One CPU with 2 threads can be overloaded
>
> ok, so it seems that your 1st problem is that you have 2 threads on
> the same CPU whereas you should have an idle core in this numa node.
> All cores are sharing the same LLC, aren't they ?

Sorry, not following this.

Socket 1 has N-1 threads, and thus an idle CPU.
Socket 2 has N+1 threads, and thus one CPU with two threads.

Socket 1 tries to steal from that one CPU with two threads, but that
fails, because both threads prefer being on Socket 2.

Since most (or all?) of the threads on Socket 2 perfer being on Socket 2.
the only hope for Socket 1 to fill in its idle core is active balancing.
But active balancing is not triggered because of migrate_util and because
CPU_NEWLY_IDLE prevents the failure counter from ebing increased.

The part that I am currently missing to understand is that when I convert
CPU_NEWLY_IDLE to CPU_IDLE, it typically picks a CPU with only one thread
as busiest. I have the impression that the fbq_type intervenes to cause
it to avoid the CPU with two threads that already prefer Socket 2. But I
don't know at the moment why that is the case. In any case, it's fine to
active balance from a CPU with only one thread, because Socket 2 will
even itself out afterwards.

>
> You should not have more than 1 thread per CPU when there are N+1
> threads on a node with N cores / 2N CPUs.

Hmm, I think there is a miscommunication about cores and CPUs. The
machine has two sockets with 16 physical cores each, and thus 32
hyperthreads. There are 64 threads running.

julia

> This will enable the
> load_balance to try to migrate a task instead of some util(ization)
> and you should reach the active load balance.
>
> >
> > > In theory you should have the
> > > local "group_has_spare" and the busiest "group_fully_busy" (at most).
> > > This means that no group should be overloaded and load_balance should
> > > not try to migrate utli but only task
> >
> > I didn't collect information about the groups. I will look into that.
> >
> > julia
> >
> > >
> > >
> > > >
> > > > and changing the above test to:
> > > >
> > > > if ((env->migration_type == migrate_task || env->migration_type == migrate_util) &&
> > > > (sd->nr_balance_failed > sd->cache_nice_tries+2))
> > > >
> > > > seems to solve the problem.
> > > >
> > > > I will test this on more applications. But let me know if the above
> > > > solution seems completely inappropriate. Maybe it violates some other
> > > > constraints.
> > > >
> > > > I have no idea why this problem became more visible with EEVDF. It seems
> > > > to have to do with the time slices all turning out to be the same. I got
> > > > the same behavior in 6.5 by overwriting the timeslice calculation to
> > > > always return 1. But I don't see the connection between the timeslice and
> > > > the behavior of the idle task.
> > > >
> > > > thanks,
> > > > julia
> > >
>

2023-12-20 16:39:46

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing



On Tue, 19 Dec 2023, Vincent Guittot wrote:

> On Mon, 18 Dec 2023 at 23:31, Julia Lawall <[email protected]> wrote:
> >
> >
> >
> > On Mon, 18 Dec 2023, Vincent Guittot wrote:
> >
> > > On Mon, 18 Dec 2023 at 14:58, Julia Lawall <[email protected]> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I have looked further into the NUMA balancing issue.
> > > >
> > > > The context is that there are 2N threads running on 2N cores, one thread
> > > > gets NUMA balanced to the other socket, leaving N+1 threads on one socket
> > > > and N-1 threads on the other socket. This condition typically persists
> > > > for one or more seconds.
> > > >
> > > > Previously, I reported this on a 4-socket machine, but it can also occur
> > > > on a 2-socket machine, with other tests from the NAS benchmark suite
> > > > (sp.B, bt.B, etc).
> > > >
> > > > Since there are N+1 threads on one of the sockets, it would seem that load
> > > > balancing would quickly kick in to bring some thread back to socket with
> > > > only N-1 threads. This doesn't happen, though, because actually most of
> > > > the threads have some NUMA effects such that they have a preferred node.
> > > > So there is a high chance that an attempt to steal will fail, because both
> > > > threads have a preference for the socket.
> > > >
> > > > At this point, the only hope is active balancing. However, triggering
> > > > active balancing requires the success of the following condition in
> > > > imbalanced_active_balance:
> > > >
> > > > if ((env->migration_type == migrate_task) &&
> > > > (sd->nr_balance_failed > sd->cache_nice_tries+2))
> > > >
> > > > sd->nr_balance_failed does not increase because the core is idle. When a
> > > > core is idle, it comes to the load_balance function from schedule() though
> > > > newidle_balance. newidle_balance always sends in the flag CPU_NEWLY_IDLE,
> > > > even if the core has been idle for a long time.
> > >
> > > Do you mean that you never kick a normal idle load balance ?
> >
> > OK, it seems that both happen, at different times. But the calls to
> > trigger_load_balance seem to rarely do more than the SMT level.
>
> yes, the min period is equal to "cpumask_weight of sched_domain" ms, 2
> ms at SMT level and 2N ms at numa level.
>
> >
> > I have attached part of a trace in which I print various things that
> > happen during the idle period.
> >
> > >
> > > >
> > > > Changing newidle_balance to use CPU_IDLE rather than CPU_NEWLY_IDLE when
> > > > the core was already idle before the call to schedule() is not enough
> > > > though, because there is also the constraint on the migration type. That
> > > > turns out to be (mostly?) migrate_util. Removing the following
> > > > code from find_busiest_queue:
> > > >
> > > > /*
> > > > * Don't try to pull utilization from a CPU with one
> > > > * running task. Whatever its utilization, we will fail
> > > > * detach the task.
> > > > */
> > > > if (nr_running <= 1)
> > > > continue;
> > >
> > > I'm surprised that load_balance wants to "migrate_util" instead of
> > > "migrate_task"
> >
> > In the attached trace, there are 147 occurrences of migrate_util, and 3
> > occurrences of migrate_task. But even when migrate_task appears, the
> > counter has gotten knocked back down, due to the calls to newidle_balance.
> >
> > > You have N+1 threads on a group of 2N CPUs so you should have at most
> > > 1 thread per CPUs in your busiest group.
> >
> > One CPU has 2 threads, and the others have one. The one with two threads
> > is returned as the busiest one. But nothing happens, because both of them
> > prefer the socket that they are on.
>
> This explains way load_balance uses migrate_util and not migrate_task.
> One CPU with 2 threads can be overloaded

The node with N-1 tasks (and thus an empty core) is categorized as
group_has_spare and the one with N+1 tasks (and thus one core with 2
tasks and N-1 cores with 1 task) is categorized as group_overloaded. This
seems reasonable, and based on these values the conditions hold for
migrate_util to be chosen.

I tried just extending the test in imbalanced_active_balance to also
accept migrate_util, but the sd->nr_balance_failed still goes up too
slowly due to the many calls from newidle_balance.

julia

>
> ok, so it seems that your 1st problem is that you have 2 threads on
> the same CPU whereas you should have an idle core in this numa node.
> All cores are sharing the same LLC, aren't they ?
>
> You should not have more than 1 thread per CPU when there are N+1
> threads on a node with N cores / 2N CPUs. This will enable the
> load_balance to try to migrate a task instead of some util(ization)
> and you should reach the active load balance.
>
> >
> > > In theory you should have the
> > > local "group_has_spare" and the busiest "group_fully_busy" (at most).
> > > This means that no group should be overloaded and load_balance should
> > > not try to migrate utli but only task
> >
> > I didn't collect information about the groups. I will look into that.
> >
> > julia
> >
> > >
> > >
> > > >
> > > > and changing the above test to:
> > > >
> > > > if ((env->migration_type == migrate_task || env->migration_type == migrate_util) &&
> > > > (sd->nr_balance_failed > sd->cache_nice_tries+2))
> > > >
> > > > seems to solve the problem.
> > > >
> > > > I will test this on more applications. But let me know if the above
> > > > solution seems completely inappropriate. Maybe it violates some other
> > > > constraints.
> > > >
> > > > I have no idea why this problem became more visible with EEVDF. It seems
> > > > to have to do with the time slices all turning out to be the same. I got
> > > > the same behavior in 6.5 by overwriting the timeslice calculation to
> > > > always return 1. But I don't see the connection between the timeslice and
> > > > the behavior of the idle task.
> > > >
> > > > thanks,
> > > > julia
> > >
>

2023-12-20 17:09:48

by Vincent Guittot

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing

On Tue, 19 Dec 2023 at 18:51, Julia Lawall <[email protected]> wrote:
>
> > > One CPU has 2 threads, and the others have one. The one with two threads
> > > is returned as the busiest one. But nothing happens, because both of them
> > > prefer the socket that they are on.
> >
> > This explains way load_balance uses migrate_util and not migrate_task.
> > One CPU with 2 threads can be overloaded
> >
> > ok, so it seems that your 1st problem is that you have 2 threads on
> > the same CPU whereas you should have an idle core in this numa node.
> > All cores are sharing the same LLC, aren't they ?
>
> Sorry, not following this.
>
> Socket 1 has N-1 threads, and thus an idle CPU.
> Socket 2 has N+1 threads, and thus one CPU with two threads.
>
> Socket 1 tries to steal from that one CPU with two threads, but that
> fails, because both threads prefer being on Socket 2.
>
> Since most (or all?) of the threads on Socket 2 perfer being on Socket 2.
> the only hope for Socket 1 to fill in its idle core is active balancing.
> But active balancing is not triggered because of migrate_util and because
> CPU_NEWLY_IDLE prevents the failure counter from ebing increased.

CPU_NEWLY_IDLE load_balance doesn't aims to do active load balance so
you should focus on the CPU_NEWLY_IDLE load_balance

>
> The part that I am currently missing to understand is that when I convert
> CPU_NEWLY_IDLE to CPU_IDLE, it typically picks a CPU with only one thread
> as busiest. I have the impression that the fbq_type intervenes to cause

find_busiest_queue skips rqs which only have threads preferring being
in there. So it selects another rq with a thread that doesn't prefer
its current node.

do you know what is the value of env->fbq_type ?

need_active_balance() probably needs a new condition for the numa case
where the busiest queue can't be selected and we have to trigger an
active load_balance on a rq with only 1 thread but that is not running
on its preferred node. Something like the untested below :

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e5da5eaab6ce..de1474191488 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11150,6 +11150,24 @@ imbalanced_active_balance(struct lb_env *env)
return 0;
}

+static inline bool
+numa_active_balance(struct lb_env *env)
+{
+ struct sched_domain *sd = env->sd;
+
+ /*
+ * We tried to migrate only a !numa task or a task on wrong node but
+ * the busiest queue with such task has only 1 running task. Previous
+ * attempt has failed so force the migration of such task.
+ */
+ if ((env->fbq_type < all) &&
+ (env->src_rq->cfs.h_nr_running == 1) &&
+ (sd->nr_balance_failed > 0))
+ return 1;
+
+ return 0;
+}
+
static int need_active_balance(struct lb_env *env)
{
struct sched_domain *sd = env->sd;
@@ -11176,6 +11194,9 @@ static int need_active_balance(struct lb_env *env)
if (env->migration_type == migrate_misfit)
return 1;

+ if (numa_active_balance(env))
+ return 1;
+
return 0;
}


> it to avoid the CPU with two threads that already prefer Socket 2. But I
> don't know at the moment why that is the case. In any case, it's fine to
> active balance from a CPU with only one thread, because Socket 2 will
> even itself out afterwards.
>
> >
> > You should not have more than 1 thread per CPU when there are N+1
> > threads on a node with N cores / 2N CPUs.
>
> Hmm, I think there is a miscommunication about cores and CPUs. The
> machine has two sockets with 16 physical cores each, and thus 32
> hyperthreads. There are 64 threads running.

Ok, I have been confused by what you wrote previously:
" The context is that there are 2N threads running on 2N cores, one thread
gets NUMA balanced to the other socket, leaving N+1 threads on one socket
and N-1 threads on the other socket."

I have assumed that there were N cores and 2N CPUs per socket as you
mentioned Intel Xeon 6130 in the commit message . My previous emails
don't apply at all with N CPUs per socket and the group_overloaded is
correct.



>
> julia
>
> > This will enable the
> > load_balance to try to migrate a task instead of some util(ization)
> > and you should reach the active load balance.
> >
> > >
> > > > In theory you should have the
> > > > local "group_has_spare" and the busiest "group_fully_busy" (at most).
> > > > This means that no group should be overloaded and load_balance should
> > > > not try to migrate utli but only task
> > >
> > > I didn't collect information about the groups. I will look into that.
> > >
> > > julia
> > >
> > > >
> > > >
> > > > >
> > > > > and changing the above test to:
> > > > >
> > > > > if ((env->migration_type == migrate_task || env->migration_type == migrate_util) &&
> > > > > (sd->nr_balance_failed > sd->cache_nice_tries+2))
> > > > >
> > > > > seems to solve the problem.
> > > > >
> > > > > I will test this on more applications. But let me know if the above
> > > > > solution seems completely inappropriate. Maybe it violates some other
> > > > > constraints.
> > > > >
> > > > > I have no idea why this problem became more visible with EEVDF. It seems
> > > > > to have to do with the time slices all turning out to be the same. I got
> > > > > the same behavior in 6.5 by overwriting the timeslice calculation to
> > > > > always return 1. But I don't see the connection between the timeslice and
> > > > > the behavior of the idle task.
> > > > >
> > > > > thanks,
> > > > > julia
> > > >
> >

2023-12-20 17:11:59

by Vincent Guittot

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing

On Wed, 20 Dec 2023 at 17:39, Julia Lawall <[email protected]> wrote:
>
>
>
> On Tue, 19 Dec 2023, Vincent Guittot wrote:
>
> > On Mon, 18 Dec 2023 at 23:31, Julia Lawall <[email protected]> wrote:
> > >
> > >
> > >
> > > On Mon, 18 Dec 2023, Vincent Guittot wrote:
> > >
> > > > On Mon, 18 Dec 2023 at 14:58, Julia Lawall <[email protected]> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > I have looked further into the NUMA balancing issue.
> > > > >
> > > > > The context is that there are 2N threads running on 2N cores, one thread
> > > > > gets NUMA balanced to the other socket, leaving N+1 threads on one socket
> > > > > and N-1 threads on the other socket. This condition typically persists
> > > > > for one or more seconds.
> > > > >
> > > > > Previously, I reported this on a 4-socket machine, but it can also occur
> > > > > on a 2-socket machine, with other tests from the NAS benchmark suite
> > > > > (sp.B, bt.B, etc).
> > > > >
> > > > > Since there are N+1 threads on one of the sockets, it would seem that load
> > > > > balancing would quickly kick in to bring some thread back to socket with
> > > > > only N-1 threads. This doesn't happen, though, because actually most of
> > > > > the threads have some NUMA effects such that they have a preferred node.
> > > > > So there is a high chance that an attempt to steal will fail, because both
> > > > > threads have a preference for the socket.
> > > > >
> > > > > At this point, the only hope is active balancing. However, triggering
> > > > > active balancing requires the success of the following condition in
> > > > > imbalanced_active_balance:
> > > > >
> > > > > if ((env->migration_type == migrate_task) &&
> > > > > (sd->nr_balance_failed > sd->cache_nice_tries+2))
> > > > >
> > > > > sd->nr_balance_failed does not increase because the core is idle. When a
> > > > > core is idle, it comes to the load_balance function from schedule() though
> > > > > newidle_balance. newidle_balance always sends in the flag CPU_NEWLY_IDLE,
> > > > > even if the core has been idle for a long time.
> > > >
> > > > Do you mean that you never kick a normal idle load balance ?
> > >
> > > OK, it seems that both happen, at different times. But the calls to
> > > trigger_load_balance seem to rarely do more than the SMT level.
> >
> > yes, the min period is equal to "cpumask_weight of sched_domain" ms, 2
> > ms at SMT level and 2N ms at numa level.
> >
> > >
> > > I have attached part of a trace in which I print various things that
> > > happen during the idle period.
> > >
> > > >
> > > > >
> > > > > Changing newidle_balance to use CPU_IDLE rather than CPU_NEWLY_IDLE when
> > > > > the core was already idle before the call to schedule() is not enough
> > > > > though, because there is also the constraint on the migration type. That
> > > > > turns out to be (mostly?) migrate_util. Removing the following
> > > > > code from find_busiest_queue:
> > > > >
> > > > > /*
> > > > > * Don't try to pull utilization from a CPU with one
> > > > > * running task. Whatever its utilization, we will fail
> > > > > * detach the task.
> > > > > */
> > > > > if (nr_running <= 1)
> > > > > continue;
> > > >
> > > > I'm surprised that load_balance wants to "migrate_util" instead of
> > > > "migrate_task"
> > >
> > > In the attached trace, there are 147 occurrences of migrate_util, and 3
> > > occurrences of migrate_task. But even when migrate_task appears, the
> > > counter has gotten knocked back down, due to the calls to newidle_balance.
> > >
> > > > You have N+1 threads on a group of 2N CPUs so you should have at most
> > > > 1 thread per CPUs in your busiest group.
> > >
> > > One CPU has 2 threads, and the others have one. The one with two threads
> > > is returned as the busiest one. But nothing happens, because both of them
> > > prefer the socket that they are on.
> >
> > This explains way load_balance uses migrate_util and not migrate_task.
> > One CPU with 2 threads can be overloaded
>
> The node with N-1 tasks (and thus an empty core) is categorized as
> group_has_spare and the one with N+1 tasks (and thus one core with 2
> tasks and N-1 cores with 1 task) is categorized as group_overloaded. This
> seems reasonable, and based on these values the conditions hold for
> migrate_util to be chosen.
>
> I tried just extending the test in imbalanced_active_balance to also
> accept migrate_util, but the sd->nr_balance_failed still goes up too
> slowly due to the many calls from newidle_balance.

As mentioned in my other reply, your problem comes from fbq_type which
prevents you to get the the rq which triggered the group_overloaded
state and the use of migrate_util

>
> julia
>
> >
> > ok, so it seems that your 1st problem is that you have 2 threads on
> > the same CPU whereas you should have an idle core in this numa node.
> > All cores are sharing the same LLC, aren't they ?
> >
> > You should not have more than 1 thread per CPU when there are N+1
> > threads on a node with N cores / 2N CPUs. This will enable the
> > load_balance to try to migrate a task instead of some util(ization)
> > and you should reach the active load balance.
> >
> > >
> > > > In theory you should have the
> > > > local "group_has_spare" and the busiest "group_fully_busy" (at most).
> > > > This means that no group should be overloaded and load_balance should
> > > > not try to migrate utli but only task
> > >
> > > I didn't collect information about the groups. I will look into that.
> > >
> > > julia
> > >
> > > >
> > > >
> > > > >
> > > > > and changing the above test to:
> > > > >
> > > > > if ((env->migration_type == migrate_task || env->migration_type == migrate_util) &&
> > > > > (sd->nr_balance_failed > sd->cache_nice_tries+2))
> > > > >
> > > > > seems to solve the problem.
> > > > >
> > > > > I will test this on more applications. But let me know if the above
> > > > > solution seems completely inappropriate. Maybe it violates some other
> > > > > constraints.
> > > > >
> > > > > I have no idea why this problem became more visible with EEVDF. It seems
> > > > > to have to do with the time slices all turning out to be the same. I got
> > > > > the same behavior in 6.5 by overwriting the timeslice calculation to
> > > > > always return 1. But I don't see the connection between the timeslice and
> > > > > the behavior of the idle task.
> > > > >
> > > > > thanks,
> > > > > julia
> > > >
> >

2023-12-21 18:21:18

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing



On Wed, 20 Dec 2023, Vincent Guittot wrote:

> On Tue, 19 Dec 2023 at 18:51, Julia Lawall <[email protected]> wrote:
> >
> > > > One CPU has 2 threads, and the others have one. The one with two threads
> > > > is returned as the busiest one. But nothing happens, because both of them
> > > > prefer the socket that they are on.
> > >
> > > This explains way load_balance uses migrate_util and not migrate_task.
> > > One CPU with 2 threads can be overloaded
> > >
> > > ok, so it seems that your 1st problem is that you have 2 threads on
> > > the same CPU whereas you should have an idle core in this numa node.
> > > All cores are sharing the same LLC, aren't they ?
> >
> > Sorry, not following this.
> >
> > Socket 1 has N-1 threads, and thus an idle CPU.
> > Socket 2 has N+1 threads, and thus one CPU with two threads.
> >
> > Socket 1 tries to steal from that one CPU with two threads, but that
> > fails, because both threads prefer being on Socket 2.
> >
> > Since most (or all?) of the threads on Socket 2 perfer being on Socket 2.
> > the only hope for Socket 1 to fill in its idle core is active balancing.
> > But active balancing is not triggered because of migrate_util and because
> > CPU_NEWLY_IDLE prevents the failure counter from ebing increased.
>
> CPU_NEWLY_IDLE load_balance doesn't aims to do active load balance so
> you should focus on the CPU_NEWLY_IDLE load_balance

I'm still perplexed why a core that has been idle for 1 second or more is
considered to be newly idle.

>
> >
> > The part that I am currently missing to understand is that when I convert
> > CPU_NEWLY_IDLE to CPU_IDLE, it typically picks a CPU with only one thread
> > as busiest. I have the impression that the fbq_type intervenes to cause
>
> find_busiest_queue skips rqs which only have threads preferring being
> in there. So it selects another rq with a thread that doesn't prefer
> its current node.
>
> do you know what is the value of env->fbq_type ?

I have seen one trace in which it is all. There are 33 tasks on one
socket, and they are all considered to have a preference for that socket.

But I have another trace in which it is regular. There are 33 tasks on
the socket, but only 32 have a preference.

>
> need_active_balance() probably needs a new condition for the numa case
> where the busiest queue can't be selected and we have to trigger an
> active load_balance on a rq with only 1 thread but that is not running
> on its preferred node. Something like the untested below :
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e5da5eaab6ce..de1474191488 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11150,6 +11150,24 @@ imbalanced_active_balance(struct lb_env *env)
> return 0;
> }
>
> +static inline bool
> +numa_active_balance(struct lb_env *env)
> +{
> + struct sched_domain *sd = env->sd;
> +
> + /*
> + * We tried to migrate only a !numa task or a task on wrong node but
> + * the busiest queue with such task has only 1 running task. Previous
> + * attempt has failed so force the migration of such task.
> + */
> + if ((env->fbq_type < all) &&
> + (env->src_rq->cfs.h_nr_running == 1) &&
> + (sd->nr_balance_failed > 0))

The last condition will still be a problem because of CPU_NEWLY_IDLE. The
nr_balance_failed counter doesn't get incremented very often.

julia

> + return 1;
> +
> + return 0;
> +}
> +
> static int need_active_balance(struct lb_env *env)
> {
> struct sched_domain *sd = env->sd;
> @@ -11176,6 +11194,9 @@ static int need_active_balance(struct lb_env *env)
> if (env->migration_type == migrate_misfit)
> return 1;
>
> + if (numa_active_balance(env))
> + return 1;
> +
> return 0;
> }
>
>
> > it to avoid the CPU with two threads that already prefer Socket 2. But I
> > don't know at the moment why that is the case. In any case, it's fine to
> > active balance from a CPU with only one thread, because Socket 2 will
> > even itself out afterwards.
> >
> > >
> > > You should not have more than 1 thread per CPU when there are N+1
> > > threads on a node with N cores / 2N CPUs.
> >
> > Hmm, I think there is a miscommunication about cores and CPUs. The
> > machine has two sockets with 16 physical cores each, and thus 32
> > hyperthreads. There are 64 threads running.
>
> Ok, I have been confused by what you wrote previously:
> " The context is that there are 2N threads running on 2N cores, one thread
> gets NUMA balanced to the other socket, leaving N+1 threads on one socket
> and N-1 threads on the other socket."
>
> I have assumed that there were N cores and 2N CPUs per socket as you
> mentioned Intel Xeon 6130 in the commit message . My previous emails
> don't apply at all with N CPUs per socket and the group_overloaded is
> correct.
>
>
>
> >
> > julia
> >
> > > This will enable the
> > > load_balance to try to migrate a task instead of some util(ization)
> > > and you should reach the active load balance.
> > >
> > > >
> > > > > In theory you should have the
> > > > > local "group_has_spare" and the busiest "group_fully_busy" (at most).
> > > > > This means that no group should be overloaded and load_balance should
> > > > > not try to migrate utli but only task
> > > >
> > > > I didn't collect information about the groups. I will look into that.
> > > >
> > > > julia
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > > and changing the above test to:
> > > > > >
> > > > > > if ((env->migration_type == migrate_task || env->migration_type == migrate_util) &&
> > > > > > (sd->nr_balance_failed > sd->cache_nice_tries+2))
> > > > > >
> > > > > > seems to solve the problem.
> > > > > >
> > > > > > I will test this on more applications. But let me know if the above
> > > > > > solution seems completely inappropriate. Maybe it violates some other
> > > > > > constraints.
> > > > > >
> > > > > > I have no idea why this problem became more visible with EEVDF. It seems
> > > > > > to have to do with the time slices all turning out to be the same. I got
> > > > > > the same behavior in 6.5 by overwriting the timeslice calculation to
> > > > > > always return 1. But I don't see the connection between the timeslice and
> > > > > > the behavior of the idle task.
> > > > > >
> > > > > > thanks,
> > > > > > julia
> > > > >
> > >
>

2023-12-22 14:56:20

by Vincent Guittot

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing

On Thu, 21 Dec 2023 at 19:20, Julia Lawall <[email protected]> wrote:
>
>
>
> On Wed, 20 Dec 2023, Vincent Guittot wrote:
>
> > On Tue, 19 Dec 2023 at 18:51, Julia Lawall <[email protected]> wrote:
> > >
> > > > > One CPU has 2 threads, and the others have one. The one with two threads
> > > > > is returned as the busiest one. But nothing happens, because both of them
> > > > > prefer the socket that they are on.
> > > >
> > > > This explains way load_balance uses migrate_util and not migrate_task.
> > > > One CPU with 2 threads can be overloaded
> > > >
> > > > ok, so it seems that your 1st problem is that you have 2 threads on
> > > > the same CPU whereas you should have an idle core in this numa node.
> > > > All cores are sharing the same LLC, aren't they ?
> > >
> > > Sorry, not following this.
> > >
> > > Socket 1 has N-1 threads, and thus an idle CPU.
> > > Socket 2 has N+1 threads, and thus one CPU with two threads.
> > >
> > > Socket 1 tries to steal from that one CPU with two threads, but that
> > > fails, because both threads prefer being on Socket 2.
> > >
> > > Since most (or all?) of the threads on Socket 2 perfer being on Socket 2.
> > > the only hope for Socket 1 to fill in its idle core is active balancing.
> > > But active balancing is not triggered because of migrate_util and because
> > > CPU_NEWLY_IDLE prevents the failure counter from ebing increased.
> >
> > CPU_NEWLY_IDLE load_balance doesn't aims to do active load balance so
> > you should focus on the CPU_NEWLY_IDLE load_balance
>
> I'm still perplexed why a core that has been idle for 1 second or more is
> considered to be newly idle.

CPU_NEWLY_IDLE load balance is called when the scheduler was
scheduling something that just migrated or went back to sleep and
doesn't have anything to schedule so it tries to pull a task from
somewhere else.

But you should still have some CPU_IDLE load balance according to your
description where one CPU of the socket remains idle and those will
increase the nr_balance_failed

I'm surprised that you have mainly CPU_NEWLY_IDLE. Do you know the reason ?

>
> >
> > >
> > > The part that I am currently missing to understand is that when I convert
> > > CPU_NEWLY_IDLE to CPU_IDLE, it typically picks a CPU with only one thread
> > > as busiest. I have the impression that the fbq_type intervenes to cause
> >
> > find_busiest_queue skips rqs which only have threads preferring being
> > in there. So it selects another rq with a thread that doesn't prefer
> > its current node.
> >
> > do you know what is the value of env->fbq_type ?
>
> I have seen one trace in which it is all. There are 33 tasks on one
> socket, and they are all considered to have a preference for that socket.

With env->fbq_type == all, load_balance and find_busiest_queue should
be able to select the actual busiest queue with 2 threads.

But then I imagine that can_migrate/ migrate_degrades_locality
prevents to detach the task

>
> But I have another trace in which it is regular. There are 33 tasks on
> the socket, but only 32 have a preference.
>
> >
> > need_active_balance() probably needs a new condition for the numa case
> > where the busiest queue can't be selected and we have to trigger an
> > active load_balance on a rq with only 1 thread but that is not running
> > on its preferred node. Something like the untested below :
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e5da5eaab6ce..de1474191488 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11150,6 +11150,24 @@ imbalanced_active_balance(struct lb_env *env)
> > return 0;
> > }
> >
> > +static inline bool
> > +numa_active_balance(struct lb_env *env)
> > +{
> > + struct sched_domain *sd = env->sd;
> > +
> > + /*
> > + * We tried to migrate only a !numa task or a task on wrong node but
> > + * the busiest queue with such task has only 1 running task. Previous
> > + * attempt has failed so force the migration of such task.
> > + */
> > + if ((env->fbq_type < all) &&
> > + (env->src_rq->cfs.h_nr_running == 1) &&
> > + (sd->nr_balance_failed > 0))
>
> The last condition will still be a problem because of CPU_NEWLY_IDLE. The
> nr_balance_failed counter doesn't get incremented very often.

It waits for at least 1 failed CPU_IDLE load_balance

>
> julia
>
> > + return 1;
> > +
> > + return 0;
> > +}
> > +
> > static int need_active_balance(struct lb_env *env)
> > {
> > struct sched_domain *sd = env->sd;
> > @@ -11176,6 +11194,9 @@ static int need_active_balance(struct lb_env *env)
> > if (env->migration_type == migrate_misfit)
> > return 1;
> >
> > + if (numa_active_balance(env))
> > + return 1;
> > +
> > return 0;
> > }
> >
> >
> > > it to avoid the CPU with two threads that already prefer Socket 2. But I
> > > don't know at the moment why that is the case. In any case, it's fine to
> > > active balance from a CPU with only one thread, because Socket 2 will
> > > even itself out afterwards.
> > >
> > > >
> > > > You should not have more than 1 thread per CPU when there are N+1
> > > > threads on a node with N cores / 2N CPUs.
> > >
> > > Hmm, I think there is a miscommunication about cores and CPUs. The
> > > machine has two sockets with 16 physical cores each, and thus 32
> > > hyperthreads. There are 64 threads running.
> >
> > Ok, I have been confused by what you wrote previously:
> > " The context is that there are 2N threads running on 2N cores, one thread
> > gets NUMA balanced to the other socket, leaving N+1 threads on one socket
> > and N-1 threads on the other socket."
> >
> > I have assumed that there were N cores and 2N CPUs per socket as you
> > mentioned Intel Xeon 6130 in the commit message . My previous emails
> > don't apply at all with N CPUs per socket and the group_overloaded is
> > correct.
> >
> >
> >
> > >
> > > julia
> > >
> > > > This will enable the
> > > > load_balance to try to migrate a task instead of some util(ization)
> > > > and you should reach the active load balance.
> > > >
> > > > >
> > > > > > In theory you should have the
> > > > > > local "group_has_spare" and the busiest "group_fully_busy" (at most).
> > > > > > This means that no group should be overloaded and load_balance should
> > > > > > not try to migrate utli but only task
> > > > >
> > > > > I didn't collect information about the groups. I will look into that.
> > > > >
> > > > > julia
> > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > and changing the above test to:
> > > > > > >
> > > > > > > if ((env->migration_type == migrate_task || env->migration_type == migrate_util) &&
> > > > > > > (sd->nr_balance_failed > sd->cache_nice_tries+2))
> > > > > > >
> > > > > > > seems to solve the problem.
> > > > > > >
> > > > > > > I will test this on more applications. But let me know if the above
> > > > > > > solution seems completely inappropriate. Maybe it violates some other
> > > > > > > constraints.
> > > > > > >
> > > > > > > I have no idea why this problem became more visible with EEVDF. It seems
> > > > > > > to have to do with the time slices all turning out to be the same. I got
> > > > > > > the same behavior in 6.5 by overwriting the timeslice calculation to
> > > > > > > always return 1. But I don't see the connection between the timeslice and
> > > > > > > the behavior of the idle task.
> > > > > > >
> > > > > > > thanks,
> > > > > > > julia
> > > > > >
> > > >
> >

2023-12-22 15:00:55

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing



On Fri, 22 Dec 2023, Vincent Guittot wrote:

> On Thu, 21 Dec 2023 at 19:20, Julia Lawall <[email protected]> wrote:
> >
> >
> >
> > On Wed, 20 Dec 2023, Vincent Guittot wrote:
> >
> > > On Tue, 19 Dec 2023 at 18:51, Julia Lawall <[email protected]> wrote:
> > > >
> > > > > > One CPU has 2 threads, and the others have one. The one with two threads
> > > > > > is returned as the busiest one. But nothing happens, because both of them
> > > > > > prefer the socket that they are on.
> > > > >
> > > > > This explains way load_balance uses migrate_util and not migrate_task.
> > > > > One CPU with 2 threads can be overloaded
> > > > >
> > > > > ok, so it seems that your 1st problem is that you have 2 threads on
> > > > > the same CPU whereas you should have an idle core in this numa node.
> > > > > All cores are sharing the same LLC, aren't they ?
> > > >
> > > > Sorry, not following this.
> > > >
> > > > Socket 1 has N-1 threads, and thus an idle CPU.
> > > > Socket 2 has N+1 threads, and thus one CPU with two threads.
> > > >
> > > > Socket 1 tries to steal from that one CPU with two threads, but that
> > > > fails, because both threads prefer being on Socket 2.
> > > >
> > > > Since most (or all?) of the threads on Socket 2 perfer being on Socket 2.
> > > > the only hope for Socket 1 to fill in its idle core is active balancing.
> > > > But active balancing is not triggered because of migrate_util and because
> > > > CPU_NEWLY_IDLE prevents the failure counter from ebing increased.
> > >
> > > CPU_NEWLY_IDLE load_balance doesn't aims to do active load balance so
> > > you should focus on the CPU_NEWLY_IDLE load_balance
> >
> > I'm still perplexed why a core that has been idle for 1 second or more is
> > considered to be newly idle.
>
> CPU_NEWLY_IDLE load balance is called when the scheduler was
> scheduling something that just migrated or went back to sleep and
> doesn't have anything to schedule so it tries to pull a task from
> somewhere else.
>
> But you should still have some CPU_IDLE load balance according to your
> description where one CPU of the socket remains idle and those will
> increase the nr_balance_failed

This happens. But not often.

> I'm surprised that you have mainly CPU_NEWLY_IDLE. Do you know the reason ?

No. They come from do_idle calling the scheduler. I will look into why
this happens so often.

>
> >
> > >
> > > >
> > > > The part that I am currently missing to understand is that when I convert
> > > > CPU_NEWLY_IDLE to CPU_IDLE, it typically picks a CPU with only one thread
> > > > as busiest. I have the impression that the fbq_type intervenes to cause
> > >
> > > find_busiest_queue skips rqs which only have threads preferring being
> > > in there. So it selects another rq with a thread that doesn't prefer
> > > its current node.
> > >
> > > do you know what is the value of env->fbq_type ?
> >
> > I have seen one trace in which it is all. There are 33 tasks on one
> > socket, and they are all considered to have a preference for that socket.
>
> With env->fbq_type == all, load_balance and find_busiest_queue should
> be able to select the actual busiest queue with 2 threads.

That's what it does. But nothing can be stolen because there is no active
balancing.

>
> But then I imagine that can_migrate/ migrate_degrades_locality
> prevents to detach the task

Exactly.

julia

> >
> > But I have another trace in which it is regular. There are 33 tasks on
> > the socket, but only 32 have a preference.
> >
> > >
> > > need_active_balance() probably needs a new condition for the numa case
> > > where the busiest queue can't be selected and we have to trigger an
> > > active load_balance on a rq with only 1 thread but that is not running
> > > on its preferred node. Something like the untested below :
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index e5da5eaab6ce..de1474191488 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -11150,6 +11150,24 @@ imbalanced_active_balance(struct lb_env *env)
> > > return 0;
> > > }
> > >
> > > +static inline bool
> > > +numa_active_balance(struct lb_env *env)
> > > +{
> > > + struct sched_domain *sd = env->sd;
> > > +
> > > + /*
> > > + * We tried to migrate only a !numa task or a task on wrong node but
> > > + * the busiest queue with such task has only 1 running task. Previous
> > > + * attempt has failed so force the migration of such task.
> > > + */
> > > + if ((env->fbq_type < all) &&
> > > + (env->src_rq->cfs.h_nr_running == 1) &&
> > > + (sd->nr_balance_failed > 0))
> >
> > The last condition will still be a problem because of CPU_NEWLY_IDLE. The
> > nr_balance_failed counter doesn't get incremented very often.
>
> It waits for at least 1 failed CPU_IDLE load_balance
>
> >
> > julia
> >
> > > + return 1;
> > > +
> > > + return 0;
> > > +}
> > > +
> > > static int need_active_balance(struct lb_env *env)
> > > {
> > > struct sched_domain *sd = env->sd;
> > > @@ -11176,6 +11194,9 @@ static int need_active_balance(struct lb_env *env)
> > > if (env->migration_type == migrate_misfit)
> > > return 1;
> > >
> > > + if (numa_active_balance(env))
> > > + return 1;
> > > +
> > > return 0;
> > > }
> > >
> > >
> > > > it to avoid the CPU with two threads that already prefer Socket 2. But I
> > > > don't know at the moment why that is the case. In any case, it's fine to
> > > > active balance from a CPU with only one thread, because Socket 2 will
> > > > even itself out afterwards.
> > > >
> > > > >
> > > > > You should not have more than 1 thread per CPU when there are N+1
> > > > > threads on a node with N cores / 2N CPUs.
> > > >
> > > > Hmm, I think there is a miscommunication about cores and CPUs. The
> > > > machine has two sockets with 16 physical cores each, and thus 32
> > > > hyperthreads. There are 64 threads running.
> > >
> > > Ok, I have been confused by what you wrote previously:
> > > " The context is that there are 2N threads running on 2N cores, one thread
> > > gets NUMA balanced to the other socket, leaving N+1 threads on one socket
> > > and N-1 threads on the other socket."
> > >
> > > I have assumed that there were N cores and 2N CPUs per socket as you
> > > mentioned Intel Xeon 6130 in the commit message . My previous emails
> > > don't apply at all with N CPUs per socket and the group_overloaded is
> > > correct.
> > >
> > >
> > >
> > > >
> > > > julia
> > > >
> > > > > This will enable the
> > > > > load_balance to try to migrate a task instead of some util(ization)
> > > > > and you should reach the active load balance.
> > > > >
> > > > > >
> > > > > > > In theory you should have the
> > > > > > > local "group_has_spare" and the busiest "group_fully_busy" (at most).
> > > > > > > This means that no group should be overloaded and load_balance should
> > > > > > > not try to migrate utli but only task
> > > > > >
> > > > > > I didn't collect information about the groups. I will look into that.
> > > > > >
> > > > > > julia
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > and changing the above test to:
> > > > > > > >
> > > > > > > > if ((env->migration_type == migrate_task || env->migration_type == migrate_util) &&
> > > > > > > > (sd->nr_balance_failed > sd->cache_nice_tries+2))
> > > > > > > >
> > > > > > > > seems to solve the problem.
> > > > > > > >
> > > > > > > > I will test this on more applications. But let me know if the above
> > > > > > > > solution seems completely inappropriate. Maybe it violates some other
> > > > > > > > constraints.
> > > > > > > >
> > > > > > > > I have no idea why this problem became more visible with EEVDF. It seems
> > > > > > > > to have to do with the time slices all turning out to be the same. I got
> > > > > > > > the same behavior in 6.5 by overwriting the timeslice calculation to
> > > > > > > > always return 1. But I don't see the connection between the timeslice and
> > > > > > > > the behavior of the idle task.
> > > > > > > >
> > > > > > > > thanks,
> > > > > > > > julia
> > > > > > >
> > > > >
> > >
>

2023-12-22 15:59:32

by Vincent Guittot

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing

On Fri, 22 Dec 2023 at 16:00, Julia Lawall <[email protected]> wrote:
>
>
>
> On Fri, 22 Dec 2023, Vincent Guittot wrote:
>
> > On Thu, 21 Dec 2023 at 19:20, Julia Lawall <[email protected]> wrote:
> > >
> > >
> > >
> > > On Wed, 20 Dec 2023, Vincent Guittot wrote:
> > >
> > > > On Tue, 19 Dec 2023 at 18:51, Julia Lawall <[email protected]> wrote:
> > > > >
> > > > > > > One CPU has 2 threads, and the others have one. The one with two threads
> > > > > > > is returned as the busiest one. But nothing happens, because both of them
> > > > > > > prefer the socket that they are on.
> > > > > >
> > > > > > This explains way load_balance uses migrate_util and not migrate_task.
> > > > > > One CPU with 2 threads can be overloaded
> > > > > >
> > > > > > ok, so it seems that your 1st problem is that you have 2 threads on
> > > > > > the same CPU whereas you should have an idle core in this numa node.
> > > > > > All cores are sharing the same LLC, aren't they ?
> > > > >
> > > > > Sorry, not following this.
> > > > >
> > > > > Socket 1 has N-1 threads, and thus an idle CPU.
> > > > > Socket 2 has N+1 threads, and thus one CPU with two threads.
> > > > >
> > > > > Socket 1 tries to steal from that one CPU with two threads, but that
> > > > > fails, because both threads prefer being on Socket 2.
> > > > >
> > > > > Since most (or all?) of the threads on Socket 2 perfer being on Socket 2.
> > > > > the only hope for Socket 1 to fill in its idle core is active balancing.
> > > > > But active balancing is not triggered because of migrate_util and because
> > > > > CPU_NEWLY_IDLE prevents the failure counter from ebing increased.
> > > >
> > > > CPU_NEWLY_IDLE load_balance doesn't aims to do active load balance so
> > > > you should focus on the CPU_NEWLY_IDLE load_balance
> > >
> > > I'm still perplexed why a core that has been idle for 1 second or more is
> > > considered to be newly idle.
> >
> > CPU_NEWLY_IDLE load balance is called when the scheduler was
> > scheduling something that just migrated or went back to sleep and
> > doesn't have anything to schedule so it tries to pull a task from
> > somewhere else.
> >
> > But you should still have some CPU_IDLE load balance according to your
> > description where one CPU of the socket remains idle and those will
> > increase the nr_balance_failed
>
> This happens. But not often.
>
> > I'm surprised that you have mainly CPU_NEWLY_IDLE. Do you know the reason ?
>
> No. They come from do_idle calling the scheduler. I will look into why
> this happens so often.

Hmm, the CPU was idle and received a need resched which triggered the
scheduler but there was nothing to schedule so it goes back to idle
after running a newly_idle _load_balance.

>
> >
> > >
> > > >
> > > > >
> > > > > The part that I am currently missing to understand is that when I convert
> > > > > CPU_NEWLY_IDLE to CPU_IDLE, it typically picks a CPU with only one thread
> > > > > as busiest. I have the impression that the fbq_type intervenes to cause
> > > >
> > > > find_busiest_queue skips rqs which only have threads preferring being
> > > > in there. So it selects another rq with a thread that doesn't prefer
> > > > its current node.
> > > >
> > > > do you know what is the value of env->fbq_type ?
> > >
> > > I have seen one trace in which it is all. There are 33 tasks on one
> > > socket, and they are all considered to have a preference for that socket.
> >
> > With env->fbq_type == all, load_balance and find_busiest_queue should
> > be able to select the actual busiest queue with 2 threads.
>
> That's what it does. But nothing can be stolen because there is no active
> balancing.

My patch below should enable to pull a task from the 1st idle load
balance that fails

>
> >
> > But then I imagine that can_migrate/ migrate_degrades_locality
> > prevents to detach the task
>
> Exactly.
>
> julia
>
> > >
> > > But I have another trace in which it is regular. There are 33 tasks on
> > > the socket, but only 32 have a preference.
> > >
> > > >
> > > > need_active_balance() probably needs a new condition for the numa case
> > > > where the busiest queue can't be selected and we have to trigger an
> > > > active load_balance on a rq with only 1 thread but that is not running
> > > > on its preferred node. Something like the untested below :
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index e5da5eaab6ce..de1474191488 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -11150,6 +11150,24 @@ imbalanced_active_balance(struct lb_env *env)
> > > > return 0;
> > > > }
> > > >
> > > > +static inline bool
> > > > +numa_active_balance(struct lb_env *env)
> > > > +{
> > > > + struct sched_domain *sd = env->sd;
> > > > +
> > > > + /*
> > > > + * We tried to migrate only a !numa task or a task on wrong node but
> > > > + * the busiest queue with such task has only 1 running task. Previous
> > > > + * attempt has failed so force the migration of such task.
> > > > + */
> > > > + if ((env->fbq_type < all) &&
> > > > + (env->src_rq->cfs.h_nr_running == 1) &&
> > > > + (sd->nr_balance_failed > 0))
> > >
> > > The last condition will still be a problem because of CPU_NEWLY_IDLE. The
> > > nr_balance_failed counter doesn't get incremented very often.
> >
> > It waits for at least 1 failed CPU_IDLE load_balance
> >
> > >
> > > julia
> > >
> > > > + return 1;
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > static int need_active_balance(struct lb_env *env)
> > > > {
> > > > struct sched_domain *sd = env->sd;
> > > > @@ -11176,6 +11194,9 @@ static int need_active_balance(struct lb_env *env)
> > > > if (env->migration_type == migrate_misfit)
> > > > return 1;
> > > >
> > > > + if (numa_active_balance(env))
> > > > + return 1;
> > > > +
> > > > return 0;
> > > > }
> > > >
> > > >
> > > > > it to avoid the CPU with two threads that already prefer Socket 2. But I
> > > > > don't know at the moment why that is the case. In any case, it's fine to
> > > > > active balance from a CPU with only one thread, because Socket 2 will
> > > > > even itself out afterwards.
> > > > >
> > > > > >
> > > > > > You should not have more than 1 thread per CPU when there are N+1
> > > > > > threads on a node with N cores / 2N CPUs.
> > > > >
> > > > > Hmm, I think there is a miscommunication about cores and CPUs. The
> > > > > machine has two sockets with 16 physical cores each, and thus 32
> > > > > hyperthreads. There are 64 threads running.
> > > >
> > > > Ok, I have been confused by what you wrote previously:
> > > > " The context is that there are 2N threads running on 2N cores, one thread
> > > > gets NUMA balanced to the other socket, leaving N+1 threads on one socket
> > > > and N-1 threads on the other socket."
> > > >
> > > > I have assumed that there were N cores and 2N CPUs per socket as you
> > > > mentioned Intel Xeon 6130 in the commit message . My previous emails
> > > > don't apply at all with N CPUs per socket and the group_overloaded is
> > > > correct.
> > > >
> > > >
> > > >
> > > > >
> > > > > julia
> > > > >
> > > > > > This will enable the
> > > > > > load_balance to try to migrate a task instead of some util(ization)
> > > > > > and you should reach the active load balance.
> > > > > >
> > > > > > >
> > > > > > > > In theory you should have the
> > > > > > > > local "group_has_spare" and the busiest "group_fully_busy" (at most).
> > > > > > > > This means that no group should be overloaded and load_balance should
> > > > > > > > not try to migrate utli but only task
> > > > > > >
> > > > > > > I didn't collect information about the groups. I will look into that.
> > > > > > >
> > > > > > > julia
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > and changing the above test to:
> > > > > > > > >
> > > > > > > > > if ((env->migration_type == migrate_task || env->migration_type == migrate_util) &&
> > > > > > > > > (sd->nr_balance_failed > sd->cache_nice_tries+2))
> > > > > > > > >
> > > > > > > > > seems to solve the problem.
> > > > > > > > >
> > > > > > > > > I will test this on more applications. But let me know if the above
> > > > > > > > > solution seems completely inappropriate. Maybe it violates some other
> > > > > > > > > constraints.
> > > > > > > > >
> > > > > > > > > I have no idea why this problem became more visible with EEVDF. It seems
> > > > > > > > > to have to do with the time slices all turning out to be the same. I got
> > > > > > > > > the same behavior in 6.5 by overwriting the timeslice calculation to
> > > > > > > > > always return 1. But I don't see the connection between the timeslice and
> > > > > > > > > the behavior of the idle task.
> > > > > > > > >
> > > > > > > > > thanks,
> > > > > > > > > julia
> > > > > > > >
> > > > > >
> > > >
> >

2023-12-22 16:18:17

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing



On Fri, 22 Dec 2023, Vincent Guittot wrote:

> On Fri, 22 Dec 2023 at 16:00, Julia Lawall <[email protected]> wrote:
> >
> >
> >
> > On Fri, 22 Dec 2023, Vincent Guittot wrote:
> >
> > > On Thu, 21 Dec 2023 at 19:20, Julia Lawall <[email protected]> wrote:
> > > >
> > > >
> > > >
> > > > On Wed, 20 Dec 2023, Vincent Guittot wrote:
> > > >
> > > > > On Tue, 19 Dec 2023 at 18:51, Julia Lawall <[email protected]> wrote:
> > > > > >
> > > > > > > > One CPU has 2 threads, and the others have one. The one with two threads
> > > > > > > > is returned as the busiest one. But nothing happens, because both of them
> > > > > > > > prefer the socket that they are on.
> > > > > > >
> > > > > > > This explains way load_balance uses migrate_util and not migrate_task.
> > > > > > > One CPU with 2 threads can be overloaded
> > > > > > >
> > > > > > > ok, so it seems that your 1st problem is that you have 2 threads on
> > > > > > > the same CPU whereas you should have an idle core in this numa node.
> > > > > > > All cores are sharing the same LLC, aren't they ?
> > > > > >
> > > > > > Sorry, not following this.
> > > > > >
> > > > > > Socket 1 has N-1 threads, and thus an idle CPU.
> > > > > > Socket 2 has N+1 threads, and thus one CPU with two threads.
> > > > > >
> > > > > > Socket 1 tries to steal from that one CPU with two threads, but that
> > > > > > fails, because both threads prefer being on Socket 2.
> > > > > >
> > > > > > Since most (or all?) of the threads on Socket 2 perfer being on Socket 2.
> > > > > > the only hope for Socket 1 to fill in its idle core is active balancing.
> > > > > > But active balancing is not triggered because of migrate_util and because
> > > > > > CPU_NEWLY_IDLE prevents the failure counter from ebing increased.
> > > > >
> > > > > CPU_NEWLY_IDLE load_balance doesn't aims to do active load balance so
> > > > > you should focus on the CPU_NEWLY_IDLE load_balance
> > > >
> > > > I'm still perplexed why a core that has been idle for 1 second or more is
> > > > considered to be newly idle.
> > >
> > > CPU_NEWLY_IDLE load balance is called when the scheduler was
> > > scheduling something that just migrated or went back to sleep and
> > > doesn't have anything to schedule so it tries to pull a task from
> > > somewhere else.
> > >
> > > But you should still have some CPU_IDLE load balance according to your
> > > description where one CPU of the socket remains idle and those will
> > > increase the nr_balance_failed
> >
> > This happens. But not often.
> >
> > > I'm surprised that you have mainly CPU_NEWLY_IDLE. Do you know the reason ?
> >
> > No. They come from do_idle calling the scheduler. I will look into why
> > this happens so often.
>
> Hmm, the CPU was idle and received a need resched which triggered the
> scheduler but there was nothing to schedule so it goes back to idle
> after running a newly_idle _load_balance.

I spent quite some time thinking the same until I saw the following code
in do_idle:

preempt_set_need_resched();

So I have the impression that do_idle sets need resched itself.

julia

>
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > The part that I am currently missing to understand is that when I convert
> > > > > > CPU_NEWLY_IDLE to CPU_IDLE, it typically picks a CPU with only one thread
> > > > > > as busiest. I have the impression that the fbq_type intervenes to cause
> > > > >
> > > > > find_busiest_queue skips rqs which only have threads preferring being
> > > > > in there. So it selects another rq with a thread that doesn't prefer
> > > > > its current node.
> > > > >
> > > > > do you know what is the value of env->fbq_type ?
> > > >
> > > > I have seen one trace in which it is all. There are 33 tasks on one
> > > > socket, and they are all considered to have a preference for that socket.
> > >
> > > With env->fbq_type == all, load_balance and find_busiest_queue should
> > > be able to select the actual busiest queue with 2 threads.
> >
> > That's what it does. But nothing can be stolen because there is no active
> > balancing.
>
> My patch below should enable to pull a task from the 1st idle load
> balance that fails
>
> >
> > >
> > > But then I imagine that can_migrate/ migrate_degrades_locality
> > > prevents to detach the task
> >
> > Exactly.
> >
> > julia
> >
> > > >
> > > > But I have another trace in which it is regular. There are 33 tasks on
> > > > the socket, but only 32 have a preference.
> > > >
> > > > >
> > > > > need_active_balance() probably needs a new condition for the numa case
> > > > > where the busiest queue can't be selected and we have to trigger an
> > > > > active load_balance on a rq with only 1 thread but that is not running
> > > > > on its preferred node. Something like the untested below :
> > > > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index e5da5eaab6ce..de1474191488 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -11150,6 +11150,24 @@ imbalanced_active_balance(struct lb_env *env)
> > > > > return 0;
> > > > > }
> > > > >
> > > > > +static inline bool
> > > > > +numa_active_balance(struct lb_env *env)
> > > > > +{
> > > > > + struct sched_domain *sd = env->sd;
> > > > > +
> > > > > + /*
> > > > > + * We tried to migrate only a !numa task or a task on wrong node but
> > > > > + * the busiest queue with such task has only 1 running task. Previous
> > > > > + * attempt has failed so force the migration of such task.
> > > > > + */
> > > > > + if ((env->fbq_type < all) &&
> > > > > + (env->src_rq->cfs.h_nr_running == 1) &&
> > > > > + (sd->nr_balance_failed > 0))
> > > >
> > > > The last condition will still be a problem because of CPU_NEWLY_IDLE. The
> > > > nr_balance_failed counter doesn't get incremented very often.
> > >
> > > It waits for at least 1 failed CPU_IDLE load_balance
> > >
> > > >
> > > > julia
> > > >
> > > > > + return 1;
> > > > > +
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > static int need_active_balance(struct lb_env *env)
> > > > > {
> > > > > struct sched_domain *sd = env->sd;
> > > > > @@ -11176,6 +11194,9 @@ static int need_active_balance(struct lb_env *env)
> > > > > if (env->migration_type == migrate_misfit)
> > > > > return 1;
> > > > >
> > > > > + if (numa_active_balance(env))
> > > > > + return 1;
> > > > > +
> > > > > return 0;
> > > > > }
> > > > >
> > > > >
> > > > > > it to avoid the CPU with two threads that already prefer Socket 2. But I
> > > > > > don't know at the moment why that is the case. In any case, it's fine to
> > > > > > active balance from a CPU with only one thread, because Socket 2 will
> > > > > > even itself out afterwards.
> > > > > >
> > > > > > >
> > > > > > > You should not have more than 1 thread per CPU when there are N+1
> > > > > > > threads on a node with N cores / 2N CPUs.
> > > > > >
> > > > > > Hmm, I think there is a miscommunication about cores and CPUs. The
> > > > > > machine has two sockets with 16 physical cores each, and thus 32
> > > > > > hyperthreads. There are 64 threads running.
> > > > >
> > > > > Ok, I have been confused by what you wrote previously:
> > > > > " The context is that there are 2N threads running on 2N cores, one thread
> > > > > gets NUMA balanced to the other socket, leaving N+1 threads on one socket
> > > > > and N-1 threads on the other socket."
> > > > >
> > > > > I have assumed that there were N cores and 2N CPUs per socket as you
> > > > > mentioned Intel Xeon 6130 in the commit message . My previous emails
> > > > > don't apply at all with N CPUs per socket and the group_overloaded is
> > > > > correct.
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > julia
> > > > > >
> > > > > > > This will enable the
> > > > > > > load_balance to try to migrate a task instead of some util(ization)
> > > > > > > and you should reach the active load balance.
> > > > > > >
> > > > > > > >
> > > > > > > > > In theory you should have the
> > > > > > > > > local "group_has_spare" and the busiest "group_fully_busy" (at most).
> > > > > > > > > This means that no group should be overloaded and load_balance should
> > > > > > > > > not try to migrate utli but only task
> > > > > > > >
> > > > > > > > I didn't collect information about the groups. I will look into that.
> > > > > > > >
> > > > > > > > julia
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > and changing the above test to:
> > > > > > > > > >
> > > > > > > > > > if ((env->migration_type == migrate_task || env->migration_type == migrate_util) &&
> > > > > > > > > > (sd->nr_balance_failed > sd->cache_nice_tries+2))
> > > > > > > > > >
> > > > > > > > > > seems to solve the problem.
> > > > > > > > > >
> > > > > > > > > > I will test this on more applications. But let me know if the above
> > > > > > > > > > solution seems completely inappropriate. Maybe it violates some other
> > > > > > > > > > constraints.
> > > > > > > > > >
> > > > > > > > > > I have no idea why this problem became more visible with EEVDF. It seems
> > > > > > > > > > to have to do with the time slices all turning out to be the same. I got
> > > > > > > > > > the same behavior in 6.5 by overwriting the timeslice calculation to
> > > > > > > > > > always return 1. But I don't see the connection between the timeslice and
> > > > > > > > > > the behavior of the idle task.
> > > > > > > > > >
> > > > > > > > > > thanks,
> > > > > > > > > > julia
> > > > > > > > >
> > > > > > >
> > > > >
> > >
>

2023-12-22 16:34:50

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing



On Fri, 22 Dec 2023, Julia Lawall wrote:

>
>
> On Fri, 22 Dec 2023, Vincent Guittot wrote:
>
> > On Fri, 22 Dec 2023 at 16:00, Julia Lawall <[email protected]> wrote:
> > >
> > >
> > >
> > > On Fri, 22 Dec 2023, Vincent Guittot wrote:
> > >
> > > > On Thu, 21 Dec 2023 at 19:20, Julia Lawall <[email protected]> wrote:
> > > > >
> > > > >
> > > > >
> > > > > On Wed, 20 Dec 2023, Vincent Guittot wrote:
> > > > >
> > > > > > On Tue, 19 Dec 2023 at 18:51, Julia Lawall <[email protected]> wrote:
> > > > > > >
> > > > > > > > > One CPU has 2 threads, and the others have one. The one with two threads
> > > > > > > > > is returned as the busiest one. But nothing happens, because both of them
> > > > > > > > > prefer the socket that they are on.
> > > > > > > >
> > > > > > > > This explains way load_balance uses migrate_util and not migrate_task.
> > > > > > > > One CPU with 2 threads can be overloaded
> > > > > > > >
> > > > > > > > ok, so it seems that your 1st problem is that you have 2 threads on
> > > > > > > > the same CPU whereas you should have an idle core in this numa node.
> > > > > > > > All cores are sharing the same LLC, aren't they ?
> > > > > > >
> > > > > > > Sorry, not following this.
> > > > > > >
> > > > > > > Socket 1 has N-1 threads, and thus an idle CPU.
> > > > > > > Socket 2 has N+1 threads, and thus one CPU with two threads.
> > > > > > >
> > > > > > > Socket 1 tries to steal from that one CPU with two threads, but that
> > > > > > > fails, because both threads prefer being on Socket 2.
> > > > > > >
> > > > > > > Since most (or all?) of the threads on Socket 2 perfer being on Socket 2.
> > > > > > > the only hope for Socket 1 to fill in its idle core is active balancing.
> > > > > > > But active balancing is not triggered because of migrate_util and because
> > > > > > > CPU_NEWLY_IDLE prevents the failure counter from ebing increased.
> > > > > >
> > > > > > CPU_NEWLY_IDLE load_balance doesn't aims to do active load balance so
> > > > > > you should focus on the CPU_NEWLY_IDLE load_balance
> > > > >
> > > > > I'm still perplexed why a core that has been idle for 1 second or more is
> > > > > considered to be newly idle.
> > > >
> > > > CPU_NEWLY_IDLE load balance is called when the scheduler was
> > > > scheduling something that just migrated or went back to sleep and
> > > > doesn't have anything to schedule so it tries to pull a task from
> > > > somewhere else.
> > > >
> > > > But you should still have some CPU_IDLE load balance according to your
> > > > description where one CPU of the socket remains idle and those will
> > > > increase the nr_balance_failed
> > >
> > > This happens. But not often.
> > >
> > > > I'm surprised that you have mainly CPU_NEWLY_IDLE. Do you know the reason ?
> > >
> > > No. They come from do_idle calling the scheduler. I will look into why
> > > this happens so often.
> >
> > Hmm, the CPU was idle and received a need resched which triggered the
> > scheduler but there was nothing to schedule so it goes back to idle
> > after running a newly_idle _load_balance.
>
> I spent quite some time thinking the same until I saw the following code
> in do_idle:
>
> preempt_set_need_resched();
>
> So I have the impression that do_idle sets need resched itself.

But of course that code is only executed if need_resched is true. But I
don't know who would be setting need resched on each clock tick.

julia

>
> julia
>
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > The part that I am currently missing to understand is that when I convert
> > > > > > > CPU_NEWLY_IDLE to CPU_IDLE, it typically picks a CPU with only one thread
> > > > > > > as busiest. I have the impression that the fbq_type intervenes to cause
> > > > > >
> > > > > > find_busiest_queue skips rqs which only have threads preferring being
> > > > > > in there. So it selects another rq with a thread that doesn't prefer
> > > > > > its current node.
> > > > > >
> > > > > > do you know what is the value of env->fbq_type ?
> > > > >
> > > > > I have seen one trace in which it is all. There are 33 tasks on one
> > > > > socket, and they are all considered to have a preference for that socket.
> > > >
> > > > With env->fbq_type == all, load_balance and find_busiest_queue should
> > > > be able to select the actual busiest queue with 2 threads.
> > >
> > > That's what it does. But nothing can be stolen because there is no active
> > > balancing.
> >
> > My patch below should enable to pull a task from the 1st idle load
> > balance that fails
> >
> > >
> > > >
> > > > But then I imagine that can_migrate/ migrate_degrades_locality
> > > > prevents to detach the task
> > >
> > > Exactly.
> > >
> > > julia
> > >
> > > > >
> > > > > But I have another trace in which it is regular. There are 33 tasks on
> > > > > the socket, but only 32 have a preference.
> > > > >
> > > > > >
> > > > > > need_active_balance() probably needs a new condition for the numa case
> > > > > > where the busiest queue can't be selected and we have to trigger an
> > > > > > active load_balance on a rq with only 1 thread but that is not running
> > > > > > on its preferred node. Something like the untested below :
> > > > > >
> > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > > index e5da5eaab6ce..de1474191488 100644
> > > > > > --- a/kernel/sched/fair.c
> > > > > > +++ b/kernel/sched/fair.c
> > > > > > @@ -11150,6 +11150,24 @@ imbalanced_active_balance(struct lb_env *env)
> > > > > > return 0;
> > > > > > }
> > > > > >
> > > > > > +static inline bool
> > > > > > +numa_active_balance(struct lb_env *env)
> > > > > > +{
> > > > > > + struct sched_domain *sd = env->sd;
> > > > > > +
> > > > > > + /*
> > > > > > + * We tried to migrate only a !numa task or a task on wrong node but
> > > > > > + * the busiest queue with such task has only 1 running task. Previous
> > > > > > + * attempt has failed so force the migration of such task.
> > > > > > + */
> > > > > > + if ((env->fbq_type < all) &&
> > > > > > + (env->src_rq->cfs.h_nr_running == 1) &&
> > > > > > + (sd->nr_balance_failed > 0))
> > > > >
> > > > > The last condition will still be a problem because of CPU_NEWLY_IDLE. The
> > > > > nr_balance_failed counter doesn't get incremented very often.
> > > >
> > > > It waits for at least 1 failed CPU_IDLE load_balance
> > > >
> > > > >
> > > > > julia
> > > > >
> > > > > > + return 1;
> > > > > > +
> > > > > > + return 0;
> > > > > > +}
> > > > > > +
> > > > > > static int need_active_balance(struct lb_env *env)
> > > > > > {
> > > > > > struct sched_domain *sd = env->sd;
> > > > > > @@ -11176,6 +11194,9 @@ static int need_active_balance(struct lb_env *env)
> > > > > > if (env->migration_type == migrate_misfit)
> > > > > > return 1;
> > > > > >
> > > > > > + if (numa_active_balance(env))
> > > > > > + return 1;
> > > > > > +
> > > > > > return 0;
> > > > > > }
> > > > > >
> > > > > >
> > > > > > > it to avoid the CPU with two threads that already prefer Socket 2. But I
> > > > > > > don't know at the moment why that is the case. In any case, it's fine to
> > > > > > > active balance from a CPU with only one thread, because Socket 2 will
> > > > > > > even itself out afterwards.
> > > > > > >
> > > > > > > >
> > > > > > > > You should not have more than 1 thread per CPU when there are N+1
> > > > > > > > threads on a node with N cores / 2N CPUs.
> > > > > > >
> > > > > > > Hmm, I think there is a miscommunication about cores and CPUs. The
> > > > > > > machine has two sockets with 16 physical cores each, and thus 32
> > > > > > > hyperthreads. There are 64 threads running.
> > > > > >
> > > > > > Ok, I have been confused by what you wrote previously:
> > > > > > " The context is that there are 2N threads running on 2N cores, one thread
> > > > > > gets NUMA balanced to the other socket, leaving N+1 threads on one socket
> > > > > > and N-1 threads on the other socket."
> > > > > >
> > > > > > I have assumed that there were N cores and 2N CPUs per socket as you
> > > > > > mentioned Intel Xeon 6130 in the commit message . My previous emails
> > > > > > don't apply at all with N CPUs per socket and the group_overloaded is
> > > > > > correct.
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > julia
> > > > > > >
> > > > > > > > This will enable the
> > > > > > > > load_balance to try to migrate a task instead of some util(ization)
> > > > > > > > and you should reach the active load balance.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > In theory you should have the
> > > > > > > > > > local "group_has_spare" and the busiest "group_fully_busy" (at most).
> > > > > > > > > > This means that no group should be overloaded and load_balance should
> > > > > > > > > > not try to migrate utli but only task
> > > > > > > > >
> > > > > > > > > I didn't collect information about the groups. I will look into that.
> > > > > > > > >
> > > > > > > > > julia
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > and changing the above test to:
> > > > > > > > > > >
> > > > > > > > > > > if ((env->migration_type == migrate_task || env->migration_type == migrate_util) &&
> > > > > > > > > > > (sd->nr_balance_failed > sd->cache_nice_tries+2))
> > > > > > > > > > >
> > > > > > > > > > > seems to solve the problem.
> > > > > > > > > > >
> > > > > > > > > > > I will test this on more applications. But let me know if the above
> > > > > > > > > > > solution seems completely inappropriate. Maybe it violates some other
> > > > > > > > > > > constraints.
> > > > > > > > > > >
> > > > > > > > > > > I have no idea why this problem became more visible with EEVDF. It seems
> > > > > > > > > > > to have to do with the time slices all turning out to be the same. I got
> > > > > > > > > > > the same behavior in 6.5 by overwriting the timeslice calculation to
> > > > > > > > > > > always return 1. But I don't see the connection between the timeslice and
> > > > > > > > > > > the behavior of the idle task.
> > > > > > > > > > >
> > > > > > > > > > > thanks,
> > > > > > > > > > > julia
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> >
>

2023-12-22 16:42:32

by Vincent Guittot

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing

On Fri, 22 Dec 2023 at 17:29, Julia Lawall <[email protected]> wrote:
>
>
>
> On Fri, 22 Dec 2023, Julia Lawall wrote:
>
> >
> >
> > On Fri, 22 Dec 2023, Vincent Guittot wrote:
> >
> > > On Fri, 22 Dec 2023 at 16:00, Julia Lawall <[email protected]> wrote:
> > > >
> > > >
> > > >
> > > > On Fri, 22 Dec 2023, Vincent Guittot wrote:
> > > >
> > > > > On Thu, 21 Dec 2023 at 19:20, Julia Lawall <[email protected]> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, 20 Dec 2023, Vincent Guittot wrote:
> > > > > >
> > > > > > > On Tue, 19 Dec 2023 at 18:51, Julia Lawall <[email protected]> wrote:
> > > > > > > >
> > > > > > > > > > One CPU has 2 threads, and the others have one. The one with two threads
> > > > > > > > > > is returned as the busiest one. But nothing happens, because both of them
> > > > > > > > > > prefer the socket that they are on.
> > > > > > > > >
> > > > > > > > > This explains way load_balance uses migrate_util and not migrate_task.
> > > > > > > > > One CPU with 2 threads can be overloaded
> > > > > > > > >
> > > > > > > > > ok, so it seems that your 1st problem is that you have 2 threads on
> > > > > > > > > the same CPU whereas you should have an idle core in this numa node.
> > > > > > > > > All cores are sharing the same LLC, aren't they ?
> > > > > > > >
> > > > > > > > Sorry, not following this.
> > > > > > > >
> > > > > > > > Socket 1 has N-1 threads, and thus an idle CPU.
> > > > > > > > Socket 2 has N+1 threads, and thus one CPU with two threads.
> > > > > > > >
> > > > > > > > Socket 1 tries to steal from that one CPU with two threads, but that
> > > > > > > > fails, because both threads prefer being on Socket 2.
> > > > > > > >
> > > > > > > > Since most (or all?) of the threads on Socket 2 perfer being on Socket 2.
> > > > > > > > the only hope for Socket 1 to fill in its idle core is active balancing.
> > > > > > > > But active balancing is not triggered because of migrate_util and because
> > > > > > > > CPU_NEWLY_IDLE prevents the failure counter from ebing increased.
> > > > > > >
> > > > > > > CPU_NEWLY_IDLE load_balance doesn't aims to do active load balance so
> > > > > > > you should focus on the CPU_NEWLY_IDLE load_balance
> > > > > >
> > > > > > I'm still perplexed why a core that has been idle for 1 second or more is
> > > > > > considered to be newly idle.
> > > > >
> > > > > CPU_NEWLY_IDLE load balance is called when the scheduler was
> > > > > scheduling something that just migrated or went back to sleep and
> > > > > doesn't have anything to schedule so it tries to pull a task from
> > > > > somewhere else.
> > > > >
> > > > > But you should still have some CPU_IDLE load balance according to your
> > > > > description where one CPU of the socket remains idle and those will
> > > > > increase the nr_balance_failed
> > > >
> > > > This happens. But not often.
> > > >
> > > > > I'm surprised that you have mainly CPU_NEWLY_IDLE. Do you know the reason ?
> > > >
> > > > No. They come from do_idle calling the scheduler. I will look into why
> > > > this happens so often.
> > >
> > > Hmm, the CPU was idle and received a need resched which triggered the
> > > scheduler but there was nothing to schedule so it goes back to idle
> > > after running a newly_idle _load_balance.
> >
> > I spent quite some time thinking the same until I saw the following code
> > in do_idle:
> >
> > preempt_set_need_resched();
> >
> > So I have the impression that do_idle sets need resched itself.
>
> But of course that code is only executed if need_resched is true. But I

Yes, that is your root cause. something, most probably in interrupt
context, wakes up your CPU and expect to wake up a thread

> don't know who would be setting need resched on each clock tick.

that can be a timer, interrupt, ipi, rcu ...
a trace should give you some hints

>
> julia
>
> >
> > julia
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > The part that I am currently missing to understand is that when I convert
> > > > > > > > CPU_NEWLY_IDLE to CPU_IDLE, it typically picks a CPU with only one thread
> > > > > > > > as busiest. I have the impression that the fbq_type intervenes to cause
> > > > > > >
> > > > > > > find_busiest_queue skips rqs which only have threads preferring being
> > > > > > > in there. So it selects another rq with a thread that doesn't prefer
> > > > > > > its current node.
> > > > > > >
> > > > > > > do you know what is the value of env->fbq_type ?
> > > > > >
> > > > > > I have seen one trace in which it is all. There are 33 tasks on one
> > > > > > socket, and they are all considered to have a preference for that socket.
> > > > >
> > > > > With env->fbq_type == all, load_balance and find_busiest_queue should
> > > > > be able to select the actual busiest queue with 2 threads.
> > > >
> > > > That's what it does. But nothing can be stolen because there is no active
> > > > balancing.
> > >
> > > My patch below should enable to pull a task from the 1st idle load
> > > balance that fails
> > >
> > > >
> > > > >
> > > > > But then I imagine that can_migrate/ migrate_degrades_locality
> > > > > prevents to detach the task
> > > >
> > > > Exactly.
> > > >
> > > > julia
> > > >
> > > > > >
> > > > > > But I have another trace in which it is regular. There are 33 tasks on
> > > > > > the socket, but only 32 have a preference.
> > > > > >
> > > > > > >
> > > > > > > need_active_balance() probably needs a new condition for the numa case
> > > > > > > where the busiest queue can't be selected and we have to trigger an
> > > > > > > active load_balance on a rq with only 1 thread but that is not running
> > > > > > > on its preferred node. Something like the untested below :
> > > > > > >
> > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > > > index e5da5eaab6ce..de1474191488 100644
> > > > > > > --- a/kernel/sched/fair.c
> > > > > > > +++ b/kernel/sched/fair.c
> > > > > > > @@ -11150,6 +11150,24 @@ imbalanced_active_balance(struct lb_env *env)
> > > > > > > return 0;
> > > > > > > }
> > > > > > >
> > > > > > > +static inline bool
> > > > > > > +numa_active_balance(struct lb_env *env)
> > > > > > > +{
> > > > > > > + struct sched_domain *sd = env->sd;
> > > > > > > +
> > > > > > > + /*
> > > > > > > + * We tried to migrate only a !numa task or a task on wrong node but
> > > > > > > + * the busiest queue with such task has only 1 running task. Previous
> > > > > > > + * attempt has failed so force the migration of such task.
> > > > > > > + */
> > > > > > > + if ((env->fbq_type < all) &&
> > > > > > > + (env->src_rq->cfs.h_nr_running == 1) &&
> > > > > > > + (sd->nr_balance_failed > 0))
> > > > > >
> > > > > > The last condition will still be a problem because of CPU_NEWLY_IDLE. The
> > > > > > nr_balance_failed counter doesn't get incremented very often.
> > > > >
> > > > > It waits for at least 1 failed CPU_IDLE load_balance
> > > > >
> > > > > >
> > > > > > julia
> > > > > >
> > > > > > > + return 1;
> > > > > > > +
> > > > > > > + return 0;
> > > > > > > +}
> > > > > > > +
> > > > > > > static int need_active_balance(struct lb_env *env)
> > > > > > > {
> > > > > > > struct sched_domain *sd = env->sd;
> > > > > > > @@ -11176,6 +11194,9 @@ static int need_active_balance(struct lb_env *env)
> > > > > > > if (env->migration_type == migrate_misfit)
> > > > > > > return 1;
> > > > > > >
> > > > > > > + if (numa_active_balance(env))
> > > > > > > + return 1;
> > > > > > > +
> > > > > > > return 0;
> > > > > > > }
> > > > > > >
> > > > > > >
> > > > > > > > it to avoid the CPU with two threads that already prefer Socket 2. But I
> > > > > > > > don't know at the moment why that is the case. In any case, it's fine to
> > > > > > > > active balance from a CPU with only one thread, because Socket 2 will
> > > > > > > > even itself out afterwards.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > You should not have more than 1 thread per CPU when there are N+1
> > > > > > > > > threads on a node with N cores / 2N CPUs.
> > > > > > > >
> > > > > > > > Hmm, I think there is a miscommunication about cores and CPUs. The
> > > > > > > > machine has two sockets with 16 physical cores each, and thus 32
> > > > > > > > hyperthreads. There are 64 threads running.
> > > > > > >
> > > > > > > Ok, I have been confused by what you wrote previously:
> > > > > > > " The context is that there are 2N threads running on 2N cores, one thread
> > > > > > > gets NUMA balanced to the other socket, leaving N+1 threads on one socket
> > > > > > > and N-1 threads on the other socket."
> > > > > > >
> > > > > > > I have assumed that there were N cores and 2N CPUs per socket as you
> > > > > > > mentioned Intel Xeon 6130 in the commit message . My previous emails
> > > > > > > don't apply at all with N CPUs per socket and the group_overloaded is
> > > > > > > correct.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > julia
> > > > > > > >
> > > > > > > > > This will enable the
> > > > > > > > > load_balance to try to migrate a task instead of some util(ization)
> > > > > > > > > and you should reach the active load balance.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > In theory you should have the
> > > > > > > > > > > local "group_has_spare" and the busiest "group_fully_busy" (at most).
> > > > > > > > > > > This means that no group should be overloaded and load_balance should
> > > > > > > > > > > not try to migrate utli but only task
> > > > > > > > > >
> > > > > > > > > > I didn't collect information about the groups. I will look into that.
> > > > > > > > > >
> > > > > > > > > > julia
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > and changing the above test to:
> > > > > > > > > > > >
> > > > > > > > > > > > if ((env->migration_type == migrate_task || env->migration_type == migrate_util) &&
> > > > > > > > > > > > (sd->nr_balance_failed > sd->cache_nice_tries+2))
> > > > > > > > > > > >
> > > > > > > > > > > > seems to solve the problem.
> > > > > > > > > > > >
> > > > > > > > > > > > I will test this on more applications. But let me know if the above
> > > > > > > > > > > > solution seems completely inappropriate. Maybe it violates some other
> > > > > > > > > > > > constraints.
> > > > > > > > > > > >
> > > > > > > > > > > > I have no idea why this problem became more visible with EEVDF. It seems
> > > > > > > > > > > > to have to do with the time slices all turning out to be the same. I got
> > > > > > > > > > > > the same behavior in 6.5 by overwriting the timeslice calculation to
> > > > > > > > > > > > always return 1. But I don't see the connection between the timeslice and
> > > > > > > > > > > > the behavior of the idle task.
> > > > > > > > > > > >
> > > > > > > > > > > > thanks,
> > > > > > > > > > > > julia
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> >

2023-12-28 18:35:16

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing

> > > > > > I'm surprised that you have mainly CPU_NEWLY_IDLE. Do you know the reason ?
> > > > >
> > > > > No. They come from do_idle calling the scheduler. I will look into why
> > > > > this happens so often.
> > > >
> > > > Hmm, the CPU was idle and received a need resched which triggered the
> > > > scheduler but there was nothing to schedule so it goes back to idle
> > > > after running a newly_idle _load_balance.
> > >
> > > I spent quite some time thinking the same until I saw the following code
> > > in do_idle:
> > >
> > > preempt_set_need_resched();
> > >
> > > So I have the impression that do_idle sets need resched itself.
> >
> > But of course that code is only executed if need_resched is true. But I
>
> Yes, that is your root cause. something, most probably in interrupt
> context, wakes up your CPU and expect to wake up a thread
>
> > don't know who would be setting need resched on each clock tick.
>
> that can be a timer, interrupt, ipi, rcu ...
> a trace should give you some hints

I have the impression that it is the goal of calling nohz_csd_func on each
clock tick that causes the calls to need_resched. If the idle process is
polling, call_function_single_prep_ipi just sets need_resched to get the
idle process to stop polling. But there is no actual task that the idle
process should schedule. The need_resched then prevents the idle process
from stealing, due to the CPU_NEWLY_IDLE flag, contradicting the whole
purpose of calling nohz_csd_func in the first place.

julia

2023-12-29 15:18:45

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing



On Thu, 28 Dec 2023, Julia Lawall wrote:

> > > > > > > I'm surprised that you have mainly CPU_NEWLY_IDLE. Do you know the reason ?
> > > > > >
> > > > > > No. They come from do_idle calling the scheduler. I will look into why
> > > > > > this happens so often.
> > > > >
> > > > > Hmm, the CPU was idle and received a need resched which triggered the
> > > > > scheduler but there was nothing to schedule so it goes back to idle
> > > > > after running a newly_idle _load_balance.
> > > >
> > > > I spent quite some time thinking the same until I saw the following code
> > > > in do_idle:
> > > >
> > > > preempt_set_need_resched();
> > > >
> > > > So I have the impression that do_idle sets need resched itself.
> > >
> > > But of course that code is only executed if need_resched is true. But I
> >
> > Yes, that is your root cause. something, most probably in interrupt
> > context, wakes up your CPU and expect to wake up a thread
> >
> > > don't know who would be setting need resched on each clock tick.
> >
> > that can be a timer, interrupt, ipi, rcu ...
> > a trace should give you some hints
>
> I have the impression that it is the goal of calling nohz_csd_func on each
> clock tick that causes the calls to need_resched. If the idle process is
> polling, call_function_single_prep_ipi just sets need_resched to get the
> idle process to stop polling. But there is no actual task that the idle
> process should schedule. The need_resched then prevents the idle process
> from stealing, due to the CPU_NEWLY_IDLE flag, contradicting the whole
> purpose of calling nohz_csd_func in the first place.

Looking in more detail, do_idle contains the following after existing the
polling loop:

flush_smp_call_function_queue();
schedule_idle();

flush_smp_call_function_queue() does end up calling nohz_csd_func, but
this has no impact, because it first checks that need_resched() is false,
whereas it is currently true to cause existing the polling loop. Removing
that test causes:

raise_softirq_irqoff(SCHED_SOFTIRQ);

but that causes the load balancing code to be executed from a ksoftirqd
task, which means that there is now no load imbalance.

So the only chance to detect an imbalance does seem to be to have the load
balance call be executed by the idle task, via schedule_idle(), as is
done currently. But that leads to the core being considered to be newly
idle.

julia



2024-01-04 16:27:26

by Vincent Guittot

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing

On Fri, 29 Dec 2023 at 16:18, Julia Lawall <[email protected]> wrote:
>
>
>
> On Thu, 28 Dec 2023, Julia Lawall wrote:
>
> > > > > > > > I'm surprised that you have mainly CPU_NEWLY_IDLE. Do you know the reason ?
> > > > > > >
> > > > > > > No. They come from do_idle calling the scheduler. I will look into why
> > > > > > > this happens so often.
> > > > > >
> > > > > > Hmm, the CPU was idle and received a need resched which triggered the
> > > > > > scheduler but there was nothing to schedule so it goes back to idle
> > > > > > after running a newly_idle _load_balance.
> > > > >
> > > > > I spent quite some time thinking the same until I saw the following code
> > > > > in do_idle:
> > > > >
> > > > > preempt_set_need_resched();
> > > > >
> > > > > So I have the impression that do_idle sets need resched itself.
> > > >
> > > > But of course that code is only executed if need_resched is true. But I
> > >
> > > Yes, that is your root cause. something, most probably in interrupt
> > > context, wakes up your CPU and expect to wake up a thread
> > >
> > > > don't know who would be setting need resched on each clock tick.
> > >
> > > that can be a timer, interrupt, ipi, rcu ...
> > > a trace should give you some hints
> >
> > I have the impression that it is the goal of calling nohz_csd_func on each
> > clock tick that causes the calls to need_resched. If the idle process is
> > polling, call_function_single_prep_ipi just sets need_resched to get the

Your system is calling the polling mode and not the default
cpuidle_idle_call() ? This could explain why I don't see such problem
on my system which doesn't have polling

Are you forcing the use of polling mode ?
If yes, could you check that this problem disappears without forcing
polling mode ?

> > idle process to stop polling. But there is no actual task that the idle
> > process should schedule. The need_resched then prevents the idle process
> > from stealing, due to the CPU_NEWLY_IDLE flag, contradicting the whole
> > purpose of calling nohz_csd_func in the first place.

Do I understand correctly that your sequence is :
CPU A CPU B
cpu enters idle
do_idle()
...
loop in cpu_idle_poll
...
kick_ilb on CPU A
send_call_function_single_ipi
set_nr_if_polling
set TIF_NEED_RESCHED

exit polling loop
exit while (!need_resched())

call nohz_csd_func but
need_resched is true so it's a nope

pick_next_task_fair
newidle_balance
load_balance(CPU_NEWLY_IDLE)


>
> Looking in more detail, do_idle contains the following after existing the
> polling loop:
>
> flush_smp_call_function_queue();
> schedule_idle();
>
> flush_smp_call_function_queue() does end up calling nohz_csd_func, but
> this has no impact, because it first checks that need_resched() is false,
> whereas it is currently true to cause existing the polling loop. Removing
> that test causes:
>
> raise_softirq_irqoff(SCHED_SOFTIRQ);
>
> but that causes the load balancing code to be executed from a ksoftirqd
> task, which means that there is now no load imbalance.
>
> So the only chance to detect an imbalance does seem to be to have the load
> balance call be executed by the idle task, via schedule_idle(), as is
> done currently. But that leads to the core being considered to be newly
> idle.
>
> julia
>
>

2024-01-04 16:45:20

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing



On Thu, 4 Jan 2024, Vincent Guittot wrote:

> On Fri, 29 Dec 2023 at 16:18, Julia Lawall <[email protected]> wrote:
> >
> >
> >
> > On Thu, 28 Dec 2023, Julia Lawall wrote:
> >
> > > > > > > > > I'm surprised that you have mainly CPU_NEWLY_IDLE. Do you know the reason ?
> > > > > > > >
> > > > > > > > No. They come from do_idle calling the scheduler. I will look into why
> > > > > > > > this happens so often.
> > > > > > >
> > > > > > > Hmm, the CPU was idle and received a need resched which triggered the
> > > > > > > scheduler but there was nothing to schedule so it goes back to idle
> > > > > > > after running a newly_idle _load_balance.
> > > > > >
> > > > > > I spent quite some time thinking the same until I saw the following code
> > > > > > in do_idle:
> > > > > >
> > > > > > preempt_set_need_resched();
> > > > > >
> > > > > > So I have the impression that do_idle sets need resched itself.
> > > > >
> > > > > But of course that code is only executed if need_resched is true. But I
> > > >
> > > > Yes, that is your root cause. something, most probably in interrupt
> > > > context, wakes up your CPU and expect to wake up a thread
> > > >
> > > > > don't know who would be setting need resched on each clock tick.
> > > >
> > > > that can be a timer, interrupt, ipi, rcu ...
> > > > a trace should give you some hints
> > >
> > > I have the impression that it is the goal of calling nohz_csd_func on each
> > > clock tick that causes the calls to need_resched. If the idle process is
> > > polling, call_function_single_prep_ipi just sets need_resched to get the
>
> Your system is calling the polling mode and not the default
> cpuidle_idle_call() ? This could explain why I don't see such problem
> on my system which doesn't have polling
>
> Are you forcing the use of polling mode ?
> If yes, could you check that this problem disappears without forcing
> polling mode ?

I'll check. I didn't explicitly set anything, but I don't really know
what my configuration file does.

>
> > > idle process to stop polling. But there is no actual task that the idle
> > > process should schedule. The need_resched then prevents the idle process
> > > from stealing, due to the CPU_NEWLY_IDLE flag, contradicting the whole
> > > purpose of calling nohz_csd_func in the first place.
>
> Do I understand correctly that your sequence is :
> CPU A CPU B
> cpu enters idle
> do_idle()
> ...
> loop in cpu_idle_poll
> ...
> kick_ilb on CPU A
> send_call_function_single_ipi
> set_nr_if_polling
> set TIF_NEED_RESCHED
>
> exit polling loop
> exit while (!need_resched())
>
> call nohz_csd_func but
> need_resched is true so it's a nope
>
> pick_next_task_fair
> newidle_balance
> load_balance(CPU_NEWLY_IDLE)

Yes, this looks correct.

thanks,
julia

>
> >
> > Looking in more detail, do_idle contains the following after existing the
> > polling loop:
> >
> > flush_smp_call_function_queue();
> > schedule_idle();
> >
> > flush_smp_call_function_queue() does end up calling nohz_csd_func, but
> > this has no impact, because it first checks that need_resched() is false,
> > whereas it is currently true to cause existing the polling loop. Removing
> > that test causes:
> >
> > raise_softirq_irqoff(SCHED_SOFTIRQ);
> >
> > but that causes the load balancing code to be executed from a ksoftirqd
> > task, which means that there is now no load imbalance.
> >
> > So the only chance to detect an imbalance does seem to be to have the load
> > balance call be executed by the idle task, via schedule_idle(), as is
> > done currently. But that leads to the core being considered to be newly
> > idle.
> >
> > julia
> >
> >
>

2024-01-05 14:52:04

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing

> Your system is calling the polling mode and not the default
> cpuidle_idle_call() ? This could explain why I don't see such problem
> on my system which doesn't have polling
>
> Are you forcing the use of polling mode ?
> If yes, could you check that this problem disappears without forcing
> polling mode ?

I expanded the code in do_idle to:

if (cpu_idle_force_poll) { c1++;
tick_nohz_idle_restart_tick();
cpu_idle_poll();
} else if (tick_check_broadcast_expired()) { c2++;
tick_nohz_idle_restart_tick();
cpu_idle_poll();
} else { c3++;
cpuidle_idle_call();
}

Later, I have:

trace_printk("force poll: %d: c1: %d, c2: %d, c3: %d\n",cpu_idle_force_poll, c1, c2, c3);
flush_smp_call_function_queue();
schedule_idle();

force poll, c1 and c2 are always 0, and c3 is always some non-zero value.
Sometimes small (often 1), and sometimes large (304 or 305).

So I don't think it's calling cpu_idle_poll().

x86 has TIF_POLLING_NRFLAG defined to be a non zero value, which I think
is sufficient to cause the issue.

julia

2024-01-05 16:01:42

by Vincent Guittot

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing

On Fri, 5 Jan 2024 at 15:51, Julia Lawall <[email protected]> wrote:
>
> > Your system is calling the polling mode and not the default
> > cpuidle_idle_call() ? This could explain why I don't see such problem
> > on my system which doesn't have polling
> >
> > Are you forcing the use of polling mode ?
> > If yes, could you check that this problem disappears without forcing
> > polling mode ?
>
> I expanded the code in do_idle to:
>
> if (cpu_idle_force_poll) { c1++;
> tick_nohz_idle_restart_tick();
> cpu_idle_poll();
> } else if (tick_check_broadcast_expired()) { c2++;
> tick_nohz_idle_restart_tick();
> cpu_idle_poll();
> } else { c3++;
> cpuidle_idle_call();
> }
>
> Later, I have:
>
> trace_printk("force poll: %d: c1: %d, c2: %d, c3: %d\n",cpu_idle_force_poll, c1, c2, c3);
> flush_smp_call_function_queue();
> schedule_idle();
>
> force poll, c1 and c2 are always 0, and c3 is always some non-zero value.
> Sometimes small (often 1), and sometimes large (304 or 305).
>
> So I don't think it's calling cpu_idle_poll().

I agree that something else

>
> x86 has TIF_POLLING_NRFLAG defined to be a non zero value, which I think
> is sufficient to cause the issue.

Could you trace trace_sched_wake_idle_without_ipi() ans csd traces as well ?
I don't understand what set need_resched() in your case; having in
mind that I don't see the problem on my Arm systems and IIRC Peter
said that he didn't face the problem on his x86 system.

>
> julia

2024-01-05 16:45:52

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing



On Fri, 5 Jan 2024, Vincent Guittot wrote:

> On Fri, 5 Jan 2024 at 15:51, Julia Lawall <[email protected]> wrote:
> >
> > > Your system is calling the polling mode and not the default
> > > cpuidle_idle_call() ? This could explain why I don't see such problem
> > > on my system which doesn't have polling
> > >
> > > Are you forcing the use of polling mode ?
> > > If yes, could you check that this problem disappears without forcing
> > > polling mode ?
> >
> > I expanded the code in do_idle to:
> >
> > if (cpu_idle_force_poll) { c1++;
> > tick_nohz_idle_restart_tick();
> > cpu_idle_poll();
> > } else if (tick_check_broadcast_expired()) { c2++;
> > tick_nohz_idle_restart_tick();
> > cpu_idle_poll();
> > } else { c3++;
> > cpuidle_idle_call();
> > }
> >
> > Later, I have:
> >
> > trace_printk("force poll: %d: c1: %d, c2: %d, c3: %d\n",cpu_idle_force_poll, c1, c2, c3);
> > flush_smp_call_function_queue();
> > schedule_idle();
> >
> > force poll, c1 and c2 are always 0, and c3 is always some non-zero value.
> > Sometimes small (often 1), and sometimes large (304 or 305).
> >
> > So I don't think it's calling cpu_idle_poll().
>
> I agree that something else
>
> >
> > x86 has TIF_POLLING_NRFLAG defined to be a non zero value, which I think
> > is sufficient to cause the issue.
>
> Could you trace trace_sched_wake_idle_without_ipi() ans csd traces as well ?
> I don't understand what set need_resched() in your case; having in
> mind that I don't see the problem on my Arm systems and IIRC Peter
> said that he didn't face the problem on his x86 system.

TIF_POLLING_NRFLAG doesn't seem to be defined on Arm.

Peter said that he didn't see the problem, but perhaps that was just
random. It requires a NUMA move to occur. I make 20 runs to be sure to
see the problem at least once. But another machine might behave
differently.

I believe the call chain is:

scheduler_tick
trigger_load_balance
nohz_balancer_kick
kick_ilb
smp_call_function_single_async
generic_exec_single
__smp_call_single_queue
send_call_function_single_ipi
call_function_single_prep_ipi
set_nr_if_polling <====== sets need_resched

I'll make a trace to reverify that.

julia

2024-01-05 17:27:43

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing



On Fri, 5 Jan 2024, Julia Lawall wrote:

>
>
> On Fri, 5 Jan 2024, Vincent Guittot wrote:
>
> > On Fri, 5 Jan 2024 at 15:51, Julia Lawall <[email protected]> wrote:
> > >
> > > > Your system is calling the polling mode and not the default
> > > > cpuidle_idle_call() ? This could explain why I don't see such problem
> > > > on my system which doesn't have polling
> > > >
> > > > Are you forcing the use of polling mode ?
> > > > If yes, could you check that this problem disappears without forcing
> > > > polling mode ?
> > >
> > > I expanded the code in do_idle to:
> > >
> > > if (cpu_idle_force_poll) { c1++;
> > > tick_nohz_idle_restart_tick();
> > > cpu_idle_poll();
> > > } else if (tick_check_broadcast_expired()) { c2++;
> > > tick_nohz_idle_restart_tick();
> > > cpu_idle_poll();
> > > } else { c3++;
> > > cpuidle_idle_call();
> > > }
> > >
> > > Later, I have:
> > >
> > > trace_printk("force poll: %d: c1: %d, c2: %d, c3: %d\n",cpu_idle_force_poll, c1, c2, c3);
> > > flush_smp_call_function_queue();
> > > schedule_idle();
> > >
> > > force poll, c1 and c2 are always 0, and c3 is always some non-zero value.
> > > Sometimes small (often 1), and sometimes large (304 or 305).
> > >
> > > So I don't think it's calling cpu_idle_poll().
> >
> > I agree that something else
> >
> > >
> > > x86 has TIF_POLLING_NRFLAG defined to be a non zero value, which I think
> > > is sufficient to cause the issue.
> >
> > Could you trace trace_sched_wake_idle_without_ipi() ans csd traces as well ?
> > I don't understand what set need_resched() in your case; having in
> > mind that I don't see the problem on my Arm systems and IIRC Peter
> > said that he didn't face the problem on his x86 system.
>
> TIF_POLLING_NRFLAG doesn't seem to be defined on Arm.
>
> Peter said that he didn't see the problem, but perhaps that was just
> random. It requires a NUMA move to occur. I make 20 runs to be sure to
> see the problem at least once. But another machine might behave
> differently.
>
> I believe the call chain is:
>
> scheduler_tick
> trigger_load_balance
> nohz_balancer_kick
> kick_ilb
> smp_call_function_single_async
> generic_exec_single
> __smp_call_single_queue
> send_call_function_single_ipi
> call_function_single_prep_ipi
> set_nr_if_polling <====== sets need_resched
>
> I'll make a trace to reverify that.

This is what I see at a tick, which corresponds to the call chain shown
above:

bt.B.x-4184 [046] 466.410605: bputs: scheduler_tick: calling trigger_load_balance
bt.B.x-4184 [046] 466.410605: bputs: trigger_load_balance: calling nohz_balancer_kick
bt.B.x-4184 [046] 466.410605: bputs: trigger_load_balance: calling kick_ilb
bt.B.x-4184 [046] 466.410607: bprint: trigger_load_balance: calling smp_call_function_single_async 22
bt.B.x-4184 [046] 466.410607: bputs: smp_call_function_single_async: calling generic_exec_single
bt.B.x-4184 [046] 466.410607: bputs: generic_exec_single: calling __smp_call_single_queue
bt.B.x-4184 [046] 466.410608: bputs: __smp_call_single_queue: calling send_call_function_single_ipi
bt.B.x-4184 [046] 466.410608: bputs: __smp_call_single_queue: calling call_function_single_prep_ipi
bt.B.x-4184 [046] 466.410608: bputs: call_function_single_prep_ipi: calling set_nr_if_polling
bt.B.x-4184 [046] 466.410609: sched_wake_idle_without_ipi: cpu=22

julia

2024-01-05 20:45:33

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing

I also tried a 2 socket AMD EPYC 7352 (24 cores/CPU). There I can also
get prolonged idle cores, but in seems to happen less often.

julia

2024-01-18 16:35:53

by Vincent Guittot

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing

Hi Julia,

Sorry for the delay. I have been involved on other perf regression

On Fri, 5 Jan 2024 at 18:27, Julia Lawall <[email protected]> wrote:
>
>
>
> On Fri, 5 Jan 2024, Julia Lawall wrote:
>
> >
> >
> > On Fri, 5 Jan 2024, Vincent Guittot wrote:
> >
> > > On Fri, 5 Jan 2024 at 15:51, Julia Lawall <[email protected]> wrote:
> > > >
> > > > > Your system is calling the polling mode and not the default
> > > > > cpuidle_idle_call() ? This could explain why I don't see such problem
> > > > > on my system which doesn't have polling
> > > > >
> > > > > Are you forcing the use of polling mode ?
> > > > > If yes, could you check that this problem disappears without forcing
> > > > > polling mode ?
> > > >
> > > > I expanded the code in do_idle to:
> > > >
> > > > if (cpu_idle_force_poll) { c1++;
> > > > tick_nohz_idle_restart_tick();
> > > > cpu_idle_poll();
> > > > } else if (tick_check_broadcast_expired()) { c2++;
> > > > tick_nohz_idle_restart_tick();
> > > > cpu_idle_poll();
> > > > } else { c3++;
> > > > cpuidle_idle_call();
> > > > }
> > > >
> > > > Later, I have:
> > > >
> > > > trace_printk("force poll: %d: c1: %d, c2: %d, c3: %d\n",cpu_idle_force_poll, c1, c2, c3);
> > > > flush_smp_call_function_queue();
> > > > schedule_idle();
> > > >
> > > > force poll, c1 and c2 are always 0, and c3 is always some non-zero value.
> > > > Sometimes small (often 1), and sometimes large (304 or 305).
> > > >
> > > > So I don't think it's calling cpu_idle_poll().
> > >
> > > I agree that something else
> > >
> > > >
> > > > x86 has TIF_POLLING_NRFLAG defined to be a non zero value, which I think
> > > > is sufficient to cause the issue.
> > >
> > > Could you trace trace_sched_wake_idle_without_ipi() ans csd traces as well ?
> > > I don't understand what set need_resched() in your case; having in
> > > mind that I don't see the problem on my Arm systems and IIRC Peter
> > > said that he didn't face the problem on his x86 system.
> >
> > TIF_POLLING_NRFLAG doesn't seem to be defined on Arm.
> >
> > Peter said that he didn't see the problem, but perhaps that was just
> > random. It requires a NUMA move to occur. I make 20 runs to be sure to
> > see the problem at least once. But another machine might behave
> > differently.
> >
> > I believe the call chain is:
> >
> > scheduler_tick
> > trigger_load_balance
> > nohz_balancer_kick
> > kick_ilb
> > smp_call_function_single_async
> > generic_exec_single
> > __smp_call_single_queue
> > send_call_function_single_ipi
> > call_function_single_prep_ipi
> > set_nr_if_polling <====== sets need_resched
> >
> > I'll make a trace to reverify that.
>
> This is what I see at a tick, which corresponds to the call chain shown
> above:
>
> bt.B.x-4184 [046] 466.410605: bputs: scheduler_tick: calling trigger_load_balance
> bt.B.x-4184 [046] 466.410605: bputs: trigger_load_balance: calling nohz_balancer_kick
> bt.B.x-4184 [046] 466.410605: bputs: trigger_load_balance: calling kick_ilb
> bt.B.x-4184 [046] 466.410607: bprint: trigger_load_balance: calling smp_call_function_single_async 22
> bt.B.x-4184 [046] 466.410607: bputs: smp_call_function_single_async: calling generic_exec_single
> bt.B.x-4184 [046] 466.410607: bputs: generic_exec_single: calling __smp_call_single_queue
> bt.B.x-4184 [046] 466.410608: bputs: __smp_call_single_queue: calling send_call_function_single_ipi
> bt.B.x-4184 [046] 466.410608: bputs: __smp_call_single_queue: calling call_function_single_prep_ipi
> bt.B.x-4184 [046] 466.410608: bputs: call_function_single_prep_ipi: calling set_nr_if_polling
> bt.B.x-4184 [046] 466.410609: sched_wake_idle_without_ipi: cpu=22

I don't know if you have made progress on this in the meantime.

Regarding the trace above, do you know if anything happens on CPU22
just before the scheduler tried to kick the ILB on it ?

Have you found why TIF_POLLING_NRFLAG seems to be always set when the
kick_ilb happens ? It should be cleared once entering the idle state.

Could you check your cpuidle driver ?

Vincent

>
> julia

2024-01-18 16:52:01

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing



On Thu, 18 Jan 2024, Vincent Guittot wrote:

> Hi Julia,
>
> Sorry for the delay. I have been involved on other perf regression
>
> On Fri, 5 Jan 2024 at 18:27, Julia Lawall <[email protected]> wrote:
> >
> >
> >
> > On Fri, 5 Jan 2024, Julia Lawall wrote:
> >
> > >
> > >
> > > On Fri, 5 Jan 2024, Vincent Guittot wrote:
> > >
> > > > On Fri, 5 Jan 2024 at 15:51, Julia Lawall <[email protected]> wrote:
> > > > >
> > > > > > Your system is calling the polling mode and not the default
> > > > > > cpuidle_idle_call() ? This could explain why I don't see such problem
> > > > > > on my system which doesn't have polling
> > > > > >
> > > > > > Are you forcing the use of polling mode ?
> > > > > > If yes, could you check that this problem disappears without forcing
> > > > > > polling mode ?
> > > > >
> > > > > I expanded the code in do_idle to:
> > > > >
> > > > > if (cpu_idle_force_poll) { c1++;
> > > > > tick_nohz_idle_restart_tick();
> > > > > cpu_idle_poll();
> > > > > } else if (tick_check_broadcast_expired()) { c2++;
> > > > > tick_nohz_idle_restart_tick();
> > > > > cpu_idle_poll();
> > > > > } else { c3++;
> > > > > cpuidle_idle_call();
> > > > > }
> > > > >
> > > > > Later, I have:
> > > > >
> > > > > trace_printk("force poll: %d: c1: %d, c2: %d, c3: %d\n",cpu_idle_force_poll, c1, c2, c3);
> > > > > flush_smp_call_function_queue();
> > > > > schedule_idle();
> > > > >
> > > > > force poll, c1 and c2 are always 0, and c3 is always some non-zero value.
> > > > > Sometimes small (often 1), and sometimes large (304 or 305).
> > > > >
> > > > > So I don't think it's calling cpu_idle_poll().
> > > >
> > > > I agree that something else
> > > >
> > > > >
> > > > > x86 has TIF_POLLING_NRFLAG defined to be a non zero value, which I think
> > > > > is sufficient to cause the issue.
> > > >
> > > > Could you trace trace_sched_wake_idle_without_ipi() ans csd traces as well ?
> > > > I don't understand what set need_resched() in your case; having in
> > > > mind that I don't see the problem on my Arm systems and IIRC Peter
> > > > said that he didn't face the problem on his x86 system.
> > >
> > > TIF_POLLING_NRFLAG doesn't seem to be defined on Arm.
> > >
> > > Peter said that he didn't see the problem, but perhaps that was just
> > > random. It requires a NUMA move to occur. I make 20 runs to be sure to
> > > see the problem at least once. But another machine might behave
> > > differently.
> > >
> > > I believe the call chain is:
> > >
> > > scheduler_tick
> > > trigger_load_balance
> > > nohz_balancer_kick
> > > kick_ilb
> > > smp_call_function_single_async
> > > generic_exec_single
> > > __smp_call_single_queue
> > > send_call_function_single_ipi
> > > call_function_single_prep_ipi
> > > set_nr_if_polling <====== sets need_resched
> > >
> > > I'll make a trace to reverify that.
> >
> > This is what I see at a tick, which corresponds to the call chain shown
> > above:
> >
> > bt.B.x-4184 [046] 466.410605: bputs: scheduler_tick: calling trigger_load_balance
> > bt.B.x-4184 [046] 466.410605: bputs: trigger_load_balance: calling nohz_balancer_kick
> > bt.B.x-4184 [046] 466.410605: bputs: trigger_load_balance: calling kick_ilb
> > bt.B.x-4184 [046] 466.410607: bprint: trigger_load_balance: calling smp_call_function_single_async 22
> > bt.B.x-4184 [046] 466.410607: bputs: smp_call_function_single_async: calling generic_exec_single
> > bt.B.x-4184 [046] 466.410607: bputs: generic_exec_single: calling __smp_call_single_queue
> > bt.B.x-4184 [046] 466.410608: bputs: __smp_call_single_queue: calling send_call_function_single_ipi
> > bt.B.x-4184 [046] 466.410608: bputs: __smp_call_single_queue: calling call_function_single_prep_ipi
> > bt.B.x-4184 [046] 466.410608: bputs: call_function_single_prep_ipi: calling set_nr_if_polling
> > bt.B.x-4184 [046] 466.410609: sched_wake_idle_without_ipi: cpu=22
>
> I don't know if you have made progress on this in the meantime.

Not really. Basically after do_idle, there is the call to
flush_smp_call_function_queue that invokes the deposited functions, which
in our case is at best going to raise a softirq, and the call to schedule.
Raising a softirq doesn't happen because of the check for need_resched.
But even if that test were removed, it would still not be useful because
there would be the ksoftirqd running on the idle core that would eliminate
the imbalance between the sockets. Maybe there could be some way to call
run_rebalance_domains directly from nohz_csd_func, since
run_rebalance_domains doesn't use its argument, but at the moment
run_rebalance_domains is not visible to nohz_csd_func.

>
> Regarding the trace above, do you know if anything happens on CPU22
> just before the scheduler tried to kick the ILB on it ?

I don't think so. It's idle.

> Have you found why TIF_POLLING_NRFLAG seems to be always set when the
> kick_ilb happens ? It should be cleared once entering the idle state.

Actually, I don't think it is always set. It switches back and forth
between two cases. I will look for the traces that show that.

> Could you check your cpuidle driver ?

Check what specifically?

thanks,
julia

2024-01-18 17:11:17

by Vincent Guittot

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing

On Thu, 18 Jan 2024 at 17:50, Julia Lawall <[email protected]> wrote:
>
>
>
> On Thu, 18 Jan 2024, Vincent Guittot wrote:
>
> > Hi Julia,
> >
> > Sorry for the delay. I have been involved on other perf regression
> >
> > On Fri, 5 Jan 2024 at 18:27, Julia Lawall <[email protected]> wrote:
> > >
> > >
> > >
> > > On Fri, 5 Jan 2024, Julia Lawall wrote:
> > >
> > > >
> > > >
> > > > On Fri, 5 Jan 2024, Vincent Guittot wrote:
> > > >
> > > > > On Fri, 5 Jan 2024 at 15:51, Julia Lawall <[email protected]> wrote:
> > > > > >
> > > > > > > Your system is calling the polling mode and not the default
> > > > > > > cpuidle_idle_call() ? This could explain why I don't see such problem
> > > > > > > on my system which doesn't have polling
> > > > > > >
> > > > > > > Are you forcing the use of polling mode ?
> > > > > > > If yes, could you check that this problem disappears without forcing
> > > > > > > polling mode ?
> > > > > >
> > > > > > I expanded the code in do_idle to:
> > > > > >
> > > > > > if (cpu_idle_force_poll) { c1++;
> > > > > > tick_nohz_idle_restart_tick();
> > > > > > cpu_idle_poll();
> > > > > > } else if (tick_check_broadcast_expired()) { c2++;
> > > > > > tick_nohz_idle_restart_tick();
> > > > > > cpu_idle_poll();
> > > > > > } else { c3++;
> > > > > > cpuidle_idle_call();
> > > > > > }
> > > > > >
> > > > > > Later, I have:
> > > > > >
> > > > > > trace_printk("force poll: %d: c1: %d, c2: %d, c3: %d\n",cpu_idle_force_poll, c1, c2, c3);
> > > > > > flush_smp_call_function_queue();
> > > > > > schedule_idle();
> > > > > >
> > > > > > force poll, c1 and c2 are always 0, and c3 is always some non-zero value.
> > > > > > Sometimes small (often 1), and sometimes large (304 or 305).
> > > > > >
> > > > > > So I don't think it's calling cpu_idle_poll().
> > > > >
> > > > > I agree that something else
> > > > >
> > > > > >
> > > > > > x86 has TIF_POLLING_NRFLAG defined to be a non zero value, which I think
> > > > > > is sufficient to cause the issue.
> > > > >
> > > > > Could you trace trace_sched_wake_idle_without_ipi() ans csd traces as well ?
> > > > > I don't understand what set need_resched() in your case; having in
> > > > > mind that I don't see the problem on my Arm systems and IIRC Peter
> > > > > said that he didn't face the problem on his x86 system.
> > > >
> > > > TIF_POLLING_NRFLAG doesn't seem to be defined on Arm.
> > > >
> > > > Peter said that he didn't see the problem, but perhaps that was just
> > > > random. It requires a NUMA move to occur. I make 20 runs to be sure to
> > > > see the problem at least once. But another machine might behave
> > > > differently.
> > > >
> > > > I believe the call chain is:
> > > >
> > > > scheduler_tick
> > > > trigger_load_balance
> > > > nohz_balancer_kick
> > > > kick_ilb
> > > > smp_call_function_single_async
> > > > generic_exec_single
> > > > __smp_call_single_queue
> > > > send_call_function_single_ipi
> > > > call_function_single_prep_ipi
> > > > set_nr_if_polling <====== sets need_resched
> > > >
> > > > I'll make a trace to reverify that.
> > >
> > > This is what I see at a tick, which corresponds to the call chain shown
> > > above:
> > >
> > > bt.B.x-4184 [046] 466.410605: bputs: scheduler_tick: calling trigger_load_balance
> > > bt.B.x-4184 [046] 466.410605: bputs: trigger_load_balance: calling nohz_balancer_kick
> > > bt.B.x-4184 [046] 466.410605: bputs: trigger_load_balance: calling kick_ilb
> > > bt.B.x-4184 [046] 466.410607: bprint: trigger_load_balance: calling smp_call_function_single_async 22
> > > bt.B.x-4184 [046] 466.410607: bputs: smp_call_function_single_async: calling generic_exec_single
> > > bt.B.x-4184 [046] 466.410607: bputs: generic_exec_single: calling __smp_call_single_queue
> > > bt.B.x-4184 [046] 466.410608: bputs: __smp_call_single_queue: calling send_call_function_single_ipi
> > > bt.B.x-4184 [046] 466.410608: bputs: __smp_call_single_queue: calling call_function_single_prep_ipi
> > > bt.B.x-4184 [046] 466.410608: bputs: call_function_single_prep_ipi: calling set_nr_if_polling
> > > bt.B.x-4184 [046] 466.410609: sched_wake_idle_without_ipi: cpu=22
> >
> > I don't know if you have made progress on this in the meantime.
>
> Not really. Basically after do_idle, there is the call to
> flush_smp_call_function_queue that invokes the deposited functions, which
> in our case is at best going to raise a softirq, and the call to schedule.
> Raising a softirq doesn't happen because of the check for need_resched.
> But even if that test were removed, it would still not be useful because
> there would be the ksoftirqd running on the idle core that would eliminate
> the imbalance between the sockets. Maybe there could be some way to call
> run_rebalance_domains directly from nohz_csd_func, since
> run_rebalance_domains doesn't use its argument, but at the moment
> run_rebalance_domains is not visible to nohz_csd_func.

All this happen because we don't use an ipi, it should not use
ksoftirqd with ipi

>
> >
> > Regarding the trace above, do you know if anything happens on CPU22
> > just before the scheduler tried to kick the ILB on it ?
>
> I don't think so. It's idle.

Ok, so if it is idle for a while , I mean nothing happened on it, not
even spurious irq, It should have cleared its TIF_POLLING_NRFLAG

I would be good to trace the selected idle state

>
> > Have you found why TIF_POLLING_NRFLAG seems to be always set when the
> > kick_ilb happens ? It should be cleared once entering the idle state.
>
> Actually, I don't think it is always set. It switches back and forth
> between two cases. I will look for the traces that show that.
>
> > Could you check your cpuidle driver ?
>
> Check what specifically?

$ cat /sys/devices/system/cpu/cpuidle/current_driver
$ cat /sys/devices/system/cpu/cpuidle/current_governor

>
> thanks,
> julia

2024-01-18 17:44:44

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing



On Thu, 18 Jan 2024, Vincent Guittot wrote:

> On Thu, 18 Jan 2024 at 17:50, Julia Lawall <[email protected]> wrote:
> >
> >
> >
> > On Thu, 18 Jan 2024, Vincent Guittot wrote:
> >
> > > Hi Julia,
> > >
> > > Sorry for the delay. I have been involved on other perf regression
> > >
> > > On Fri, 5 Jan 2024 at 18:27, Julia Lawall <[email protected]> wrote:
> > > >
> > > >
> > > >
> > > > On Fri, 5 Jan 2024, Julia Lawall wrote:
> > > >
> > > > >
> > > > >
> > > > > On Fri, 5 Jan 2024, Vincent Guittot wrote:
> > > > >
> > > > > > On Fri, 5 Jan 2024 at 15:51, Julia Lawall <[email protected]> wrote:
> > > > > > >
> > > > > > > > Your system is calling the polling mode and not the default
> > > > > > > > cpuidle_idle_call() ? This could explain why I don't see such problem
> > > > > > > > on my system which doesn't have polling
> > > > > > > >
> > > > > > > > Are you forcing the use of polling mode ?
> > > > > > > > If yes, could you check that this problem disappears without forcing
> > > > > > > > polling mode ?
> > > > > > >
> > > > > > > I expanded the code in do_idle to:
> > > > > > >
> > > > > > > if (cpu_idle_force_poll) { c1++;
> > > > > > > tick_nohz_idle_restart_tick();
> > > > > > > cpu_idle_poll();
> > > > > > > } else if (tick_check_broadcast_expired()) { c2++;
> > > > > > > tick_nohz_idle_restart_tick();
> > > > > > > cpu_idle_poll();
> > > > > > > } else { c3++;
> > > > > > > cpuidle_idle_call();
> > > > > > > }
> > > > > > >
> > > > > > > Later, I have:
> > > > > > >
> > > > > > > trace_printk("force poll: %d: c1: %d, c2: %d, c3: %d\n",cpu_idle_force_poll, c1, c2, c3);
> > > > > > > flush_smp_call_function_queue();
> > > > > > > schedule_idle();
> > > > > > >
> > > > > > > force poll, c1 and c2 are always 0, and c3 is always some non-zero value.
> > > > > > > Sometimes small (often 1), and sometimes large (304 or 305).
> > > > > > >
> > > > > > > So I don't think it's calling cpu_idle_poll().
> > > > > >
> > > > > > I agree that something else
> > > > > >
> > > > > > >
> > > > > > > x86 has TIF_POLLING_NRFLAG defined to be a non zero value, which I think
> > > > > > > is sufficient to cause the issue.
> > > > > >
> > > > > > Could you trace trace_sched_wake_idle_without_ipi() ans csd traces as well ?
> > > > > > I don't understand what set need_resched() in your case; having in
> > > > > > mind that I don't see the problem on my Arm systems and IIRC Peter
> > > > > > said that he didn't face the problem on his x86 system.
> > > > >
> > > > > TIF_POLLING_NRFLAG doesn't seem to be defined on Arm.
> > > > >
> > > > > Peter said that he didn't see the problem, but perhaps that was just
> > > > > random. It requires a NUMA move to occur. I make 20 runs to be sure to
> > > > > see the problem at least once. But another machine might behave
> > > > > differently.
> > > > >
> > > > > I believe the call chain is:
> > > > >
> > > > > scheduler_tick
> > > > > trigger_load_balance
> > > > > nohz_balancer_kick
> > > > > kick_ilb
> > > > > smp_call_function_single_async
> > > > > generic_exec_single
> > > > > __smp_call_single_queue
> > > > > send_call_function_single_ipi
> > > > > call_function_single_prep_ipi
> > > > > set_nr_if_polling <====== sets need_resched
> > > > >
> > > > > I'll make a trace to reverify that.
> > > >
> > > > This is what I see at a tick, which corresponds to the call chain shown
> > > > above:
> > > >
> > > > bt.B.x-4184 [046] 466.410605: bputs: scheduler_tick: calling trigger_load_balance
> > > > bt.B.x-4184 [046] 466.410605: bputs: trigger_load_balance: calling nohz_balancer_kick
> > > > bt.B.x-4184 [046] 466.410605: bputs: trigger_load_balance: calling kick_ilb
> > > > bt.B.x-4184 [046] 466.410607: bprint: trigger_load_balance: calling smp_call_function_single_async 22
> > > > bt.B.x-4184 [046] 466.410607: bputs: smp_call_function_single_async: calling generic_exec_single
> > > > bt.B.x-4184 [046] 466.410607: bputs: generic_exec_single: calling __smp_call_single_queue
> > > > bt.B.x-4184 [046] 466.410608: bputs: __smp_call_single_queue: calling send_call_function_single_ipi
> > > > bt.B.x-4184 [046] 466.410608: bputs: __smp_call_single_queue: calling call_function_single_prep_ipi
> > > > bt.B.x-4184 [046] 466.410608: bputs: call_function_single_prep_ipi: calling set_nr_if_polling
> > > > bt.B.x-4184 [046] 466.410609: sched_wake_idle_without_ipi: cpu=22
> > >
> > > I don't know if you have made progress on this in the meantime.
> >
> > Not really. Basically after do_idle, there is the call to
> > flush_smp_call_function_queue that invokes the deposited functions, which
> > in our case is at best going to raise a softirq, and the call to schedule.
> > Raising a softirq doesn't happen because of the check for need_resched.
> > But even if that test were removed, it would still not be useful because
> > there would be the ksoftirqd running on the idle core that would eliminate
> > the imbalance between the sockets. Maybe there could be some way to call
> > run_rebalance_domains directly from nohz_csd_func, since
> > run_rebalance_domains doesn't use its argument, but at the moment
> > run_rebalance_domains is not visible to nohz_csd_func.
>
> All this happen because we don't use an ipi, it should not use
> ksoftirqd with ipi
>
> >
> > >
> > > Regarding the trace above, do you know if anything happens on CPU22
> > > just before the scheduler tried to kick the ILB on it ?
> >
> > I don't think so. It's idle.
>
> Ok, so if it is idle for a while , I mean nothing happened on it, not
> even spurious irq, It should have cleared its TIF_POLLING_NRFLAG
>
> I would be good to trace the selected idle state
>
> >
> > > Have you found why TIF_POLLING_NRFLAG seems to be always set when the
> > > kick_ilb happens ? It should be cleared once entering the idle state.
> >
> > Actually, I don't think it is always set. It switches back and forth
> > between two cases. I will look for the traces that show that.
> >
> > > Could you check your cpuidle driver ?
> >
> > Check what specifically?
>
> $ cat /sys/devices/system/cpu/cpuidle/current_driver
> $ cat /sys/devices/system/cpu/cpuidle/current_governor

intel_idle and menu

julia

2024-01-18 22:14:09

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing



On Thu, 18 Jan 2024, Vincent Guittot wrote:

> Hi Julia,
>
> Sorry for the delay. I have been involved on other perf regression
>
> On Fri, 5 Jan 2024 at 18:27, Julia Lawall <[email protected]> wrote:
> >
> >
> >
> > On Fri, 5 Jan 2024, Julia Lawall wrote:
> >
> > >
> > >
> > > On Fri, 5 Jan 2024, Vincent Guittot wrote:
> > >
> > > > On Fri, 5 Jan 2024 at 15:51, Julia Lawall <[email protected]> wrote:
> > > > >
> > > > > > Your system is calling the polling mode and not the default
> > > > > > cpuidle_idle_call() ? This could explain why I don't see such problem
> > > > > > on my system which doesn't have polling
> > > > > >
> > > > > > Are you forcing the use of polling mode ?
> > > > > > If yes, could you check that this problem disappears without forcing
> > > > > > polling mode ?
> > > > >
> > > > > I expanded the code in do_idle to:
> > > > >
> > > > > if (cpu_idle_force_poll) { c1++;
> > > > > tick_nohz_idle_restart_tick();
> > > > > cpu_idle_poll();
> > > > > } else if (tick_check_broadcast_expired()) { c2++;
> > > > > tick_nohz_idle_restart_tick();
> > > > > cpu_idle_poll();
> > > > > } else { c3++;
> > > > > cpuidle_idle_call();
> > > > > }
> > > > >
> > > > > Later, I have:
> > > > >
> > > > > trace_printk("force poll: %d: c1: %d, c2: %d, c3: %d\n",cpu_idle_force_poll, c1, c2, c3);
> > > > > flush_smp_call_function_queue();
> > > > > schedule_idle();
> > > > >
> > > > > force poll, c1 and c2 are always 0, and c3 is always some non-zero value.
> > > > > Sometimes small (often 1), and sometimes large (304 or 305).
> > > > >
> > > > > So I don't think it's calling cpu_idle_poll().
> > > >
> > > > I agree that something else
> > > >
> > > > >
> > > > > x86 has TIF_POLLING_NRFLAG defined to be a non zero value, which I think
> > > > > is sufficient to cause the issue.
> > > >
> > > > Could you trace trace_sched_wake_idle_without_ipi() ans csd traces as well ?
> > > > I don't understand what set need_resched() in your case; having in
> > > > mind that I don't see the problem on my Arm systems and IIRC Peter
> > > > said that he didn't face the problem on his x86 system.
> > >
> > > TIF_POLLING_NRFLAG doesn't seem to be defined on Arm.
> > >
> > > Peter said that he didn't see the problem, but perhaps that was just
> > > random. It requires a NUMA move to occur. I make 20 runs to be sure to
> > > see the problem at least once. But another machine might behave
> > > differently.
> > >
> > > I believe the call chain is:
> > >
> > > scheduler_tick
> > > trigger_load_balance
> > > nohz_balancer_kick
> > > kick_ilb
> > > smp_call_function_single_async
> > > generic_exec_single
> > > __smp_call_single_queue
> > > send_call_function_single_ipi
> > > call_function_single_prep_ipi
> > > set_nr_if_polling <====== sets need_resched
> > >
> > > I'll make a trace to reverify that.
> >
> > This is what I see at a tick, which corresponds to the call chain shown
> > above:
> >
> > bt.B.x-4184 [046] 466.410605: bputs: scheduler_tick: calling trigger_load_balance
> > bt.B.x-4184 [046] 466.410605: bputs: trigger_load_balance: calling nohz_balancer_kick
> > bt.B.x-4184 [046] 466.410605: bputs: trigger_load_balance: calling kick_ilb
> > bt.B.x-4184 [046] 466.410607: bprint: trigger_load_balance: calling smp_call_function_single_async 22
> > bt.B.x-4184 [046] 466.410607: bputs: smp_call_function_single_async: calling generic_exec_single
> > bt.B.x-4184 [046] 466.410607: bputs: generic_exec_single: calling __smp_call_single_queue
> > bt.B.x-4184 [046] 466.410608: bputs: __smp_call_single_queue: calling send_call_function_single_ipi
> > bt.B.x-4184 [046] 466.410608: bputs: __smp_call_single_queue: calling call_function_single_prep_ipi
> > bt.B.x-4184 [046] 466.410608: bputs: call_function_single_prep_ipi: calling set_nr_if_polling
> > bt.B.x-4184 [046] 466.410609: sched_wake_idle_without_ipi: cpu=22
>
> I don't know if you have made progress on this in the meantime.
>
> Regarding the trace above, do you know if anything happens on CPU22
> just before the scheduler tried to kick the ILB on it ?
>
> Have you found why TIF_POLLING_NRFLAG seems to be always set when the
> kick_ilb happens ? It should be cleared once entering the idle state.

I haven't figured out everything, but the attached graph shows
that TIF_POLLING_NRFLAG is not always set. Sometimes it is and sometimes
it isn't.

In the graph, on core 57, the blue box and the green x are before and
after the call to cpuidle_idle_call(), resplectively. One can't see it in
this graph, but the green x comes before the blue box. So almost all of
the time, it is in cpuidle_idle_call(), only in the tiny gap between the x
and the box is it back in do_idle with TIF_POLLING_NRFLAG set.

Afterwards, there is a diamond for the polling case and a triangle for the
non polling case. These also occur on clock ticks, and may be
microscopically closer to (polling) or further from (not polling) the
green x and blue box.

I haven't yet studied what happens afterwards in the non polling case.

julia

>
> Could you check your cpuidle driver ?
>
> Vincent
>
> >
> > julia
>


Attachments:
current_graph.pdf (13.14 kB)

2024-01-19 11:26:59

by Vincent Guittot

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing

On Thu, 18 Jan 2024 at 23:13, Julia Lawall <[email protected]> wrote:
>
>
>
> On Thu, 18 Jan 2024, Vincent Guittot wrote:
>
> > Hi Julia,
> >
> > Sorry for the delay. I have been involved on other perf regression
> >
> > On Fri, 5 Jan 2024 at 18:27, Julia Lawall <[email protected]> wrote:
> > >
> > >
> > >
> > > On Fri, 5 Jan 2024, Julia Lawall wrote:
> > >
> > > >
> > > >
> > > > On Fri, 5 Jan 2024, Vincent Guittot wrote:
> > > >
> > > > > On Fri, 5 Jan 2024 at 15:51, Julia Lawall <[email protected]> wrote:
> > > > > >
> > > > > > > Your system is calling the polling mode and not the default
> > > > > > > cpuidle_idle_call() ? This could explain why I don't see such problem
> > > > > > > on my system which doesn't have polling
> > > > > > >
> > > > > > > Are you forcing the use of polling mode ?
> > > > > > > If yes, could you check that this problem disappears without forcing
> > > > > > > polling mode ?
> > > > > >
> > > > > > I expanded the code in do_idle to:
> > > > > >
> > > > > > if (cpu_idle_force_poll) { c1++;
> > > > > > tick_nohz_idle_restart_tick();
> > > > > > cpu_idle_poll();
> > > > > > } else if (tick_check_broadcast_expired()) { c2++;
> > > > > > tick_nohz_idle_restart_tick();
> > > > > > cpu_idle_poll();
> > > > > > } else { c3++;
> > > > > > cpuidle_idle_call();
> > > > > > }
> > > > > >
> > > > > > Later, I have:
> > > > > >
> > > > > > trace_printk("force poll: %d: c1: %d, c2: %d, c3: %d\n",cpu_idle_force_poll, c1, c2, c3);
> > > > > > flush_smp_call_function_queue();
> > > > > > schedule_idle();
> > > > > >
> > > > > > force poll, c1 and c2 are always 0, and c3 is always some non-zero value.
> > > > > > Sometimes small (often 1), and sometimes large (304 or 305).
> > > > > >
> > > > > > So I don't think it's calling cpu_idle_poll().
> > > > >
> > > > > I agree that something else
> > > > >
> > > > > >
> > > > > > x86 has TIF_POLLING_NRFLAG defined to be a non zero value, which I think
> > > > > > is sufficient to cause the issue.
> > > > >
> > > > > Could you trace trace_sched_wake_idle_without_ipi() ans csd traces as well ?
> > > > > I don't understand what set need_resched() in your case; having in
> > > > > mind that I don't see the problem on my Arm systems and IIRC Peter
> > > > > said that he didn't face the problem on his x86 system.
> > > >
> > > > TIF_POLLING_NRFLAG doesn't seem to be defined on Arm.
> > > >
> > > > Peter said that he didn't see the problem, but perhaps that was just
> > > > random. It requires a NUMA move to occur. I make 20 runs to be sure to
> > > > see the problem at least once. But another machine might behave
> > > > differently.
> > > >
> > > > I believe the call chain is:
> > > >
> > > > scheduler_tick
> > > > trigger_load_balance
> > > > nohz_balancer_kick
> > > > kick_ilb
> > > > smp_call_function_single_async
> > > > generic_exec_single
> > > > __smp_call_single_queue
> > > > send_call_function_single_ipi
> > > > call_function_single_prep_ipi
> > > > set_nr_if_polling <====== sets need_resched
> > > >
> > > > I'll make a trace to reverify that.
> > >
> > > This is what I see at a tick, which corresponds to the call chain shown
> > > above:
> > >
> > > bt.B.x-4184 [046] 466.410605: bputs: scheduler_tick: calling trigger_load_balance
> > > bt.B.x-4184 [046] 466.410605: bputs: trigger_load_balance: calling nohz_balancer_kick
> > > bt.B.x-4184 [046] 466.410605: bputs: trigger_load_balance: calling kick_ilb
> > > bt.B.x-4184 [046] 466.410607: bprint: trigger_load_balance: calling smp_call_function_single_async 22
> > > bt.B.x-4184 [046] 466.410607: bputs: smp_call_function_single_async: calling generic_exec_single
> > > bt.B.x-4184 [046] 466.410607: bputs: generic_exec_single: calling __smp_call_single_queue
> > > bt.B.x-4184 [046] 466.410608: bputs: __smp_call_single_queue: calling send_call_function_single_ipi
> > > bt.B.x-4184 [046] 466.410608: bputs: __smp_call_single_queue: calling call_function_single_prep_ipi
> > > bt.B.x-4184 [046] 466.410608: bputs: call_function_single_prep_ipi: calling set_nr_if_polling
> > > bt.B.x-4184 [046] 466.410609: sched_wake_idle_without_ipi: cpu=22
> >
> > I don't know if you have made progress on this in the meantime.
> >
> > Regarding the trace above, do you know if anything happens on CPU22
> > just before the scheduler tried to kick the ILB on it ?
> >
> > Have you found why TIF_POLLING_NRFLAG seems to be always set when the
> > kick_ilb happens ? It should be cleared once entering the idle state.
>
> I haven't figured out everything, but the attached graph shows
> that TIF_POLLING_NRFLAG is not always set. Sometimes it is and sometimes
> it isn't.
>
> In the graph, on core 57, the blue box and the green x are before and
> after the call to cpuidle_idle_call(), resplectively. One can't see it in
> this graph, but the green x comes before the blue box. So almost all of
> the time, it is in cpuidle_idle_call(), only in the tiny gap between the x
> and the box is it back in do_idle with TIF_POLLING_NRFLAG set.
>
> Afterwards, there is a diamond for the polling case and a triangle for the
> non polling case. These also occur on clock ticks, and may be
> microscopically closer to (polling) or further from (not polling) the
> green x and blue box.

Your problem really looks like weird timing.

It would be good to know which idle states are selected ? or even
better if it's possible, disable all but one idle state and see if one
idle state in particular trigger your problem

idle state can be disable here :
echo 1 > /sys/devices/system/cpu/cpu*/cpuidle/state*/disable

One possible sequence:
tick is not stopped on the idle cpu
tick fires on busy and idle cpus
idle cpu wakes up and the wake up time varies depending of wakeup
latency of the entered c-state
busy cpu executes call_function_single_prep_ipi() and idle cpu could
be already woken or not depending of the time to wake up

>
> I haven't yet studied what happens afterwards in the non polling case.

Side point, according to your trace above, you can 2 consecutives real
idle load balance so the patch that I proposed, should be able to
trigger active migration because the nr_balance_failed will be != 0
the 2nd idle load balance. Are I missing something ?

Vincent
>
> julia
>
> >
> > Could you check your cpuidle driver ?
> >
> > Vincent
> >
> > >
> > > julia
> >

2024-01-19 11:34:09

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing



On Fri, 19 Jan 2024, Vincent Guittot wrote:

> On Thu, 18 Jan 2024 at 23:13, Julia Lawall <[email protected]> wrote:
> >
> >
> >
> > On Thu, 18 Jan 2024, Vincent Guittot wrote:
> >
> > > Hi Julia,
> > >
> > > Sorry for the delay. I have been involved on other perf regression
> > >
> > > On Fri, 5 Jan 2024 at 18:27, Julia Lawall <[email protected]> wrote:
> > > >
> > > >
> > > >
> > > > On Fri, 5 Jan 2024, Julia Lawall wrote:
> > > >
> > > > >
> > > > >
> > > > > On Fri, 5 Jan 2024, Vincent Guittot wrote:
> > > > >
> > > > > > On Fri, 5 Jan 2024 at 15:51, Julia Lawall <[email protected]> wrote:
> > > > > > >
> > > > > > > > Your system is calling the polling mode and not the default
> > > > > > > > cpuidle_idle_call() ? This could explain why I don't see such problem
> > > > > > > > on my system which doesn't have polling
> > > > > > > >
> > > > > > > > Are you forcing the use of polling mode ?
> > > > > > > > If yes, could you check that this problem disappears without forcing
> > > > > > > > polling mode ?
> > > > > > >
> > > > > > > I expanded the code in do_idle to:
> > > > > > >
> > > > > > > if (cpu_idle_force_poll) { c1++;
> > > > > > > tick_nohz_idle_restart_tick();
> > > > > > > cpu_idle_poll();
> > > > > > > } else if (tick_check_broadcast_expired()) { c2++;
> > > > > > > tick_nohz_idle_restart_tick();
> > > > > > > cpu_idle_poll();
> > > > > > > } else { c3++;
> > > > > > > cpuidle_idle_call();
> > > > > > > }
> > > > > > >
> > > > > > > Later, I have:
> > > > > > >
> > > > > > > trace_printk("force poll: %d: c1: %d, c2: %d, c3: %d\n",cpu_idle_force_poll, c1, c2, c3);
> > > > > > > flush_smp_call_function_queue();
> > > > > > > schedule_idle();
> > > > > > >
> > > > > > > force poll, c1 and c2 are always 0, and c3 is always some non-zero value.
> > > > > > > Sometimes small (often 1), and sometimes large (304 or 305).
> > > > > > >
> > > > > > > So I don't think it's calling cpu_idle_poll().
> > > > > >
> > > > > > I agree that something else
> > > > > >
> > > > > > >
> > > > > > > x86 has TIF_POLLING_NRFLAG defined to be a non zero value, which I think
> > > > > > > is sufficient to cause the issue.
> > > > > >
> > > > > > Could you trace trace_sched_wake_idle_without_ipi() ans csd traces as well ?
> > > > > > I don't understand what set need_resched() in your case; having in
> > > > > > mind that I don't see the problem on my Arm systems and IIRC Peter
> > > > > > said that he didn't face the problem on his x86 system.
> > > > >
> > > > > TIF_POLLING_NRFLAG doesn't seem to be defined on Arm.
> > > > >
> > > > > Peter said that he didn't see the problem, but perhaps that was just
> > > > > random. It requires a NUMA move to occur. I make 20 runs to be sure to
> > > > > see the problem at least once. But another machine might behave
> > > > > differently.
> > > > >
> > > > > I believe the call chain is:
> > > > >
> > > > > scheduler_tick
> > > > > trigger_load_balance
> > > > > nohz_balancer_kick
> > > > > kick_ilb
> > > > > smp_call_function_single_async
> > > > > generic_exec_single
> > > > > __smp_call_single_queue
> > > > > send_call_function_single_ipi
> > > > > call_function_single_prep_ipi
> > > > > set_nr_if_polling <====== sets need_resched
> > > > >
> > > > > I'll make a trace to reverify that.
> > > >
> > > > This is what I see at a tick, which corresponds to the call chain shown
> > > > above:
> > > >
> > > > bt.B.x-4184 [046] 466.410605: bputs: scheduler_tick: calling trigger_load_balance
> > > > bt.B.x-4184 [046] 466.410605: bputs: trigger_load_balance: calling nohz_balancer_kick
> > > > bt.B.x-4184 [046] 466.410605: bputs: trigger_load_balance: calling kick_ilb
> > > > bt.B.x-4184 [046] 466.410607: bprint: trigger_load_balance: calling smp_call_function_single_async 22
> > > > bt.B.x-4184 [046] 466.410607: bputs: smp_call_function_single_async: calling generic_exec_single
> > > > bt.B.x-4184 [046] 466.410607: bputs: generic_exec_single: calling __smp_call_single_queue
> > > > bt.B.x-4184 [046] 466.410608: bputs: __smp_call_single_queue: calling send_call_function_single_ipi
> > > > bt.B.x-4184 [046] 466.410608: bputs: __smp_call_single_queue: calling call_function_single_prep_ipi
> > > > bt.B.x-4184 [046] 466.410608: bputs: call_function_single_prep_ipi: calling set_nr_if_polling
> > > > bt.B.x-4184 [046] 466.410609: sched_wake_idle_without_ipi: cpu=22
> > >
> > > I don't know if you have made progress on this in the meantime.
> > >
> > > Regarding the trace above, do you know if anything happens on CPU22
> > > just before the scheduler tried to kick the ILB on it ?
> > >
> > > Have you found why TIF_POLLING_NRFLAG seems to be always set when the
> > > kick_ilb happens ? It should be cleared once entering the idle state.
> >
> > I haven't figured out everything, but the attached graph shows
> > that TIF_POLLING_NRFLAG is not always set. Sometimes it is and sometimes
> > it isn't.
> >
> > In the graph, on core 57, the blue box and the green x are before and
> > after the call to cpuidle_idle_call(), resplectively. One can't see it in
> > this graph, but the green x comes before the blue box. So almost all of
> > the time, it is in cpuidle_idle_call(), only in the tiny gap between the x
> > and the box is it back in do_idle with TIF_POLLING_NRFLAG set.
> >
> > Afterwards, there is a diamond for the polling case and a triangle for the
> > non polling case. These also occur on clock ticks, and may be
> > microscopically closer to (polling) or further from (not polling) the
> > green x and blue box.
>
> Your problem really looks like weird timing.
>
> It would be good to know which idle states are selected ? or even
> better if it's possible, disable all but one idle state and see if one
> idle state in particular trigger your problem
>
> idle state can be disable here :
> echo 1 > /sys/devices/system/cpu/cpu*/cpuidle/state*/disable
>
> One possible sequence:
> tick is not stopped on the idle cpu
> tick fires on busy and idle cpus
> idle cpu wakes up and the wake up time varies depending of wakeup
> latency of the entered c-state
> busy cpu executes call_function_single_prep_ipi() and idle cpu could
> be already woken or not depending of the time to wake up
>
> >
> > I haven't yet studied what happens afterwards in the non polling case.
>
> Side point, according to your trace above, you can 2 consecutives real
> idle load balance so the patch that I proposed, should be able to
> trigger active migration because the nr_balance_failed will be != 0
> the 2nd idle load balance. Are I missing something ?

Thanks for the suggestions. I will check both issues.

julia

2024-01-26 21:20:46

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing



On Fri, 19 Jan 2024, Vincent Guittot wrote:

> On Thu, 18 Jan 2024 at 23:13, Julia Lawall <[email protected]> wrote:
> >
> >
> >
> > On Thu, 18 Jan 2024, Vincent Guittot wrote:
> >
> > > Hi Julia,
> > >
> > > Sorry for the delay. I have been involved on other perf regression
> > >
> > > On Fri, 5 Jan 2024 at 18:27, Julia Lawall <[email protected]> wrote:
> > > >
> > > >
> > > >
> > > > On Fri, 5 Jan 2024, Julia Lawall wrote:
> > > >
> > > > >
> > > > >
> > > > > On Fri, 5 Jan 2024, Vincent Guittot wrote:
> > > > >
> > > > > > On Fri, 5 Jan 2024 at 15:51, Julia Lawall <[email protected]> wrote:
> > > > > > >
> > > > > > > > Your system is calling the polling mode and not the default
> > > > > > > > cpuidle_idle_call() ? This could explain why I don't see such problem
> > > > > > > > on my system which doesn't have polling
> > > > > > > >
> > > > > > > > Are you forcing the use of polling mode ?
> > > > > > > > If yes, could you check that this problem disappears without forcing
> > > > > > > > polling mode ?
> > > > > > >
> > > > > > > I expanded the code in do_idle to:
> > > > > > >
> > > > > > > if (cpu_idle_force_poll) { c1++;
> > > > > > > tick_nohz_idle_restart_tick();
> > > > > > > cpu_idle_poll();
> > > > > > > } else if (tick_check_broadcast_expired()) { c2++;
> > > > > > > tick_nohz_idle_restart_tick();
> > > > > > > cpu_idle_poll();
> > > > > > > } else { c3++;
> > > > > > > cpuidle_idle_call();
> > > > > > > }
> > > > > > >
> > > > > > > Later, I have:
> > > > > > >
> > > > > > > trace_printk("force poll: %d: c1: %d, c2: %d, c3: %d\n",cpu_idle_force_poll, c1, c2, c3);
> > > > > > > flush_smp_call_function_queue();
> > > > > > > schedule_idle();
> > > > > > >
> > > > > > > force poll, c1 and c2 are always 0, and c3 is always some non-zero value.
> > > > > > > Sometimes small (often 1), and sometimes large (304 or 305).
> > > > > > >
> > > > > > > So I don't think it's calling cpu_idle_poll().
> > > > > >
> > > > > > I agree that something else
> > > > > >
> > > > > > >
> > > > > > > x86 has TIF_POLLING_NRFLAG defined to be a non zero value, which I think
> > > > > > > is sufficient to cause the issue.
> > > > > >
> > > > > > Could you trace trace_sched_wake_idle_without_ipi() ans csd traces as well ?
> > > > > > I don't understand what set need_resched() in your case; having in
> > > > > > mind that I don't see the problem on my Arm systems and IIRC Peter
> > > > > > said that he didn't face the problem on his x86 system.
> > > > >
> > > > > TIF_POLLING_NRFLAG doesn't seem to be defined on Arm.
> > > > >
> > > > > Peter said that he didn't see the problem, but perhaps that was just
> > > > > random. It requires a NUMA move to occur. I make 20 runs to be sure to
> > > > > see the problem at least once. But another machine might behave
> > > > > differently.
> > > > >
> > > > > I believe the call chain is:
> > > > >
> > > > > scheduler_tick
> > > > > trigger_load_balance
> > > > > nohz_balancer_kick
> > > > > kick_ilb
> > > > > smp_call_function_single_async
> > > > > generic_exec_single
> > > > > __smp_call_single_queue
> > > > > send_call_function_single_ipi
> > > > > call_function_single_prep_ipi
> > > > > set_nr_if_polling <====== sets need_resched
> > > > >
> > > > > I'll make a trace to reverify that.
> > > >
> > > > This is what I see at a tick, which corresponds to the call chain shown
> > > > above:
> > > >
> > > > bt.B.x-4184 [046] 466.410605: bputs: scheduler_tick: calling trigger_load_balance
> > > > bt.B.x-4184 [046] 466.410605: bputs: trigger_load_balance: calling nohz_balancer_kick
> > > > bt.B.x-4184 [046] 466.410605: bputs: trigger_load_balance: calling kick_ilb
> > > > bt.B.x-4184 [046] 466.410607: bprint: trigger_load_balance: calling smp_call_function_single_async 22
> > > > bt.B.x-4184 [046] 466.410607: bputs: smp_call_function_single_async: calling generic_exec_single
> > > > bt.B.x-4184 [046] 466.410607: bputs: generic_exec_single: calling __smp_call_single_queue
> > > > bt.B.x-4184 [046] 466.410608: bputs: __smp_call_single_queue: calling send_call_function_single_ipi
> > > > bt.B.x-4184 [046] 466.410608: bputs: __smp_call_single_queue: calling call_function_single_prep_ipi
> > > > bt.B.x-4184 [046] 466.410608: bputs: call_function_single_prep_ipi: calling set_nr_if_polling
> > > > bt.B.x-4184 [046] 466.410609: sched_wake_idle_without_ipi: cpu=22
> > >
> > > I don't know if you have made progress on this in the meantime.
> > >
> > > Regarding the trace above, do you know if anything happens on CPU22
> > > just before the scheduler tried to kick the ILB on it ?
> > >
> > > Have you found why TIF_POLLING_NRFLAG seems to be always set when the
> > > kick_ilb happens ? It should be cleared once entering the idle state.
> >
> > I haven't figured out everything, but the attached graph shows
> > that TIF_POLLING_NRFLAG is not always set. Sometimes it is and sometimes
> > it isn't.
> >
> > In the graph, on core 57, the blue box and the green x are before and
> > after the call to cpuidle_idle_call(), resplectively. One can't see it in
> > this graph, but the green x comes before the blue box. So almost all of
> > the time, it is in cpuidle_idle_call(), only in the tiny gap between the x
> > and the box is it back in do_idle with TIF_POLLING_NRFLAG set.
> >
> > Afterwards, there is a diamond for the polling case and a triangle for the
> > non polling case. These also occur on clock ticks, and may be
> > microscopically closer to (polling) or further from (not polling) the
> > green x and blue box.
>
> Your problem really looks like weird timing.
>
> It would be good to know which idle states are selected ? or even
> better if it's possible, disable all but one idle state and see if one
> idle state in particular trigger your problem
>
> idle state can be disable here :
> echo 1 > /sys/devices/system/cpu/cpu*/cpuidle/state*/disable

I tried all possible options (states 0-3 each disabled or not). If all of
states 1-3 are disabled, the call to cpuidle_idle_call(); only lasts a
chose time (around 1/4ms, but not exactly that). In all other cases, that
call lasts for 4ms. If all of 1-3 are disabled, set_nr_if_polling does
not seem to be called. In the other cases, set_nr_if_polling is called,
and finds that the idle core is polling or not polling, but polling is
more common (typically by 3x).

The problem of large gaps can happen regardless of which idle states are
disabled.


>
> One possible sequence:
> tick is not stopped on the idle cpu
> tick fires on busy and idle cpus
> idle cpu wakes up and the wake up time varies depending of wakeup
> latency of the entered c-state
> busy cpu executes call_function_single_prep_ipi() and idle cpu could
> be already woken or not depending of the time to wake up
>
> >
> > I haven't yet studied what happens afterwards in the non polling case.
>
> Side point, according to your trace above, you can 2 consecutives real
> idle load balance so the patch that I proposed, should be able to
> trigger active migration because the nr_balance_failed will be != 0
> the 2nd idle load balance. Are I missing something ?

Indeed, nr_balance_failed does get increased on the non polling
iterations. But it is still the case that the fbq type on the overloaded
socket is all, so nothing happens.

julia

> Vincent
> >
> > julia
> >
> > >
> > > Could you check your cpuidle driver ?
> > >
> > > Vincent
> > >
> > > >
> > > > julia
> > >
>

2024-03-10 09:42:56

by Julia Lawall

[permalink] [raw]
Subject: Re: EEVDF and NUMA balancing



On Fri, 19 Jan 2024, Vincent Guittot wrote:

> On Thu, 18 Jan 2024 at 23:13, Julia Lawall <[email protected]> wrote:
> >
> >
> >
> > On Thu, 18 Jan 2024, Vincent Guittot wrote:
> >
> > > Hi Julia,
> > >
> > > Sorry for the delay. I have been involved on other perf regression
> > >
> > > On Fri, 5 Jan 2024 at 18:27, Julia Lawall <[email protected]> wrote:
> > > >
> > > >
> > > >
> > > > On Fri, 5 Jan 2024, Julia Lawall wrote:
> > > >
> > > > >
> > > > >
> > > > > On Fri, 5 Jan 2024, Vincent Guittot wrote:
> > > > >
> > > > > > On Fri, 5 Jan 2024 at 15:51, Julia Lawall <[email protected]> wrote:
> > > > > > >
> > > > > > > > Your system is calling the polling mode and not the default
> > > > > > > > cpuidle_idle_call() ? This could explain why I don't see such problem
> > > > > > > > on my system which doesn't have polling
> > > > > > > >
> > > > > > > > Are you forcing the use of polling mode ?
> > > > > > > > If yes, could you check that this problem disappears without forcing
> > > > > > > > polling mode ?
> > > > > > >
> > > > > > > I expanded the code in do_idle to:
> > > > > > >
> > > > > > > if (cpu_idle_force_poll) { c1++;
> > > > > > > tick_nohz_idle_restart_tick();
> > > > > > > cpu_idle_poll();
> > > > > > > } else if (tick_check_broadcast_expired()) { c2++;
> > > > > > > tick_nohz_idle_restart_tick();
> > > > > > > cpu_idle_poll();
> > > > > > > } else { c3++;
> > > > > > > cpuidle_idle_call();
> > > > > > > }
> > > > > > >
> > > > > > > Later, I have:
> > > > > > >
> > > > > > > trace_printk("force poll: %d: c1: %d, c2: %d, c3: %d\n",cpu_idle_force_poll, c1, c2, c3);
> > > > > > > flush_smp_call_function_queue();
> > > > > > > schedule_idle();
> > > > > > >
> > > > > > > force poll, c1 and c2 are always 0, and c3 is always some non-zero value.
> > > > > > > Sometimes small (often 1), and sometimes large (304 or 305).
> > > > > > >
> > > > > > > So I don't think it's calling cpu_idle_poll().
> > > > > >
> > > > > > I agree that something else
> > > > > >
> > > > > > >
> > > > > > > x86 has TIF_POLLING_NRFLAG defined to be a non zero value, which I think
> > > > > > > is sufficient to cause the issue.
> > > > > >
> > > > > > Could you trace trace_sched_wake_idle_without_ipi() ans csd traces as well ?
> > > > > > I don't understand what set need_resched() in your case; having in
> > > > > > mind that I don't see the problem on my Arm systems and IIRC Peter
> > > > > > said that he didn't face the problem on his x86 system.
> > > > >
> > > > > TIF_POLLING_NRFLAG doesn't seem to be defined on Arm.
> > > > >
> > > > > Peter said that he didn't see the problem, but perhaps that was just
> > > > > random. It requires a NUMA move to occur. I make 20 runs to be sure to
> > > > > see the problem at least once. But another machine might behave
> > > > > differently.
> > > > >
> > > > > I believe the call chain is:
> > > > >
> > > > > scheduler_tick
> > > > > trigger_load_balance
> > > > > nohz_balancer_kick
> > > > > kick_ilb
> > > > > smp_call_function_single_async
> > > > > generic_exec_single
> > > > > __smp_call_single_queue
> > > > > send_call_function_single_ipi
> > > > > call_function_single_prep_ipi
> > > > > set_nr_if_polling <====== sets need_resched
> > > > >
> > > > > I'll make a trace to reverify that.
> > > >
> > > > This is what I see at a tick, which corresponds to the call chain shown
> > > > above:
> > > >
> > > > bt.B.x-4184 [046] 466.410605: bputs: scheduler_tick: calling trigger_load_balance
> > > > bt.B.x-4184 [046] 466.410605: bputs: trigger_load_balance: calling nohz_balancer_kick
> > > > bt.B.x-4184 [046] 466.410605: bputs: trigger_load_balance: calling kick_ilb
> > > > bt.B.x-4184 [046] 466.410607: bprint: trigger_load_balance: calling smp_call_function_single_async 22
> > > > bt.B.x-4184 [046] 466.410607: bputs: smp_call_function_single_async: calling generic_exec_single
> > > > bt.B.x-4184 [046] 466.410607: bputs: generic_exec_single: calling __smp_call_single_queue
> > > > bt.B.x-4184 [046] 466.410608: bputs: __smp_call_single_queue: calling send_call_function_single_ipi
> > > > bt.B.x-4184 [046] 466.410608: bputs: __smp_call_single_queue: calling call_function_single_prep_ipi
> > > > bt.B.x-4184 [046] 466.410608: bputs: call_function_single_prep_ipi: calling set_nr_if_polling
> > > > bt.B.x-4184 [046] 466.410609: sched_wake_idle_without_ipi: cpu=22
> > >
> > > I don't know if you have made progress on this in the meantime.
> > >
> > > Regarding the trace above, do you know if anything happens on CPU22
> > > just before the scheduler tried to kick the ILB on it ?
> > >
> > > Have you found why TIF_POLLING_NRFLAG seems to be always set when the
> > > kick_ilb happens ? It should be cleared once entering the idle state.
> >
> > I haven't figured out everything, but the attached graph shows
> > that TIF_POLLING_NRFLAG is not always set. Sometimes it is and sometimes
> > it isn't.
> >
> > In the graph, on core 57, the blue box and the green x are before and
> > after the call to cpuidle_idle_call(), resplectively. One can't see it in
> > this graph, but the green x comes before the blue box. So almost all of
> > the time, it is in cpuidle_idle_call(), only in the tiny gap between the x
> > and the box is it back in do_idle with TIF_POLLING_NRFLAG set.
> >
> > Afterwards, there is a diamond for the polling case and a triangle for the
> > non polling case. These also occur on clock ticks, and may be
> > microscopically closer to (polling) or further from (not polling) the
> > green x and blue box.
>
> Your problem really looks like weird timing.
>
> It would be good to know which idle states are selected ? or even
> better if it's possible, disable all but one idle state and see if one
> idle state in particular trigger your problem
>
> idle state can be disable here :
> echo 1 > /sys/devices/system/cpu/cpu*/cpuidle/state*/disable
>
> One possible sequence:
> tick is not stopped on the idle cpu
> tick fires on busy and idle cpus
> idle cpu wakes up and the wake up time varies depending of wakeup
> latency of the entered c-state
> busy cpu executes call_function_single_prep_ipi() and idle cpu could
> be already woken or not depending of the time to wake up
>
> >
> > I haven't yet studied what happens afterwards in the non polling case.
>
> Side point, according to your trace above, you can 2 consecutives real
> idle load balance so the patch that I proposed, should be able to
> trigger active migration because the nr_balance_failed will be != 0
> the 2nd idle load balance. Are I missing something ?


I have gotten access to a 2-socket ARM server:

Cavium ThunderX2 99xx (Vulcan), aarch64, 2 CPUs/node, 32 cores/CPU

Actually, I can observe the same behavior as on Intel. The issue is that
when one runs the benchmark with no configuration options, the threads
iteratively do some work and then synchronize on a barrier. To wait to
synchronize, they spin for 300K iterations and then sleep. On the Intel
6130, the 300K iterations is more than enough time for the other threads
to reach the barrier. On the above ARM machine, more time is required (or
the spinning takes less time), and the threads always end up sleeping.
It seems that this sleeping is sufficient to allow resolving any task
placement problem (I haven't looked into the details).

However, one can instruct the program to spin continuously and never sleep
before reaching a barrier, which replicates the Intel behavior on ARM. In
this case, we again get the big gaps on some cores and the highly variable
execution times.

OMP_WAIT_POLICY=active ./bt.B.x

Two graphs are attached. bt.B.x_pyxis-4_6.7.0_performance_4.pdf is
the worst case out of 20 runs with the default configuration. There are
many short gaps, with soms slightly longer ones in the lower right corner
(starting around 8 seconds). bt.B.spin_pyxis-4_6.7.0_performance_15.pdf
is the worst case out of 20 runs with spinning. There are much fewer
events in this graph so this graph includes the arrows for migrations,
small marks for wakeups, etc. There seem to be gaps in the upper right
continuously on some core from 7 seconds to the end of the execution.

julia


Attachments:
bt.B.spin_pyxis-4_6.7.0_performance_15.pdf (184.53 kB)
bt.B.x_pyxis-4_6.7.0_performance_4.pdf (1.65 MB)
Download all attachments