LinuxLists.cc - [PATCH v2] sched/fair: check for idle core

2020-10-22 17:32:49

Subject: [PATCH v2] sched/fair: check for idle core

In the case of a thread wakeup, wake_affine determines whether a core
will be chosen for the thread on the socket where the thread ran
previously or on the socket of the waker. This is done primarily by
comparing the load of the core where th thread ran previously (prev)
and the load of the waker (this).

commit 11f10e5420f6 ("sched/fair: Use load instead of runnable load
in wakeup path") changed the load computation from the runnable load
to the load average, where the latter includes the load of threads
that have already blocked on the core.

When a short-running daemon processes happens to run on prev, this
change raised the situation that prev could appear to have a greater
load than this, even when prev is actually idle. When prev and this
are on the same socket, the idle prev is detected later, in
select_idle_sibling. But if that does not hold, prev is completely
ignored, causing the waking thread to move to the socket of the waker.
In the case of N mostly active threads on N cores, this triggers other
migrations and hurts performance.

In contrast, before commit 11f10e5420f6, the load on an idle core
was 0, and in the case of a non-idle waker core, the effect of
wake_affine was to select prev as the target for searching for a core
for the waking thread.

To avoid unnecessary migrations, extend wake_affine_idle to check
whether the core where the thread previously ran is currently idle,
and if so simply return that core as the target.

[1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable
load in wakeup path")

This particularly has an impact when using the ondemand power manager,
where kworkers run every 0.004 seconds on all cores, increasing the
likelihood that an idle core will be considered to have a load.

The following numbers were obtained with the benchmarking tool
hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel
benchmarks (https://www.nas.nasa.gov/publications/npb.html). The
tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @
2.10GHz. Active (intel_pstate) and passive (intel_cpufreq) power
management were used. Times are in seconds. All experiments use all
160 hardware threads.

v5.9/intel-pstate v5.9+patch/intel-pstate
bt.C.c 24.725724+-0.962340 23.349608+-1.607214
lu.C.x 29.105952+-4.804203 25.249052+-5.561617
sp.C.x 31.220696+-1.831335 30.227760+-2.429792
ua.C.x 26.606118+-1.767384 25.778367+-1.263850

v5.9/ondemand v5.9+patch/ondemand
bt.C.c 25.330360+-1.028316 23.544036+-1.020189
lu.C.x 35.872659+-4.872090 23.719295+-3.883848
sp.C.x 32.141310+-2.289541 29.125363+-0.872300
ua.C.x 29.024597+-1.667049 25.728888+-1.539772

On the smaller data sets (A and B) and on the other NAS benchmarks
there is no impact on performance.

This also has a major impact on the splash2x.volrend benchmark of the
parsec benchmark suite that goes from 1m25 without this patch to 0m45,
in active (intel_pstate) mode.

Fixes: 11f10e5420f6 ("sched/fair: Use load instead of runnable load in wakeup path")
Signed-off-by: Julia Lawall <[email protected]>
Reviewed-by Vincent Guittot <[email protected]>

---
v2: rewrite the log message, add volrend information

kernel/sched/fair.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aa4c6227cd6d..9b23dad883ee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5804,6 +5804,9 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
if (sync && cpu_rq(this_cpu)->nr_running == 1)
return this_cpu;

+ if (available_idle_cpu(prev_cpu))
+ return prev_cpu;
+
return nr_cpumask_bits;
}

2020-10-23 14:49:42

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH v2] sched/fair: check for idle core

On Thu, Oct 22, 2020 at 03:15:50PM +0200, Julia Lawall wrote:
> In the case of a thread wakeup, wake_affine determines whether a core
> will be chosen for the thread on the socket where the thread ran
> previously or on the socket of the waker. This is done primarily by
> comparing the load of the core where th thread ran previously (prev)
> and the load of the waker (this).
>
> commit 11f10e5420f6 ("sched/fair: Use load instead of runnable load
> in wakeup path") changed the load computation from the runnable load
> to the load average, where the latter includes the load of threads
> that have already blocked on the core.
>
> When a short-running daemon processes happens to run on prev, this
> change raised the situation that prev could appear to have a greater
> load than this, even when prev is actually idle. When prev and this
> are on the same socket, the idle prev is detected later, in
> select_idle_sibling. But if that does not hold, prev is completely
> ignored, causing the waking thread to move to the socket of the waker.
> In the case of N mostly active threads on N cores, this triggers other
> migrations and hurts performance.
>
> In contrast, before commit 11f10e5420f6, the load on an idle core
> was 0, and in the case of a non-idle waker core, the effect of
> wake_affine was to select prev as the target for searching for a core
> for the waking thread.
>
> To avoid unnecessary migrations, extend wake_affine_idle to check
> whether the core where the thread previously ran is currently idle,
> and if so simply return that core as the target.
> target
> [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable
> load in wakeup path")
>
> This particularly has an impact when using the ondemand power manager,
> where kworkers run every 0.004 seconds on all cores, increasing the
> likelihood that an idle core will be considered to have a load.
>
> The following numbers were obtained with the benchmarking tool
> hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel
> benchmarks (https://www.nas.nasa.gov/publications/npb.html). The
> tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @
> 2.10GHz. Active (intel_pstate) and passive (intel_cpufreq) power
> management were used. Times are in seconds. All experiments use all
> 160 hardware threads.
>
> v5.9/intel-pstate v5.9+patch/intel-pstate
> bt.C.c 24.725724+-0.962340 23.349608+-1.607214
> lu.C.x 29.105952+-4.804203 25.249052+-5.561617
> sp.C.x 31.220696+-1.831335 30.227760+-2.429792
> ua.C.x 26.606118+-1.767384 25.778367+-1.263850
>
> v5.9/ondemand v5.9+patch/ondemand
> bt.C.c 25.330360+-1.028316 23.544036+-1.020189
> lu.C.x 35.872659+-4.872090 23.719295+-3.883848
> sp.C.x 32.141310+-2.289541 29.125363+-0.872300
> ua.C.x 29.024597+-1.667049 25.728888+-1.539772
>
> On the smaller data sets (A and B) and on the other NAS benchmarks
> there is no impact on performance.
>
> This also has a major impact on the splash2x.volrend benchmark of the
> parsec benchmark suite that goes from 1m25 without this patch to 0m45,
> in active (intel_pstate) mode.
>
> Fixes: 11f10e5420f6 ("sched/fair: Use load instead of runnable load in wakeup path")
> Signed-off-by: Julia Lawall <[email protected]>
> Reviewed-by Vincent Guittot <[email protected]>
>

In principal, I think the patch is ok after the recent discussion. I'm
holding off an ack until a battery of tests on loads with different
levels of utilisation and wakeup patterns makes its way through a test
grid. It's based on Linus's tree mid-merge window that includes what is
in the scheduler pipeline

--
Mel Gorman
SUSE Labs

2020-10-23 16:39:17

by Yu Chen

[permalink] [raw]

Subject: Re: [PATCH v2] sched/fair: check for idle core

On Fri, Oct 23, 2020 at 1:32 AM Julia Lawall <[email protected]> wrote:
>
[cut]
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index aa4c6227cd6d..9b23dad883ee 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5804,6 +5804,9 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
> if (sync && cpu_rq(this_cpu)->nr_running == 1)
> return this_cpu;
>
> + if (available_idle_cpu(prev_cpu))
How about also taking sched_idle_cpu(prev_cpu) into consideration?
if (available_idle_cpu(prev_cpu) || sched_idle_cpu(prev_cpu))

Thanks,
Chenyu

2020-10-23 16:52:00

by Julia Lawall

[permalink] [raw]

Subject: Re: [PATCH v2] sched/fair: check for idle core

On Fri, 23 Oct 2020, Mel Gorman wrote:

> On Thu, Oct 22, 2020 at 03:15:50PM +0200, Julia Lawall wrote:
> > In the case of a thread wakeup, wake_affine determines whether a core
> > will be chosen for the thread on the socket where the thread ran
> > previously or on the socket of the waker. This is done primarily by
> > comparing the load of the core where th thread ran previously (prev)
> > and the load of the waker (this).
> >
> > commit 11f10e5420f6 ("sched/fair: Use load instead of runnable load
> > in wakeup path") changed the load computation from the runnable load
> > to the load average, where the latter includes the load of threads
> > that have already blocked on the core.
> >
> > When a short-running daemon processes happens to run on prev, this
> > change raised the situation that prev could appear to have a greater
> > load than this, even when prev is actually idle. When prev and this
> > are on the same socket, the idle prev is detected later, in
> > select_idle_sibling. But if that does not hold, prev is completely
> > ignored, causing the waking thread to move to the socket of the waker.
> > In the case of N mostly active threads on N cores, this triggers other
> > migrations and hurts performance.
> >
> > In contrast, before commit 11f10e5420f6, the load on an idle core
> > was 0, and in the case of a non-idle waker core, the effect of
> > wake_affine was to select prev as the target for searching for a core
> > for the waking thread.
> >
> > To avoid unnecessary migrations, extend wake_affine_idle to check
> > whether the core where the thread previously ran is currently idle,
> > and if so simply return that core as the target.
> > target
> > [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable
> > load in wakeup path")
> >
> > This particularly has an impact when using the ondemand power manager,
> > where kworkers run every 0.004 seconds on all cores, increasing the
> > likelihood that an idle core will be considered to have a load.
> >
> > The following numbers were obtained with the benchmarking tool
> > hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel
> > benchmarks (https://www.nas.nasa.gov/publications/npb.html). The
> > tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @
> > 2.10GHz. Active (intel_pstate) and passive (intel_cpufreq) power
> > management were used. Times are in seconds. All experiments use all
> > 160 hardware threads.
> >
> > v5.9/intel-pstate v5.9+patch/intel-pstate
> > bt.C.c 24.725724+-0.962340 23.349608+-1.607214
> > lu.C.x 29.105952+-4.804203 25.249052+-5.561617
> > sp.C.x 31.220696+-1.831335 30.227760+-2.429792
> > ua.C.x 26.606118+-1.767384 25.778367+-1.263850
> >
> > v5.9/ondemand v5.9+patch/ondemand
> > bt.C.c 25.330360+-1.028316 23.544036+-1.020189
> > lu.C.x 35.872659+-4.872090 23.719295+-3.883848
> > sp.C.x 32.141310+-2.289541 29.125363+-0.872300
> > ua.C.x 29.024597+-1.667049 25.728888+-1.539772
> >
> > On the smaller data sets (A and B) and on the other NAS benchmarks
> > there is no impact on performance.
> >
> > This also has a major impact on the splash2x.volrend benchmark of the
> > parsec benchmark suite that goes from 1m25 without this patch to 0m45,
> > in active (intel_pstate) mode.
> >
> > Fixes: 11f10e5420f6 ("sched/fair: Use load instead of runnable load in wakeup path")
> > Signed-off-by: Julia Lawall <[email protected]>
> > Reviewed-by Vincent Guittot <[email protected]>
> >
>
> In principal, I think the patch is ok after the recent discussion. I'm
> holding off an ack until a battery of tests on loads with different
> levels of utilisation and wakeup patterns makes its way through a test
> grid. It's based on Linus's tree mid-merge window that includes what is
> in the scheduler pipeline

OK, if it doesn't work out, it would be interesting to know what goes
badly.

thanks,
julia

2020-10-23 16:52:46

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH v2] sched/fair: check for idle core

On Fri, Oct 23, 2020 at 11:21:50AM +0200, Julia Lawall wrote:
>
>
> On Fri, 23 Oct 2020, Mel Gorman wrote:
>
> > On Thu, Oct 22, 2020 at 03:15:50PM +0200, Julia Lawall wrote:
> > > In the case of a thread wakeup, wake_affine determines whether a core
> > > will be chosen for the thread on the socket where the thread ran
> > > previously or on the socket of the waker. This is done primarily by
> > > comparing the load of the core where th thread ran previously (prev)
> > > and the load of the waker (this).
> > >
> > > commit 11f10e5420f6 ("sched/fair: Use load instead of runnable load
> > > in wakeup path") changed the load computation from the runnable load
> > > to the load average, where the latter includes the load of threads
> > > that have already blocked on the core.
> > >
> > > When a short-running daemon processes happens to run on prev, this
> > > change raised the situation that prev could appear to have a greater
> > > load than this, even when prev is actually idle. When prev and this
> > > are on the same socket, the idle prev is detected later, in
> > > select_idle_sibling. But if that does not hold, prev is completely
> > > ignored, causing the waking thread to move to the socket of the waker.
> > > In the case of N mostly active threads on N cores, this triggers other
> > > migrations and hurts performance.
> > >
> > > In contrast, before commit 11f10e5420f6, the load on an idle core
> > > was 0, and in the case of a non-idle waker core, the effect of
> > > wake_affine was to select prev as the target for searching for a core
> > > for the waking thread.
> > >
> > > To avoid unnecessary migrations, extend wake_affine_idle to check
> > > whether the core where the thread previously ran is currently idle,
> > > and if so simply return that core as the target.
> > > target
> > > [1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable
> > > load in wakeup path")
> > >
> > > This particularly has an impact when using the ondemand power manager,
> > > where kworkers run every 0.004 seconds on all cores, increasing the
> > > likelihood that an idle core will be considered to have a load.
> > >
> > > The following numbers were obtained with the benchmarking tool
> > > hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel
> > > benchmarks (https://www.nas.nasa.gov/publications/npb.html). The
> > > tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @
> > > 2.10GHz. Active (intel_pstate) and passive (intel_cpufreq) power
> > > management were used. Times are in seconds. All experiments use all
> > > 160 hardware threads.
> > >
> > > v5.9/intel-pstate v5.9+patch/intel-pstate
> > > bt.C.c 24.725724+-0.962340 23.349608+-1.607214
> > > lu.C.x 29.105952+-4.804203 25.249052+-5.561617
> > > sp.C.x 31.220696+-1.831335 30.227760+-2.429792
> > > ua.C.x 26.606118+-1.767384 25.778367+-1.263850
> > >
> > > v5.9/ondemand v5.9+patch/ondemand
> > > bt.C.c 25.330360+-1.028316 23.544036+-1.020189
> > > lu.C.x 35.872659+-4.872090 23.719295+-3.883848
> > > sp.C.x 32.141310+-2.289541 29.125363+-0.872300
> > > ua.C.x 29.024597+-1.667049 25.728888+-1.539772
> > >
> > > On the smaller data sets (A and B) and on the other NAS benchmarks
> > > there is no impact on performance.
> > >
> > > This also has a major impact on the splash2x.volrend benchmark of the
> > > parsec benchmark suite that goes from 1m25 without this patch to 0m45,
> > > in active (intel_pstate) mode.
> > >
> > > Fixes: 11f10e5420f6 ("sched/fair: Use load instead of runnable load in wakeup path")
> > > Signed-off-by: Julia Lawall <[email protected]>
> > > Reviewed-by Vincent Guittot <[email protected]>
> > >
> >
> > In principal, I think the patch is ok after the recent discussion. I'm
> > holding off an ack until a battery of tests on loads with different
> > levels of utilisation and wakeup patterns makes its way through a test
> > grid. It's based on Linus's tree mid-merge window that includes what is
> > in the scheduler pipeline
>
> OK, if it doesn't work out, it would be interesting to know what goes
> badly.
>

Yep, if something goes wrong, I'll make the full logs available, details
on reproducing it etc.

--
Mel Gorman
SUSE Labs

2020-10-23 18:56:43

by Julia Lawall

[permalink] [raw]

Subject: Re: [PATCH v2] sched/fair: check for idle core

On Sat, 24 Oct 2020, Chen Yu wrote:

> On Fri, Oct 23, 2020 at 1:32 AM Julia Lawall <[email protected]> wrote:
> >
> [cut]
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index aa4c6227cd6d..9b23dad883ee 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5804,6 +5804,9 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
> > if (sync && cpu_rq(this_cpu)->nr_running == 1)
> > return this_cpu;
> >
> > + if (available_idle_cpu(prev_cpu))
> How about also taking sched_idle_cpu(prev_cpu) into consideration?
> if (available_idle_cpu(prev_cpu) || sched_idle_cpu(prev_cpu))

I have no opinion about this, because it wasn't relevant to the
application I was looking at. The available_idle check on prev previously
in wake_affine_idle doesn't check sched_idle_cpu, so I didn't put it here.

julia

2020-10-27 14:23:49

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH v2] sched/fair: check for idle core

On Thu, Oct 22, 2020 at 03:15:50PM +0200, Julia Lawall wrote:
> Fixes: 11f10e5420f6 ("sched/fair: Use load instead of runnable load in wakeup path")
> Signed-off-by: Julia Lawall <[email protected]>
> Reviewed-by Vincent Guittot <[email protected]>
>

While not a universal win, it was mostly a win or neutral. In few cases
where there was a problem, one benchmark I'm a bit suspicious of generally
as occasionally it generates bad results for unknown and unpredictable
reasons. In another, it was very machine specific and the differences
were small in absolte time rather than relative time. Other tests on the
same machine were fine so overall;

Acked-by: Mel Gorman <[email protected]>

Thanks.

--
Mel Gorman
SUSE Labs

2020-10-28 06:33:07

by Julia Lawall

[permalink] [raw]

Subject: Re: [PATCH v2] sched/fair: check for idle core

2020-10-29 10:55:27

by tip-bot2 for Jacob Pan

[permalink] [raw]

Subject: [tip: sched/core] sched/fair: Check for idle core in wake_affine

The following commit has been merged into the sched/core branch of tip:

Commit-ID: d8fcb81f1acf651a0e50eacecca43d0524984f87
Gitweb: https://git.kernel.org/tip/d8fcb81f1acf651a0e50eacecca43d0524984f87
Author: Julia Lawall <[email protected]>
AuthorDate: Thu, 22 Oct 2020 15:15:50 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 29 Oct 2020 11:00:32 +01:00

sched/fair: Check for idle core in wake_affine

In the case of a thread wakeup, wake_affine determines whether a core
will be chosen for the thread on the socket where the thread ran
previously or on the socket of the waker. This is done primarily by
comparing the load of the core where th thread ran previously (prev)
and the load of the waker (this).

commit 11f10e5420f6 ("sched/fair: Use load instead of runnable load
in wakeup path") changed the load computation from the runnable load
to the load average, where the latter includes the load of threads
that have already blocked on the core.

When a short-running daemon processes happens to run on prev, this
change raised the situation that prev could appear to have a greater
load than this, even when prev is actually idle. When prev and this
are on the same socket, the idle prev is detected later, in
select_idle_sibling. But if that does not hold, prev is completely
ignored, causing the waking thread to move to the socket of the waker.
In the case of N mostly active threads on N cores, this triggers other
migrations and hurts performance.

In contrast, before commit 11f10e5420f6, the load on an idle core
was 0, and in the case of a non-idle waker core, the effect of
wake_affine was to select prev as the target for searching for a core
for the waking thread.

To avoid unnecessary migrations, extend wake_affine_idle to check
whether the core where the thread previously ran is currently idle,
and if so simply return that core as the target.

[1] commit 11f10e5420f6ce ("sched/fair: Use load instead of runnable
load in wakeup path")

This particularly has an impact when using the ondemand power manager,
where kworkers run every 0.004 seconds on all cores, increasing the
likelihood that an idle core will be considered to have a load.

The following numbers were obtained with the benchmarking tool
hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel
benchmarks (https://www.nas.nasa.gov/publications/npb.html). The
tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @
2.10GHz. Active (intel_pstate) and passive (intel_cpufreq) power
management were used. Times are in seconds. All experiments use all
160 hardware threads.

v5.9/intel-pstate v5.9+patch/intel-pstate
bt.C.c 24.725724+-0.962340 23.349608+-1.607214
lu.C.x 29.105952+-4.804203 25.249052+-5.561617
sp.C.x 31.220696+-1.831335 30.227760+-2.429792
ua.C.x 26.606118+-1.767384 25.778367+-1.263850

v5.9/ondemand v5.9+patch/ondemand
bt.C.c 25.330360+-1.028316 23.544036+-1.020189
lu.C.x 35.872659+-4.872090 23.719295+-3.883848
sp.C.x 32.141310+-2.289541 29.125363+-0.872300
ua.C.x 29.024597+-1.667049 25.728888+-1.539772

On the smaller data sets (A and B) and on the other NAS benchmarks
there is no impact on performance.

This also has a major impact on the splash2x.volrend benchmark of the
parsec benchmark suite that goes from 1m25 without this patch to 0m45,
in active (intel_pstate) mode.

Fixes: 11f10e5420f6 ("sched/fair: Use load instead of runnable load in wakeup path")
Signed-off-by: Julia Lawall <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by Vincent Guittot <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f30d35a..52cacfc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5813,6 +5813,9 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
if (sync && cpu_rq(this_cpu)->nr_running == 1)
return this_cpu;

+ if (available_idle_cpu(prev_cpu))
+ return prev_cpu;
+
return nr_cpumask_bits;
}

2021-01-24 20:42:09

by Julia Lawall

[permalink] [raw]

Subject: Re: [PATCH v2] sched/fair: check for idle core

On Tue, 27 Oct 2020, Mel Gorman wrote:

> On Thu, Oct 22, 2020 at 03:15:50PM +0200, Julia Lawall wrote:
> > Fixes: 11f10e5420f6 ("sched/fair: Use load instead of runnable load in wakeup path")
> > Signed-off-by: Julia Lawall <[email protected]>
> > Reviewed-by Vincent Guittot <[email protected]>
> >
>
> While not a universal win, it was mostly a win or neutral. In few cases
> where there was a problem, one benchmark I'm a bit suspicious of generally
> as occasionally it generates bad results for unknown and unpredictable
> reasons. In another, it was very machine specific and the differences
> were small in absolte time rather than relative time. Other tests on the
> same machine were fine so overall;
>
> Acked-by: Mel Gorman <[email protected]>

Recently, we have been testing the phoronix multicore benchmarks. On v5.9
with this patch, the preparation time of phoronix slows down, from ~23
seconds to ~28 seconds. In v5.11-rc4, we see 29 seconds. It's not yet
clear what causes the problem. But perhaps the patch should be removed
from v5.11, until the problem is understood.

commit d8fcb81f1acf651a0e50eacecca43d0524984f87

julia

>
> Thanks.
>
> --
> Mel Gorman
> SUSE Labs
>

2021-01-25 09:38:41

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH v2] sched/fair: check for idle core

On Mon, 25 Jan 2021 at 10:20, Julia Lawall <[email protected]> wrote:
>
>
>
> On Mon, 25 Jan 2021, Mel Gorman wrote:
>
> > On Sun, Jan 24, 2021 at 09:38:14PM +0100, Julia Lawall wrote:
> > >
> > >
> > > On Tue, 27 Oct 2020, Mel Gorman wrote:
> > >
> > > > On Thu, Oct 22, 2020 at 03:15:50PM +0200, Julia Lawall wrote:
> > > > > Fixes: 11f10e5420f6 ("sched/fair: Use load instead of runnable load in wakeup path")
> > > > > Signed-off-by: Julia Lawall <[email protected]>
> > > > > Reviewed-by Vincent Guittot <[email protected]>
> > > > >
> > > >
> > > > While not a universal win, it was mostly a win or neutral. In few cases
> > > > where there was a problem, one benchmark I'm a bit suspicious of generally
> > > > as occasionally it generates bad results for unknown and unpredictable
> > > > reasons. In another, it was very machine specific and the differences
> > > > were small in absolte time rather than relative time. Other tests on the
> > > > same machine were fine so overall;
> > > >
> > > > Acked-by: Mel Gorman <[email protected]>
> > >
> > > Recently, we have been testing the phoronix multicore benchmarks. On v5.9
> > > with this patch, the preparation time of phoronix slows down, from ~23
> > > seconds to ~28 seconds. In v5.11-rc4, we see 29 seconds. It's not yet
> > > clear what causes the problem. But perhaps the patch should be removed
> > > from v5.11, until the problem is understood.
> > >
> > > commit d8fcb81f1acf651a0e50eacecca43d0524984f87
> > >
> >
> > I'm not 100% convinved given that it was a mix of wins and losses. In
> > the wakup path in general, universal wins almost never happen. It's not
> > 100% clear from your mail what happens during the preparation patch. If
> > it included time to download the benchmarks and install then it would be
> > inherently variable due to network time (if download) or cache hotness
> > (if installing/compiling). While preparation time can be interesting --
> > for example, if preparation involves reading a lot of files from disk,
> > it's not universally interesting when it's not the critical phase of a
> > benchmark.
>
> The benchmark is completely downloaded prior to the runs. There seems to
> be some perturbation to the activation of containerd. Normally it is
> even: * * * *

Does it impact the benchmark results too or only the preparation prior
to running the benchmark ?

>
> and with the patch it becomes more like: * ** **
>
> That is every other one is on time, and every other one is late.
>
> But I don't know why this happens.
>
> julia
>
> >
> > I think it would be better to wait until the problem is fully understood
> > to see if it's a timing artifact (e.g. a race between when prev_cpu is
> > observed to be idle and when it is busy).

I agree that a better understanding of what is happening is necessary
before any changes

> >
> > --
> > Mel Gorman
> > SUSE Labs
> >

2021-01-26 05:08:36

by Julia Lawall

[permalink] [raw]

Subject: Re: [PATCH v2] sched/fair: check for idle core

On Mon, 25 Jan 2021, Mel Gorman wrote:

> On Sun, Jan 24, 2021 at 09:38:14PM +0100, Julia Lawall wrote:
> >
> >
> > On Tue, 27 Oct 2020, Mel Gorman wrote:
> >
> > > On Thu, Oct 22, 2020 at 03:15:50PM +0200, Julia Lawall wrote:
> > > > Fixes: 11f10e5420f6 ("sched/fair: Use load instead of runnable load in wakeup path")
> > > > Signed-off-by: Julia Lawall <[email protected]>
> > > > Reviewed-by Vincent Guittot <[email protected]>
> > > >
> > >
> > > While not a universal win, it was mostly a win or neutral. In few cases
> > > where there was a problem, one benchmark I'm a bit suspicious of generally
> > > as occasionally it generates bad results for unknown and unpredictable
> > > reasons. In another, it was very machine specific and the differences
> > > were small in absolte time rather than relative time. Other tests on the
> > > same machine were fine so overall;
> > >
> > > Acked-by: Mel Gorman <[email protected]>
> >
> > Recently, we have been testing the phoronix multicore benchmarks. On v5.9
> > with this patch, the preparation time of phoronix slows down, from ~23
> > seconds to ~28 seconds. In v5.11-rc4, we see 29 seconds. It's not yet
> > clear what causes the problem. But perhaps the patch should be removed
> > from v5.11, until the problem is understood.
> >
> > commit d8fcb81f1acf651a0e50eacecca43d0524984f87
> >
>
> I'm not 100% convinved given that it was a mix of wins and losses. In
> the wakup path in general, universal wins almost never happen. It's not
> 100% clear from your mail what happens during the preparation patch. If
> it included time to download the benchmarks and install then it would be
> inherently variable due to network time (if download) or cache hotness
> (if installing/compiling). While preparation time can be interesting --
> for example, if preparation involves reading a lot of files from disk,
> it's not universally interesting when it's not the critical phase of a
> benchmark.

The benchmark is completely downloaded prior to the runs. There seems to
be some perturbation to the activation of containerd. Normally it is
even: * * * *

and with the patch it becomes more like: * ** **

That is every other one is on time, and every other one is late.

But I don't know why this happens.

julia

>
> I think it would be better to wait until the problem is fully understood
> to see if it's a timing artifact (e.g. a race between when prev_cpu is
> observed to be idle and when it is busy).
>
> --
> Mel Gorman
> SUSE Labs
>

2021-01-26 05:11:43

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH v2] sched/fair: check for idle core

On Sun, Jan 24, 2021 at 09:38:14PM +0100, Julia Lawall wrote:
>
>
> On Tue, 27 Oct 2020, Mel Gorman wrote:
>
> > On Thu, Oct 22, 2020 at 03:15:50PM +0200, Julia Lawall wrote:
> > > Fixes: 11f10e5420f6 ("sched/fair: Use load instead of runnable load in wakeup path")
> > > Signed-off-by: Julia Lawall <[email protected]>
> > > Reviewed-by Vincent Guittot <[email protected]>
> > >
> >
> > While not a universal win, it was mostly a win or neutral. In few cases
> > where there was a problem, one benchmark I'm a bit suspicious of generally
> > as occasionally it generates bad results for unknown and unpredictable
> > reasons. In another, it was very machine specific and the differences
> > were small in absolte time rather than relative time. Other tests on the
> > same machine were fine so overall;
> >
> > Acked-by: Mel Gorman <[email protected]>
>
> Recently, we have been testing the phoronix multicore benchmarks. On v5.9
> with this patch, the preparation time of phoronix slows down, from ~23
> seconds to ~28 seconds. In v5.11-rc4, we see 29 seconds. It's not yet
> clear what causes the problem. But perhaps the patch should be removed
> from v5.11, until the problem is understood.
>
> commit d8fcb81f1acf651a0e50eacecca43d0524984f87
>

I'm not 100% convinved given that it was a mix of wins and losses. In
the wakup path in general, universal wins almost never happen. It's not
100% clear from your mail what happens during the preparation patch. If
it included time to download the benchmarks and install then it would be
inherently variable due to network time (if download) or cache hotness
(if installing/compiling). While preparation time can be interesting --
for example, if preparation involves reading a lot of files from disk,
it's not universally interesting when it's not the critical phase of a
benchmark.

I think it would be better to wait until the problem is fully understood
to see if it's a timing artifact (e.g. a race between when prev_cpu is
observed to be idle and when it is busy).

--
Mel Gorman
SUSE Labs

2021-01-26 06:26:54

by Julia Lawall

[permalink] [raw]

Subject: Re: [PATCH v2] sched/fair: check for idle core

On Mon, 25 Jan 2021, Vincent Guittot wrote:

> On Mon, 25 Jan 2021 at 10:20, Julia Lawall <[email protected]> wrote:
> >
> >
> >
> > On Mon, 25 Jan 2021, Mel Gorman wrote:
> >
> > > On Sun, Jan 24, 2021 at 09:38:14PM +0100, Julia Lawall wrote:
> > > >
> > > >
> > > > On Tue, 27 Oct 2020, Mel Gorman wrote:
> > > >
> > > > > On Thu, Oct 22, 2020 at 03:15:50PM +0200, Julia Lawall wrote:
> > > > > > Fixes: 11f10e5420f6 ("sched/fair: Use load instead of runnable load in wakeup path")
> > > > > > Signed-off-by: Julia Lawall <[email protected]>
> > > > > > Reviewed-by Vincent Guittot <[email protected]>
> > > > > >
> > > > >
> > > > > While not a universal win, it was mostly a win or neutral. In few cases
> > > > > where there was a problem, one benchmark I'm a bit suspicious of generally
> > > > > as occasionally it generates bad results for unknown and unpredictable
> > > > > reasons. In another, it was very machine specific and the differences
> > > > > were small in absolte time rather than relative time. Other tests on the
> > > > > same machine were fine so overall;
> > > > >
> > > > > Acked-by: Mel Gorman <[email protected]>
> > > >
> > > > Recently, we have been testing the phoronix multicore benchmarks. On v5.9
> > > > with this patch, the preparation time of phoronix slows down, from ~23
> > > > seconds to ~28 seconds. In v5.11-rc4, we see 29 seconds. It's not yet
> > > > clear what causes the problem. But perhaps the patch should be removed
> > > > from v5.11, until the problem is understood.
> > > >
> > > > commit d8fcb81f1acf651a0e50eacecca43d0524984f87
> > > >
> > >
> > > I'm not 100% convinved given that it was a mix of wins and losses. In
> > > the wakup path in general, universal wins almost never happen. It's not
> > > 100% clear from your mail what happens during the preparation patch. If
> > > it included time to download the benchmarks and install then it would be
> > > inherently variable due to network time (if download) or cache hotness
> > > (if installing/compiling). While preparation time can be interesting --
> > > for example, if preparation involves reading a lot of files from disk,
> > > it's not universally interesting when it's not the critical phase of a
> > > benchmark.
> >
> > The benchmark is completely downloaded prior to the runs. There seems to
> > be some perturbation to the activation of containerd. Normally it is
> > even: * * * *
>
> Does it impact the benchmark results too or only the preparation prior
> to running the benchmark ?

Looking at a few of the benchmarks, there is no clear pattern which is
better. But there is not a big degradation, like from 23 to 28 seconds
for the preparation time. I will report back when we figure out more.

julia

>
> >
> > and with the patch it becomes more like: * ** **
> >
> > That is every other one is on time, and every other one is late.
> >
> > But I don't know why this happens.
> >
> > julia
> >
> > >
> > > I think it would be better to wait until the problem is fully understood
> > > to see if it's a timing artifact (e.g. a race between when prev_cpu is
> > > observed to be idle and when it is busy).
>
> I agree that a better understanding of what is happening is necessary
> before any changes
>
> > >
> > > --
> > > Mel Gorman
> > > SUSE Labs
> > >
>

2021-02-06 18:56:42

by Julia Lawall

[permalink] [raw]

Subject: Re: [PATCH v2] sched/fair: check for idle core

On Mon, 25 Jan 2021, Vincent Guittot wrote:

> On Mon, 25 Jan 2021 at 10:20, Julia Lawall <[email protected]> wrote:
> >
> >
> >
> > On Mon, 25 Jan 2021, Mel Gorman wrote:
> >
> > > On Sun, Jan 24, 2021 at 09:38:14PM +0100, Julia Lawall wrote:
> > > >
> > > >
> > > > On Tue, 27 Oct 2020, Mel Gorman wrote:
> > > >
> > > > > On Thu, Oct 22, 2020 at 03:15:50PM +0200, Julia Lawall wrote:
> > > > > > Fixes: 11f10e5420f6 ("sched/fair: Use load instead of runnable load in wakeup path")
> > > > > > Signed-off-by: Julia Lawall <[email protected]>
> > > > > > Reviewed-by Vincent Guittot <[email protected]>
> > > > > >
> > > > >
> > > > > While not a universal win, it was mostly a win or neutral. In few cases
> > > > > where there was a problem, one benchmark I'm a bit suspicious of generally
> > > > > as occasionally it generates bad results for unknown and unpredictable
> > > > > reasons. In another, it was very machine specific and the differences
> > > > > were small in absolte time rather than relative time. Other tests on the
> > > > > same machine were fine so overall;
> > > > >
> > > > > Acked-by: Mel Gorman <[email protected]>
> > > >
> > > > Recently, we have been testing the phoronix multicore benchmarks. On v5.9
> > > > with this patch, the preparation time of phoronix slows down, from ~23
> > > > seconds to ~28 seconds. In v5.11-rc4, we see 29 seconds. It's not yet
> > > > clear what causes the problem. But perhaps the patch should be removed
> > > > from v5.11, until the problem is understood.
> > > >
> > > > commit d8fcb81f1acf651a0e50eacecca43d0524984f87
> > > >
> > >
> > > I'm not 100% convinved given that it was a mix of wins and losses. In
> > > the wakup path in general, universal wins almost never happen. It's not
> > > 100% clear from your mail what happens during the preparation patch. If
> > > it included time to download the benchmarks and install then it would be
> > > inherently variable due to network time (if download) or cache hotness
> > > (if installing/compiling). While preparation time can be interesting --
> > > for example, if preparation involves reading a lot of files from disk,
> > > it's not universally interesting when it's not the critical phase of a
> > > benchmark.
> >
> > The benchmark is completely downloaded prior to the runs. There seems to
> > be some perturbation to the activation of containerd. Normally it is
> > even: * * * *
>
> Does it impact the benchmark results too or only the preparation prior
> to running the benchmark ?
>
> >
> > and with the patch it becomes more like: * ** **
> >
> > That is every other one is on time, and every other one is late.
> >
> > But I don't know why this happens.
> >
> > julia
> >
> > >
> > > I think it would be better to wait until the problem is fully understood
> > > to see if it's a timing artifact (e.g. a race between when prev_cpu is
> > > observed to be idle and when it is busy).
>
> I agree that a better understanding of what is happening is necessary
> before any changes

The tests were incorrect. The faster ones without the patch were with
schedutil. If we use powersave with the patch or without we get the same
setup time and comparable values for the metrics for the actual benchmarks
(some of which vary a lot, though).

So there is no evidence of any problem with the patch.

julia