2021-10-21 14:58:53

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers

This patch mitigates a problem where wake_wide() allows a heavy waker
(e.g. X) to stack an excessive number of wakees on the same CPU. This
is due to the cpu_load check in wake_affine_weight. As noted by the
original patch author (Mike Galbraith)[1];

Between load updates, X, or any other waker of many, can stack
wakees to a ludicrous depth. Tracing kbuild vs firefox playing a
youtube clip, I watched X stack 20 of the zillion firefox minions
while their previous CPUs all had 1 lousy task running but a
cpu_load() higher than the cpu_load() of X's CPU. Most of those
prev_cpus were where X had left them when it migrated. Each and
every crazy depth migration was wake_affine_weight() deciding we
should pull.

Parahrasing Mike's test results from the patch.

With make -j8 running along with firefox with two tabs, one
containing youtube's suggestions of stuff, the other a running
clip, if the idle tab in focus, and don't drive mouse around,
flips decay enough for wake_wide() to lose interest, but just
wiggle the mouse, and it starts waking wide. Focus on the running
clip, and it continuously wakes wide.  

The end result is that heavy wakers are less likely to stack tasks and,
depending on the workload, reduce migrations.

From additional tests on various servers, the impact is machine dependant
but generally this patch improves the situation.

hackbench-process-pipes
5.15.0-rc3 5.15.0-rc3
vanilla sched-wakeeflips-v1r1
Amean 1 0.3667 ( 0.00%) 0.3890 ( -6.09%)
Amean 4 0.5343 ( 0.00%) 0.5217 ( 2.37%)
Amean 7 0.5300 ( 0.00%) 0.5387 ( -1.64%)
Amean 12 0.5737 ( 0.00%) 0.5443 ( 5.11%)
Amean 21 0.6727 ( 0.00%) 0.6487 ( 3.57%)
Amean 30 0.8583 ( 0.00%) 0.8033 ( 6.41%)
Amean 48 1.3977 ( 0.00%) 1.2400 * 11.28%*
Amean 79 1.9790 ( 0.00%) 1.8200 * 8.03%*
Amean 110 2.8020 ( 0.00%) 2.5820 * 7.85%*
Amean 141 3.6683 ( 0.00%) 3.2203 * 12.21%*
Amean 172 4.6687 ( 0.00%) 3.8200 * 18.18%*
Amean 203 5.2183 ( 0.00%) 4.3357 * 16.91%*
Amean 234 6.1077 ( 0.00%) 4.8047 * 21.33%*
Amean 265 7.1313 ( 0.00%) 5.1243 * 28.14%*
Amean 296 7.7557 ( 0.00%) 5.5940 * 27.87%*

While different machines showed different results, in general
there were much less CPU migrations of tasks

tbench4
5.15.0-rc3 5.15.0-rc3
vanilla sched-wakeeflips-v1r1
Hmean 1 824.05 ( 0.00%) 802.56 * -2.61%*
Hmean 2 1578.49 ( 0.00%) 1645.11 * 4.22%*
Hmean 4 2959.08 ( 0.00%) 2984.75 * 0.87%*
Hmean 8 5080.09 ( 0.00%) 5173.35 * 1.84%*
Hmean 16 8276.02 ( 0.00%) 9327.17 * 12.70%*
Hmean 32 15501.61 ( 0.00%) 15925.55 * 2.73%*
Hmean 64 27313.67 ( 0.00%) 24107.81 * -11.74%*
Hmean 128 32928.19 ( 0.00%) 36261.75 * 10.12%*
Hmean 256 35434.73 ( 0.00%) 38670.61 * 9.13%*
Hmean 512 50098.34 ( 0.00%) 53243.75 * 6.28%*
Hmean 1024 69503.69 ( 0.00%) 67425.26 * -2.99%*

Bit of a mixed bag but wins more than it loses.

A new workload was added that runs a kernel build in the background
-jNR_CPUS while NR_CPUS pairs of tasks run Netperf TCP_RR. The
intent is to see if heavy background tasks disrupt ligher tasks

multi subtest kernbench
5.15.0-rc3 5.15.0-rc3
vanilla sched-wakeeflips-v1r1
Min elsp-256 20.80 ( 0.00%) 14.89 ( 28.41%)
Amean elsp-256 24.08 ( 0.00%) 20.94 ( 13.05%)
Stddev elsp-256 3.32 ( 0.00%) 4.68 ( -41.16%)
CoeffVar elsp-256 13.78 ( 0.00%) 22.36 ( -62.33%)
Max elsp-256 29.11 ( 0.00%) 26.49 ( 9.00%)

multi subtest netperf-tcp-rr
5.15.0-rc3 5.15.0-rc3
vanilla sched-wakeeflips-v1r1
Min 1 48286.26 ( 0.00%) 49101.48 ( 1.69%)
Hmean 1 62894.82 ( 0.00%) 68963.51 * 9.65%*
Stddev 1 7600.56 ( 0.00%) 8804.82 ( -15.84%)
Max 1 78975.16 ( 0.00%) 87124.67 ( 10.32%)

The variability is higher as a result of the patch but both workloads
experienced improved performance.

[1] https://lore.kernel.org/r/[email protected]

Signed-off-by: Mike Galbraith <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff69f245b939..d00af3b97d8f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5865,6 +5865,14 @@ static void record_wakee(struct task_struct *p)
}

if (current->last_wakee != p) {
+ int min = __this_cpu_read(sd_llc_size) << 1;
+ /*
+ * Couple the wakee flips to the waker for the case where it
+ * doesn't accrue flips, taking care to not push the wakee
+ * high enough that the wake_wide() heuristic fails.
+ */
+ if (current->wakee_flips > p->wakee_flips * min)
+ p->wakee_flips++;
current->last_wakee = p;
current->wakee_flips++;
}
@@ -5895,7 +5903,7 @@ static int wake_wide(struct task_struct *p)

if (master < slave)
swap(master, slave);
- if (slave < factor || master < slave * factor)
+ if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor)
return 0;
return 1;
}
--
2.31.1


2021-10-22 10:30:52

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers

On Thu, 2021-10-21 at 15:56 +0100, Mel Gorman wrote:
>
> From additional tests on various servers, the impact is machine dependant
> but generally this patch improves the situation.
>
> hackbench-process-pipes
>                           5.15.0-rc3             5.15.0-rc3
>                              vanilla  sched-wakeeflips-v1r1
> Amean     1        0.3667 (   0.00%)      0.3890 (  -6.09%)
> Amean     4        0.5343 (   0.00%)      0.5217 (   2.37%)
> Amean     7        0.5300 (   0.00%)      0.5387 (  -1.64%)
> Amean     12       0.5737 (   0.00%)      0.5443 (   5.11%)
> Amean     21       0.6727 (   0.00%)      0.6487 (   3.57%)
> Amean     30       0.8583 (   0.00%)      0.8033 (   6.41%)
> Amean     48       1.3977 (   0.00%)      1.2400 *  11.28%*
> Amean     79       1.9790 (   0.00%)      1.8200 *   8.03%*
> Amean     110      2.8020 (   0.00%)      2.5820 *   7.85%*
> Amean     141      3.6683 (   0.00%)      3.2203 *  12.21%*
> Amean     172      4.6687 (   0.00%)      3.8200 *  18.18%*
> Amean     203      5.2183 (   0.00%)      4.3357 *  16.91%*
> Amean     234      6.1077 (   0.00%)      4.8047 *  21.33%*
> Amean     265      7.1313 (   0.00%)      5.1243 *  28.14%*
> Amean     296      7.7557 (   0.00%)      5.5940 *  27.87%*
>
> While different machines showed different results, in general
> there were much less CPU migrations of tasks

Patchlet helped hackbench? That's.. unexpected (at least by me).

> tbench4
>                            5.15.0-rc3             5.15.0-rc3
>                               vanilla  sched-wakeeflips-v1r1
> Hmean     1         824.05 (   0.00%)      802.56 *  -2.61%*
> Hmean     2        1578.49 (   0.00%)     1645.11 *   4.22%*
> Hmean     4        2959.08 (   0.00%)     2984.75 *   0.87%*
> Hmean     8        5080.09 (   0.00%)     5173.35 *   1.84%*
> Hmean     16       8276.02 (   0.00%)     9327.17 *  12.70%*
> Hmean     32      15501.61 (   0.00%)    15925.55 *   2.73%*
> Hmean     64      27313.67 (   0.00%)    24107.81 * -11.74%*
> Hmean     128     32928.19 (   0.00%)    36261.75 *  10.12%*
> Hmean     256     35434.73 (   0.00%)    38670.61 *   9.13%*
> Hmean     512     50098.34 (   0.00%)    53243.75 *   6.28%*
> Hmean     1024    69503.69 (   0.00%)    67425.26 *  -2.99%*
>
> Bit of a mixed bag but wins more than it loses.

Hm. If patchlet repeatably impacts buddy pairs one way or the other,
it should probably be tossed out the nearest window.


-Mike

2021-10-22 11:10:40

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers

On Fri, Oct 22, 2021 at 12:26:08PM +0200, Mike Galbraith wrote:
> On Thu, 2021-10-21 at 15:56 +0100, Mel Gorman wrote:
> >
> > From additional tests on various servers, the impact is machine dependant
> > but generally this patch improves the situation.
> >
> > hackbench-process-pipes
> > ????????????????????????? 5.15.0-rc3???????????? 5.15.0-rc3
> > ???????????????????????????? vanilla? sched-wakeeflips-v1r1
> > Amean???? 1??????? 0.3667 (?? 0.00%)????? 0.3890 (? -6.09%)
> > Amean???? 4??????? 0.5343 (?? 0.00%)????? 0.5217 (?? 2.37%)
> > Amean???? 7??????? 0.5300 (?? 0.00%)????? 0.5387 (? -1.64%)
> > Amean???? 12?????? 0.5737 (?? 0.00%)????? 0.5443 (?? 5.11%)
> > Amean???? 21?????? 0.6727 (?? 0.00%)????? 0.6487 (?? 3.57%)
> > Amean???? 30?????? 0.8583 (?? 0.00%)????? 0.8033 (?? 6.41%)
> > Amean???? 48?????? 1.3977 (?? 0.00%)????? 1.2400 *? 11.28%*
> > Amean???? 79?????? 1.9790 (?? 0.00%)????? 1.8200 *?? 8.03%*
> > Amean???? 110????? 2.8020 (?? 0.00%)????? 2.5820 *?? 7.85%*
> > Amean???? 141????? 3.6683 (?? 0.00%)????? 3.2203 *? 12.21%*
> > Amean???? 172????? 4.6687 (?? 0.00%)????? 3.8200 *? 18.18%*
> > Amean???? 203????? 5.2183 (?? 0.00%)????? 4.3357 *? 16.91%*
> > Amean???? 234????? 6.1077 (?? 0.00%)????? 4.8047 *? 21.33%*
> > Amean???? 265????? 7.1313 (?? 0.00%)????? 5.1243 *? 28.14%*
> > Amean???? 296????? 7.7557 (?? 0.00%)????? 5.5940 *? 27.87%*
> >
> > While different machines showed different results, in general
> > there were much less CPU migrations of tasks
>
> Patchlet helped hackbench? That's.. unexpected (at least by me).
>

I didn't analyse in depth and other machines do not show as dramatic
a difference but it's likely due to timings of tasks getting wakeup
preempted. On a 2-socket cascadelake machine the difference was -7.4%
to 7.66% depending on group count. The second biggest loss was -0.71%
and more gains than losses. In each case, CPU migrations and system CPU
usage are reduced.

The big difference here is likely because the machine is Zen 3 and has
multiple LLCs per cache so it suffers more if there are imbalances between
LLCs that wouldn't be visible on most Intel machines with 1 LLC per socket.

> > tbench4
> > ?????????????????????????? 5.15.0-rc3???????????? 5.15.0-rc3
> > ????????????????????????????? vanilla? sched-wakeeflips-v1r1
> > Hmean???? 1???????? 824.05 (?? 0.00%)????? 802.56 *? -2.61%*
> > Hmean???? 2??????? 1578.49 (?? 0.00%)???? 1645.11 *?? 4.22%*
> > Hmean???? 4??????? 2959.08 (?? 0.00%)???? 2984.75 *?? 0.87%*
> > Hmean???? 8??????? 5080.09 (?? 0.00%)???? 5173.35 *?? 1.84%*
> > Hmean???? 16?????? 8276.02 (?? 0.00%)???? 9327.17 *? 12.70%*
> > Hmean???? 32????? 15501.61 (?? 0.00%)??? 15925.55 *?? 2.73%*
> > Hmean???? 64????? 27313.67 (?? 0.00%)??? 24107.81 * -11.74%*
> > Hmean???? 128???? 32928.19 (?? 0.00%)??? 36261.75 *? 10.12%*
> > Hmean???? 256???? 35434.73 (?? 0.00%)??? 38670.61 *?? 9.13%*
> > Hmean???? 512???? 50098.34 (?? 0.00%)??? 53243.75 *?? 6.28%*
> > Hmean???? 1024??? 69503.69 (?? 0.00%)??? 67425.26 *? -2.99%*
> >
> > Bit of a mixed bag but wins more than it loses.
>
> Hm. If patchlet repeatably impacts buddy pairs one way or the other,
> it should probably be tossed out the nearest window.
>

I don't see how buddy pairing would be impacted although there is likely
differences in the degree tasks get preempted due to pulling tasks.

--
Mel Gorman
SUSE Labs

2021-10-22 12:04:22

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers

On Fri, 2021-10-22 at 12:05 +0100, Mel Gorman wrote:
> On Fri, Oct 22, 2021 at 12:26:08PM +0200, Mike Galbraith wrote:
>
> >
> > Hm.  If patchlet repeatably impacts buddy pairs one way or the other,
> > it should probably be tossed out the nearest window.
> >
>
> I don't see how buddy pairing would be impacted although there is likely
> differences in the degree tasks get preempted due to pulling tasks.

Hohum, numbers get to say whatever (weirdness) they want to I suppose.

btw, below are some desktop impact numbers I had collected.

box = i7-4790 quad+smt
desktop vs massive_intr 8 9999 (8 x 8ms run/1ms sleep, for 9999 secs.. effectively forever)
perf sched record -a -- su mikeg -c 'firefox https://www.youtube.com/watch?v=aqz-KE-bpKQ'& sleep 300 && killall perf firefox
runtime runtime sum delay sum delay sum delay switches desktop
patch/features total util massive_intr util total massive_intr desktop total/massive util
virgin/stock 2267347.921 ms 94.4% 1932675.152 ms 80.5% 158611.016 ms 133309.938 ms 25301.078 ms 594780/441157 13.9%
virgin/-wa_weight 2236871.408 ms 93.2% 1881464.401 ms 78.3% 255785.391 ms 243958.616 ms 11826.775 ms 1525470/1424083 14.8%
-1.34% -1.2% -2.2% -13.474 s +0.9%
wake_wide/stock 2254335.961 ms 93.9% 1917834.157 ms 79.9% 164766.194 ms 141974.540 ms 22791.654 ms 720711/599064 14.0%

While patchlet mitigated the stacking somewhat, it wasn't meaningful to
my test load (BigBuckBunny for the 10001th time). PELT clearly stacks
up the desktop pretty damn badly, but gets away with it.. in this case.

OTOH, killing stacking graveyard dead via NO_WA_WEIGHT would have
certainly dinged up cache quite a bit for the compute load had it been
something other than a synthetic CPU burner, so there's a brownie point
for PELT in the mix to go along with its stacking demerit.

Given there was zero perceptible difference, the only thing patchlet
really did was to give me a warm fuzzy knowing it was in there fighting
the good fight against obscene *looking* stacking (with potential).

-Mike

2021-10-25 07:04:06

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers

On Fri, 2021-10-22 at 12:05 +0100, Mel Gorman wrote:
> On Fri, Oct 22, 2021 at 12:26:08PM +0200, Mike Galbraith wrote:
>
> >
> > Patchlet helped hackbench?  That's.. unexpected (at least by me).
> >
>
> I didn't analyse in depth and other machines do not show as dramatic
> a difference but it's likely due to timings of tasks getting wakeup
> preempted.

Wakeup tracing made those hackbench numbers less surprising. There's
tons of wake-many going on. At a glance, it appears to already be bi-
directional though, so patchlet helping seemingly means that there's
just not quite enough to tickle the heuristic without a little help.
Question is, is the potential reward of strengthening that heuristic
yet again, keeping in mind that "heuristic" tends to not play well with
"deterministic", worth the risk?

My desktop trace session said distribution improved a bit, but there
was no meaningful latency or throughput improvement, making for a
pretty clear "nope" to the above question. It benefiting NUMA box
hackbench is a valid indicator, but one that is IMO too disconnected
from the real world to carry much weight.

-Mike

2021-10-26 10:00:26

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers

On Mon, Oct 25, 2021 at 08:35:52AM +0200, Mike Galbraith wrote:
> On Fri, 2021-10-22 at 12:05 +0100, Mel Gorman wrote:
> > On Fri, Oct 22, 2021 at 12:26:08PM +0200, Mike Galbraith wrote:
> >
> > >
> > > Patchlet helped hackbench?? That's.. unexpected (at least by me).
> > >
> >
> > I didn't analyse in depth and other machines do not show as dramatic
> > a difference but it's likely due to timings of tasks getting wakeup
> > preempted.
>
> Wakeup tracing made those hackbench numbers less surprising. There's
> tons of wake-many going on. At a glance, it appears to already be bi-
> directional though, so patchlet helping seemingly means that there's
> just not quite enough to tickle the heuristic without a little help.

Another possible explanation is that hackbench overloads a machine to
such an extent that the ratio of bi-directional wakeups is not
sufficient to trigger the wake-wide logic.

> Question is, is the potential reward of strengthening that heuristic
> yet again, keeping in mind that "heuristic" tends to not play well with
> "deterministic", worth the risk?
>
> My desktop trace session said distribution improved a bit, but there
> was no meaningful latency or throughput improvement, making for a
> pretty clear "nope" to the above question.

Another interpretation is that it's simply neutral and does no harm.

> It benefiting NUMA box
> hackbench is a valid indicator, but one that is IMO too disconnected
> from the real world to carry much weight.
>

I think if it's not shown to be harmful to a realistic workload but helps
an overloaded example then it should be ok. While excessive overload is
rare in a realistic workload, it does happen. There are a few workloads
I've seen bugs for that were triggered when an excessive number of worker
threads get spawned and compete for CPU access which in turns leads more
worker threads get spawned. There are application workarounds for this
corner case but it still triggers bugs.

--
Mel Gorman
SUSE Labs

2021-10-26 13:36:10

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers

On Tue, 2021-10-26 at 09:18 +0100, Mel Gorman wrote:
> On Mon, Oct 25, 2021 at 08:35:52AM +0200, Mike Galbraith wrote:
>
> >
> > My desktop trace session said distribution improved a bit, but there
> > was no meaningful latency or throughput improvement, making for a
> > pretty clear "nope" to the above question.
>
> Another interpretation is that it's simply neutral and does no harm.

Yes, patchlet not completely countermanding PELT is a good sign. Had it
continuously fired and countermanded PELT entirely, that would mean
either that it was busted, or that the desktop load is so damn over-
threaded as to be continuously waking more damn threads than can
possibly have any benefit whatsoever. Neither of those having happened
happening is a good thing. While I really dislike PELT's evil side,
it's other face is rather attractive... rob

> > It benefiting NUMA box
> > hackbench is a valid indicator, but one that is IMO too disconnected
> > from the real world to carry much weight.
> >
>
> I think if it's not shown to be harmful to a realistic workload but helps
> an overloaded example then it should be ok. While excessive overload is
> rare in a realistic workload, it does happen. There are a few workloads
> I've seen bugs for that were triggered when an excessive number of worker
> threads get spawned and compete for CPU access which in turns leads more
> worker threads get spawned. There are application workarounds for this
> corner case but it still triggers bugs.
>

2021-10-26 14:13:57

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers

On Tue, 2021-10-26 at 12:15 +0200, Mike Galbraith wrote:
>
...

Well now, that interruption didn't go as planned. The briefer edit
would have been preferred, but you get the drift, so moving on...

> > > It benefiting NUMA box
> > > hackbench is a valid indicator, but one that is IMO too disconnected
> > > from the real world to carry much weight.
> > >
> >
> > I think if it's not shown to be harmful to a realistic workload but helps
> > an overloaded example then it should be ok. While excessive overload is
> > rare in a realistic workload, it does happen. There are a few workloads
> > I've seen bugs for that were triggered when an excessive number of worker
> > threads get spawned and compete for CPU access which in turns leads more
> > worker threads get spawned. There are application workarounds for this
> > corner case but it still triggers bugs.
>

wake_wide()'s proper test environment is NUMA, not a desktop box, so
patchlet has yet to meet a real world load that qualifies as such.
That it could detect the test load doing nutty stuff like waking a
thread pool three times the size of the box is all well and good, but
not the point.

$.02 WRT poor abused hackbench: if it happens to benefit that's fine,
but I don't think it should ever be considered change validation. It's
a useful tool, but the common massive overload use is just nuts IMO.

-Mike

2021-10-26 15:49:33

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers

On Tue, Oct 26, 2021 at 12:41:36PM +0200, Mike Galbraith wrote:
> On Tue, 2021-10-26 at 12:15 +0200, Mike Galbraith wrote:
> >
> ...
>
> Well now, that interruption didn't go as planned. The briefer edit
> would have been preferred, but you get the drift, so moving on...
>
> > > > It benefiting NUMA box
> > > > hackbench is a valid indicator, but one that is IMO too disconnected
> > > > from the real world to carry much weight.
> > > >
> > >
> > > I think if it's not shown to be harmful to a realistic workload but helps
> > > an overloaded example then it should be ok. While excessive overload is
> > > rare in a realistic workload, it does happen. There are a few workloads
> > > I've seen bugs for that were triggered when an excessive number of worker
> > > threads get spawned and compete for CPU access which in turns leads more
> > > worker threads get spawned. There are application workarounds for this
> > > corner case but it still triggers bugs.
> >
>
> wake_wide()'s proper test environment is NUMA, not a desktop box, so
> patchlet has yet to meet a real world load that qualifies as such.
> That it could detect the test load doing nutty stuff like waking a
> thread pool three times the size of the box is all well and good, but
> not the point.
>
> $.02 WRT poor abused hackbench: if it happens to benefit that's fine,
> but I don't think it should ever be considered change validation. It's
> a useful tool, but the common massive overload use is just nuts IMO.
>

hackbench is nuts and generally only useful for looking at overload
placement, overload balancing and useless search depth for idle CPUs.
I pay it some attention but it's usually not my favour workload.
Regrettably, it's mostly used because it's popular, not because it
makes sense.

The patch in question was also tested on other workloads on NUMA
machines. For a 2-socket machine (20 cores, HT enabled so 40 CPUs)
running specjbb 2005 with one JVM per NUMA node, the patch also scaled
reasonably well

specjbb
5.15.0-rc3 5.15.0-rc3
vanilla sched-wakeeflips-v1r1
Hmean tput-1 50044.48 ( 0.00%) 53969.00 * 7.84%*
Hmean tput-2 106050.31 ( 0.00%) 113580.78 * 7.10%*
Hmean tput-3 156701.44 ( 0.00%) 164857.00 * 5.20%*
Hmean tput-4 196538.75 ( 0.00%) 218373.42 * 11.11%*
Hmean tput-5 247566.16 ( 0.00%) 267173.09 * 7.92%*
Hmean tput-6 284981.46 ( 0.00%) 311007.14 * 9.13%*
Hmean tput-7 328882.48 ( 0.00%) 359373.89 * 9.27%*
Hmean tput-8 366941.24 ( 0.00%) 393244.37 * 7.17%*
Hmean tput-9 402386.74 ( 0.00%) 433010.43 * 7.61%*
Hmean tput-10 437551.05 ( 0.00%) 475756.08 * 8.73%*
Hmean tput-11 481349.41 ( 0.00%) 519824.54 * 7.99%*
Hmean tput-12 533148.45 ( 0.00%) 565070.21 * 5.99%*
Hmean tput-13 570563.97 ( 0.00%) 609499.06 * 6.82%*
Hmean tput-14 601117.97 ( 0.00%) 647876.05 * 7.78%*
Hmean tput-15 639096.38 ( 0.00%) 690854.46 * 8.10%*
Hmean tput-16 682644.91 ( 0.00%) 722826.06 * 5.89%*
Hmean tput-17 732248.96 ( 0.00%) 758805.17 * 3.63%*
Hmean tput-18 762771.33 ( 0.00%) 791211.66 * 3.73%*
Hmean tput-19 780582.92 ( 0.00%) 819064.19 * 4.93%*
Hmean tput-20 812183.95 ( 0.00%) 836664.87 * 3.01%*
Hmean tput-21 821415.48 ( 0.00%) 833734.23 ( 1.50%)
Hmean tput-22 815457.65 ( 0.00%) 844393.98 * 3.55%*
Hmean tput-23 819263.63 ( 0.00%) 846109.07 * 3.28%*
Hmean tput-24 817962.95 ( 0.00%) 839682.92 * 2.66%*
Hmean tput-25 807814.64 ( 0.00%) 841826.52 * 4.21%*
Hmean tput-26 811755.89 ( 0.00%) 838543.08 * 3.30%*
Hmean tput-27 799341.75 ( 0.00%) 833487.26 * 4.27%*
Hmean tput-28 803434.89 ( 0.00%) 829022.50 * 3.18%*
Hmean tput-29 803233.25 ( 0.00%) 826622.37 * 2.91%*
Hmean tput-30 800465.12 ( 0.00%) 824347.42 * 2.98%*
Hmean tput-31 791284.39 ( 0.00%) 791575.67 ( 0.04%)
Hmean tput-32 781930.07 ( 0.00%) 805725.80 ( 3.04%)
Hmean tput-33 785194.31 ( 0.00%) 804795.44 ( 2.50%)
Hmean tput-34 781325.67 ( 0.00%) 800067.53 ( 2.40%)
Hmean tput-35 777715.92 ( 0.00%) 753926.32 ( -3.06%)
Hmean tput-36 770516.85 ( 0.00%) 783328.32 ( 1.66%)
Hmean tput-37 758067.26 ( 0.00%) 772243.18 * 1.87%*
Hmean tput-38 764815.45 ( 0.00%) 769156.32 ( 0.57%)
Hmean tput-39 757885.41 ( 0.00%) 757670.59 ( -0.03%)
Hmean tput-40 750140.15 ( 0.00%) 760739.13 ( 1.41%)

The largest regression was within noise. Most results were outside the
noise.

Some HPC workloads showed little difference but they do not communicate
that heavily. redis microbenchmark showed mostly neutral results.
schbench (facebook simulator workload that is latency sensitive) showed a
mix of results, but helped more than it hurt. Even the machine with the
worst results for schbench showed improved wakeup latencies at the 99th
percentile. These were all on NUMA machines.

--
Mel Gorman
SUSE Labs

2021-10-26 16:05:03

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers

On Tue, 2021-10-26 at 12:57 +0100, Mel Gorman wrote:
>
> The patch in question was also tested on other workloads on NUMA
> machines. For a 2-socket machine (20 cores, HT enabled so 40 CPUs)
> running specjbb 2005 with one JVM per NUMA node, the patch also scaled
> reasonably well

That's way more more interesting. No idea what this thing does under
the hood thus whether it should be helped or not, but at least it's a
real deal benchmark vs a kernel hacker tool.

> specjbb
>                               5.15.0-rc3             5.15.0-rc3
>                                  vanilla  sched-wakeeflips-v1r1
> Hmean     tput-1     50044.48 (   0.00%)    53969.00 *   7.84%*
> Hmean     tput-2    106050.31 (   0.00%)   113580.78 *   7.10%*
> Hmean     tput-3    156701.44 (   0.00%)   164857.00 *   5.20%*
> Hmean     tput-4    196538.75 (   0.00%)   218373.42 *  11.11%*
> Hmean     tput-5    247566.16 (   0.00%)   267173.09 *   7.92%*
> Hmean     tput-6    284981.46 (   0.00%)   311007.14 *   9.13%*
> Hmean     tput-7    328882.48 (   0.00%)   359373.89 *   9.27%*
> Hmean     tput-8    366941.24 (   0.00%)   393244.37 *   7.17%*
> Hmean     tput-9    402386.74 (   0.00%)   433010.43 *   7.61%*
> Hmean     tput-10   437551.05 (   0.00%)   475756.08 *   8.73%*
> Hmean     tput-11   481349.41 (   0.00%)   519824.54 *   7.99%*
> Hmean     tput-12   533148.45 (   0.00%)   565070.21 *   5.99%*
> Hmean     tput-13   570563.97 (   0.00%)   609499.06 *   6.82%*
> Hmean     tput-14   601117.97 (   0.00%)   647876.05 *   7.78%*
> Hmean     tput-15   639096.38 (   0.00%)   690854.46 *   8.10%*
> Hmean     tput-16   682644.91 (   0.00%)   722826.06 *   5.89%*
> Hmean     tput-17   732248.96 (   0.00%)   758805.17 *   3.63%*
> Hmean     tput-18   762771.33 (   0.00%)   791211.66 *   3.73%*
> Hmean     tput-19   780582.92 (   0.00%)   819064.19 *   4.93%*
> Hmean     tput-20   812183.95 (   0.00%)   836664.87 *   3.01%*
> Hmean     tput-21   821415.48 (   0.00%)   833734.23 (   1.50%)
> Hmean     tput-22   815457.65 (   0.00%)   844393.98 *   3.55%*
> Hmean     tput-23   819263.63 (   0.00%)   846109.07 *   3.28%*
> Hmean     tput-24   817962.95 (   0.00%)   839682.92 *   2.66%*
> Hmean     tput-25   807814.64 (   0.00%)   841826.52 *   4.21%*
> Hmean     tput-26   811755.89 (   0.00%)   838543.08 *   3.30%*
> Hmean     tput-27   799341.75 (   0.00%)   833487.26 *   4.27%*
> Hmean     tput-28   803434.89 (   0.00%)   829022.50 *   3.18%*
> Hmean     tput-29   803233.25 (   0.00%)   826622.37 *   2.91%*
> Hmean     tput-30   800465.12 (   0.00%)   824347.42 *   2.98%*
> Hmean     tput-31   791284.39 (   0.00%)   791575.67 (   0.04%)
> Hmean     tput-32   781930.07 (   0.00%)   805725.80 (   3.04%)
> Hmean     tput-33   785194.31 (   0.00%)   804795.44 (   2.50%)
> Hmean     tput-34   781325.67 (   0.00%)   800067.53 (   2.40%)
> Hmean     tput-35   777715.92 (   0.00%)   753926.32 (  -3.06%)
> Hmean     tput-36   770516.85 (   0.00%)   783328.32 (   1.66%)
> Hmean     tput-37   758067.26 (   0.00%)   772243.18 *   1.87%*
> Hmean     tput-38   764815.45 (   0.00%)   769156.32 (   0.57%)
> Hmean     tput-39   757885.41 (   0.00%)   757670.59 (  -0.03%)
> Hmean     tput-40   750140.15 (   0.00%)   760739.13 (   1.41%)
>
> The largest regression was within noise. Most results were outside the
> noise.
>
> Some HPC workloads showed little difference but they do not communicate
> that heavily. redis microbenchmark showed mostly neutral results.
> schbench (facebook simulator workload that is latency sensitive) showed a
> mix of results, but helped more than it hurt. Even the machine with the
> worst results for schbench showed improved wakeup latencies at the 99th
> percentile. These were all on NUMA machines.
>

2021-10-27 15:28:54

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers

On Tue, 2021-10-26 at 14:13 +0200, Mike Galbraith wrote:
> On Tue, 2021-10-26 at 12:57 +0100, Mel Gorman wrote:
> >
> > The patch in question was also tested on other workloads on NUMA
> > machines. For a 2-socket machine (20 cores, HT enabled so 40 CPUs)
> > running specjbb 2005 with one JVM per NUMA node, the patch also
> > scaled
> > reasonably well
>
> That's way more more interesting.  No idea what this thing does under
> the hood thus whether it should be helped or not, but at least it's a
> real deal benchmark vs a kernel hacker tool.

...
Installing test specjbb
specjvm-install: Fetching from mirror
http://mcp/mmtests-mirror/spec/SPECjbb2005_kitv1.07.tar.gz
specjvm-install: Fetching from internet
NOT_AVAILABLE/SPECjbb2005_kitv1.07.tar.gz
specjvm-install: Fetching from alt internet
/SPECjbb2005_kitv1.07.tar.gz
FATAL specjvm-install: specjvm-install: Could not download
/SPECjbb2005_kitv1.07.tar.gz
FATAL specjbb-bench: specjbb install script returned error
FATAL: specjbb returned failure, unable to continue
FATAL: Installation step failed for specjbb

Hohum, so much for trying to take a peek.

At any rate, unlike the tbench numbers, these have the look of signal
rather than test jig noise, and pretty strong signal at that, so maybe
patchlet should fly. At the very least, it appears to be saying that
there is significant performance to be had by some means.

Bah, fly or die little patchlet. Either way there will be winners and
losers, that's just the way it works if you're not shaving cycles.

> > specjbb
> >                               5.15.0-rc3             5.15.0-rc3
> >                                  vanilla  sched-wakeeflips-v1r1
> > Hmean     tput-1     50044.48 (   0.00%)    53969.00 *   7.84%*
> > Hmean     tput-2    106050.31 (   0.00%)   113580.78 *   7.10%*
> > Hmean     tput-3    156701.44 (   0.00%)   164857.00 *   5.20%*
> > Hmean     tput-4    196538.75 (   0.00%)   218373.42 *  11.11%*
> > Hmean     tput-5    247566.16 (   0.00%)   267173.09 *   7.92%*
> > Hmean     tput-6    284981.46 (   0.00%)   311007.14 *   9.13%*
> > Hmean     tput-7    328882.48 (   0.00%)   359373.89 *   9.27%*
> > Hmean     tput-8    366941.24 (   0.00%)   393244.37 *   7.17%*
> > Hmean     tput-9    402386.74 (   0.00%)   433010.43 *   7.61%*
> > Hmean     tput-10   437551.05 (   0.00%)   475756.08 *   8.73%*
> > Hmean     tput-11   481349.41 (   0.00%)   519824.54 *   7.99%*
> > Hmean     tput-12   533148.45 (   0.00%)   565070.21 *   5.99%*
> > Hmean     tput-13   570563.97 (   0.00%)   609499.06 *   6.82%*
> > Hmean     tput-14   601117.97 (   0.00%)   647876.05 *   7.78%*
> > Hmean     tput-15   639096.38 (   0.00%)   690854.46 *   8.10%*
> > Hmean     tput-16   682644.91 (   0.00%)   722826.06 *   5.89%*
> > Hmean     tput-17   732248.96 (   0.00%)   758805.17 *   3.63%*
> > Hmean     tput-18   762771.33 (   0.00%)   791211.66 *   3.73%*
> > Hmean     tput-19   780582.92 (   0.00%)   819064.19 *   4.93%*
> > Hmean     tput-20   812183.95 (   0.00%)   836664.87 *   3.01%*
> > Hmean     tput-21   821415.48 (   0.00%)   833734.23 (   1.50%)
> > Hmean     tput-22   815457.65 (   0.00%)   844393.98 *   3.55%*
> > Hmean     tput-23   819263.63 (   0.00%)   846109.07 *   3.28%*
> > Hmean     tput-24   817962.95 (   0.00%)   839682.92 *   2.66%*
> > Hmean     tput-25   807814.64 (   0.00%)   841826.52 *   4.21%*
> > Hmean     tput-26   811755.89 (   0.00%)   838543.08 *   3.30%*
> > Hmean     tput-27   799341.75 (   0.00%)   833487.26 *   4.27%*
> > Hmean     tput-28   803434.89 (   0.00%)   829022.50 *   3.18%*
> > Hmean     tput-29   803233.25 (   0.00%)   826622.37 *   2.91%*
> > Hmean     tput-30   800465.12 (   0.00%)   824347.42 *   2.98%*
> > Hmean     tput-31   791284.39 (   0.00%)   791575.67 (   0.04%)
> > Hmean     tput-32   781930.07 (   0.00%)   805725.80 (   3.04%)
> > Hmean     tput-33   785194.31 (   0.00%)   804795.44 (   2.50%)
> > Hmean     tput-34   781325.67 (   0.00%)   800067.53 (   2.40%)
> > Hmean     tput-35   777715.92 (   0.00%)   753926.32 (  -3.06%)
> > Hmean     tput-36   770516.85 (   0.00%)   783328.32 (   1.66%)
> > Hmean     tput-37   758067.26 (   0.00%)   772243.18 *   1.87%*
> > Hmean     tput-38   764815.45 (   0.00%)   769156.32 (   0.57%)
> > Hmean     tput-39   757885.41 (   0.00%)   757670.59 (  -0.03%)
> > Hmean     tput-40   750140.15 (   0.00%)   760739.13 (   1.41%)
> >
> > The largest regression was within noise. Most results were outside the
> > noise.
> >
> > Some HPC workloads showed little difference but they do not communicate
> > that heavily. redis microbenchmark showed mostly neutral results.
> > schbench (facebook simulator workload that is latency sensitive) showed a
> > mix of results, but helped more than it hurt. Even the machine with the
> > worst results for schbench showed improved wakeup latencies at the 99th
> > percentile. These were all on NUMA machines.
> >
>

2021-10-27 21:20:50

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers

On Wed, Oct 27, 2021 at 04:09:12AM +0200, Mike Galbraith wrote:
> On Tue, 2021-10-26 at 14:13 +0200, Mike Galbraith wrote:
> > On Tue, 2021-10-26 at 12:57 +0100, Mel Gorman wrote:
> > >
> > > The patch in question was also tested on other workloads on NUMA
> > > machines. For a 2-socket machine (20 cores, HT enabled so 40 CPUs)
> > > running specjbb 2005 with one JVM per NUMA node, the patch also
> > > scaled
> > > reasonably well
> >
> > That's way more more interesting.? No idea what this thing does under
> > the hood thus whether it should be helped or not, but at least it's a
> > real deal benchmark vs a kernel hacker tool.
>
> ...
> Installing test specjbb
> specjvm-install: Fetching from mirror
> http://mcp/mmtests-mirror/spec/SPECjbb2005_kitv1.07.tar.gz
> specjvm-install: Fetching from internet
> NOT_AVAILABLE/SPECjbb2005_kitv1.07.tar.gz
> specjvm-install: Fetching from alt internet
> /SPECjbb2005_kitv1.07.tar.gz
> FATAL specjvm-install: specjvm-install: Could not download
> /SPECjbb2005_kitv1.07.tar.gz
> FATAL specjbb-bench: specjbb install script returned error
> FATAL: specjbb returned failure, unable to continue
> FATAL: Installation step failed for specjbb
>
> Hohum, so much for trying to take a peek.
>

The benchmark is not available for free unfortunately.

> At any rate, unlike the tbench numbers, these have the look of signal
> rather than test jig noise, and pretty strong signal at that, so maybe
> patchlet should fly. At the very least, it appears to be saying that
> there is significant performance to be had by some means.
>
> Bah, fly or die little patchlet. Either way there will be winners and
> losers, that's just the way it works if you're not shaving cycles.
>

So, I assume you are ok for patch 1 to take flight to either live or
die. I'll handle any bugs that show up in relation to it. How about
patch 2?

--
Mel Gorman
SUSE Labs

2021-10-27 22:07:53

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers

On Wed, 2021-10-27 at 10:00 +0100, Mel Gorman wrote:
>
> How about patch 2?

This version doesn't slam the door, so no worries wrt kthread/control
thread latencies etc. Doing a brief tbench vs hogs mixed load run
didn't even show much of a delta.

-Mike

2021-11-09 22:19:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers

On Thu, Oct 21, 2021 at 03:56:02PM +0100, Mel Gorman wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ff69f245b939..d00af3b97d8f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5865,6 +5865,14 @@ static void record_wakee(struct task_struct *p)
> }
>
> if (current->last_wakee != p) {
> + int min = __this_cpu_read(sd_llc_size) << 1;
> + /*
> + * Couple the wakee flips to the waker for the case where it
> + * doesn't accrue flips, taking care to not push the wakee
> + * high enough that the wake_wide() heuristic fails.
> + */
> + if (current->wakee_flips > p->wakee_flips * min)
> + p->wakee_flips++;
> current->last_wakee = p;
> current->wakee_flips++;
> }

It's a bit odd that the above uses min for llc_size, while the below:

> @@ -5895,7 +5903,7 @@ static int wake_wide(struct task_struct *p)
>
> if (master < slave)
> swap(master, slave);
> - if (slave < factor || master < slave * factor)
> + if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor)
> return 0;
> return 1;
> }

has factor.

Now:

!(slave < factor || master < slave * factor)

!(x || y) == !x && !y, gives:

slave >= factor && master >= slave * factor

subst lhr in rhs:

master >= factor * factor


your extra term:

!((slave < factor && master < (factor*factor)/2) || master < slave * factor)

changes that how? AFAICT it's a nop.

2021-11-09 22:23:12

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Couple wakee flips with heavy wakers

On Tue, 2021-11-09 at 12:56 +0100, Peter Zijlstra wrote:
>
>
> > @@ -5895,7 +5903,7 @@ static int wake_wide(struct task_struct *p)
> >  
> >         if (master < slave)
> >                 swap(master, slave);
> > -       if (slave < factor || master < slave * factor)
> > +       if ((slave < factor && master < (factor>>1)*factor) || master < slave * factor)
> >                 return 0;
> >         return 1;
> >  }
>
> has factor.
>
> Now:
>
>         !(slave < factor || master < slave * factor)
>
>   !(x || y) == !x && !y, gives:
>
>         slave >= factor && master >= slave * factor
>
>   subst lhr in rhs:
>
>         master >= factor * factor
>
>
> your extra term:
>
>         !((slave < factor && master < (factor*factor)/2) || master < slave * factor)
>
> changes that how? AFAICT it's a nop.

That can happen when twiddling. The thought was to let volume on the
right override individual thread decay on the left to a limited extent.

-Mike