2023-09-12 20:18:43

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

On Mon, 2023-09-11 at 13:59 +0530, K Prateek Nayak wrote:
>
> Speaking of cache-hot idle CPU, is netperf actually more happy with
> piling on current CPU?

Some tests would be happier, others not at all, some numbers below.

I doubt much in the real world can perform better stacked, to be a win,
stacked task overlap induced service latency and utilization loss has
to be less than cache population cost of an idle CPU, something that
modern CPUs have become darn good at, making for a high bar.

> I ask this because the logic seems to be
> reserving the previous CPU for a task that dislikes migration but I
> do not see anything in the wake_affine_idle() path that would make the
> short sleeper proactively choose the previous CPU when the wakeup is
> marked with the WF_SYNC flag. Let me know if I'm missing something?

If select_idle_sibling() didn't intervene, the wake affine logic would
indeed routinely step all over working sets, and at one time briefly
did so due to a silly bug. (see kernel/sched/fair.c.today:7292)

The sync hint stems from the bad old days of SMP when cross-cpu latency
was horrid, and has lost much of its value, but its bias toward the
waker CPU still helps reduce man-in-the-middle latency in a busy box,
which can do even more damage than that done by stacking of not really
synchronous tasks that can be seen below.

The TCP/UDP_RR tests are very close to synchronous, and the numbers
reflect that, stacking is unbeatable for them [1], but for the other
tests, hopefully doing something a bit more realistic than tiny ball
ping-pong, stacking is a demonstrable loser.

Not super carefully run script output:

homer:/root # netperf.sh
TCP_SENDFILE-1 unbound Avg: 87889 Sum: 87889
TCP_SENDFILE-1 stacked Avg: 62885 Sum: 62885
TCP_SENDFILE-1 cross-smt Avg: 58887 Sum: 58887
TCP_SENDFILE-1 cross-core Avg: 90673 Sum: 90673

TCP_STREAM-1 unbound Avg: 71858 Sum: 71858
TCP_STREAM-1 stacked Avg: 58883 Sum: 58883
TCP_STREAM-1 cross-smt Avg: 49345 Sum: 49345
TCP_STREAM-1 cross-core Avg: 72346 Sum: 72346

TCP_MAERTS-1 unbound Avg: 73890 Sum: 73890
TCP_MAERTS-1 stacked Avg: 60682 Sum: 60682
TCP_MAERTS-1 cross-smt Avg: 49868 Sum: 49868
TCP_MAERTS-1 cross-core Avg: 73343 Sum: 73343

UDP_STREAM-1 unbound Avg: 99442 Sum: 99442
UDP_STREAM-1 stacked Avg: 85319 Sum: 85319
UDP_STREAM-1 cross-smt Avg: 63239 Sum: 63239
UDP_STREAM-1 cross-core Avg: 99102 Sum: 99102

TCP_RR-1 unbound Avg: 200833 Sum: 200833
TCP_RR-1 stacked Avg: 243733 Sum: 243733
TCP_RR-1 cross-smt Avg: 138507 Sum: 138507
TCP_RR-1 cross-core Avg: 210404 Sum: 210404

UDP_RR-1 unbound Avg: 252575 Sum: 252575
UDP_RR-1 stacked Avg: 273081 Sum: 273081
UDP_RR-1 cross-smt Avg: 168448 Sum: 168448
UDP_RR-1 cross-core Avg: 264124 Sum: 264124

1. nearly unbeatable - shared L2 CPUS can by a wee bit.