LinuxLists.cc - [PATCH v3 0/3] sched/fair: Capacity aware wakeup rework

2020-01-26 20:12:34

Subject: [PATCH v3 0/3] sched/fair: Capacity aware wakeup rework

This series is about replacing the current wakeup logic for asymmetric CPU
capacity topologies, i.e. wake_cap().

Details are in patch 1, the TL;DR is that wake_cap() works fine for
"legacy" big.LITTLE systems (e.g. Juno), since the Last Level Cache (LLC)
domain of a CPU only spans CPUs of the same capacity, but somewhat broken
for newer DynamIQ systems (e.g. Dragonboard 845C), since the LLC domain of
a CPU can span all CPUs in the system. Both example boards are supported in
mainline.

A bit of history
================

Due to the old Energy Model (EM) used until Android Common Kernel v4.14
which grafted itself onto the sched domain hierarchy, mobile topologies
have been represented with "phantom domains"; IOW we'd make a DynamIQ
topology look like a big.LITTLE one:

actual hardware:

+-------------------+
| L3 |
+----+----+----+----+
| L2 | L2 | L2 | L2 |
+----+----+----+----+
|CPU0|CPU1|CPU2|CPU3|
+----+----+----+----+
^^^^^ ^^^^^
LITTLEs bigs

vanilla/mainline topology:

MC [ ]
0 1 2 3

phantom domains topology:

DIE [ ]
MC [ ][ ]
0 1 2 3

With the newer, mainline EM this is no longer required, and wake_cap() is
the last sticking point to getting rid of this legacy crud. More details
and examples are in patch 1.

Notes
=====

This removes the use of SD_BALANCE_WAKE for asymmetric CPU capacity
topologies (which are the last mainline users of that flag), as such it
shouldn't be a surprise that this comes with significant improvements to
wake-intensive workloads: wakeups no longer go through the
select_task_rq_fair() slow-path.

Testing
=======

I've picked sysbench --test=threads to mimic Peter's testing mentioned in

commit 182a85f8a119 ("sched: Disable wakeup balancing")

Sysbench results are the number of events handled in a fixed amount of
time, so higher is better. Hackbench results are the usual time taken for
the thing, so lower is better.

Note: the 'X%' stats are the percentiles, so 50% is the 50th percentile.

Juno r0 ("legacy" big.LITTLE)
+++++++++++++++++++++++++++++

This is 2 bigs and 4 LITTLEs:

+---------------+ +-------+
| L2 | | L2 |
+---+---+---+---+ +---+---+
| L | L | L | L | | B | B |
+---+---+---+---+ +---+---+

100 iterations of 'hackbench':

| | -PATCH | +PATCH | DELTA (%) |
|------+----------+----------+-----------|
| mean | 0.631040 | 0.617040 | -2.219 |
| std | 0.025486 | 0.017176 | -32.606 |
| min | 0.582000 | 0.580000 | -0.344 |
| 50% | 0.628500 | 0.613500 | -2.387 |
| 75% | 0.645500 | 0.625000 | -3.176 |
| 99% | 0.697060 | 0.668020 | -4.166 |
| max | 0.703000 | 0.670000 | -4.694 |

100 iterations of 'sysbench --max-time=5 --max-requests=-1 --test=threads --num-threads=6 run':

| | -PATCH | +PATCH | DELTA (%) |
|------+--------------+--------------+-----------|
| mean | 10267.760000 | 15767.580000 | +53.564 |
| std | 3110.439815 | 151.040712 | -95.144 |
| min | 7186.000000 | 15206.000000 | +111.606 |
| 50% | 9019.500000 | 15778.500000 | +74.938 |
| 75% | 12711.000000 | 15873.500000 | +24.880 |
| 99% | 15749.290000 | 16042.100000 | +1.859 |
| max | 15877.000000 | 16052.000000 | +1.102 |

Pixel3 (DynamIQ)
++++++++++++++++

Ideally I would have used a DB845C but had a few issues with mine, so I
went with a mainline-ish Pixel3 instead [1]. It's still the same SoC under
the hood (Snapdragon 845), which has 4 bigs and 4 LITTLEs:

+-------------------------------+
| L3 |
+---+---+---+---+---+---+---+---+
| L2| L2| L2| L2| L2| L2| L2| L2|
+---+---+---+---+---+---+---+---+
| L | L | L | L | B | B | B | B |
+---+---+---+---+---+---+---+---+

Default topology (single MC domain)
-----------------------------------

100 iterations of 'hackbench -l 200'

| | -PATCH | +PATCH | DELTA (%) |
|------+----------+----------+-----------|
| mean | 1.131360 | 1.127490 | -0.342 |
| std | 0.116322 | 0.084154 | -27.654 |
| min | 0.935000 | 0.984000 | +5.241 |
| 50% | 1.099000 | 1.115500 | +1.501 |
| 75% | 1.211250 | 1.182500 | -2.374 |
| 99% | 1.401020 | 1.343070 | -4.136 |
| max | 1.502000 | 1.350000 | -10.120 |

100 iterations of 'sysbench --max-time=5 --max-requests=-1 --test=threads --num-threads=8 run':

| | -PATCH | +PATCH | DELTA (%) |
|------+-------------+-------------+-----------|
| mean | 7108.310000 | 8724.770000 | +22.740 |
| std | 199.431854 | 185.383188 | -7.044 |
| min | 6655.000000 | 8288.000000 | +24.538 |
| 50% | 7107.500000 | 8712.500000 | +22.582 |
| 75% | 7255.500000 | 8846.250000 | +21.925 |
| 99% | 7539.540000 | 9080.970000 | +20.445 |
| max | 7593.000000 | 9177.000000 | +20.861 |

Phantom domains (MC + DIE)
--------------------------

This is mostly included for the sake of completeness.

100 iterations of 'sysbench --max-time=5 --max-requests=-1 --test=threads --num-threads=8 run':

| | -PATCH | +PATCH | DELTA (%) |
|------+-------------+--------------+-----------|
| mean | 7317.940000 | 9888.270000 | +35.124 |
| std | 460.372682 | 225.019185 | -51.122 |
| min | 5888.000000 | 9277.000000 | +57.558 |
| 50% | 7271.000000 | 9879.500000 | +35.875 |
| 75% | 7497.500000 | 10023.750000 | +33.695 |
| 99% | 8464.390000 | 10318.320000 | +21.903 |
| max | 8602.000000 | 10350.000000 | +20.321 |

Revisions
=========

v2 -> v3
--------
o Added missing sync_entity_load_avg() (Quentin)
o Added fallback CPU selection (maximize capacity)
o Added special case for CPU hogs: task_fits_capacity() will always return 'false'
for tasks that are simply too big, due to the margin.

v1 -> v2
--------
o Removed unrelated select_idle_core() change

[1]: https://git.linaro.org/people/amit.pundir/linux.git/log/?h=blueline-mainline-tracking

Morten Rasmussen (3):
sched/fair: Add asymmetric CPU capacity wakeup scan
sched/topology: Remove SD_BALANCE_WAKE on asymmetric capacity systems
sched/fair: Kill wake_cap()

kernel/sched/fair.c | 89 +++++++++++++++++++++++++++--------------
kernel/sched/topology.c | 15 ++-----
2 files changed, 63 insertions(+), 41 deletions(-)

--
2.24.0

2020-01-28 09:26:41

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [PATCH v3 0/3] sched/fair: Capacity aware wakeup rework

On 26/01/2020 21:09, Valentin Schneider wrote:

[...]

> v2 -> v3
> --------
> o Added missing sync_entity_load_avg() (Quentin)
> o Added fallback CPU selection (maximize capacity)
> o Added special case for CPU hogs: task_fits_capacity() will always return 'false'
> for tasks that are simply too big, due to the margin.

v3 fixes the Geekbench multicore regression I saw on Pixel4 (Android 10,
Android Common Kernel v4.14 based, Snapdragon 855) running v1.

I changed the Pixel4 kernel a bit (PELT instead WALT, mainline
select_idle_sibling() instead the csctate aware one), mainline
task_fits_capacity()) for base, v1 & v3.

Since it's not mainline kernel the results have to be taken with a pinch
of salt but they probably show that the new condition:

if (rq->cpu_capacity_orig == READ_ONCE(rq->rd->max_cpu_capacity) && ...
return cpu;

has an effect when dealing with tasks with util_est values > 0.8 * 1024.

2020-01-28 11:35:36

by Valentin Schneider

[permalink] [raw]

Subject: Re: [PATCH v3 0/3] sched/fair: Capacity aware wakeup rework

On 28/01/2020 09:24, Dietmar Eggemann wrote:
> On 26/01/2020 21:09, Valentin Schneider wrote:
>
> [...]
>
>> v2 -> v3
>> --------
>> o Added missing sync_entity_load_avg() (Quentin)
>> o Added fallback CPU selection (maximize capacity)
>> o Added special case for CPU hogs: task_fits_capacity() will always return 'false'
>> for tasks that are simply too big, due to the margin.
>
> v3 fixes the Geekbench multicore regression I saw on Pixel4 (Android 10,
> Android Common Kernel v4.14 based, Snapdragon 855) running v1.
>
> I changed the Pixel4 kernel a bit (PELT instead WALT, mainline
> select_idle_sibling() instead the csctate aware one), mainline
> task_fits_capacity()) for base, v1 & v3.
>
> Since it's not mainline kernel the results have to be taken with a pinch
> of salt but they probably show that the new condition:
>
> if (rq->cpu_capacity_orig == READ_ONCE(rq->rd->max_cpu_capacity) && ...
> return cpu;
>
> has an effect when dealing with tasks with util_est values > 0.8 * 1024.
>

Thanks for going through that hassle! I'm quite pleased with the results
of that change - if you also look at e.g. the stddev for sysbench on Juno,
runs are much more consistent now.