2021-01-28 00:00:42

by Mel Gorman

[permalink] [raw]
Subject: [PATCH v5 0/4] Scan for an idle sibling in a single pass

Changelog since v4
o Avoid use of intermediate variable during select_idle_cpu

Changelog since v3
o Drop scanning based on cores, SMT4 results showed problems

Changelog since v2
o Remove unnecessary parameters
o Update nr during scan only when scanning for cpus

Changlog since v1
o Move extern declaration to header for coding style
o Remove unnecessary parameter from __select_idle_cpu

This series of 4 patches reposts three patches from Peter entitled
"select_idle_sibling() wreckage". It only scans the runqueues in a single
pass when searching for an idle sibling.

Three patches from Peter were dropped. The first patch altered how scan
depth was calculated. Scan depth deletion is a random number generator
with two major limitations. The avg_idle time is based on the time
between a CPU going idle and being woken up clamped approximately by
2*sysctl_sched_migration_cost. This is difficult to compare in a sensible
fashion to avg_scan_cost. The second issue is that only the avg_scan_cost
of scan failures is recorded and it does not decay. This requires deeper
surgery that would justify a patch on its own although Peter notes that
https://lkml.kernel.org/r/[email protected] is
potentially useful for an alternative avg_idle metric.

The second patch dropped scanned based on cores instead of CPUs as it
rationalised the difference between core scanning and CPU scanning.
Unfortunately, Vincent reported problems with SMT4 so it's dropped
for now until depth searching can be fixed.

The third patch dropped converted the idle core scan throttling mechanism
to SIS_PROP. While this would unify the throttling of core and CPU
scanning, it was not free of regressions and has_idle_cores is a fairly
effective throttling mechanism with the caveat that it can have a lot of
false positives for workloads like hackbench.

Peter's series tried to solve three problems at once, this subset addresses
one problem.

kernel/sched/fair.c | 151 +++++++++++++++++++---------------------
kernel/sched/features.h | 1 -
2 files changed, 70 insertions(+), 82 deletions(-)

--
2.26.2

Mel Gorman (4):
sched/fair: Remove SIS_AVG_CPU
sched/fair: Move avg_scan_cost calculations under SIS_PROP
sched/fair: Remove select_idle_smt()
sched/fair: Merge select_idle_core/cpu()

kernel/sched/fair.c | 151 +++++++++++++++++++---------------------
kernel/sched/features.h | 1 -
2 files changed, 70 insertions(+), 82 deletions(-)

--
2.26.2


2021-02-01 01:17:25

by Li, Aubrey

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] Scan for an idle sibling in a single pass

On 2021/1/27 21:51, Mel Gorman wrote:
> Changelog since v4
> o Avoid use of intermediate variable during select_idle_cpu
>
> Changelog since v3
> o Drop scanning based on cores, SMT4 results showed problems
>
> Changelog since v2
> o Remove unnecessary parameters
> o Update nr during scan only when scanning for cpus
>
> Changlog since v1
> o Move extern declaration to header for coding style
> o Remove unnecessary parameter from __select_idle_cpu
>
> This series of 4 patches reposts three patches from Peter entitled
> "select_idle_sibling() wreckage". It only scans the runqueues in a single
> pass when searching for an idle sibling.
>
> Three patches from Peter were dropped. The first patch altered how scan
> depth was calculated. Scan depth deletion is a random number generator
> with two major limitations. The avg_idle time is based on the time
> between a CPU going idle and being woken up clamped approximately by
> 2*sysctl_sched_migration_cost. This is difficult to compare in a sensible
> fashion to avg_scan_cost. The second issue is that only the avg_scan_cost
> of scan failures is recorded and it does not decay. This requires deeper
> surgery that would justify a patch on its own although Peter notes that
> https://lkml.kernel.org/r/[email protected] is
> potentially useful for an alternative avg_idle metric.
>
> The second patch dropped scanned based on cores instead of CPUs as it
> rationalised the difference between core scanning and CPU scanning.
> Unfortunately, Vincent reported problems with SMT4 so it's dropped
> for now until depth searching can be fixed.
>
> The third patch dropped converted the idle core scan throttling mechanism
> to SIS_PROP. While this would unify the throttling of core and CPU
> scanning, it was not free of regressions and has_idle_cores is a fairly
> effective throttling mechanism with the caveat that it can have a lot of
> false positives for workloads like hackbench.
>
> Peter's series tried to solve three problems at once, this subset addresses
> one problem.
>
> kernel/sched/fair.c | 151 +++++++++++++++++++---------------------
> kernel/sched/features.h | 1 -
> 2 files changed, 70 insertions(+), 82 deletions(-)
>

4 benchmarks measured on a x86 4s system with 24 cores per socket and
2 HTs per core, total 192 CPUs.

The load level is [25%, 50%, 75%, 100%].

- hackbench almost has a universal win.
- netperf high load has notable changes, as well as tbench 50% load.

Details below:

hackbench: 10 iterations, 10000 loops, 40 fds per group
======================================================

- pipe process

group base %std v5 %std
3 1 19.18 1.0266 9.06
6 1 9.17 0.987 13.03
9 1 7.11 1.0195 4.61
12 1 1.07 0.9927 1.43

- pipe thread

group base %std v5 %std
3 1 11.14 0.9742 7.27
6 1 9.15 0.9572 7.48
9 1 2.95 0.986 4.05
12 1 1.75 0.9992 1.68

- socket process

group base %std v5 %std
3 1 2.9 0.9586 2.39
6 1 0.68 0.9641 1.3
9 1 0.64 0.9388 0.76
12 1 0.56 0.9375 0.55

- socket thread

group base %std v5 %std
3 1 3.82 0.9686 2.97
6 1 2.06 0.9667 1.91
9 1 0.44 0.9354 1.25
12 1 0.54 0.9362 0.6

netperf: 10 iterations x 100 seconds, transactions rate / sec
=============================================================

- tcp request/response performance

thread base %std v4 %std
25% 1 5.34 1.0039 5.13
50% 1 4.97 1.0115 6.3
75% 1 5.09 0.9257 6.75
100% 1 4.53 0.908 4.83



- udp request/response performance

thread base %std v4 %std
25% 1 6.18 0.9896 6.09
50% 1 5.88 1.0198 8.92
75% 1 24.38 0.9236 29.14
100% 1 26.16 0.9063 22.16

tbench: 10 iterations x 100 seconds, throughput / sec
=====================================================

thread base %std v4 %std
25% 1 0.45 1.003 1.48
50% 1 1.71 0.9286 0.82
75% 1 0.84 0.9928 0.94
100% 1 0.76 0.9762 0.59

schbench: 10 iterations x 100 seconds, 99th percentile latency
==============================================================

mthread base %std v4 %std
25% 1 2.89 0.9884 7.34
50% 1 40.38 1.0055 38.37
75% 1 4.76 1.0095 4.62
100% 1 10.09 1.0083 8.03

Thanks,
-Aubrey

2021-02-01 13:12:46

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] Scan for an idle sibling in a single pass

On Mon, Feb 01, 2021 at 09:13:16AM +0800, Li, Aubrey wrote:
> > Peter's series tried to solve three problems at once, this subset addresses
> > one problem.
> >
> > kernel/sched/fair.c | 151 +++++++++++++++++++---------------------
> > kernel/sched/features.h | 1 -
> > 2 files changed, 70 insertions(+), 82 deletions(-)
> >
>
> 4 benchmarks measured on a x86 4s system with 24 cores per socket and
> 2 HTs per core, total 192 CPUs.
>
> The load level is [25%, 50%, 75%, 100%].
>
> - hackbench almost has a universal win.
> - netperf high load has notable changes, as well as tbench 50% load.
>

Ok, both netperf and tbench are somewhat expected as at those loads are
rapidly idling. Previously I observed that rapidly idling loads can
allow the has_idle_cores test pass for short durations and the double
scanning means there is a greater chance of finding an idle CPU over the
two passes. I think overall it's better to avoid double scanning even if
there are counter examples as it's possible we'll get that back through
better depth selection in the future.

Thanks.

--
Mel Gorman
SUSE Labs