v4 at https://lore.kernel.org/lkml/[email protected]/
Changes wrt v4:
- Removing conditional access in the function arch_scale_freq_capacity()
and initialize arch_freq_scale to 1024 to account for when freq
invariance isn't enabled (Ionela V.)
- In case the max frequency can't be read in MSRs, do not enable frequency
invariance at all (Ionela V., Peter Z.).
- Renames:
variables:
arch_cpu_freq -> arch_freq_scale
arch_max_freq -> arch_max_freq_ratio
... and others
functions:
init_scale_freq -> init_counter_refs
set_cpu_max_freq -> init_freq_invariance
{core,skx,knl...}_set_cpu_max_freq -> {core,skx,knl...}_set_max_freq_ratio
... and others
- Use the same function for parsing SKX and GLM registers (Peter Z.)
- Pass a parameter to the function parsing KNL registers (Peter Z.)
- Fix a bug whereby refs to [am]perf were initialized only on cpu #0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Cover Letter from v4:
v3 at https://lore.kernel.org/lkml/[email protected]/
Changes wrt v3:
- Add definition of function set_arch_max_freq if !CONFIG_SMP
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Cover Letter from v3:
v2 at https://lore.kernel.org/lkml/[email protected]/
Changes wrt v2:
- Removing the tick_disable mechanism. Frequency scale-invariance isn't
just about helping schedutil choose better frequencies, but also
providing the scheduler load balancer with better metrics. All users of
PELT signals benefit from this feature. The tick_disable patch disabled
frequency invariant calculation when a specific driver is in use
(intel_pstate in active mode).
- static_branch_enable(&arch_scale_freq_key) is now called earlier, right
after we learn that X86_FEATURE_APERFMPERF is available. Previously Peter
Z. commented "if we can't tell the max_freq we don't want to use the
invariant stuff.". I've decided to do it differently: if we can't tell
the max_freq, then it's because the CPU encodes max_freq in MSRs in a way
this patch doesn't understand, and we assume max_p is the max_freq which
seems like a safe bet. As a reminder, max_freq=max_p is encoded by
setting arch_max_freq=1024 as default value. I'm open to feedback.
- Refactoring the switch case statement in set_cpu_max_freq() as Rafael
W. Now the first patch doesn't hint at what the following patch will
bring along.
- Handling the case were turbo is disabled at runtime and a _PPC ACPI
notification is issued, as requested by Rafael W. This happens eg. when
some laptop model is disconnected from AC. (Patch #6)
- Handling all Intel x86_64 micro-arches.
- A note for Srinivas P., who expressed concern for Atoms: on Atom CPUs the
max_freq is set to the highest turbo level, as a power-efficiency
oriented measure. In this way the ratio curr_freq/max_freq tends to be
lower, PELT signals are consequently lower, and schedutil doesn't push
too hard on speed. (Patches #4 and #5).
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Cover Letter from v2:
v1 at https://lore.kernel.org/lkml/[email protected]/
Changes wrt v1:
- add x86-specific implementation of arch_scale_freq_invariant() using a
static key that checks for the availability of APERF and MPERF
- refer to GOLDMONT_D instead of GOLDMONT_X, according to recent rename
- set arch_cpu_freq to 1024 from x86_arch_scale_freq_tick_disable() to prevent
PELT from being fed stale data
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Cover Letter from v1:
This is a resend with of Peter Zijlstra's patch to support frequency
scale-invariance on x86 from May 2018 [see 1]. I've added some modifications
and included performance test results. If Peter doesn't mind, I'm slapping my
name on it :)
The changes from Peter's original implementation are:
1) normalizing against the 4-cores turbo level instead or 1-core turbo
2) removing the run-time search for when the above value isn't found in the
various Intel MSRs -- the base frequency value is taken in that case.
The section "4. KNOWN LIMITATIONS" in the first patch commit message addresses
the reason why this approach was dropped back in 2018, and explains that the
performance gains outweight that issue.
The second patch from Srinivas is taken verbatim from the May 2018 submission
as it still applies.
I apologies for the length of patch #1 commit message; I've made a table of
contents with summaries of each section that should make easier to skim
through the content.
This submission incorporates the feedback and requests for additional tests
received during the presentation made at OSPM 2019 in Pisa three months ago.
[1] https://lore.kernel.org/lkml/[email protected]/
Giovanni Gherdovich (6):
x86,sched: Add support for frequency invariance
x86,sched: Add support for frequency invariance on SKYLAKE_X
x86,sched: Add support for frequency invariance on XEON_PHI_KNL/KNM
x86,sched: Add support for frequency invariance on ATOM_GOLDMONT*
x86,sched: Add support for frequency invariance on ATOM
x86: intel_pstate: handle runtime turbo disablement/enablement in
freq. invariance
arch/x86/include/asm/topology.h | 25 ++++
arch/x86/kernel/smpboot.c | 290 +++++++++++++++++++++++++++++++++++++++-
drivers/cpufreq/intel_pstate.c | 1 +
kernel/sched/core.c | 1 +
kernel/sched/sched.h | 7 +
5 files changed, 323 insertions(+), 1 deletion(-)
--
2.16.4
The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On GOLDMONT (aka Apollo Lake), GOLDMONT_D (aka Denverton) and
GOLDMONT_PLUS CPUs (aka Gemini Lake) set freq_max to the highest frequency
reported by the CPU.
The encoding of turbo ratios for GOLDMONT* is identical to the one for
SKYLAKE_X, but we treat the Atom case apart because we want to set freq_max to
a higher value, thus the ratio freq_curr/freq_max to be lower, leading to more
conservative frequency selections (favoring power efficiency).
Signed-off-by: Giovanni Gherdovich <[email protected]>
---
arch/x86/kernel/smpboot.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 8cb3113377a9..3e32d620f1fb 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1795,6 +1795,10 @@ void native_play_dead(void)
* which would ignore the entire turbo range (a conspicuous part, making
* freq_curr/freq_max always maxed out).
*
+ * An exception to the heuristic above is the Atom uarch, where we choose the
+ * highest turbo level for freq_max since Atom's are generally oriented towards
+ * power efficiency.
+ *
* Setting freq_max to anything less than the 1C turbo ratio makes the ratio
* freq_curr / freq_max to eventually grow >1, in which case we clip it to 1.
*/
@@ -1937,18 +1941,18 @@ static bool intel_set_max_freq_ratio(void)
/*
* TODO: add support for:
*
- * - Atom Goldmont
* - Atom Silvermont
*/
u64 base_freq = 1, turbo_freq = 1;
- if (x86_match_cpu(has_glm_turbo_ratio_limits))
- return false;
-
if (turbo_disabled())
goto out;
+ if (x86_match_cpu(has_glm_turbo_ratio_limits) &&
+ skx_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
+ goto out;
+
if (knl_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
goto out;
--
2.16.4
The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On all ATOM CPUs prior to Goldmont, set freq_max to the 1-core
turbo ratio.
We intended to perform tests validating that this patch doesn't regress in
terms of energy efficiency, given that this is the primary concern on Atom
processors. Alas, we found out that turbostat doesn't support reading RAPL
interfaces on our test machine (Airmont), and we don't have external equipment
to measure power consumption; all we have is the performance results of the
benchmarks we ran.
Test machine:
Platform : Dell Wyse 3040 Thin Client[1]
CPU Model : Intel Atom x5-Z8350 (aka Cherry Trail, aka Airmont)
Fam/Mod/Ste : 6:76:4
Topology : 1 socket, 4 cores / 4 threads
Memory : 2G
Storage : onboard flash, XFS filesystem
[1] https://www.dell.com/en-us/work/shop/wyse-endpoints-and-software/wyse-3040-thin-client/spd/wyse-3040-thin-client
Base frequency and available turbo levels (MHz):
Min Operating Freq 266 |***
Low Freq Mode 800 |********
Base Freq 2400 |************************
4 Cores 2800 |****************************
3 Cores 2800 |****************************
2 Cores 3200 |********************************
1 Core 3200 |********************************
Tested kernels:
Baseline : v5.4-rc1, intel_pstate passive, schedutil
Comparison #1 : v5.4-rc1, intel_pstate active , powersave
Comparison #2 : v5.4-rc1, this patch, intel_pstate passive, schedutil
tbench, hackbench and kernbench performed the same under all three kernels;
dbench ran faster with intel_pstate/powersave and the git unit tests were a
lot faster with intel_pstate/powersave and invariant schedutil wrt the
baseline. Not that any of this is terrbily interesting anyway, one doesn't buy
an Atom system to go fast. Power consumption regressions aren't expected but
we lack the equipment to make that measurement. Turbostat seems to think that
reading RAPL on this machine isn't a good idea and we're trusting that
decision.
comparison ratio of performance with baseline; 1.00 means neutral,
lower is better:
I_PSTATE FREQ-INV
----------------------------------------
dbench 0.90 ~
kernbench 0.98 0.97
gitsource 0.63 0.43
Signed-off-by: Giovanni Gherdovich <[email protected]>
---
arch/x86/kernel/smpboot.c | 27 +++++++++++++++++++++------
1 file changed, 21 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 3e32d620f1fb..5f04bf8419f9 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1821,6 +1821,24 @@ static bool turbo_disabled(void)
return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
}
+static bool slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
+{
+ int err;
+
+ err = rdmsrl_safe(MSR_ATOM_CORE_RATIOS, base_freq);
+ if (err)
+ return false;
+
+ err = rdmsrl_safe(MSR_ATOM_CORE_TURBO_RATIOS, turbo_freq);
+ if (err)
+ return false;
+
+ *base_freq = (*base_freq >> 16) & 0x3F; /* max P state */
+ *turbo_freq = *turbo_freq & 0x3F; /* 1C turbo */
+
+ return true;
+}
+
#include <asm/cpu_device_id.h>
#include <asm/intel-family.h>
@@ -1938,17 +1956,14 @@ static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
static bool intel_set_max_freq_ratio(void)
{
- /*
- * TODO: add support for:
- *
- * - Atom Silvermont
- */
-
u64 base_freq = 1, turbo_freq = 1;
if (turbo_disabled())
goto out;
+ if (slv_set_max_freq_ratio(&base_freq, &turbo_freq))
+ goto out;
+
if (x86_match_cpu(has_glm_turbo_ratio_limits) &&
skx_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
goto out;
--
2.16.4
The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On Xeon Phi CPUs set freq_max to the second-highest frequency
reported by the CPU.
Xeon Phi CPUs such as Knights Landing and Knights Mill typically have either
one or two turbo frequencies; in the former case that's 100 MHz above the base
frequency, in the latter case the two levels are 100 MHz and 200 MHz above
base frequency.
We set freq_max to the second-highest frequency reported by the CPU. This
could be the base frequency (if only one turbo level is available) or the first
turbo level (if two levels are available). The rationale is to compromise
between power efficiency or performance -- going straight to max turbo would
favor efficiency and blindly using base freq would favor performance.
For reference, this is how MSR_TURBO_RATIO_LIMIT must be parsed on a Xeon Phi
to get the available frequencies (taken from a comment in turbostat's sources):
[0] -- Reserved
[7:1] -- Base value of number of active cores of bucket 1.
[15:8] -- Base value of freq ratio of bucket 1.
[20:16] -- +ve delta of number of active cores of bucket 2.
i.e. active cores of bucket 2 =
active cores of bucket 1 + delta
[23:21] -- Negative delta of freq ratio of bucket 2.
i.e. freq ratio of bucket 2 =
freq ratio of bucket 1 - delta
[28:24]-- +ve delta of number of active cores of bucket 3.
[31:29]-- -ve delta of freq ratio of bucket 3.
[36:32]-- +ve delta of number of active cores of bucket 4.
[39:37]-- -ve delta of freq ratio of bucket 4.
[44:40]-- +ve delta of number of active cores of bucket 5.
[47:45]-- -ve delta of freq ratio of bucket 5.
[52:48]-- +ve delta of number of active cores of bucket 6.
[55:53]-- -ve delta of freq ratio of bucket 6.
[60:56]-- +ve delta of number of active cores of bucket 7.
[63:61]-- -ve delta of freq ratio of bucket 7.
1. PERFORMANCE EVALUATION: TBENCH +5%
2. NEUTRAL BENCHMARKS (ALL OTHERS)
3. TEST SETUP
1. PERFORMANCE EVALUATION: TBENCH +5%
-------------------------------------
A performance evaluation was conducted on a Knights Mill machine (see "Test
Setup" below), were the frequency-invariance patch (on schedutil) is compared
to both non-invariant schedutil and active intel_pstate with powersave: all
three tested kernels behave the same performance-wise and with regard to power
consumption (performance per watt). The only notable difference is tbench:
comparison ratio of performance with baseline; 1.00 means neutral,
higher is better:
I_PSTATE FREQ-INV
----------------------------------------
tbench 1.04 1.05
performance-per-watt ratios with baseline; 1.00 means neutral, higher is better:
I_PSTATE FREQ-INV
----------------------------------------
tbench 1.03 1.04
which essentially means that frequency-invariant schedutil is 5% better than
baseline, the same as intel_pstate+powersave.
As the results above are averaged over the varying parameter, here the detailed
table.
Varying parameter : number of clients
Unit : MB/sec (higher is better)
5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 freq-inv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean 1 49.06 +- 2.12% ( ) 51.66 +- 1.52% ( 5.30%) 52.87 +- 0.88% ( 7.76%)
Hmean 2 93.82 +- 0.45% ( ) 103.24 +- 0.70% ( 10.05%) 105.90 +- 0.70% ( 12.88%)
Hmean 4 192.46 +- 1.15% ( ) 215.95 +- 0.60% ( 12.21%) 215.78 +- 1.43% ( 12.12%)
Hmean 8 406.74 +- 2.58% ( ) 438.58 +- 0.36% ( 7.83%) 437.61 +- 0.97% ( 7.59%)
Hmean 16 857.70 +- 1.22% ( ) 890.26 +- 0.72% ( 3.80%) 889.11 +- 0.73% ( 3.66%)
Hmean 32 1760.10 +- 0.92% ( ) 1791.70 +- 0.44% ( 1.79%) 1787.95 +- 0.44% ( 1.58%)
Hmean 64 3183.50 +- 0.34% ( ) 3183.19 +- 0.36% ( -0.01%) 3187.53 +- 0.36% ( 0.13%)
Hmean 128 4830.96 +- 0.31% ( ) 4846.53 +- 0.30% ( 0.32%) 4855.86 +- 0.30% ( 0.52%)
Hmean 256 5467.98 +- 0.38% ( ) 5793.80 +- 0.28% ( 5.96%) 5821.94 +- 0.17% ( 6.47%)
Hmean 512 5398.10 +- 0.06% ( ) 5745.56 +- 0.08% ( 6.44%) 5503.68 +- 0.07% ( 1.96%)
Hmean 1024 5290.43 +- 0.63% ( ) 5221.07 +- 0.47% ( -1.31%) 5277.22 +- 0.80% ( -0.25%)
Hmean 1088 5139.71 +- 0.57% ( ) 5236.02 +- 0.71% ( 1.87%) 5190.57 +- 0.41% ( 0.99%)
2. NEUTRAL BENCHMARKS (ALL OTHERS)
----------------------------------
* pgbench (both read/write and read-only)
* NASA Parallel Benchmarks (NPB), MPI or OpenMP for message-passing
* hackbench
* netperf
* dbench
* kernbench
* gitsource (git unit test suite)
3. TEST SETUP
-------------
Test machine:
CPU Model : Intel Xeon Phi CPU 7255 @ 1.10GHz (a.k.a. Knights Mill)
Fam/Mod/Ste : 6:133:0
Topology : 1 socket, 68 cores / 272 threads
Memory : 96G
Storage : rotary, XFS filesystem
Max EFFICiency, BASE frequency and available turbo levels (MHz):
EFFIC 1000 |**********
BASE 1100 |***********
68C 1100 |***********
30C 1200 |************
Tested kernels:
Baseline : v5.2, intel_pstate passive, schedutil
Comparison #1 : v5.2, intel_pstate active , powersave
Comparison #2 : v5.2, this patch, intel_pstate passive, schedutil
Signed-off-by: Giovanni Gherdovich <[email protected]>
---
arch/x86/kernel/smpboot.c | 49 ++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 46 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index ba9d3bdc191c..8cb3113377a9 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1841,6 +1841,48 @@ static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
{}
};
+static bool knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq,
+ int num_delta_fratio)
+{
+ int fratio, delta_fratio, found;
+ int err, i;
+ u64 msr;
+
+ if (!x86_match_cpu(has_knl_turbo_ratio_limits))
+ return false;
+
+ err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
+ if (err)
+ return false;
+
+ *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */
+
+ err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
+ if (err)
+ return false;
+
+ fratio = (msr >> 8) & 0xFF;
+ i = 16;
+ found = 0;
+ do {
+ if (found >= num_delta_fratio) {
+ *turbo_freq = fratio;
+ return true;
+ }
+
+ delta_fratio = (msr >> (i + 5)) & 0x7;
+
+ if (delta_fratio) {
+ found += 1;
+ fratio -= delta_fratio;
+ }
+
+ i += 8;
+ } while (i < 64);
+
+ return true;
+}
+
static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
{
u64 ratios, counts;
@@ -1895,20 +1937,21 @@ static bool intel_set_max_freq_ratio(void)
/*
* TODO: add support for:
*
- * - Xeon Phi (KNM, KNL)
* - Atom Goldmont
* - Atom Silvermont
*/
u64 base_freq = 1, turbo_freq = 1;
- if (x86_match_cpu(has_knl_turbo_ratio_limits) ||
- x86_match_cpu(has_glm_turbo_ratio_limits))
+ if (x86_match_cpu(has_glm_turbo_ratio_limits))
return false;
if (turbo_disabled())
goto out;
+ if (knl_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
+ goto out;
+
if (x86_match_cpu(has_skx_turbo_ratio_limits) &&
skx_set_max_freq_ratio(&base_freq, &turbo_freq, 4))
goto out;
--
2.16.4
Implement arch_scale_freq_capacity() for 'modern' x86. This function
is used by the scheduler to correctly account usage in the face of
DVFS.
The present patch addresses Intel processors specifically and has positive
performance and performance-per-watt implications for the schedutil cpufreq
governor, bringing it closer to, if not on-par with, the powersave governor
from the intel_pstate driver/framework.
Large performance gains are obtained when the machine is lightly loaded and
no regression are observed at saturation. The benchmarks with the largest
gains are kernel compilation, tbench (the networking version of dbench) and
shell-intensive workloads.
1. FREQUENCY INVARIANCE: MOTIVATION
* Without it, a task looks larger if the CPU runs slower
2. PECULIARITIES OF X86
* freq invariance accounting requires knowing the ratio freq_curr/freq_max
2.1 CURRENT FREQUENCY
* Use delta_APERF / delta_MPERF * freq_base (a.k.a "BusyMHz")
2.2 MAX FREQUENCY
* It varies with time (turbo). As an approximation, we set it to a
constant, i.e. 4-cores turbo frequency.
3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
* The invariant schedutil's formula has no feedback loop and reacts faster
to utilization changes
4. KNOWN LIMITATIONS
* In some cases tasks can't reach max util despite how hard they try
5. PERFORMANCE TESTING
5.1 MACHINES
* Skylake, Broadwell, Haswell
5.2 SETUP
* baseline Linux v5.2 w/ non-invariant schedutil. Tested freq_max = 1-2-3-4-8-12
active cores turbo w/ invariant schedutil, and intel_pstate/powersave
5.3 BENCHMARK RESULTS
5.3.1 NEUTRAL BENCHMARKS
* NAS Parallel Benchmark (HPC), hackbench
5.3.2 NON-NEUTRAL BENCHMARKS
* tbench (10-30% better), kernbench (10-15% better),
shell-intensive-scripts (30-50% better)
* no regressions
5.3.3 SELECTION OF DETAILED RESULTS
5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT
* dbench (5% worse on one machine), kernbench (3% worse),
tbench (5-10% better), shell-intensive-scripts (10-40% better)
6. MICROARCH'ES ADDRESSED HERE
* Xeon Core before Scalable Performance processors line (Xeon Gold/Platinum
etc have different MSRs semantic for querying turbo levels)
7. REFERENCES
* MMTests performance testing framework, github.com/gormanm/mmtests
+-------------------------------------------------------------------------+
| 1. FREQUENCY INVARIANCE: MOTIVATION
+-------------------------------------------------------------------------+
For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When
running a task that would consume 1/3rd of a CPU at 1000 MHz, it would
appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the
false impression this CPU is almost at capacity, even though it can go
faster [*]. In a nutshell, without frequency scale-invariance tasks look
larger just because the CPU is running slower.
[*] (footnote: this assumes a linear frequency/performance relation; which
everybody knows to be false, but given realities its the best approximation
we can make.)
+-------------------------------------------------------------------------+
| 2. PECULIARITIES OF X86
+-------------------------------------------------------------------------+
Accounting for frequency changes in PELT signals requires the computation of
the ratio freq_curr / freq_max. On x86 neither of those terms is readily
available.
2.1 CURRENT FREQUENCY
====================
Since modern x86 has hardware control over the actual frequency we run
at (because amongst other things, Turbo-Mode), we cannot simply use
the frequency as requested through cpufreq.
Instead we use the APERF/MPERF MSRs to compute the effective frequency
over the recent past. Also, because reading MSRs is expensive, don't
do so every time we need the value, but amortize the cost by doing it
every tick.
2.2 MAX FREQUENCY
=================
Obtaining freq_max is also non-trivial because at any time the hardware can
provide a frequency boost to a selected subset of cores if the package has
enough power to spare (eg: Turbo Boost). This means that the maximum frequency
available to a given core changes with time.
The approach taken in this change is to arbitrarily set freq_max to a constant
value at boot. The value chosen is the "4-cores (4C) turbo frequency" on most
microarchitectures, after evaluating the following candidates:
* 1-core (1C) turbo frequency (the fastest turbo state available)
* around base frequency (a.k.a. max P-state)
* something in between, such as 4C turbo
To interpret these options, consider that this is the denominator in
freq_curr/freq_max, and that ratio will be used to scale PELT signals such as
util_avg and load_avg. A large denominator will undershoot (util_avg looks a
bit smaller than it really is), viceversa with a smaller denominator PELT
signals will tend to overshoot. Given that PELT drives frequency selection
in the schedutil governor, we will have:
freq_max set to | effect on DVFS
--------------------+------------------
1C turbo | power efficiency (lower freq choices)
base freq | performance (higher util_avg, higher freq requests)
4C turbo | a bit of both
4C turbo proves to be a good compromise in a number of benchmarks (see below).
+-------------------------------------------------------------------------+
| 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
+-------------------------------------------------------------------------+
Once an architecture implements a frequency scale-invariant utilization (the
PELT signal util_avg), schedutil switches its frequency selection formula from
freq_next = 1.25 * freq_curr * util [non-invariant util signal]
to
freq_next = 1.25 * freq_max * util [invariant util signal]
where, in the second formula, freq_max is set to the 1C turbo frequency (max
turbo). The advantage of the second formula, whose usage we unlock with this
patch, is that freq_next doesn't depend on the current frequency in an
iterative fashion, but can jump to any frequency in a single update. This
absence of feedback in the formula makes it quicker to react to utilization
changes and more robust against pathological instabilities.
Compare it to the update formula of intel_pstate/powersave:
freq_next = 1.25 * freq_max * Busy%
where again freq_max is 1C turbo and Busy% is the percentage of time not spent
idling (calculated with delta_MPERF / delta_TSC); essentially the same as
invariant schedutil, and largely responsible for intel_pstate/powersave good
reputation. The non-invariant schedutil formula is derived from the invariant
one by approximating util_inv with util_raw * freq_curr / freq_max, but this
has limitations.
Testing shows improved performances due to better frequency selections when
the machine is lightly loaded, and essentially no change in behaviour at
saturation / overutilization.
+-------------------------------------------------------------------------+
| 4. KNOWN LIMITATIONS
+-------------------------------------------------------------------------+
It's been shown that it is possible to create pathological scenarios where a
CPU-bound task cannot reach max utilization, if the normalizing factor
freq_max is fixed to a constant value (see [Lelli-2018]).
If freq_max is set to 4C turbo as we do here, one needs to peg at least 5
cores in a package doing some busywork, and observe that none of those task
will ever reach max util (1024) because they're all running at less than the
4C turbo frequency.
While this concern still applies, we believe the performance benefit of
frequency scale-invariant PELT signals outweights the cost of this limitation.
[Lelli-2018]
https://lore.kernel.org/lkml/[email protected]/
+-------------------------------------------------------------------------+
| 5. PERFORMANCE TESTING
+-------------------------------------------------------------------------+
5.1 MACHINES
============
We tested the patch on three machines, with Skylake, Broadwell and Haswell
CPUs. The details are below, together with the available turbo ratios as
reported by the appropriate MSRs.
* 8x-SKYLAKE-UMA:
Single socket E3-1240 v5, Skylake 4 cores/8 threads
Max EFFiciency, BASE frequency and available turbo levels (MHz):
EFFIC 800 |********
BASE 3500 |***********************************
4C 3700 |*************************************
3C 3800 |**************************************
2C 3900 |***************************************
1C 3900 |***************************************
* 80x-BROADWELL-NUMA:
Two sockets E5-2698 v4, 2x Broadwell 20 cores/40 threads
Max EFFiciency, BASE frequency and available turbo levels (MHz):
EFFIC 1200 |************
BASE 2200 |**********************
8C 2900 |*****************************
7C 3000 |******************************
6C 3100 |*******************************
5C 3200 |********************************
4C 3300 |*********************************
3C 3400 |**********************************
2C 3600 |************************************
1C 3600 |************************************
* 48x-HASWELL-NUMA
Two sockets E5-2670 v3, 2x Haswell 12 cores/24 threads
Max EFFiciency, BASE frequency and available turbo levels (MHz):
EFFIC 1200 |************
BASE 2300 |***********************
12C 2600 |**************************
11C 2600 |**************************
10C 2600 |**************************
9C 2600 |**************************
8C 2600 |**************************
7C 2600 |**************************
6C 2600 |**************************
5C 2700 |***************************
4C 2800 |****************************
3C 2900 |*****************************
2C 3100 |*******************************
1C 3100 |*******************************
5.2 SETUP
=========
* The baseline is Linux v5.2 with schedutil (non-invariant) and the intel_pstate
driver in passive mode.
* The rationale for choosing the various freq_max values to test have been to
try all the 1-2-3-4C turbo levels (note that 1C and 2C turbo are identical
on all machines), plus one more value closer to base_freq but still in the
turbo range (8C turbo for both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA).
* In addition we've run all tests with intel_pstate/powersave for comparison.
* The filesystem is always XFS, the userspace is openSUSE Leap 15.1.
* 8x-SKYLAKE-UMA is capable of HWP (Hardware-Managed P-States), so the runs
with active intel_pstate on this machine use that.
This gives, in terms of combinations tested on each machine:
* 8x-SKYLAKE-UMA
* Baseline: Linux v5.2, non-invariant schedutil, intel_pstate passive
* intel_pstate active + powersave + HWP
* invariant schedutil, freq_max = 1C turbo
* invariant schedutil, freq_max = 3C turbo
* invariant schedutil, freq_max = 4C turbo
* both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA
* [same as 8x-SKYLAKE-UMA, but no HWP capable]
* invariant schedutil, freq_max = 8C turbo
(which on 48x-HASWELL-NUMA is the same as 12C turbo, or "all cores turbo")
5.3 BENCHMARK RESULTS
=====================
5.3.1 NEUTRAL BENCHMARKS
------------------------
Tests that didn't show any measurable difference in performance on any of the
test machines between non-invariant schedutil and our patch are:
* NAS Parallel Benchmarks (NPB) using either MPI or openMP for IPC, any
computational kernel
* flexible I/O (FIO)
* hackbench (using threads or processes, and using pipes or sockets)
5.3.2 NON-NEUTRAL BENCHMARKS
----------------------------
What follow are summary tables where each benchmark result is given a score.
* A tilde (~) means a neutral result, i.e. no difference from baseline.
* Scores are computed with the ratio result_new / result_baseline, so a tilde
means a score of 1.00.
* The results in the score ratio are the geometric means of results running
the benchmark with different parameters (eg: for kernbench: using 1, 2, 4,
... number of processes; for pgbench: varying the number of clients, and so
on).
* The first three tables show higher-is-better kind of tests (i.e. measured in
operations/second), the subsequent three show lower-is-better kind of tests
(i.e. the workload is fixed and we measure elapsed time, think kernbench).
* "gitsource" is a name we made up for the test consisting in running the
entire unit tests suite of the Git SCM and measuring how long it takes. We
take it as a typical example of shell-intensive serialized workload.
* In the "I_PSTATE" column we have the results for intel_pstate/powersave. Other
columns show invariant schedutil for different values of freq_max. 4C turbo
is circled as it's the value we've chosen for the final implementation.
80x-BROADWELL-NUMA (comparison ratio; higher is better)
+------+
I_PSTATE 1C 3C | 4C | 8C
pgbench-ro 1.14 ~ ~ | 1.11 | 1.14
pgbench-rw ~ ~ ~ | ~ | ~
netperf-udp 1.06 ~ 1.06 | 1.05 | 1.07
netperf-tcp ~ 1.03 ~ | 1.01 | 1.02
tbench4 1.57 1.18 1.22 | 1.30 | 1.56
+------+
8x-SKYLAKE-UMA (comparison ratio; higher is better)
+------+
I_PSTATE/HWP 1C 3C | 4C |
pgbench-ro ~ ~ ~ | ~ |
pgbench-rw ~ ~ ~ | ~ |
netperf-udp ~ ~ ~ | ~ |
netperf-tcp ~ ~ ~ | ~ |
tbench4 1.30 1.14 1.14 | 1.16 |
+------+
48x-HASWELL-NUMA (comparison ratio; higher is better)
+------+
I_PSTATE 1C 3C | 4C | 12C
pgbench-ro 1.15 ~ ~ | 1.06 | 1.16
pgbench-rw ~ ~ ~ | ~ | ~
netperf-udp 1.05 0.97 1.04 | 1.04 | 1.02
netperf-tcp 0.96 1.01 1.01 | 1.01 | 1.01
tbench4 1.50 1.05 1.13 | 1.13 | 1.25
+------+
In the table above we see that active intel_pstate is slightly better than our
4C-turbo patch (both in reference to the baseline non-invariant schedutil) on
read-only pgbench and much better on tbench. Both cases are notable in which
it shows that lowering our freq_max (to 8C-turbo and 12C-turbo on
80x-BROADWELL-NUMA and 48x-HASWELL-NUMA respectively) helps invariant
schedutil to get closer.
If we ignore active intel_pstate and focus on the comparison with baseline
alone, there are several instances of double-digit performance improvement.
80x-BROADWELL-NUMA (comparison ratio; lower is better)
+------+
I_PSTATE 1C 3C | 4C | 8C
dbench4 1.23 0.95 0.95 | 0.95 | 0.95
kernbench 0.93 0.83 0.83 | 0.83 | 0.82
gitsource 0.98 0.49 0.49 | 0.49 | 0.48
+------+
8x-SKYLAKE-UMA (comparison ratio; lower is better)
+------+
I_PSTATE/HWP 1C 3C | 4C |
dbench4 ~ ~ ~ | ~ |
kernbench ~ ~ ~ | ~ |
gitsource 0.92 0.55 0.55 | 0.55 |
+------+
48x-HASWELL-NUMA (comparison ratio; lower is better)
+------+
I_PSTATE 1C 3C | 4C | 8C
dbench4 ~ ~ ~ | ~ | ~
kernbench 0.94 0.90 0.89 | 0.90 | 0.90
gitsource 0.97 0.69 0.69 | 0.69 | 0.69
+------+
dbench is not very remarkable here, unless we notice how poorly active
intel_pstate is performing on 80x-BROADWELL-NUMA: 23% regression versus
non-invariant schedutil. We repeated that run getting consistent results. Out
of scope for the patch at hand, but deserving future investigation. Other than
that, we previously ran this campaign with Linux v5.0 and saw the patch doing
better on dbench a the time. We haven't checked closely and can only speculate
at this point.
On the NUMA boxes kernbench gets 10-15% improvements on average; we'll see in
the detailed tables that the gains concentrate on low process counts (lightly
loaded machines).
The test we call "gitsource" (running the git unit test suite, a long-running
single-threaded shell script) appears rather spectacular in this table (gains
of 30-50% depending on the machine). It is to be noted, however, that
gitsource has no adjustable parameters (such as the number of jobs in
kernbench, which we average over in order to get a single-number summary
score) and is exactly the kind of low-parallelism workload that benefits the
most from this patch. When looking at the detailed tables of kernbench or
tbench4, at low process or client counts one can see similar numbers.
5.3.3 SELECTION OF DETAILED RESULTS
-----------------------------------
Machine : 48x-HASWELL-NUMA
Benchmark : tbench4 (i.e. dbench4 over the network, actually loopback)
Varying parameter : number of clients
Unit : MB/sec (higher is better)
5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean 1 126.73 +- 0.31% ( ) 315.91 +- 0.66% ( 149.28%) 125.03 +- 0.76% ( -1.34%)
Hmean 2 258.04 +- 0.62% ( ) 614.16 +- 0.51% ( 138.01%) 269.58 +- 1.45% ( 4.47%)
Hmean 4 514.30 +- 0.67% ( ) 1146.58 +- 0.54% ( 122.94%) 533.84 +- 1.99% ( 3.80%)
Hmean 8 1111.38 +- 2.52% ( ) 2159.78 +- 0.38% ( 94.33%) 1359.92 +- 1.56% ( 22.36%)
Hmean 16 2286.47 +- 1.36% ( ) 3338.29 +- 0.21% ( 46.00%) 2720.20 +- 0.52% ( 18.97%)
Hmean 32 4704.84 +- 0.35% ( ) 4759.03 +- 0.43% ( 1.15%) 4774.48 +- 0.30% ( 1.48%)
Hmean 64 7578.04 +- 0.27% ( ) 7533.70 +- 0.43% ( -0.59%) 7462.17 +- 0.65% ( -1.53%)
Hmean 128 6998.52 +- 0.16% ( ) 6987.59 +- 0.12% ( -0.16%) 6909.17 +- 0.14% ( -1.28%)
Hmean 192 6901.35 +- 0.25% ( ) 6913.16 +- 0.10% ( 0.17%) 6855.47 +- 0.21% ( -0.66%)
5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 12C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean 1 128.43 +- 0.28% ( 1.34%) 130.64 +- 3.81% ( 3.09%) 153.71 +- 5.89% ( 21.30%)
Hmean 2 311.70 +- 6.15% ( 20.79%) 281.66 +- 3.40% ( 9.15%) 305.08 +- 5.70% ( 18.23%)
Hmean 4 641.98 +- 2.32% ( 24.83%) 623.88 +- 5.28% ( 21.31%) 906.84 +- 4.65% ( 76.32%)
Hmean 8 1633.31 +- 1.56% ( 46.96%) 1714.16 +- 0.93% ( 54.24%) 2095.74 +- 0.47% ( 88.57%)
Hmean 16 3047.24 +- 0.42% ( 33.27%) 3155.02 +- 0.30% ( 37.99%) 3634.58 +- 0.15% ( 58.96%)
Hmean 32 4734.31 +- 0.60% ( 0.63%) 4804.38 +- 0.23% ( 2.12%) 4674.62 +- 0.27% ( -0.64%)
Hmean 64 7699.74 +- 0.35% ( 1.61%) 7499.72 +- 0.34% ( -1.03%) 7659.03 +- 0.25% ( 1.07%)
Hmean 128 6935.18 +- 0.15% ( -0.91%) 6942.54 +- 0.10% ( -0.80%) 7004.85 +- 0.12% ( 0.09%)
Hmean 192 6901.62 +- 0.12% ( 0.00%) 6856.93 +- 0.10% ( -0.64%) 6978.74 +- 0.10% ( 1.12%)
This is one of the cases where the patch still can't surpass active
intel_pstate, not even when freq_max is as low as 12C-turbo. Otherwise, gains are
visible up to 16 clients and the saturated scenario is the same as baseline.
The scores in the summary table from the previous sections are ratios of
geometric means of the results over different clients, as seen in this table.
Machine : 80x-BROADWELL-NUMA
Benchmark : kernbench (kernel compilation)
Varying parameter : number of jobs
Unit : seconds (lower is better)
5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean 2 379.68 +- 0.06% ( ) 330.20 +- 0.43% ( 13.03%) 285.93 +- 0.07% ( 24.69%)
Amean 4 200.15 +- 0.24% ( ) 175.89 +- 0.22% ( 12.12%) 153.78 +- 0.25% ( 23.17%)
Amean 8 106.20 +- 0.31% ( ) 95.54 +- 0.23% ( 10.03%) 86.74 +- 0.10% ( 18.32%)
Amean 16 56.96 +- 1.31% ( ) 53.25 +- 1.22% ( 6.50%) 48.34 +- 1.73% ( 15.13%)
Amean 32 34.80 +- 2.46% ( ) 33.81 +- 0.77% ( 2.83%) 30.28 +- 1.59% ( 12.99%)
Amean 64 26.11 +- 1.63% ( ) 25.04 +- 1.07% ( 4.10%) 22.41 +- 2.37% ( 14.16%)
Amean 128 24.80 +- 1.36% ( ) 23.57 +- 1.23% ( 4.93%) 21.44 +- 1.37% ( 13.55%)
Amean 160 24.85 +- 0.56% ( ) 23.85 +- 1.17% ( 4.06%) 21.25 +- 1.12% ( 14.49%)
5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 8C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean 2 284.08 +- 0.13% ( 25.18%) 283.96 +- 0.51% ( 25.21%) 285.05 +- 0.21% ( 24.92%)
Amean 4 153.18 +- 0.22% ( 23.47%) 154.70 +- 1.64% ( 22.71%) 153.64 +- 0.30% ( 23.24%)
Amean 8 87.06 +- 0.28% ( 18.02%) 86.77 +- 0.46% ( 18.29%) 86.78 +- 0.22% ( 18.28%)
Amean 16 48.03 +- 0.93% ( 15.68%) 47.75 +- 1.99% ( 16.17%) 47.52 +- 1.61% ( 16.57%)
Amean 32 30.23 +- 1.20% ( 13.14%) 30.08 +- 1.67% ( 13.57%) 30.07 +- 1.67% ( 13.60%)
Amean 64 22.59 +- 2.02% ( 13.50%) 22.63 +- 0.81% ( 13.32%) 22.42 +- 0.76% ( 14.12%)
Amean 128 21.37 +- 0.67% ( 13.82%) 21.31 +- 1.15% ( 14.07%) 21.17 +- 1.93% ( 14.63%)
Amean 160 21.68 +- 0.57% ( 12.76%) 21.18 +- 1.74% ( 14.77%) 21.22 +- 1.00% ( 14.61%)
The patch outperform active intel_pstate (and baseline) by a considerable
margin; the summary table from the previous section says 4C turbo and active
intel_pstate are 0.83 and 0.93 against baseline respectively, so 4C turbo is
0.83/0.93=0.89 against intel_pstate (~10% better on average). There is no
noticeable difference with regard to the value of freq_max.
Machine : 8x-SKYLAKE-UMA
Benchmark : gitsource (time to run the git unit test suite)
Varying parameter : none
Unit : seconds (lower is better)
5.2.0 vanilla 5.2.0 intel_pstate/hwp 5.2.0 1C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean 858.85 +- 1.16% ( ) 791.94 +- 0.21% ( 7.79%) 474.95 ( 44.70%)
5.2.0 3C-turbo 5.2.0 4C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean 475.26 +- 0.20% ( 44.66%) 474.34 +- 0.13% ( 44.77%)
In this test, which is of interest as representing shell-intensive
(i.e. fork-intensive) serialized workloads, invariant schedutil outperforms
intel_pstate/powersave by a whopping 40% margin.
5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT
---------------------------------------------
The following table shows average power consumption in watt for each
benchmark. Data comes from turbostat (package average), which in turn is read
from the RAPL interface on CPUs. We know the patch affects CPU frequencies so
it's reasonable to ignore other power consumers (such as memory or I/O). Also,
we don't have a power meter available in the lab so RAPL is the best we have.
turbostat sampled average power every 10 seconds for the entire duration of
each benchmark. We took all those values and averaged them (i.e. with don't
have detail on a per-parameter granularity, only on whole benchmarks).
80x-BROADWELL-NUMA (power consumption, watts)
+--------+
BASELINE I_PSTATE 1C 3C | 4C | 8C
pgbench-ro 130.01 142.77 131.11 132.45 | 134.65 | 136.84
pgbench-rw 68.30 60.83 71.45 71.70 | 71.65 | 72.54
dbench4 90.25 59.06 101.43 99.89 | 101.10 | 102.94
netperf-udp 65.70 69.81 66.02 68.03 | 68.27 | 68.95
netperf-tcp 88.08 87.96 88.97 88.89 | 88.85 | 88.20
tbench4 142.32 176.73 153.02 163.91 | 165.58 | 176.07
kernbench 92.94 101.95 114.91 115.47 | 115.52 | 115.10
gitsource 40.92 41.87 75.14 75.20 | 75.40 | 75.70
+--------+
8x-SKYLAKE-UMA (power consumption, watts)
+--------+
BASELINE I_PSTATE/HWP 1C 3C | 4C |
pgbench-ro 46.49 46.68 46.56 46.59 | 46.52 |
pgbench-rw 29.34 31.38 30.98 31.00 | 31.00 |
dbench4 27.28 27.37 27.49 27.41 | 27.38 |
netperf-udp 22.33 22.41 22.36 22.35 | 22.36 |
netperf-tcp 27.29 27.29 27.30 27.31 | 27.33 |
tbench4 41.13 45.61 43.10 43.33 | 43.56 |
kernbench 42.56 42.63 43.01 43.01 | 43.01 |
gitsource 13.32 13.69 17.33 17.30 | 17.35 |
+--------+
48x-HASWELL-NUMA (power consumption, watts)
+--------+
BASELINE I_PSTATE 1C 3C | 4C | 12C
pgbench-ro 128.84 136.04 129.87 132.43 | 132.30 | 134.86
pgbench-rw 37.68 37.92 37.17 37.74 | 37.73 | 37.31
dbench4 28.56 28.73 28.60 28.73 | 28.70 | 28.79
netperf-udp 56.70 60.44 56.79 57.42 | 57.54 | 57.52
netperf-tcp 75.49 75.27 75.87 76.02 | 76.01 | 75.95
tbench4 115.44 139.51 119.53 123.07 | 123.97 | 130.22
kernbench 83.23 91.55 95.58 95.69 | 95.72 | 96.04
gitsource 36.79 36.99 39.99 40.34 | 40.35 | 40.23
+--------+
A lower power consumption isn't necessarily better, it depends on what is done
with that energy. Here are tables with the ratio of performance-per-watt on
each machine and benchmark. Higher is always better; a tilde (~) means a
neutral ratio (i.e. 1.00).
80x-BROADWELL-NUMA (performance-per-watt ratios; higher is better)
+------+
I_PSTATE 1C 3C | 4C | 8C
pgbench-ro 1.04 1.06 0.94 | 1.07 | 1.08
pgbench-rw 1.10 0.97 0.96 | 0.96 | 0.97
dbench4 1.24 0.94 0.95 | 0.94 | 0.92
netperf-udp ~ 1.02 1.02 | ~ | 1.02
netperf-tcp ~ 1.02 ~ | ~ | 1.02
tbench4 1.26 1.10 1.06 | 1.12 | 1.26
kernbench 0.98 0.97 0.97 | 0.97 | 0.98
gitsource ~ 1.11 1.11 | 1.11 | 1.13
+------+
8x-SKYLAKE-UMA (performance-per-watt ratios; higher is better)
+------+
I_PSTATE/HWP 1C 3C | 4C |
pgbench-ro ~ ~ ~ | ~ |
pgbench-rw 0.95 0.97 0.96 | 0.96 |
dbench4 ~ ~ ~ | ~ |
netperf-udp ~ ~ ~ | ~ |
netperf-tcp ~ ~ ~ | ~ |
tbench4 1.17 1.09 1.08 | 1.10 |
kernbench ~ ~ ~ | ~ |
gitsource 1.06 1.40 1.40 | 1.40 |
+------+
48x-HASWELL-NUMA (performance-per-watt ratios; higher is better)
+------+
I_PSTATE 1C 3C | 4C | 12C
pgbench-ro 1.09 ~ 1.09 | 1.03 | 1.11
pgbench-rw ~ 0.86 ~ | ~ | 0.86
dbench4 ~ 1.02 1.02 | 1.02 | ~
netperf-udp ~ 0.97 1.03 | 1.02 | ~
netperf-tcp 0.96 ~ ~ | ~ | ~
tbench4 1.24 ~ 1.06 | 1.05 | 1.11
kernbench 0.97 0.97 0.98 | 0.97 | 0.96
gitsource 1.03 1.33 1.32 | 1.32 | 1.33
+------+
These results are overall pleasing: in plenty of cases we observe
performance-per-watt improvements. The few regressions (read/write pgbench and
dbench on the Broadwell machine) are of small magnitude. kernbench loses a few
percentage points (it has a 10-15% performance improvement, but apparently the
increase in power consumption is larger than that). tbench4 and gitsource, which
benefit the most from the patch, keep a positive score in this table which is
a welcome surprise; that suggests that in those particular workloads the
non-invariant schedutil (and active intel_pstate, too) makes some rather
suboptimal frequency selections.
+-------------------------------------------------------------------------+
| 6. MICROARCH'ES ADDRESSED HERE
+-------------------------------------------------------------------------+
The patch addresses Xeon Core processors that use MSR_PLATFORM_INFO and
MSR_TURBO_RATIO_LIMIT to advertise their base frequency and turbo frequencies
respectively. This excludes the recent Xeon Scalable Performance processors
line (Xeon Gold, Platinum etc) whose MSRs have to be parsed differently.
Subsequent patches will address:
* Xeon Scalable Performance processors and Atom Goldmont/Goldmont Plus
* Xeon Phi (Knights Landing, Knights Mill)
* Atom Silvermont
+-------------------------------------------------------------------------+
| 7. REFERENCES
+-------------------------------------------------------------------------+
Tests have been run with the help of the MMTests performance testing
framework, see github.com/gormanm/mmtests. The configuration file names for
the benchmark used are:
db-pgbench-timed-ro-small-xfs
db-pgbench-timed-rw-small-xfs
io-dbench4-async-xfs
network-netperf-unbound
network-tbench
scheduler-unbound
workload-kerndevel-xfs
workload-shellscripts-xfs
hpc-nas-c-class-mpi-full-xfs
hpc-nas-c-class-omp-full
All those benchmarks are generally available on the web:
pgbench: https://www.postgresql.org/docs/10/pgbench.html
netperf: https://hewlettpackard.github.io/netperf/
dbench/tbench: https://dbench.samba.org/
gitsource: git unit test suite, github.com/git/git
NAS Parallel Benchmarks: https://www.nas.nasa.gov/publications/npb.html
hackbench: https://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Giovanni Gherdovich <[email protected]>
Acked-by: Doug Smythies <[email protected]>
---
arch/x86/include/asm/topology.h | 20 +++++
arch/x86/kernel/smpboot.c | 183 +++++++++++++++++++++++++++++++++++++++-
kernel/sched/core.c | 1 +
kernel/sched/sched.h | 7 ++
4 files changed, 210 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 4b14d2318251..2ebf7b7b2126 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -193,4 +193,24 @@ static inline void sched_clear_itmt_support(void)
}
#endif /* CONFIG_SCHED_MC_PRIO */
+#ifdef CONFIG_SMP
+#include <asm/cpufeature.h>
+
+DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key);
+
+#define arch_scale_freq_invariant() static_branch_likely(&arch_scale_freq_key)
+
+DECLARE_PER_CPU(unsigned long, arch_freq_scale);
+
+static inline long arch_scale_freq_capacity(int cpu)
+{
+ return per_cpu(arch_freq_scale, cpu);
+}
+#define arch_scale_freq_capacity arch_scale_freq_capacity
+
+extern void arch_scale_freq_tick(void);
+#define arch_scale_freq_tick arch_scale_freq_tick
+
+#endif
+
#endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 69881b2d446c..28696bccf912 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -147,6 +147,8 @@ static inline void smpboot_restore_warm_reset_vector(void)
*((volatile u32 *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) = 0;
}
+static void init_freq_invariance(void);
+
/*
* Report back to the Boot Processor during boot time or to the caller processor
* during CPU online.
@@ -183,6 +185,8 @@ static void smp_callin(void)
*/
set_cpu_sibling_map(raw_smp_processor_id());
+ init_freq_invariance();
+
/*
* Get our bogomips.
* Update loops_per_jiffy in cpu_data. Previous call to
@@ -1337,7 +1341,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
set_sched_topology(x86_topology);
set_cpu_sibling_map(0);
-
+ init_freq_invariance();
smp_sanity_check();
switch (apic_intr_mode) {
@@ -1764,3 +1768,180 @@ void native_play_dead(void)
}
#endif
+
+/*
+ * APERF/MPERF frequency ratio computation.
+ *
+ * The scheduler wants to do frequency invariant accounting and needs a <1
+ * ratio to account for the 'current' frequency, corresponding to
+ * freq_curr / freq_max.
+ *
+ * Since the frequency freq_curr on x86 is controlled by micro-controller and
+ * our P-state setting is little more than a request/hint, we need to observe
+ * the effective frequency 'BusyMHz', i.e. the average frequency over a time
+ * interval after discarding idle time. This is given by:
+ *
+ * BusyMHz = delta_APERF / delta_MPERF * freq_base
+ *
+ * where freq_base is the max non-turbo P-state.
+ *
+ * The freq_max term has to be set to a somewhat arbitrary value, because we
+ * can't know which turbo states will be available at a given point in time:
+ * it all depends on the thermal headroom of the entire package. We set it to
+ * the turbo level with 4 cores active.
+ *
+ * Benchmarks show that's a good compromise between the 1C turbo ratio
+ * (freq_curr/freq_max would rarely reach 1) and something close to freq_base,
+ * which would ignore the entire turbo range (a conspicuous part, making
+ * freq_curr/freq_max always maxed out).
+ *
+ * Setting freq_max to anything less than the 1C turbo ratio makes the ratio
+ * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1.
+ */
+
+DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
+
+static DEFINE_PER_CPU(u64, arch_prev_aperf);
+static DEFINE_PER_CPU(u64, arch_prev_mperf);
+static u64 arch_max_freq_ratio = SCHED_CAPACITY_SCALE;
+
+static bool turbo_disabled(void)
+{
+ u64 misc_en;
+ int err;
+
+ err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en);
+ if (err)
+ return false;
+
+ return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
+}
+
+#include <asm/cpu_device_id.h>
+#include <asm/intel-family.h>
+
+#define ICPU(model) \
+ {X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF, 0}
+
+static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = {
+ ICPU(INTEL_FAM6_XEON_PHI_KNL),
+ ICPU(INTEL_FAM6_XEON_PHI_KNM),
+ {}
+};
+
+static const struct x86_cpu_id has_skx_turbo_ratio_limits[] = {
+ ICPU(INTEL_FAM6_SKYLAKE_X),
+ {}
+};
+
+static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
+ ICPU(INTEL_FAM6_ATOM_GOLDMONT),
+ ICPU(INTEL_FAM6_ATOM_GOLDMONT_D),
+ ICPU(INTEL_FAM6_ATOM_GOLDMONT_PLUS),
+ {}
+};
+
+static bool core_set_max_freq_ratio(void)
+{
+ u64 base_freq, turbo_freq;
+ int err;
+
+ err = rdmsrl_safe(MSR_PLATFORM_INFO, &base_freq);
+ if (err)
+ return false;
+
+ err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &turbo_freq);
+ if (err)
+ return false;
+
+ base_freq = (base_freq >> 8) & 0xFF; /* max P state */
+ turbo_freq = (turbo_freq >> 24) & 0xFF; /* 4C turbo */
+
+ arch_max_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE,
+ base_freq);
+ return true;
+}
+
+static bool intel_set_max_freq_ratio(void)
+{
+ /*
+ * TODO: add support for:
+ *
+ * - Xeon Gold/Platinum
+ * - Xeon Phi (KNM, KNL)
+ * - Atom Goldmont
+ * - Atom Silvermont
+ */
+
+ if (x86_match_cpu(has_skx_turbo_ratio_limits) ||
+ x86_match_cpu(has_knl_turbo_ratio_limits) ||
+ x86_match_cpu(has_glm_turbo_ratio_limits))
+ return false;
+
+ if (turbo_disabled() || core_set_max_freq_ratio())
+ return true;
+
+ return false;
+}
+
+static void init_counter_refs(void *arg)
+{
+ u64 aperf, mperf;
+
+ rdmsrl(MSR_IA32_APERF, aperf);
+ rdmsrl(MSR_IA32_MPERF, mperf);
+
+ this_cpu_write(arch_prev_aperf, aperf);
+ this_cpu_write(arch_prev_mperf, mperf);
+}
+
+static void init_freq_invariance(void)
+{
+ bool ret = false;
+
+ if (smp_processor_id() != 0 || !boot_cpu_has(X86_FEATURE_APERFMPERF))
+ return;
+
+ if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
+ ret = intel_set_max_freq_ratio();
+
+ if (ret) {
+ on_each_cpu(init_counter_refs, NULL, 1);
+ static_branch_enable(&arch_scale_freq_key);
+ } else {
+ pr_debug("Couldn't determine max cpu frequency, necessary for scale-invariant accounting.\n");
+ }
+}
+
+DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
+
+void arch_scale_freq_tick(void)
+{
+ u64 freq_scale;
+ u64 aperf, mperf;
+ u64 acnt, mcnt;
+
+ if (!arch_scale_freq_invariant())
+ return;
+
+ rdmsrl(MSR_IA32_APERF, aperf);
+ rdmsrl(MSR_IA32_MPERF, mperf);
+
+ acnt = aperf - this_cpu_read(arch_prev_aperf);
+ mcnt = mperf - this_cpu_read(arch_prev_mperf);
+ if (!mcnt)
+ return;
+
+ this_cpu_write(arch_prev_aperf, aperf);
+ this_cpu_write(arch_prev_mperf, mperf);
+
+ acnt <<= 2*SCHED_CAPACITY_SHIFT;
+ mcnt *= arch_max_freq_ratio;
+
+ freq_scale = div64_u64(acnt, mcnt);
+
+ if (freq_scale > SCHED_CAPACITY_SCALE)
+ freq_scale = SCHED_CAPACITY_SCALE;
+
+ this_cpu_write(arch_freq_scale, freq_scale);
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 90e4b00ace89..e0b70b9fb5cc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3594,6 +3594,7 @@ void scheduler_tick(void)
struct task_struct *curr = rq->curr;
struct rq_flags rf;
+ arch_scale_freq_tick();
sched_clock_tick();
rq_lock(rq, &rf);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 280a3c735935..0b51575e2e0c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1968,6 +1968,13 @@ static inline int hrtick_enabled(struct rq *rq)
#endif /* CONFIG_SCHED_HRTICK */
+#ifndef arch_scale_freq_tick
+static __always_inline
+void arch_scale_freq_tick(void)
+{
+}
+#endif
+
#ifndef arch_scale_freq_capacity
static __always_inline
unsigned long arch_scale_freq_capacity(int cpu)
--
2.16.4
On some platforms such as the Dell XPS 13 laptop the firmware disables turbo
when the machine is disconnected from AC, and viceversa it enables it again
when it's reconnected. In these cases a _PPC ACPI notification is issued.
The scheduler needs to know freq_max for frequency-invariant calculations.
To account for turbo availability to come and go, record freq_max at boot as
if turbo was available and store it in a helper variable. Use a setter
function to swap between freq_base and freq_max every time turbo goes off or on.
Signed-off-by: Giovanni Gherdovich <[email protected]>
---
arch/x86/include/asm/topology.h | 5 +++++
arch/x86/kernel/smpboot.c | 15 ++++++++++-----
drivers/cpufreq/intel_pstate.c | 1 +
3 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 2ebf7b7b2126..79d8d5496330 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -211,6 +211,11 @@ static inline long arch_scale_freq_capacity(int cpu)
extern void arch_scale_freq_tick(void);
#define arch_scale_freq_tick arch_scale_freq_tick
+extern void arch_set_max_freq_ratio(bool turbo_disabled);
+#else
+static inline void arch_set_max_freq_ratio(bool turbo_disabled)
+{
+}
#endif
#endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 5f04bf8419f9..467191e51196 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1807,8 +1807,15 @@ DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
static DEFINE_PER_CPU(u64, arch_prev_aperf);
static DEFINE_PER_CPU(u64, arch_prev_mperf);
+static u64 arch_turbo_freq_ratio = SCHED_CAPACITY_SCALE;
static u64 arch_max_freq_ratio = SCHED_CAPACITY_SCALE;
+void arch_set_max_freq_ratio(bool turbo_disabled)
+{
+ arch_max_freq_ratio = turbo_disabled ? SCHED_CAPACITY_SCALE :
+ arch_turbo_freq_ratio;
+}
+
static bool turbo_disabled(void)
{
u64 misc_en;
@@ -1956,10 +1963,7 @@ static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
static bool intel_set_max_freq_ratio(void)
{
- u64 base_freq = 1, turbo_freq = 1;
-
- if (turbo_disabled())
- goto out;
+ u64 base_freq, turbo_freq;
if (slv_set_max_freq_ratio(&base_freq, &turbo_freq))
goto out;
@@ -1981,8 +1985,9 @@ static bool intel_set_max_freq_ratio(void)
return false;
out:
- arch_max_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE,
+ arch_turbo_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE,
base_freq);
+ arch_set_max_freq_ratio(turbo_disabled());
return true;
}
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index d2fa3e9ccd97..abbeeca8bb3b 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -922,6 +922,7 @@ static void intel_pstate_update_limits(unsigned int cpu)
*/
if (global.turbo_disabled_mf != global.turbo_disabled) {
global.turbo_disabled_mf = global.turbo_disabled;
+ arch_set_max_freq_ratio(global.turbo_disabled);
for_each_possible_cpu(cpu)
intel_pstate_update_max_freq(cpu);
} else {
--
2.16.4
The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On SKYLAKE_X CPUs set freq_max to the highest frequency that can
be sustained by a group of at least 4 cores.
From the changelog of commit 31e07522be56 ("tools/power turbostat: fix
decoding for GLM, DNV, SKX turbo-ratio limits"):
> Newer processors do not hard-code the the number of cpus in each bin
> to {1, 2, 3, 4, 5, 6, 7, 8} Rather, they can specify any number
> of CPUS in each of the 8 bins:
>
> eg.
>
> ...
> 37 * 100.0 = 3600.0 MHz max turbo 4 active cores
> 38 * 100.0 = 3700.0 MHz max turbo 3 active cores
> 39 * 100.0 = 3800.0 MHz max turbo 2 active cores
> 39 * 100.0 = 3900.0 MHz max turbo 1 active cores
>
> could now look something like this:
>
> ...
> 37 * 100.0 = 3600.0 MHz max turbo 16 active cores
> 38 * 100.0 = 3700.0 MHz max turbo 8 active cores
> 39 * 100.0 = 3800.0 MHz max turbo 4 active cores
> 39 * 100.0 = 3900.0 MHz max turbo 2 active cores
This encoding of turbo levels applies to both SKYLAKE_X and GOLDMONT/GOLDMONT_D,
but we treat these two classes in separate commits because their freq_max
values need to be different. For SKX we prefer a lower freq_max in the ratio
freq_curr/freq_max, allowing load and utilization to overshoot and the
schedutil governor to be more performance-oriented. Models from the Atom
series (such as GOLDMONT*) are handled in a forthcoming commit as they have to
favor power-efficiency over performance.
Results from a performance evaluation follow.
1. TEST SETUP
2. NEUTRAL BENCHMARKS
3. NON-NEUTRAL BENCHMARKS
4. DETAILED TABLES
1. TEST SETUP
-------------
Test machine:
CPU Model : Intel Xeon Platinum 8260L CPU @ 2.40GHz (a.k.a. Cascade Lake)
Fam/Mod/Ste : 6:85:6
Topology : 2 sockets, 24 cores / 48 threads each socket
Memory : 192G
Storage : SSD, XFS filesystem
Max EFFICiency, BASE frequency and available turbo levels (MHz):
EFFIC 1000 |**********
BASE 2400 |************************
24C 3100 |*******************************
20C 3300 |*********************************
16C 3600 |************************************
12C 3600 |************************************
8C 3600 |************************************
4C 3700 |*************************************
2C 3900 |***************************************
Tested kernels:
Baseline : v5.2, intel_pstate passive, schedutil
Comparison #1 : v5.2, intel_pstate active , powersave+HWP
Comparison #2 : v5.2, this patch, intel_pstate passive, schedutil
2. NEUTRAL BENCHMARKS
---------------------
* pgbench read/write
* NASA Parallel Benchmarks (NPB), MPI or OpenMP for message-passing
* hackbench
* netperf
3. NON-NEUTRAL BENCHMARKS
-------------------------
comparison ratio with baseline; 1.00 means neutral, higher is better:
I_PSTATE FREQ-INV
----------------------------------------
pgbench read-only 1.10 ~
tbench 1.82 1.14
comparison ratio with baseline; 1.00 means neutral, lower is better:
I_PSTATE FREQ-INV
----------------------------------------
dbench ~ 0.97
kernbench 0.88 0.78
gitsource[*] ~ 0.46
[*] "gitsource" consists in running git's unit tests
tilde (~) means 1.00, ie result identical to baseline
Performance per watt:
performance-per-watt ratios with baseline; 1.00 means neutral, higher is better:
I_PSTATE FREQ-INV
----------------------------------------
dbench 0.92 0.91
tbench 1.26 1.04
kernbench 0.95 0.96
gitsource 1.03 1.30
Similarly to earlier Xeons, measurable performance gains over non-invariant
schedutil are observed on dbench, tbench, kernel compilation and running the
git unit tests suite. Looking at the detailed tables show that the patch
scores the largest difference when the machine is lightly loaded. Power
efficiency suffers lightly on kernbench and a bit more on dbench, but largely
improves on gitsource (which also runs considerably faster). For reference, we
also report results using active intel_pstate with powersave and HWP; the
largest gap between non-invariant schedutil and intel_pstate+powersave is
still tbench, which runs 82% better and with 26% improved efficiency on the
latter configuration -- this divide isn't closed yet by frequency-invariant
schedutil.
4. DETAILED TABLES
------------------
Benchmark : tbench4 (i.e. dbench4 over the network, actually loopback)
Varying parameter : number of clients
Unit : MB/sec (higher is better)
5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate/HWP 5.2.0 freq-inv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean 1 183.56 +- 0.21% ( ) 516.12 +- 0.57% ( 181.18%) 185.59 +- 0.59% ( 1.11%)
Hmean 2 365.75 +- 0.25% ( ) 1015.14 +- 0.33% ( 177.55%) 402.59 +- 4.48% ( 10.07%)
Hmean 4 720.99 +- 0.44% ( ) 1951.75 +- 0.28% ( 170.70%) 738.39 +- 1.72% ( 2.41%)
Hmean 8 1449.93 +- 0.34% ( ) 3830.56 +- 0.24% ( 164.19%) 1750.36 +- 4.65% ( 20.72%)
Hmean 16 2874.26 +- 0.57% ( ) 7381.62 +- 0.53% ( 156.82%) 4348.35 +- 2.22% ( 51.29%)
Hmean 32 6116.17 +- 5.10% ( ) 13013.05 +- 0.08% ( 112.76%) 8980.35 +- 0.66% ( 46.83%)
Hmean 64 14485.04 +- 3.46% ( ) 17835.12 +- 0.35% ( 23.13%) 16540.73 +- 0.51% ( 14.19%)
Hmean 128 30779.16 +- 3.20% ( ) 32796.94 +- 2.13% ( 6.56%) 31512.58 +- 0.20% ( 2.38%)
Hmean 256 34664.66 +- 0.81% ( ) 34604.67 +- 0.46% ( -0.17%) 34943.70 +- 0.25% ( 0.80%)
Hmean 384 33957.51 +- 0.11% ( ) 34091.50 +- 0.14% ( 0.39%) 33921.41 +- 0.09% ( -0.11%)
Benchmark : kernbench (kernel compilation)
Varying parameter : number of jobs
Unit : seconds (lower is better)
5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate/HWP 5.2.0 freq-inv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean 2 332.94 +- 0.40% ( ) 260.16 +- 0.45% ( 21.86%) 233.56 +- 0.21% ( 29.85%)
Amean 4 173.04 +- 0.43% ( ) 138.76 +- 0.03% ( 19.81%) 123.59 +- 0.11% ( 28.58%)
Amean 8 89.65 +- 0.20% ( ) 73.54 +- 0.09% ( 17.97%) 65.69 +- 0.10% ( 26.72%)
Amean 16 48.08 +- 1.41% ( ) 41.64 +- 1.61% ( 13.40%) 36.00 +- 1.80% ( 25.11%)
Amean 32 28.78 +- 0.72% ( ) 26.61 +- 1.99% ( 7.55%) 23.19 +- 1.68% ( 19.43%)
Amean 64 20.46 +- 1.85% ( ) 19.76 +- 0.35% ( 3.42%) 17.38 +- 0.92% ( 15.06%)
Amean 128 18.69 +- 1.70% ( ) 17.59 +- 1.04% ( 5.90%) 15.73 +- 1.40% ( 15.85%)
Amean 192 18.82 +- 1.01% ( ) 17.76 +- 0.77% ( 5.67%) 15.57 +- 1.80% ( 17.28%)
Benchmark : gitsource (time to run the git unit test suite)
Varying parameter : none
Unit : seconds (lower is better)
5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate/HWP 5.2.0 freq-inv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean 792.49 +- 0.20% ( ) 779.35 +- 0.24% ( 1.66%) 427.14 +- 0.16% ( 46.10%)
Signed-off-by: Giovanni Gherdovich <[email protected]>
---
arch/x86/kernel/smpboot.c | 66 +++++++++++++++++++++++++++++++++++++----------
1 file changed, 53 insertions(+), 13 deletions(-)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 28696bccf912..ba9d3bdc191c 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1841,24 +1841,52 @@ static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
{}
};
-static bool core_set_max_freq_ratio(void)
+static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
+{
+ u64 ratios, counts;
+ u32 group_size;
+ int err, i;
+
+ err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
+ if (err)
+ return false;
+
+ *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */
+
+ err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &ratios);
+ if (err)
+ return false;
+
+ err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT1, &counts);
+ if (err)
+ return false;
+
+ for (i = 0; i < 64; i += 8) {
+ group_size = (counts >> i) & 0xFF;
+ if (group_size >= size) {
+ *turbo_freq = (ratios >> i) & 0xFF;
+ return true;
+ }
+ }
+
+ return false;
+}
+
+static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
{
- u64 base_freq, turbo_freq;
int err;
- err = rdmsrl_safe(MSR_PLATFORM_INFO, &base_freq);
+ err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
if (err)
return false;
- err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &turbo_freq);
+ err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, turbo_freq);
if (err)
return false;
- base_freq = (base_freq >> 8) & 0xFF; /* max P state */
- turbo_freq = (turbo_freq >> 24) & 0xFF; /* 4C turbo */
+ *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */
+ *turbo_freq = (*turbo_freq >> 24) & 0xFF; /* 4C turbo */
- arch_max_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE,
- base_freq);
return true;
}
@@ -1867,21 +1895,33 @@ static bool intel_set_max_freq_ratio(void)
/*
* TODO: add support for:
*
- * - Xeon Gold/Platinum
* - Xeon Phi (KNM, KNL)
* - Atom Goldmont
* - Atom Silvermont
*/
- if (x86_match_cpu(has_skx_turbo_ratio_limits) ||
- x86_match_cpu(has_knl_turbo_ratio_limits) ||
+ u64 base_freq = 1, turbo_freq = 1;
+
+ if (x86_match_cpu(has_knl_turbo_ratio_limits) ||
x86_match_cpu(has_glm_turbo_ratio_limits))
return false;
- if (turbo_disabled() || core_set_max_freq_ratio())
- return true;
+ if (turbo_disabled())
+ goto out;
+
+ if (x86_match_cpu(has_skx_turbo_ratio_limits) &&
+ skx_set_max_freq_ratio(&base_freq, &turbo_freq, 4))
+ goto out;
+
+ if (core_set_max_freq_ratio(&base_freq, &turbo_freq))
+ goto out;
return false;
+
+out:
+ arch_max_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE,
+ base_freq);
+ return true;
}
static void init_counter_refs(void *arg)
--
2.16.4
On Wed, Jan 22, 2020 at 4:10 PM Giovanni Gherdovich <[email protected]> wrote:
>
> v4 at https://lore.kernel.org/lkml/[email protected]/
>
> Changes wrt v4:
>
> - Removing conditional access in the function arch_scale_freq_capacity()
> and initialize arch_freq_scale to 1024 to account for when freq
> invariance isn't enabled (Ionela V.)
> - In case the max frequency can't be read in MSRs, do not enable frequency
> invariance at all (Ionela V., Peter Z.).
> - Renames:
> variables:
> arch_cpu_freq -> arch_freq_scale
> arch_max_freq -> arch_max_freq_ratio
> ... and others
> functions:
> init_scale_freq -> init_counter_refs
> set_cpu_max_freq -> init_freq_invariance
> {core,skx,knl...}_set_cpu_max_freq -> {core,skx,knl...}_set_max_freq_ratio
> ... and others
> - Use the same function for parsing SKX and GLM registers (Peter Z.)
> - Pass a parameter to the function parsing KNL registers (Peter Z.)
> - Fix a bug whereby refs to [am]perf were initialized only on cpu #0
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> Cover Letter from v4:
>
> v3 at https://lore.kernel.org/lkml/[email protected]/
>
> Changes wrt v3:
>
> - Add definition of function set_arch_max_freq if !CONFIG_SMP
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> Cover Letter from v3:
>
> v2 at https://lore.kernel.org/lkml/[email protected]/
>
> Changes wrt v2:
>
> - Removing the tick_disable mechanism. Frequency scale-invariance isn't
> just about helping schedutil choose better frequencies, but also
> providing the scheduler load balancer with better metrics. All users of
> PELT signals benefit from this feature. The tick_disable patch disabled
> frequency invariant calculation when a specific driver is in use
> (intel_pstate in active mode).
>
> - static_branch_enable(&arch_scale_freq_key) is now called earlier, right
> after we learn that X86_FEATURE_APERFMPERF is available. Previously Peter
> Z. commented "if we can't tell the max_freq we don't want to use the
> invariant stuff.". I've decided to do it differently: if we can't tell
> the max_freq, then it's because the CPU encodes max_freq in MSRs in a way
> this patch doesn't understand, and we assume max_p is the max_freq which
> seems like a safe bet. As a reminder, max_freq=max_p is encoded by
> setting arch_max_freq=1024 as default value. I'm open to feedback.
>
> - Refactoring the switch case statement in set_cpu_max_freq() as Rafael
> W. Now the first patch doesn't hint at what the following patch will
> bring along.
>
> - Handling the case were turbo is disabled at runtime and a _PPC ACPI
> notification is issued, as requested by Rafael W. This happens eg. when
> some laptop model is disconnected from AC. (Patch #6)
>
> - Handling all Intel x86_64 micro-arches.
>
> - A note for Srinivas P., who expressed concern for Atoms: on Atom CPUs the
> max_freq is set to the highest turbo level, as a power-efficiency
> oriented measure. In this way the ratio curr_freq/max_freq tends to be
> lower, PELT signals are consequently lower, and schedutil doesn't push
> too hard on speed. (Patches #4 and #5).
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> Cover Letter from v2:
>
> v1 at https://lore.kernel.org/lkml/[email protected]/
>
> Changes wrt v1:
>
> - add x86-specific implementation of arch_scale_freq_invariant() using a
> static key that checks for the availability of APERF and MPERF
> - refer to GOLDMONT_D instead of GOLDMONT_X, according to recent rename
> - set arch_cpu_freq to 1024 from x86_arch_scale_freq_tick_disable() to prevent
> PELT from being fed stale data
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> Cover Letter from v1:
>
> This is a resend with of Peter Zijlstra's patch to support frequency
> scale-invariance on x86 from May 2018 [see 1]. I've added some modifications
> and included performance test results. If Peter doesn't mind, I'm slapping my
> name on it :)
>
> The changes from Peter's original implementation are:
>
> 1) normalizing against the 4-cores turbo level instead or 1-core turbo
> 2) removing the run-time search for when the above value isn't found in the
> various Intel MSRs -- the base frequency value is taken in that case.
>
> The section "4. KNOWN LIMITATIONS" in the first patch commit message addresses
> the reason why this approach was dropped back in 2018, and explains that the
> performance gains outweight that issue.
>
> The second patch from Srinivas is taken verbatim from the May 2018 submission
> as it still applies.
>
> I apologies for the length of patch #1 commit message; I've made a table of
> contents with summaries of each section that should make easier to skim
> through the content.
>
> This submission incorporates the feedback and requests for additional tests
> received during the presentation made at OSPM 2019 in Pisa three months ago.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
>
> Giovanni Gherdovich (6):
> x86,sched: Add support for frequency invariance
> x86,sched: Add support for frequency invariance on SKYLAKE_X
> x86,sched: Add support for frequency invariance on XEON_PHI_KNL/KNM
> x86,sched: Add support for frequency invariance on ATOM_GOLDMONT*
> x86,sched: Add support for frequency invariance on ATOM
> x86: intel_pstate: handle runtime turbo disablement/enablement in
> freq. invariance
>
> arch/x86/include/asm/topology.h | 25 ++++
> arch/x86/kernel/smpboot.c | 290 +++++++++++++++++++++++++++++++++++++++-
> drivers/cpufreq/intel_pstate.c | 1 +
> kernel/sched/core.c | 1 +
> kernel/sched/sched.h | 7 +
> 5 files changed, 323 insertions(+), 1 deletion(-)
>
All looks good to me, so
Acked-by: Rafael J. Wysocki <[email protected]>
for the whole series (and I'm assuming that it will go it through the tip tree).
Thanks!
On Thu, Jan 23, 2020 at 04:30:36PM +0100, Rafael J. Wysocki wrote:
> All looks good to me, so
>
> Acked-by: Rafael J. Wysocki <[email protected]>
>
> for the whole series (and I'm assuming that it will go it through the tip tree).
Thanks, and yes, I've picked them up and will push them to tip if
nothing falls out.
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 298c6f99bf30ef735e79f7f6d086bdfae505d380
Gitweb: https://git.kernel.org/tip/298c6f99bf30ef735e79f7f6d086bdfae505d380
Author: Giovanni Gherdovich <[email protected]>
AuthorDate: Wed, 22 Jan 2020 16:16:16 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 28 Jan 2020 21:37:05 +01:00
x86, sched: Add support for frequency invariance on ATOM
The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On all ATOM CPUs prior to Goldmont, set freq_max to the 1-core
turbo ratio.
We intended to perform tests validating that this patch doesn't regress in
terms of energy efficiency, given that this is the primary concern on Atom
processors. Alas, we found out that turbostat doesn't support reading RAPL
interfaces on our test machine (Airmont), and we don't have external equipment
to measure power consumption; all we have is the performance results of the
benchmarks we ran.
Test machine:
Platform : Dell Wyse 3040 Thin Client[1]
CPU Model : Intel Atom x5-Z8350 (aka Cherry Trail, aka Airmont)
Fam/Mod/Ste : 6:76:4
Topology : 1 socket, 4 cores / 4 threads
Memory : 2G
Storage : onboard flash, XFS filesystem
[1] https://www.dell.com/en-us/work/shop/wyse-endpoints-and-software/wyse-3040-thin-client/spd/wyse-3040-thin-client
Base frequency and available turbo levels (MHz):
Min Operating Freq 266 |***
Low Freq Mode 800 |********
Base Freq 2400 |************************
4 Cores 2800 |****************************
3 Cores 2800 |****************************
2 Cores 3200 |********************************
1 Core 3200 |********************************
Tested kernels:
Baseline : v5.4-rc1, intel_pstate passive, schedutil
Comparison #1 : v5.4-rc1, intel_pstate active , powersave
Comparison #2 : v5.4-rc1, this patch, intel_pstate passive, schedutil
tbench, hackbench and kernbench performed the same under all three kernels;
dbench ran faster with intel_pstate/powersave and the git unit tests were a
lot faster with intel_pstate/powersave and invariant schedutil wrt the
baseline. Not that any of this is terrbily interesting anyway, one doesn't buy
an Atom system to go fast. Power consumption regressions aren't expected but
we lack the equipment to make that measurement. Turbostat seems to think that
reading RAPL on this machine isn't a good idea and we're trusting that
decision.
comparison ratio of performance with baseline; 1.00 means neutral,
lower is better:
I_PSTATE FREQ-INV
----------------------------------------
dbench 0.90 ~
kernbench 0.98 0.97
gitsource 0.63 0.43
Signed-off-by: Giovanni Gherdovich <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/kernel/smpboot.c | 27 +++++++++++++++++++++------
1 file changed, 21 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 3e32d62..5f04bf8 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1821,6 +1821,24 @@ static bool turbo_disabled(void)
return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
}
+static bool slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
+{
+ int err;
+
+ err = rdmsrl_safe(MSR_ATOM_CORE_RATIOS, base_freq);
+ if (err)
+ return false;
+
+ err = rdmsrl_safe(MSR_ATOM_CORE_TURBO_RATIOS, turbo_freq);
+ if (err)
+ return false;
+
+ *base_freq = (*base_freq >> 16) & 0x3F; /* max P state */
+ *turbo_freq = *turbo_freq & 0x3F; /* 1C turbo */
+
+ return true;
+}
+
#include <asm/cpu_device_id.h>
#include <asm/intel-family.h>
@@ -1938,17 +1956,14 @@ static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
static bool intel_set_max_freq_ratio(void)
{
- /*
- * TODO: add support for:
- *
- * - Atom Silvermont
- */
-
u64 base_freq = 1, turbo_freq = 1;
if (turbo_disabled())
goto out;
+ if (slv_set_max_freq_ratio(&base_freq, &turbo_freq))
+ goto out;
+
if (x86_match_cpu(has_glm_turbo_ratio_limits) &&
skx_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
goto out;
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 2a0abc59699896f03bf6f16efb8a3a490511216f
Gitweb: https://git.kernel.org/tip/2a0abc59699896f03bf6f16efb8a3a490511216f
Author: Giovanni Gherdovich <[email protected]>
AuthorDate: Wed, 22 Jan 2020 16:16:13 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 28 Jan 2020 21:37:01 +01:00
x86, sched: Add support for frequency invariance on SKYLAKE_X
The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On SKYLAKE_X CPUs set freq_max to the highest frequency that can
be sustained by a group of at least 4 cores.
>From the changelog of commit 31e07522be56 ("tools/power turbostat: fix
decoding for GLM, DNV, SKX turbo-ratio limits"):
> Newer processors do not hard-code the the number of cpus in each bin
> to {1, 2, 3, 4, 5, 6, 7, 8} Rather, they can specify any number
> of CPUS in each of the 8 bins:
>
> eg.
>
> ...
> 37 * 100.0 = 3600.0 MHz max turbo 4 active cores
> 38 * 100.0 = 3700.0 MHz max turbo 3 active cores
> 39 * 100.0 = 3800.0 MHz max turbo 2 active cores
> 39 * 100.0 = 3900.0 MHz max turbo 1 active cores
>
> could now look something like this:
>
> ...
> 37 * 100.0 = 3600.0 MHz max turbo 16 active cores
> 38 * 100.0 = 3700.0 MHz max turbo 8 active cores
> 39 * 100.0 = 3800.0 MHz max turbo 4 active cores
> 39 * 100.0 = 3900.0 MHz max turbo 2 active cores
This encoding of turbo levels applies to both SKYLAKE_X and GOLDMONT/GOLDMONT_D,
but we treat these two classes in separate commits because their freq_max
values need to be different. For SKX we prefer a lower freq_max in the ratio
freq_curr/freq_max, allowing load and utilization to overshoot and the
schedutil governor to be more performance-oriented. Models from the Atom
series (such as GOLDMONT*) are handled in a forthcoming commit as they have to
favor power-efficiency over performance.
Results from a performance evaluation follow.
1. TEST SETUP
2. NEUTRAL BENCHMARKS
3. NON-NEUTRAL BENCHMARKS
4. DETAILED TABLES
1. TEST SETUP
-------------
Test machine:
CPU Model : Intel Xeon Platinum 8260L CPU @ 2.40GHz (a.k.a. Cascade Lake)
Fam/Mod/Ste : 6:85:6
Topology : 2 sockets, 24 cores / 48 threads each socket
Memory : 192G
Storage : SSD, XFS filesystem
Max EFFICiency, BASE frequency and available turbo levels (MHz):
EFFIC 1000 |**********
BASE 2400 |************************
24C 3100 |*******************************
20C 3300 |*********************************
16C 3600 |************************************
12C 3600 |************************************
8C 3600 |************************************
4C 3700 |*************************************
2C 3900 |***************************************
Tested kernels:
Baseline : v5.2, intel_pstate passive, schedutil
Comparison #1 : v5.2, intel_pstate active , powersave+HWP
Comparison #2 : v5.2, this patch, intel_pstate passive, schedutil
2. NEUTRAL BENCHMARKS
---------------------
* pgbench read/write
* NASA Parallel Benchmarks (NPB), MPI or OpenMP for message-passing
* hackbench
* netperf
3. NON-NEUTRAL BENCHMARKS
-------------------------
comparison ratio with baseline; 1.00 means neutral, higher is better:
I_PSTATE FREQ-INV
----------------------------------------
pgbench read-only 1.10 ~
tbench 1.82 1.14
comparison ratio with baseline; 1.00 means neutral, lower is better:
I_PSTATE FREQ-INV
----------------------------------------
dbench ~ 0.97
kernbench 0.88 0.78
gitsource[*] ~ 0.46
[*] "gitsource" consists in running git's unit tests
tilde (~) means 1.00, ie result identical to baseline
Performance per watt:
performance-per-watt ratios with baseline; 1.00 means neutral, higher is better:
I_PSTATE FREQ-INV
----------------------------------------
dbench 0.92 0.91
tbench 1.26 1.04
kernbench 0.95 0.96
gitsource 1.03 1.30
Similarly to earlier Xeons, measurable performance gains over non-invariant
schedutil are observed on dbench, tbench, kernel compilation and running the
git unit tests suite. Looking at the detailed tables show that the patch
scores the largest difference when the machine is lightly loaded. Power
efficiency suffers lightly on kernbench and a bit more on dbench, but largely
improves on gitsource (which also runs considerably faster). For reference, we
also report results using active intel_pstate with powersave and HWP; the
largest gap between non-invariant schedutil and intel_pstate+powersave is
still tbench, which runs 82% better and with 26% improved efficiency on the
latter configuration -- this divide isn't closed yet by frequency-invariant
schedutil.
4. DETAILED TABLES
------------------
Benchmark : tbench4 (i.e. dbench4 over the network, actually loopback)
Varying parameter : number of clients
Unit : MB/sec (higher is better)
5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate/HWP 5.2.0 freq-inv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean 1 183.56 +- 0.21% ( ) 516.12 +- 0.57% ( 181.18%) 185.59 +- 0.59% ( 1.11%)
Hmean 2 365.75 +- 0.25% ( ) 1015.14 +- 0.33% ( 177.55%) 402.59 +- 4.48% ( 10.07%)
Hmean 4 720.99 +- 0.44% ( ) 1951.75 +- 0.28% ( 170.70%) 738.39 +- 1.72% ( 2.41%)
Hmean 8 1449.93 +- 0.34% ( ) 3830.56 +- 0.24% ( 164.19%) 1750.36 +- 4.65% ( 20.72%)
Hmean 16 2874.26 +- 0.57% ( ) 7381.62 +- 0.53% ( 156.82%) 4348.35 +- 2.22% ( 51.29%)
Hmean 32 6116.17 +- 5.10% ( ) 13013.05 +- 0.08% ( 112.76%) 8980.35 +- 0.66% ( 46.83%)
Hmean 64 14485.04 +- 3.46% ( ) 17835.12 +- 0.35% ( 23.13%) 16540.73 +- 0.51% ( 14.19%)
Hmean 128 30779.16 +- 3.20% ( ) 32796.94 +- 2.13% ( 6.56%) 31512.58 +- 0.20% ( 2.38%)
Hmean 256 34664.66 +- 0.81% ( ) 34604.67 +- 0.46% ( -0.17%) 34943.70 +- 0.25% ( 0.80%)
Hmean 384 33957.51 +- 0.11% ( ) 34091.50 +- 0.14% ( 0.39%) 33921.41 +- 0.09% ( -0.11%)
Benchmark : kernbench (kernel compilation)
Varying parameter : number of jobs
Unit : seconds (lower is better)
5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate/HWP 5.2.0 freq-inv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean 2 332.94 +- 0.40% ( ) 260.16 +- 0.45% ( 21.86%) 233.56 +- 0.21% ( 29.85%)
Amean 4 173.04 +- 0.43% ( ) 138.76 +- 0.03% ( 19.81%) 123.59 +- 0.11% ( 28.58%)
Amean 8 89.65 +- 0.20% ( ) 73.54 +- 0.09% ( 17.97%) 65.69 +- 0.10% ( 26.72%)
Amean 16 48.08 +- 1.41% ( ) 41.64 +- 1.61% ( 13.40%) 36.00 +- 1.80% ( 25.11%)
Amean 32 28.78 +- 0.72% ( ) 26.61 +- 1.99% ( 7.55%) 23.19 +- 1.68% ( 19.43%)
Amean 64 20.46 +- 1.85% ( ) 19.76 +- 0.35% ( 3.42%) 17.38 +- 0.92% ( 15.06%)
Amean 128 18.69 +- 1.70% ( ) 17.59 +- 1.04% ( 5.90%) 15.73 +- 1.40% ( 15.85%)
Amean 192 18.82 +- 1.01% ( ) 17.76 +- 0.77% ( 5.67%) 15.57 +- 1.80% ( 17.28%)
Benchmark : gitsource (time to run the git unit test suite)
Varying parameter : none
Unit : seconds (lower is better)
5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate/HWP 5.2.0 freq-inv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean 792.49 +- 0.20% ( ) 779.35 +- 0.24% ( 1.66%) 427.14 +- 0.16% ( 46.10%)
Signed-off-by: Giovanni Gherdovich <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/kernel/smpboot.c | 66 ++++++++++++++++++++++++++++++--------
1 file changed, 53 insertions(+), 13 deletions(-)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 28696bc..ba9d3bd 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1841,24 +1841,52 @@ static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
{}
};
-static bool core_set_max_freq_ratio(void)
+static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
+{
+ u64 ratios, counts;
+ u32 group_size;
+ int err, i;
+
+ err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
+ if (err)
+ return false;
+
+ *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */
+
+ err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &ratios);
+ if (err)
+ return false;
+
+ err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT1, &counts);
+ if (err)
+ return false;
+
+ for (i = 0; i < 64; i += 8) {
+ group_size = (counts >> i) & 0xFF;
+ if (group_size >= size) {
+ *turbo_freq = (ratios >> i) & 0xFF;
+ return true;
+ }
+ }
+
+ return false;
+}
+
+static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
{
- u64 base_freq, turbo_freq;
int err;
- err = rdmsrl_safe(MSR_PLATFORM_INFO, &base_freq);
+ err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
if (err)
return false;
- err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &turbo_freq);
+ err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, turbo_freq);
if (err)
return false;
- base_freq = (base_freq >> 8) & 0xFF; /* max P state */
- turbo_freq = (turbo_freq >> 24) & 0xFF; /* 4C turbo */
+ *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */
+ *turbo_freq = (*turbo_freq >> 24) & 0xFF; /* 4C turbo */
- arch_max_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE,
- base_freq);
return true;
}
@@ -1867,21 +1895,33 @@ static bool intel_set_max_freq_ratio(void)
/*
* TODO: add support for:
*
- * - Xeon Gold/Platinum
* - Xeon Phi (KNM, KNL)
* - Atom Goldmont
* - Atom Silvermont
*/
- if (x86_match_cpu(has_skx_turbo_ratio_limits) ||
- x86_match_cpu(has_knl_turbo_ratio_limits) ||
+ u64 base_freq = 1, turbo_freq = 1;
+
+ if (x86_match_cpu(has_knl_turbo_ratio_limits) ||
x86_match_cpu(has_glm_turbo_ratio_limits))
return false;
- if (turbo_disabled() || core_set_max_freq_ratio())
- return true;
+ if (turbo_disabled())
+ goto out;
+
+ if (x86_match_cpu(has_skx_turbo_ratio_limits) &&
+ skx_set_max_freq_ratio(&base_freq, &turbo_freq, 4))
+ goto out;
+
+ if (core_set_max_freq_ratio(&base_freq, &turbo_freq))
+ goto out;
return false;
+
+out:
+ arch_max_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE,
+ base_freq);
+ return true;
}
static void init_counter_refs(void *arg)
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 1567c3e3467cddeb019a7b53ec632f834b6a9239
Gitweb: https://git.kernel.org/tip/1567c3e3467cddeb019a7b53ec632f834b6a9239
Author: Giovanni Gherdovich <[email protected]>
AuthorDate: Wed, 22 Jan 2020 16:16:12 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 28 Jan 2020 21:36:59 +01:00
x86, sched: Add support for frequency invariance
Implement arch_scale_freq_capacity() for 'modern' x86. This function
is used by the scheduler to correctly account usage in the face of
DVFS.
The present patch addresses Intel processors specifically and has positive
performance and performance-per-watt implications for the schedutil cpufreq
governor, bringing it closer to, if not on-par with, the powersave governor
from the intel_pstate driver/framework.
Large performance gains are obtained when the machine is lightly loaded and
no regression are observed at saturation. The benchmarks with the largest
gains are kernel compilation, tbench (the networking version of dbench) and
shell-intensive workloads.
1. FREQUENCY INVARIANCE: MOTIVATION
* Without it, a task looks larger if the CPU runs slower
2. PECULIARITIES OF X86
* freq invariance accounting requires knowing the ratio freq_curr/freq_max
2.1 CURRENT FREQUENCY
* Use delta_APERF / delta_MPERF * freq_base (a.k.a "BusyMHz")
2.2 MAX FREQUENCY
* It varies with time (turbo). As an approximation, we set it to a
constant, i.e. 4-cores turbo frequency.
3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
* The invariant schedutil's formula has no feedback loop and reacts faster
to utilization changes
4. KNOWN LIMITATIONS
* In some cases tasks can't reach max util despite how hard they try
5. PERFORMANCE TESTING
5.1 MACHINES
* Skylake, Broadwell, Haswell
5.2 SETUP
* baseline Linux v5.2 w/ non-invariant schedutil. Tested freq_max = 1-2-3-4-8-12
active cores turbo w/ invariant schedutil, and intel_pstate/powersave
5.3 BENCHMARK RESULTS
5.3.1 NEUTRAL BENCHMARKS
* NAS Parallel Benchmark (HPC), hackbench
5.3.2 NON-NEUTRAL BENCHMARKS
* tbench (10-30% better), kernbench (10-15% better),
shell-intensive-scripts (30-50% better)
* no regressions
5.3.3 SELECTION OF DETAILED RESULTS
5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT
* dbench (5% worse on one machine), kernbench (3% worse),
tbench (5-10% better), shell-intensive-scripts (10-40% better)
6. MICROARCH'ES ADDRESSED HERE
* Xeon Core before Scalable Performance processors line (Xeon Gold/Platinum
etc have different MSRs semantic for querying turbo levels)
7. REFERENCES
* MMTests performance testing framework, github.com/gormanm/mmtests
+-------------------------------------------------------------------------+
| 1. FREQUENCY INVARIANCE: MOTIVATION
+-------------------------------------------------------------------------+
For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When
running a task that would consume 1/3rd of a CPU at 1000 MHz, it would
appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the
false impression this CPU is almost at capacity, even though it can go
faster [*]. In a nutshell, without frequency scale-invariance tasks look
larger just because the CPU is running slower.
[*] (footnote: this assumes a linear frequency/performance relation; which
everybody knows to be false, but given realities its the best approximation
we can make.)
+-------------------------------------------------------------------------+
| 2. PECULIARITIES OF X86
+-------------------------------------------------------------------------+
Accounting for frequency changes in PELT signals requires the computation of
the ratio freq_curr / freq_max. On x86 neither of those terms is readily
available.
2.1 CURRENT FREQUENCY
====================
Since modern x86 has hardware control over the actual frequency we run
at (because amongst other things, Turbo-Mode), we cannot simply use
the frequency as requested through cpufreq.
Instead we use the APERF/MPERF MSRs to compute the effective frequency
over the recent past. Also, because reading MSRs is expensive, don't
do so every time we need the value, but amortize the cost by doing it
every tick.
2.2 MAX FREQUENCY
=================
Obtaining freq_max is also non-trivial because at any time the hardware can
provide a frequency boost to a selected subset of cores if the package has
enough power to spare (eg: Turbo Boost). This means that the maximum frequency
available to a given core changes with time.
The approach taken in this change is to arbitrarily set freq_max to a constant
value at boot. The value chosen is the "4-cores (4C) turbo frequency" on most
microarchitectures, after evaluating the following candidates:
* 1-core (1C) turbo frequency (the fastest turbo state available)
* around base frequency (a.k.a. max P-state)
* something in between, such as 4C turbo
To interpret these options, consider that this is the denominator in
freq_curr/freq_max, and that ratio will be used to scale PELT signals such as
util_avg and load_avg. A large denominator will undershoot (util_avg looks a
bit smaller than it really is), viceversa with a smaller denominator PELT
signals will tend to overshoot. Given that PELT drives frequency selection
in the schedutil governor, we will have:
freq_max set to | effect on DVFS
--------------------+------------------
1C turbo | power efficiency (lower freq choices)
base freq | performance (higher util_avg, higher freq requests)
4C turbo | a bit of both
4C turbo proves to be a good compromise in a number of benchmarks (see below).
+-------------------------------------------------------------------------+
| 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
+-------------------------------------------------------------------------+
Once an architecture implements a frequency scale-invariant utilization (the
PELT signal util_avg), schedutil switches its frequency selection formula from
freq_next = 1.25 * freq_curr * util [non-invariant util signal]
to
freq_next = 1.25 * freq_max * util [invariant util signal]
where, in the second formula, freq_max is set to the 1C turbo frequency (max
turbo). The advantage of the second formula, whose usage we unlock with this
patch, is that freq_next doesn't depend on the current frequency in an
iterative fashion, but can jump to any frequency in a single update. This
absence of feedback in the formula makes it quicker to react to utilization
changes and more robust against pathological instabilities.
Compare it to the update formula of intel_pstate/powersave:
freq_next = 1.25 * freq_max * Busy%
where again freq_max is 1C turbo and Busy% is the percentage of time not spent
idling (calculated with delta_MPERF / delta_TSC); essentially the same as
invariant schedutil, and largely responsible for intel_pstate/powersave good
reputation. The non-invariant schedutil formula is derived from the invariant
one by approximating util_inv with util_raw * freq_curr / freq_max, but this
has limitations.
Testing shows improved performances due to better frequency selections when
the machine is lightly loaded, and essentially no change in behaviour at
saturation / overutilization.
+-------------------------------------------------------------------------+
| 4. KNOWN LIMITATIONS
+-------------------------------------------------------------------------+
It's been shown that it is possible to create pathological scenarios where a
CPU-bound task cannot reach max utilization, if the normalizing factor
freq_max is fixed to a constant value (see [Lelli-2018]).
If freq_max is set to 4C turbo as we do here, one needs to peg at least 5
cores in a package doing some busywork, and observe that none of those task
will ever reach max util (1024) because they're all running at less than the
4C turbo frequency.
While this concern still applies, we believe the performance benefit of
frequency scale-invariant PELT signals outweights the cost of this limitation.
[Lelli-2018]
https://lore.kernel.org/lkml/[email protected]/
+-------------------------------------------------------------------------+
| 5. PERFORMANCE TESTING
+-------------------------------------------------------------------------+
5.1 MACHINES
============
We tested the patch on three machines, with Skylake, Broadwell and Haswell
CPUs. The details are below, together with the available turbo ratios as
reported by the appropriate MSRs.
* 8x-SKYLAKE-UMA:
Single socket E3-1240 v5, Skylake 4 cores/8 threads
Max EFFiciency, BASE frequency and available turbo levels (MHz):
EFFIC 800 |********
BASE 3500 |***********************************
4C 3700 |*************************************
3C 3800 |**************************************
2C 3900 |***************************************
1C 3900 |***************************************
* 80x-BROADWELL-NUMA:
Two sockets E5-2698 v4, 2x Broadwell 20 cores/40 threads
Max EFFiciency, BASE frequency and available turbo levels (MHz):
EFFIC 1200 |************
BASE 2200 |**********************
8C 2900 |*****************************
7C 3000 |******************************
6C 3100 |*******************************
5C 3200 |********************************
4C 3300 |*********************************
3C 3400 |**********************************
2C 3600 |************************************
1C 3600 |************************************
* 48x-HASWELL-NUMA
Two sockets E5-2670 v3, 2x Haswell 12 cores/24 threads
Max EFFiciency, BASE frequency and available turbo levels (MHz):
EFFIC 1200 |************
BASE 2300 |***********************
12C 2600 |**************************
11C 2600 |**************************
10C 2600 |**************************
9C 2600 |**************************
8C 2600 |**************************
7C 2600 |**************************
6C 2600 |**************************
5C 2700 |***************************
4C 2800 |****************************
3C 2900 |*****************************
2C 3100 |*******************************
1C 3100 |*******************************
5.2 SETUP
=========
* The baseline is Linux v5.2 with schedutil (non-invariant) and the intel_pstate
driver in passive mode.
* The rationale for choosing the various freq_max values to test have been to
try all the 1-2-3-4C turbo levels (note that 1C and 2C turbo are identical
on all machines), plus one more value closer to base_freq but still in the
turbo range (8C turbo for both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA).
* In addition we've run all tests with intel_pstate/powersave for comparison.
* The filesystem is always XFS, the userspace is openSUSE Leap 15.1.
* 8x-SKYLAKE-UMA is capable of HWP (Hardware-Managed P-States), so the runs
with active intel_pstate on this machine use that.
This gives, in terms of combinations tested on each machine:
* 8x-SKYLAKE-UMA
* Baseline: Linux v5.2, non-invariant schedutil, intel_pstate passive
* intel_pstate active + powersave + HWP
* invariant schedutil, freq_max = 1C turbo
* invariant schedutil, freq_max = 3C turbo
* invariant schedutil, freq_max = 4C turbo
* both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA
* [same as 8x-SKYLAKE-UMA, but no HWP capable]
* invariant schedutil, freq_max = 8C turbo
(which on 48x-HASWELL-NUMA is the same as 12C turbo, or "all cores turbo")
5.3 BENCHMARK RESULTS
=====================
5.3.1 NEUTRAL BENCHMARKS
------------------------
Tests that didn't show any measurable difference in performance on any of the
test machines between non-invariant schedutil and our patch are:
* NAS Parallel Benchmarks (NPB) using either MPI or openMP for IPC, any
computational kernel
* flexible I/O (FIO)
* hackbench (using threads or processes, and using pipes or sockets)
5.3.2 NON-NEUTRAL BENCHMARKS
----------------------------
What follow are summary tables where each benchmark result is given a score.
* A tilde (~) means a neutral result, i.e. no difference from baseline.
* Scores are computed with the ratio result_new / result_baseline, so a tilde
means a score of 1.00.
* The results in the score ratio are the geometric means of results running
the benchmark with different parameters (eg: for kernbench: using 1, 2, 4,
... number of processes; for pgbench: varying the number of clients, and so
on).
* The first three tables show higher-is-better kind of tests (i.e. measured in
operations/second), the subsequent three show lower-is-better kind of tests
(i.e. the workload is fixed and we measure elapsed time, think kernbench).
* "gitsource" is a name we made up for the test consisting in running the
entire unit tests suite of the Git SCM and measuring how long it takes. We
take it as a typical example of shell-intensive serialized workload.
* In the "I_PSTATE" column we have the results for intel_pstate/powersave. Other
columns show invariant schedutil for different values of freq_max. 4C turbo
is circled as it's the value we've chosen for the final implementation.
80x-BROADWELL-NUMA (comparison ratio; higher is better)
+------+
I_PSTATE 1C 3C | 4C | 8C
pgbench-ro 1.14 ~ ~ | 1.11 | 1.14
pgbench-rw ~ ~ ~ | ~ | ~
netperf-udp 1.06 ~ 1.06 | 1.05 | 1.07
netperf-tcp ~ 1.03 ~ | 1.01 | 1.02
tbench4 1.57 1.18 1.22 | 1.30 | 1.56
+------+
8x-SKYLAKE-UMA (comparison ratio; higher is better)
+------+
I_PSTATE/HWP 1C 3C | 4C |
pgbench-ro ~ ~ ~ | ~ |
pgbench-rw ~ ~ ~ | ~ |
netperf-udp ~ ~ ~ | ~ |
netperf-tcp ~ ~ ~ | ~ |
tbench4 1.30 1.14 1.14 | 1.16 |
+------+
48x-HASWELL-NUMA (comparison ratio; higher is better)
+------+
I_PSTATE 1C 3C | 4C | 12C
pgbench-ro 1.15 ~ ~ | 1.06 | 1.16
pgbench-rw ~ ~ ~ | ~ | ~
netperf-udp 1.05 0.97 1.04 | 1.04 | 1.02
netperf-tcp 0.96 1.01 1.01 | 1.01 | 1.01
tbench4 1.50 1.05 1.13 | 1.13 | 1.25
+------+
In the table above we see that active intel_pstate is slightly better than our
4C-turbo patch (both in reference to the baseline non-invariant schedutil) on
read-only pgbench and much better on tbench. Both cases are notable in which
it shows that lowering our freq_max (to 8C-turbo and 12C-turbo on
80x-BROADWELL-NUMA and 48x-HASWELL-NUMA respectively) helps invariant
schedutil to get closer.
If we ignore active intel_pstate and focus on the comparison with baseline
alone, there are several instances of double-digit performance improvement.
80x-BROADWELL-NUMA (comparison ratio; lower is better)
+------+
I_PSTATE 1C 3C | 4C | 8C
dbench4 1.23 0.95 0.95 | 0.95 | 0.95
kernbench 0.93 0.83 0.83 | 0.83 | 0.82
gitsource 0.98 0.49 0.49 | 0.49 | 0.48
+------+
8x-SKYLAKE-UMA (comparison ratio; lower is better)
+------+
I_PSTATE/HWP 1C 3C | 4C |
dbench4 ~ ~ ~ | ~ |
kernbench ~ ~ ~ | ~ |
gitsource 0.92 0.55 0.55 | 0.55 |
+------+
48x-HASWELL-NUMA (comparison ratio; lower is better)
+------+
I_PSTATE 1C 3C | 4C | 8C
dbench4 ~ ~ ~ | ~ | ~
kernbench 0.94 0.90 0.89 | 0.90 | 0.90
gitsource 0.97 0.69 0.69 | 0.69 | 0.69
+------+
dbench is not very remarkable here, unless we notice how poorly active
intel_pstate is performing on 80x-BROADWELL-NUMA: 23% regression versus
non-invariant schedutil. We repeated that run getting consistent results. Out
of scope for the patch at hand, but deserving future investigation. Other than
that, we previously ran this campaign with Linux v5.0 and saw the patch doing
better on dbench a the time. We haven't checked closely and can only speculate
at this point.
On the NUMA boxes kernbench gets 10-15% improvements on average; we'll see in
the detailed tables that the gains concentrate on low process counts (lightly
loaded machines).
The test we call "gitsource" (running the git unit test suite, a long-running
single-threaded shell script) appears rather spectacular in this table (gains
of 30-50% depending on the machine). It is to be noted, however, that
gitsource has no adjustable parameters (such as the number of jobs in
kernbench, which we average over in order to get a single-number summary
score) and is exactly the kind of low-parallelism workload that benefits the
most from this patch. When looking at the detailed tables of kernbench or
tbench4, at low process or client counts one can see similar numbers.
5.3.3 SELECTION OF DETAILED RESULTS
-----------------------------------
Machine : 48x-HASWELL-NUMA
Benchmark : tbench4 (i.e. dbench4 over the network, actually loopback)
Varying parameter : number of clients
Unit : MB/sec (higher is better)
5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean 1 126.73 +- 0.31% ( ) 315.91 +- 0.66% ( 149.28%) 125.03 +- 0.76% ( -1.34%)
Hmean 2 258.04 +- 0.62% ( ) 614.16 +- 0.51% ( 138.01%) 269.58 +- 1.45% ( 4.47%)
Hmean 4 514.30 +- 0.67% ( ) 1146.58 +- 0.54% ( 122.94%) 533.84 +- 1.99% ( 3.80%)
Hmean 8 1111.38 +- 2.52% ( ) 2159.78 +- 0.38% ( 94.33%) 1359.92 +- 1.56% ( 22.36%)
Hmean 16 2286.47 +- 1.36% ( ) 3338.29 +- 0.21% ( 46.00%) 2720.20 +- 0.52% ( 18.97%)
Hmean 32 4704.84 +- 0.35% ( ) 4759.03 +- 0.43% ( 1.15%) 4774.48 +- 0.30% ( 1.48%)
Hmean 64 7578.04 +- 0.27% ( ) 7533.70 +- 0.43% ( -0.59%) 7462.17 +- 0.65% ( -1.53%)
Hmean 128 6998.52 +- 0.16% ( ) 6987.59 +- 0.12% ( -0.16%) 6909.17 +- 0.14% ( -1.28%)
Hmean 192 6901.35 +- 0.25% ( ) 6913.16 +- 0.10% ( 0.17%) 6855.47 +- 0.21% ( -0.66%)
5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 12C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean 1 128.43 +- 0.28% ( 1.34%) 130.64 +- 3.81% ( 3.09%) 153.71 +- 5.89% ( 21.30%)
Hmean 2 311.70 +- 6.15% ( 20.79%) 281.66 +- 3.40% ( 9.15%) 305.08 +- 5.70% ( 18.23%)
Hmean 4 641.98 +- 2.32% ( 24.83%) 623.88 +- 5.28% ( 21.31%) 906.84 +- 4.65% ( 76.32%)
Hmean 8 1633.31 +- 1.56% ( 46.96%) 1714.16 +- 0.93% ( 54.24%) 2095.74 +- 0.47% ( 88.57%)
Hmean 16 3047.24 +- 0.42% ( 33.27%) 3155.02 +- 0.30% ( 37.99%) 3634.58 +- 0.15% ( 58.96%)
Hmean 32 4734.31 +- 0.60% ( 0.63%) 4804.38 +- 0.23% ( 2.12%) 4674.62 +- 0.27% ( -0.64%)
Hmean 64 7699.74 +- 0.35% ( 1.61%) 7499.72 +- 0.34% ( -1.03%) 7659.03 +- 0.25% ( 1.07%)
Hmean 128 6935.18 +- 0.15% ( -0.91%) 6942.54 +- 0.10% ( -0.80%) 7004.85 +- 0.12% ( 0.09%)
Hmean 192 6901.62 +- 0.12% ( 0.00%) 6856.93 +- 0.10% ( -0.64%) 6978.74 +- 0.10% ( 1.12%)
This is one of the cases where the patch still can't surpass active
intel_pstate, not even when freq_max is as low as 12C-turbo. Otherwise, gains are
visible up to 16 clients and the saturated scenario is the same as baseline.
The scores in the summary table from the previous sections are ratios of
geometric means of the results over different clients, as seen in this table.
Machine : 80x-BROADWELL-NUMA
Benchmark : kernbench (kernel compilation)
Varying parameter : number of jobs
Unit : seconds (lower is better)
5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean 2 379.68 +- 0.06% ( ) 330.20 +- 0.43% ( 13.03%) 285.93 +- 0.07% ( 24.69%)
Amean 4 200.15 +- 0.24% ( ) 175.89 +- 0.22% ( 12.12%) 153.78 +- 0.25% ( 23.17%)
Amean 8 106.20 +- 0.31% ( ) 95.54 +- 0.23% ( 10.03%) 86.74 +- 0.10% ( 18.32%)
Amean 16 56.96 +- 1.31% ( ) 53.25 +- 1.22% ( 6.50%) 48.34 +- 1.73% ( 15.13%)
Amean 32 34.80 +- 2.46% ( ) 33.81 +- 0.77% ( 2.83%) 30.28 +- 1.59% ( 12.99%)
Amean 64 26.11 +- 1.63% ( ) 25.04 +- 1.07% ( 4.10%) 22.41 +- 2.37% ( 14.16%)
Amean 128 24.80 +- 1.36% ( ) 23.57 +- 1.23% ( 4.93%) 21.44 +- 1.37% ( 13.55%)
Amean 160 24.85 +- 0.56% ( ) 23.85 +- 1.17% ( 4.06%) 21.25 +- 1.12% ( 14.49%)
5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 8C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean 2 284.08 +- 0.13% ( 25.18%) 283.96 +- 0.51% ( 25.21%) 285.05 +- 0.21% ( 24.92%)
Amean 4 153.18 +- 0.22% ( 23.47%) 154.70 +- 1.64% ( 22.71%) 153.64 +- 0.30% ( 23.24%)
Amean 8 87.06 +- 0.28% ( 18.02%) 86.77 +- 0.46% ( 18.29%) 86.78 +- 0.22% ( 18.28%)
Amean 16 48.03 +- 0.93% ( 15.68%) 47.75 +- 1.99% ( 16.17%) 47.52 +- 1.61% ( 16.57%)
Amean 32 30.23 +- 1.20% ( 13.14%) 30.08 +- 1.67% ( 13.57%) 30.07 +- 1.67% ( 13.60%)
Amean 64 22.59 +- 2.02% ( 13.50%) 22.63 +- 0.81% ( 13.32%) 22.42 +- 0.76% ( 14.12%)
Amean 128 21.37 +- 0.67% ( 13.82%) 21.31 +- 1.15% ( 14.07%) 21.17 +- 1.93% ( 14.63%)
Amean 160 21.68 +- 0.57% ( 12.76%) 21.18 +- 1.74% ( 14.77%) 21.22 +- 1.00% ( 14.61%)
The patch outperform active intel_pstate (and baseline) by a considerable
margin; the summary table from the previous section says 4C turbo and active
intel_pstate are 0.83 and 0.93 against baseline respectively, so 4C turbo is
0.83/0.93=0.89 against intel_pstate (~10% better on average). There is no
noticeable difference with regard to the value of freq_max.
Machine : 8x-SKYLAKE-UMA
Benchmark : gitsource (time to run the git unit test suite)
Varying parameter : none
Unit : seconds (lower is better)
5.2.0 vanilla 5.2.0 intel_pstate/hwp 5.2.0 1C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean 858.85 +- 1.16% ( ) 791.94 +- 0.21% ( 7.79%) 474.95 ( 44.70%)
5.2.0 3C-turbo 5.2.0 4C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean 475.26 +- 0.20% ( 44.66%) 474.34 +- 0.13% ( 44.77%)
In this test, which is of interest as representing shell-intensive
(i.e. fork-intensive) serialized workloads, invariant schedutil outperforms
intel_pstate/powersave by a whopping 40% margin.
5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT
---------------------------------------------
The following table shows average power consumption in watt for each
benchmark. Data comes from turbostat (package average), which in turn is read
from the RAPL interface on CPUs. We know the patch affects CPU frequencies so
it's reasonable to ignore other power consumers (such as memory or I/O). Also,
we don't have a power meter available in the lab so RAPL is the best we have.
turbostat sampled average power every 10 seconds for the entire duration of
each benchmark. We took all those values and averaged them (i.e. with don't
have detail on a per-parameter granularity, only on whole benchmarks).
80x-BROADWELL-NUMA (power consumption, watts)
+--------+
BASELINE I_PSTATE 1C 3C | 4C | 8C
pgbench-ro 130.01 142.77 131.11 132.45 | 134.65 | 136.84
pgbench-rw 68.30 60.83 71.45 71.70 | 71.65 | 72.54
dbench4 90.25 59.06 101.43 99.89 | 101.10 | 102.94
netperf-udp 65.70 69.81 66.02 68.03 | 68.27 | 68.95
netperf-tcp 88.08 87.96 88.97 88.89 | 88.85 | 88.20
tbench4 142.32 176.73 153.02 163.91 | 165.58 | 176.07
kernbench 92.94 101.95 114.91 115.47 | 115.52 | 115.10
gitsource 40.92 41.87 75.14 75.20 | 75.40 | 75.70
+--------+
8x-SKYLAKE-UMA (power consumption, watts)
+--------+
BASELINE I_PSTATE/HWP 1C 3C | 4C |
pgbench-ro 46.49 46.68 46.56 46.59 | 46.52 |
pgbench-rw 29.34 31.38 30.98 31.00 | 31.00 |
dbench4 27.28 27.37 27.49 27.41 | 27.38 |
netperf-udp 22.33 22.41 22.36 22.35 | 22.36 |
netperf-tcp 27.29 27.29 27.30 27.31 | 27.33 |
tbench4 41.13 45.61 43.10 43.33 | 43.56 |
kernbench 42.56 42.63 43.01 43.01 | 43.01 |
gitsource 13.32 13.69 17.33 17.30 | 17.35 |
+--------+
48x-HASWELL-NUMA (power consumption, watts)
+--------+
BASELINE I_PSTATE 1C 3C | 4C | 12C
pgbench-ro 128.84 136.04 129.87 132.43 | 132.30 | 134.86
pgbench-rw 37.68 37.92 37.17 37.74 | 37.73 | 37.31
dbench4 28.56 28.73 28.60 28.73 | 28.70 | 28.79
netperf-udp 56.70 60.44 56.79 57.42 | 57.54 | 57.52
netperf-tcp 75.49 75.27 75.87 76.02 | 76.01 | 75.95
tbench4 115.44 139.51 119.53 123.07 | 123.97 | 130.22
kernbench 83.23 91.55 95.58 95.69 | 95.72 | 96.04
gitsource 36.79 36.99 39.99 40.34 | 40.35 | 40.23
+--------+
A lower power consumption isn't necessarily better, it depends on what is done
with that energy. Here are tables with the ratio of performance-per-watt on
each machine and benchmark. Higher is always better; a tilde (~) means a
neutral ratio (i.e. 1.00).
80x-BROADWELL-NUMA (performance-per-watt ratios; higher is better)
+------+
I_PSTATE 1C 3C | 4C | 8C
pgbench-ro 1.04 1.06 0.94 | 1.07 | 1.08
pgbench-rw 1.10 0.97 0.96 | 0.96 | 0.97
dbench4 1.24 0.94 0.95 | 0.94 | 0.92
netperf-udp ~ 1.02 1.02 | ~ | 1.02
netperf-tcp ~ 1.02 ~ | ~ | 1.02
tbench4 1.26 1.10 1.06 | 1.12 | 1.26
kernbench 0.98 0.97 0.97 | 0.97 | 0.98
gitsource ~ 1.11 1.11 | 1.11 | 1.13
+------+
8x-SKYLAKE-UMA (performance-per-watt ratios; higher is better)
+------+
I_PSTATE/HWP 1C 3C | 4C |
pgbench-ro ~ ~ ~ | ~ |
pgbench-rw 0.95 0.97 0.96 | 0.96 |
dbench4 ~ ~ ~ | ~ |
netperf-udp ~ ~ ~ | ~ |
netperf-tcp ~ ~ ~ | ~ |
tbench4 1.17 1.09 1.08 | 1.10 |
kernbench ~ ~ ~ | ~ |
gitsource 1.06 1.40 1.40 | 1.40 |
+------+
48x-HASWELL-NUMA (performance-per-watt ratios; higher is better)
+------+
I_PSTATE 1C 3C | 4C | 12C
pgbench-ro 1.09 ~ 1.09 | 1.03 | 1.11
pgbench-rw ~ 0.86 ~ | ~ | 0.86
dbench4 ~ 1.02 1.02 | 1.02 | ~
netperf-udp ~ 0.97 1.03 | 1.02 | ~
netperf-tcp 0.96 ~ ~ | ~ | ~
tbench4 1.24 ~ 1.06 | 1.05 | 1.11
kernbench 0.97 0.97 0.98 | 0.97 | 0.96
gitsource 1.03 1.33 1.32 | 1.32 | 1.33
+------+
These results are overall pleasing: in plenty of cases we observe
performance-per-watt improvements. The few regressions (read/write pgbench and
dbench on the Broadwell machine) are of small magnitude. kernbench loses a few
percentage points (it has a 10-15% performance improvement, but apparently the
increase in power consumption is larger than that). tbench4 and gitsource, which
benefit the most from the patch, keep a positive score in this table which is
a welcome surprise; that suggests that in those particular workloads the
non-invariant schedutil (and active intel_pstate, too) makes some rather
suboptimal frequency selections.
+-------------------------------------------------------------------------+
| 6. MICROARCH'ES ADDRESSED HERE
+-------------------------------------------------------------------------+
The patch addresses Xeon Core processors that use MSR_PLATFORM_INFO and
MSR_TURBO_RATIO_LIMIT to advertise their base frequency and turbo frequencies
respectively. This excludes the recent Xeon Scalable Performance processors
line (Xeon Gold, Platinum etc) whose MSRs have to be parsed differently.
Subsequent patches will address:
* Xeon Scalable Performance processors and Atom Goldmont/Goldmont Plus
* Xeon Phi (Knights Landing, Knights Mill)
* Atom Silvermont
+-------------------------------------------------------------------------+
| 7. REFERENCES
+-------------------------------------------------------------------------+
Tests have been run with the help of the MMTests performance testing
framework, see github.com/gormanm/mmtests. The configuration file names for
the benchmark used are:
db-pgbench-timed-ro-small-xfs
db-pgbench-timed-rw-small-xfs
io-dbench4-async-xfs
network-netperf-unbound
network-tbench
scheduler-unbound
workload-kerndevel-xfs
workload-shellscripts-xfs
hpc-nas-c-class-mpi-full-xfs
hpc-nas-c-class-omp-full
All those benchmarks are generally available on the web:
pgbench: https://www.postgresql.org/docs/10/pgbench.html
netperf: https://hewlettpackard.github.io/netperf/
dbench/tbench: https://dbench.samba.org/
gitsource: git unit test suite, github.com/git/git
NAS Parallel Benchmarks: https://www.nas.nasa.gov/publications/npb.html
hackbench: https://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Giovanni Gherdovich <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Doug Smythies <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/include/asm/topology.h | 20 +++-
arch/x86/kernel/smpboot.c | 183 ++++++++++++++++++++++++++++++-
kernel/sched/core.c | 1 +-
kernel/sched/sched.h | 7 +-
4 files changed, 210 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 4b14d23..2ebf7b7 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -193,4 +193,24 @@ static inline void sched_clear_itmt_support(void)
}
#endif /* CONFIG_SCHED_MC_PRIO */
+#ifdef CONFIG_SMP
+#include <asm/cpufeature.h>
+
+DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key);
+
+#define arch_scale_freq_invariant() static_branch_likely(&arch_scale_freq_key)
+
+DECLARE_PER_CPU(unsigned long, arch_freq_scale);
+
+static inline long arch_scale_freq_capacity(int cpu)
+{
+ return per_cpu(arch_freq_scale, cpu);
+}
+#define arch_scale_freq_capacity arch_scale_freq_capacity
+
+extern void arch_scale_freq_tick(void);
+#define arch_scale_freq_tick arch_scale_freq_tick
+
+#endif
+
#endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 69881b2..28696bc 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -147,6 +147,8 @@ static inline void smpboot_restore_warm_reset_vector(void)
*((volatile u32 *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) = 0;
}
+static void init_freq_invariance(void);
+
/*
* Report back to the Boot Processor during boot time or to the caller processor
* during CPU online.
@@ -183,6 +185,8 @@ static void smp_callin(void)
*/
set_cpu_sibling_map(raw_smp_processor_id());
+ init_freq_invariance();
+
/*
* Get our bogomips.
* Update loops_per_jiffy in cpu_data. Previous call to
@@ -1337,7 +1341,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
set_sched_topology(x86_topology);
set_cpu_sibling_map(0);
-
+ init_freq_invariance();
smp_sanity_check();
switch (apic_intr_mode) {
@@ -1764,3 +1768,180 @@ void native_play_dead(void)
}
#endif
+
+/*
+ * APERF/MPERF frequency ratio computation.
+ *
+ * The scheduler wants to do frequency invariant accounting and needs a <1
+ * ratio to account for the 'current' frequency, corresponding to
+ * freq_curr / freq_max.
+ *
+ * Since the frequency freq_curr on x86 is controlled by micro-controller and
+ * our P-state setting is little more than a request/hint, we need to observe
+ * the effective frequency 'BusyMHz', i.e. the average frequency over a time
+ * interval after discarding idle time. This is given by:
+ *
+ * BusyMHz = delta_APERF / delta_MPERF * freq_base
+ *
+ * where freq_base is the max non-turbo P-state.
+ *
+ * The freq_max term has to be set to a somewhat arbitrary value, because we
+ * can't know which turbo states will be available at a given point in time:
+ * it all depends on the thermal headroom of the entire package. We set it to
+ * the turbo level with 4 cores active.
+ *
+ * Benchmarks show that's a good compromise between the 1C turbo ratio
+ * (freq_curr/freq_max would rarely reach 1) and something close to freq_base,
+ * which would ignore the entire turbo range (a conspicuous part, making
+ * freq_curr/freq_max always maxed out).
+ *
+ * Setting freq_max to anything less than the 1C turbo ratio makes the ratio
+ * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1.
+ */
+
+DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
+
+static DEFINE_PER_CPU(u64, arch_prev_aperf);
+static DEFINE_PER_CPU(u64, arch_prev_mperf);
+static u64 arch_max_freq_ratio = SCHED_CAPACITY_SCALE;
+
+static bool turbo_disabled(void)
+{
+ u64 misc_en;
+ int err;
+
+ err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en);
+ if (err)
+ return false;
+
+ return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
+}
+
+#include <asm/cpu_device_id.h>
+#include <asm/intel-family.h>
+
+#define ICPU(model) \
+ {X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF, 0}
+
+static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = {
+ ICPU(INTEL_FAM6_XEON_PHI_KNL),
+ ICPU(INTEL_FAM6_XEON_PHI_KNM),
+ {}
+};
+
+static const struct x86_cpu_id has_skx_turbo_ratio_limits[] = {
+ ICPU(INTEL_FAM6_SKYLAKE_X),
+ {}
+};
+
+static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
+ ICPU(INTEL_FAM6_ATOM_GOLDMONT),
+ ICPU(INTEL_FAM6_ATOM_GOLDMONT_D),
+ ICPU(INTEL_FAM6_ATOM_GOLDMONT_PLUS),
+ {}
+};
+
+static bool core_set_max_freq_ratio(void)
+{
+ u64 base_freq, turbo_freq;
+ int err;
+
+ err = rdmsrl_safe(MSR_PLATFORM_INFO, &base_freq);
+ if (err)
+ return false;
+
+ err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &turbo_freq);
+ if (err)
+ return false;
+
+ base_freq = (base_freq >> 8) & 0xFF; /* max P state */
+ turbo_freq = (turbo_freq >> 24) & 0xFF; /* 4C turbo */
+
+ arch_max_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE,
+ base_freq);
+ return true;
+}
+
+static bool intel_set_max_freq_ratio(void)
+{
+ /*
+ * TODO: add support for:
+ *
+ * - Xeon Gold/Platinum
+ * - Xeon Phi (KNM, KNL)
+ * - Atom Goldmont
+ * - Atom Silvermont
+ */
+
+ if (x86_match_cpu(has_skx_turbo_ratio_limits) ||
+ x86_match_cpu(has_knl_turbo_ratio_limits) ||
+ x86_match_cpu(has_glm_turbo_ratio_limits))
+ return false;
+
+ if (turbo_disabled() || core_set_max_freq_ratio())
+ return true;
+
+ return false;
+}
+
+static void init_counter_refs(void *arg)
+{
+ u64 aperf, mperf;
+
+ rdmsrl(MSR_IA32_APERF, aperf);
+ rdmsrl(MSR_IA32_MPERF, mperf);
+
+ this_cpu_write(arch_prev_aperf, aperf);
+ this_cpu_write(arch_prev_mperf, mperf);
+}
+
+static void init_freq_invariance(void)
+{
+ bool ret = false;
+
+ if (smp_processor_id() != 0 || !boot_cpu_has(X86_FEATURE_APERFMPERF))
+ return;
+
+ if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
+ ret = intel_set_max_freq_ratio();
+
+ if (ret) {
+ on_each_cpu(init_counter_refs, NULL, 1);
+ static_branch_enable(&arch_scale_freq_key);
+ } else {
+ pr_debug("Couldn't determine max cpu frequency, necessary for scale-invariant accounting.\n");
+ }
+}
+
+DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
+
+void arch_scale_freq_tick(void)
+{
+ u64 freq_scale;
+ u64 aperf, mperf;
+ u64 acnt, mcnt;
+
+ if (!arch_scale_freq_invariant())
+ return;
+
+ rdmsrl(MSR_IA32_APERF, aperf);
+ rdmsrl(MSR_IA32_MPERF, mperf);
+
+ acnt = aperf - this_cpu_read(arch_prev_aperf);
+ mcnt = mperf - this_cpu_read(arch_prev_mperf);
+ if (!mcnt)
+ return;
+
+ this_cpu_write(arch_prev_aperf, aperf);
+ this_cpu_write(arch_prev_mperf, mperf);
+
+ acnt <<= 2*SCHED_CAPACITY_SHIFT;
+ mcnt *= arch_max_freq_ratio;
+
+ freq_scale = div64_u64(acnt, mcnt);
+
+ if (freq_scale > SCHED_CAPACITY_SCALE)
+ freq_scale = SCHED_CAPACITY_SCALE;
+
+ this_cpu_write(arch_freq_scale, freq_scale);
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 89e54f3..45f79bc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3600,6 +3600,7 @@ void scheduler_tick(void)
struct task_struct *curr = rq->curr;
struct rq_flags rf;
+ arch_scale_freq_tick();
sched_clock_tick();
rq_lock(rq, &rf);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1a88dc8..0844e81 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1968,6 +1968,13 @@ static inline int hrtick_enabled(struct rq *rq)
#endif /* CONFIG_SCHED_HRTICK */
+#ifndef arch_scale_freq_tick
+static __always_inline
+void arch_scale_freq_tick(void)
+{
+}
+#endif
+
#ifndef arch_scale_freq_capacity
static __always_inline
unsigned long arch_scale_freq_capacity(int cpu)
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 8bea0dfb4a820ae063568a87cc2e7d8f587377af
Gitweb: https://git.kernel.org/tip/8bea0dfb4a820ae063568a87cc2e7d8f587377af
Author: Giovanni Gherdovich <[email protected]>
AuthorDate: Wed, 22 Jan 2020 16:16:14 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 28 Jan 2020 21:37:02 +01:00
x86, sched: Add support for frequency invariance on XEON_PHI_KNL/KNM
The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On Xeon Phi CPUs set freq_max to the second-highest frequency
reported by the CPU.
Xeon Phi CPUs such as Knights Landing and Knights Mill typically have either
one or two turbo frequencies; in the former case that's 100 MHz above the base
frequency, in the latter case the two levels are 100 MHz and 200 MHz above
base frequency.
We set freq_max to the second-highest frequency reported by the CPU. This
could be the base frequency (if only one turbo level is available) or the first
turbo level (if two levels are available). The rationale is to compromise
between power efficiency or performance -- going straight to max turbo would
favor efficiency and blindly using base freq would favor performance.
For reference, this is how MSR_TURBO_RATIO_LIMIT must be parsed on a Xeon Phi
to get the available frequencies (taken from a comment in turbostat's sources):
[0] -- Reserved
[7:1] -- Base value of number of active cores of bucket 1.
[15:8] -- Base value of freq ratio of bucket 1.
[20:16] -- +ve delta of number of active cores of bucket 2.
i.e. active cores of bucket 2 =
active cores of bucket 1 + delta
[23:21] -- Negative delta of freq ratio of bucket 2.
i.e. freq ratio of bucket 2 =
freq ratio of bucket 1 - delta
[28:24]-- +ve delta of number of active cores of bucket 3.
[31:29]-- -ve delta of freq ratio of bucket 3.
[36:32]-- +ve delta of number of active cores of bucket 4.
[39:37]-- -ve delta of freq ratio of bucket 4.
[44:40]-- +ve delta of number of active cores of bucket 5.
[47:45]-- -ve delta of freq ratio of bucket 5.
[52:48]-- +ve delta of number of active cores of bucket 6.
[55:53]-- -ve delta of freq ratio of bucket 6.
[60:56]-- +ve delta of number of active cores of bucket 7.
[63:61]-- -ve delta of freq ratio of bucket 7.
1. PERFORMANCE EVALUATION: TBENCH +5%
2. NEUTRAL BENCHMARKS (ALL OTHERS)
3. TEST SETUP
1. PERFORMANCE EVALUATION: TBENCH +5%
-------------------------------------
A performance evaluation was conducted on a Knights Mill machine (see "Test
Setup" below), were the frequency-invariance patch (on schedutil) is compared
to both non-invariant schedutil and active intel_pstate with powersave: all
three tested kernels behave the same performance-wise and with regard to power
consumption (performance per watt). The only notable difference is tbench:
comparison ratio of performance with baseline; 1.00 means neutral,
higher is better:
I_PSTATE FREQ-INV
----------------------------------------
tbench 1.04 1.05
performance-per-watt ratios with baseline; 1.00 means neutral, higher is better:
I_PSTATE FREQ-INV
----------------------------------------
tbench 1.03 1.04
which essentially means that frequency-invariant schedutil is 5% better than
baseline, the same as intel_pstate+powersave.
As the results above are averaged over the varying parameter, here the detailed
table.
Varying parameter : number of clients
Unit : MB/sec (higher is better)
5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 freq-inv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean 1 49.06 +- 2.12% ( ) 51.66 +- 1.52% ( 5.30%) 52.87 +- 0.88% ( 7.76%)
Hmean 2 93.82 +- 0.45% ( ) 103.24 +- 0.70% ( 10.05%) 105.90 +- 0.70% ( 12.88%)
Hmean 4 192.46 +- 1.15% ( ) 215.95 +- 0.60% ( 12.21%) 215.78 +- 1.43% ( 12.12%)
Hmean 8 406.74 +- 2.58% ( ) 438.58 +- 0.36% ( 7.83%) 437.61 +- 0.97% ( 7.59%)
Hmean 16 857.70 +- 1.22% ( ) 890.26 +- 0.72% ( 3.80%) 889.11 +- 0.73% ( 3.66%)
Hmean 32 1760.10 +- 0.92% ( ) 1791.70 +- 0.44% ( 1.79%) 1787.95 +- 0.44% ( 1.58%)
Hmean 64 3183.50 +- 0.34% ( ) 3183.19 +- 0.36% ( -0.01%) 3187.53 +- 0.36% ( 0.13%)
Hmean 128 4830.96 +- 0.31% ( ) 4846.53 +- 0.30% ( 0.32%) 4855.86 +- 0.30% ( 0.52%)
Hmean 256 5467.98 +- 0.38% ( ) 5793.80 +- 0.28% ( 5.96%) 5821.94 +- 0.17% ( 6.47%)
Hmean 512 5398.10 +- 0.06% ( ) 5745.56 +- 0.08% ( 6.44%) 5503.68 +- 0.07% ( 1.96%)
Hmean 1024 5290.43 +- 0.63% ( ) 5221.07 +- 0.47% ( -1.31%) 5277.22 +- 0.80% ( -0.25%)
Hmean 1088 5139.71 +- 0.57% ( ) 5236.02 +- 0.71% ( 1.87%) 5190.57 +- 0.41% ( 0.99%)
2. NEUTRAL BENCHMARKS (ALL OTHERS)
----------------------------------
* pgbench (both read/write and read-only)
* NASA Parallel Benchmarks (NPB), MPI or OpenMP for message-passing
* hackbench
* netperf
* dbench
* kernbench
* gitsource (git unit test suite)
3. TEST SETUP
-------------
Test machine:
CPU Model : Intel Xeon Phi CPU 7255 @ 1.10GHz (a.k.a. Knights Mill)
Fam/Mod/Ste : 6:133:0
Topology : 1 socket, 68 cores / 272 threads
Memory : 96G
Storage : rotary, XFS filesystem
Max EFFICiency, BASE frequency and available turbo levels (MHz):
EFFIC 1000 |**********
BASE 1100 |***********
68C 1100 |***********
30C 1200 |************
Tested kernels:
Baseline : v5.2, intel_pstate passive, schedutil
Comparison #1 : v5.2, intel_pstate active , powersave
Comparison #2 : v5.2, this patch, intel_pstate passive, schedutil
Signed-off-by: Giovanni Gherdovich <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/kernel/smpboot.c | 49 +++++++++++++++++++++++++++++++++++---
1 file changed, 46 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index ba9d3bd..8cb3113 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1841,6 +1841,48 @@ static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = {
{}
};
+static bool knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq,
+ int num_delta_fratio)
+{
+ int fratio, delta_fratio, found;
+ int err, i;
+ u64 msr;
+
+ if (!x86_match_cpu(has_knl_turbo_ratio_limits))
+ return false;
+
+ err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq);
+ if (err)
+ return false;
+
+ *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */
+
+ err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);
+ if (err)
+ return false;
+
+ fratio = (msr >> 8) & 0xFF;
+ i = 16;
+ found = 0;
+ do {
+ if (found >= num_delta_fratio) {
+ *turbo_freq = fratio;
+ return true;
+ }
+
+ delta_fratio = (msr >> (i + 5)) & 0x7;
+
+ if (delta_fratio) {
+ found += 1;
+ fratio -= delta_fratio;
+ }
+
+ i += 8;
+ } while (i < 64);
+
+ return true;
+}
+
static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size)
{
u64 ratios, counts;
@@ -1895,20 +1937,21 @@ static bool intel_set_max_freq_ratio(void)
/*
* TODO: add support for:
*
- * - Xeon Phi (KNM, KNL)
* - Atom Goldmont
* - Atom Silvermont
*/
u64 base_freq = 1, turbo_freq = 1;
- if (x86_match_cpu(has_knl_turbo_ratio_limits) ||
- x86_match_cpu(has_glm_turbo_ratio_limits))
+ if (x86_match_cpu(has_glm_turbo_ratio_limits))
return false;
if (turbo_disabled())
goto out;
+ if (knl_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
+ goto out;
+
if (x86_match_cpu(has_skx_turbo_ratio_limits) &&
skx_set_max_freq_ratio(&base_freq, &turbo_freq, 4))
goto out;
The following commit has been merged into the sched/core branch of tip:
Commit-ID: eacf0474aec8bdccdc7f19386319127c67be3588
Gitweb: https://git.kernel.org/tip/eacf0474aec8bdccdc7f19386319127c67be3588
Author: Giovanni Gherdovich <[email protected]>
AuthorDate: Wed, 22 Jan 2020 16:16:15 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 28 Jan 2020 21:37:04 +01:00
x86, sched: Add support for frequency invariance on ATOM_GOLDMONT*
The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On GOLDMONT (aka Apollo Lake), GOLDMONT_D (aka Denverton) and
GOLDMONT_PLUS CPUs (aka Gemini Lake) set freq_max to the highest frequency
reported by the CPU.
The encoding of turbo ratios for GOLDMONT* is identical to the one for
SKYLAKE_X, but we treat the Atom case apart because we want to set freq_max to
a higher value, thus the ratio freq_curr/freq_max to be lower, leading to more
conservative frequency selections (favoring power efficiency).
Signed-off-by: Giovanni Gherdovich <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/kernel/smpboot.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 8cb3113..3e32d62 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1795,6 +1795,10 @@ void native_play_dead(void)
* which would ignore the entire turbo range (a conspicuous part, making
* freq_curr/freq_max always maxed out).
*
+ * An exception to the heuristic above is the Atom uarch, where we choose the
+ * highest turbo level for freq_max since Atom's are generally oriented towards
+ * power efficiency.
+ *
* Setting freq_max to anything less than the 1C turbo ratio makes the ratio
* freq_curr / freq_max to eventually grow >1, in which case we clip it to 1.
*/
@@ -1937,18 +1941,18 @@ static bool intel_set_max_freq_ratio(void)
/*
* TODO: add support for:
*
- * - Atom Goldmont
* - Atom Silvermont
*/
u64 base_freq = 1, turbo_freq = 1;
- if (x86_match_cpu(has_glm_turbo_ratio_limits))
- return false;
-
if (turbo_disabled())
goto out;
+ if (x86_match_cpu(has_glm_turbo_ratio_limits) &&
+ skx_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
+ goto out;
+
if (knl_set_max_freq_ratio(&base_freq, &turbo_freq, 1))
goto out;
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 918229cdd5abb50d8a2edfcd8dc6b6bc53afd765
Gitweb: https://git.kernel.org/tip/918229cdd5abb50d8a2edfcd8dc6b6bc53afd765
Author: Giovanni Gherdovich <[email protected]>
AuthorDate: Wed, 22 Jan 2020 16:16:17 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 28 Jan 2020 21:37:06 +01:00
x86/intel_pstate: Handle runtime turbo disablement/enablement in frequency invariance
On some platforms such as the Dell XPS 13 laptop the firmware disables turbo
when the machine is disconnected from AC, and viceversa it enables it again
when it's reconnected. In these cases a _PPC ACPI notification is issued.
The scheduler needs to know freq_max for frequency-invariant calculations.
To account for turbo availability to come and go, record freq_max at boot as
if turbo was available and store it in a helper variable. Use a setter
function to swap between freq_base and freq_max every time turbo goes off or on.
Signed-off-by: Giovanni Gherdovich <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/include/asm/topology.h | 5 +++++
arch/x86/kernel/smpboot.c | 15 ++++++++++-----
drivers/cpufreq/intel_pstate.c | 1 +
3 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 2ebf7b7..79d8d54 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -211,6 +211,11 @@ static inline long arch_scale_freq_capacity(int cpu)
extern void arch_scale_freq_tick(void);
#define arch_scale_freq_tick arch_scale_freq_tick
+extern void arch_set_max_freq_ratio(bool turbo_disabled);
+#else
+static inline void arch_set_max_freq_ratio(bool turbo_disabled)
+{
+}
#endif
#endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 5f04bf8..467191e 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1807,8 +1807,15 @@ DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
static DEFINE_PER_CPU(u64, arch_prev_aperf);
static DEFINE_PER_CPU(u64, arch_prev_mperf);
+static u64 arch_turbo_freq_ratio = SCHED_CAPACITY_SCALE;
static u64 arch_max_freq_ratio = SCHED_CAPACITY_SCALE;
+void arch_set_max_freq_ratio(bool turbo_disabled)
+{
+ arch_max_freq_ratio = turbo_disabled ? SCHED_CAPACITY_SCALE :
+ arch_turbo_freq_ratio;
+}
+
static bool turbo_disabled(void)
{
u64 misc_en;
@@ -1956,10 +1963,7 @@ static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq)
static bool intel_set_max_freq_ratio(void)
{
- u64 base_freq = 1, turbo_freq = 1;
-
- if (turbo_disabled())
- goto out;
+ u64 base_freq, turbo_freq;
if (slv_set_max_freq_ratio(&base_freq, &turbo_freq))
goto out;
@@ -1981,8 +1985,9 @@ static bool intel_set_max_freq_ratio(void)
return false;
out:
- arch_max_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE,
+ arch_turbo_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE,
base_freq);
+ arch_set_max_freq_ratio(turbo_disabled());
return true;
}
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index d2fa3e9..abbeeca 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -922,6 +922,7 @@ static void intel_pstate_update_limits(unsigned int cpu)
*/
if (global.turbo_disabled_mf != global.turbo_disabled) {
global.turbo_disabled_mf = global.turbo_disabled;
+ arch_set_max_freq_ratio(global.turbo_disabled);
for_each_possible_cpu(cpu)
intel_pstate_update_max_freq(cpu);
} else {
Quoting tip-bot2 for Giovanni Gherdovich (2020-01-29 11:32:58)
> The following commit has been merged into the sched/core branch of tip:
>
> Commit-ID: 1567c3e3467cddeb019a7b53ec632f834b6a9239
> Gitweb: https://git.kernel.org/tip/1567c3e3467cddeb019a7b53ec632f834b6a9239
> Author: Giovanni Gherdovich <[email protected]>
> AuthorDate: Wed, 22 Jan 2020 16:16:12 +01:00
> Committer: Ingo Molnar <[email protected]>
> CommitterDate: Tue, 28 Jan 2020 21:36:59 +01:00
>
> x86, sched: Add support for frequency invariance
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 69881b2..28696bc 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -147,6 +147,8 @@ static inline void smpboot_restore_warm_reset_vector(void)
> *((volatile u32 *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) = 0;
> }
>
> +static void init_freq_invariance(void);
> +
> /*
> * Report back to the Boot Processor during boot time or to the caller processor
> * during CPU online.
> @@ -183,6 +185,8 @@ static void smp_callin(void)
> */
> set_cpu_sibling_map(raw_smp_processor_id());
>
> + init_freq_invariance();
> +
> /*
> * Get our bogomips.
> * Update loops_per_jiffy in cpu_data. Previous call to
> @@ -1337,7 +1341,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
> set_sched_topology(x86_topology);
>
> set_cpu_sibling_map(0);
> -
> + init_freq_invariance();
> smp_sanity_check();
>
> switch (apic_intr_mode) {
Since this has become visible via linux-next [20200326?], we have been
deluged by oops during cpu-hotplug.
<6> [184.949219] [IGT] perf_pmu: starting subtest cpu-hotplug
<4> [185.092279] IRQ 24: no longer affine to CPU0
<4> [185.092285] IRQ 25: no longer affine to CPU0
<6> [185.093709] smpboot: CPU 0 is now offline
<6> [186.107062] smpboot: Booting Node 0 Processor 0 APIC 0x0
<3> [186.107643] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:49
<3> [186.107648] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/0
<4> [186.107650] no locks held by swapper/0/0.
<4> [186.107652] irq event stamp: 6424624
<4> [186.107658] hardirqs last enabled at (6424623): [<ffffffff951744bf>] tick_nohz_idle_enter+0x5f/0x90
<4> [186.107664] hardirqs last disabled at (6424624): [<ffffffff950fa1e2>] do_idle+0x82/0x260
<4> [186.107669] softirqs last enabled at (6424590): [<ffffffff95e00395>] __do_softirq+0x395/0x49e
<4> [186.107674] softirqs last disabled at (6424571): [<ffffffff950c195a>] irq_exit+0xba/0xc0
<3> [186.107677] Preemption disabled at:
<4> [186.107681] [<ffffffff9504843b>] start_secondary+0x4b/0x1b0
<4> [186.107685] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G U 5.6.0-rc7-next-20200327-g975f7a88c64d-next-20200327 #1
<4> [186.107687] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016
<4> [186.107688] Call Trace:
<4> [186.107695] dump_stack+0x71/0x9b
<4> [186.107702] ___might_sleep+0x178/0x260
<4> [186.107708] cpus_read_lock+0x13/0xd0
<4> [186.107713] static_key_enable+0x9/0x20
<4> [186.107717] init_freq_invariance+0x1f0/0x3a0
<4> [186.107724] start_secondary+0x71/0x1b0
<4> [186.107729] secondary_startup_64+0xb6/0xc0
<3> [186.107756] BUG: scheduling while atomic: swapper/0/0/0x00000002
<4> [186.107763] 1 lock held by swapper/0/0:
<4> [186.107767] #0: ffffffff9643e510 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_enable+0x9/0x20
<4> [186.107775] Modules linked in: vgem snd_hda_codec_hdmi mei_hdcp i915 x86_pkg_temp_thermal coretemp snd_hda_codec_realtek snd_hda_codec_generic crct10dif_pclmul snd_hda_intel crc32_pclmul snd_intel_dspcfg ghash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm mei_me mei realtek prime_numbers
<3> [186.107797] Preemption disabled at:
<4> [186.107800] [<ffffffff9504843b>] start_secondary+0x4b/0x1b0
<4> [186.107803] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G U W 5.6.0-rc7-next-20200327-g975f7a88c64d-next-20200327 #1
<4> [186.107805] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016
<4> [186.107807] Call Trace:
<4> [186.107811] dump_stack+0x71/0x9b
<4> [186.107815] ? start_secondary+0x4b/0x1b0
<4> [186.107819] __schedule_bug+0x7b/0xd0
<4> [186.107825] __schedule+0x776/0x810
<4> [186.107832] ? mark_held_locks+0x49/0x70
<4> [186.107839] schedule+0x37/0xe0
<4> [186.107843] ? percpu_rwsem_wait+0x117/0x180
<4> [186.107846] percpu_rwsem_wait+0x117/0x180
<4> [186.107851] ? percpu_down_write+0x140/0x140
<4> [186.107859] __percpu_down_read+0x43/0x60
<4> [186.107864] cpus_read_lock+0xc6/0xd0
<4> [186.107867] static_key_enable+0x9/0x20
<4> [186.107871] init_freq_invariance+0x1f0/0x3a0
<4> [186.107878] start_secondary+0x71/0x1b0
<4> [186.107883] secondary_startup_64+0xb6/0xc0
<4> [186.107900] ------------[ cut here ]------------
<4> [186.107901] releasing a pinned lock
<4> [186.107908] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:4640 lock_release+0x2a2/0x2c0
<4> [186.107909] Modules linked in: vgem snd_hda_codec_hdmi mei_hdcp i915 x86_pkg_temp_thermal coretemp snd_hda_codec_realtek snd_hda_codec_generic crct10dif_pclmul snd_hda_intel crc32_pclmul snd_intel_dspcfg ghash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm mei_me mei realtek prime_numbers
<4> [186.107924] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G U W 5.6.0-rc7-next-20200327-g975f7a88c64d-next-20200327 #1
<4> [186.107926] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016
<4> [186.107928] RIP: 0010:lock_release+0x2a2/0x2c0
<4> [186.107931] Code: be 3f 00 00 00 48 c7 c7 51 a7 2b 96 c6 05 68 6d 41 01 01 e8 40 8e ff ff eb ae 48 c7 c7 51 a8 2b 96 48 89 04 24 e8 be 0b f9 ff <0f> 0b 48 8b 04 24 e9 22 fe ff ff e8 4e 0e f9 ff 0f 1f 40 00 66 2e
<4> [186.107933] RSP: 0018:ffffffff96403d88 EFLAGS: 00010086
<4> [186.107936] RAX: 0000000000000000 RBX: ffffffff964188c0 RCX: 0000000000000003
<4> [186.107937] RDX: 0000000080000003 RSI: ffffffff95138419 RDI: 00000000ffffffff
<4> [186.107939] RBP: ffffa1290f83bc58 R08: 0000000000000001 R09: 0000000000000001
<4> [186.107940] R10: ffffffff96403dc0 R11: 0000000000077cc4 R12: 0000000000000046
<4> [186.107944] R13: ffffffff950fa029 R14: 0000000000000002 R15: 00000000d3a98c93
<4> [186.107947] FS: 0000000000000000(0000) GS:ffffa1290f800000(0000) knlGS:0000000000000000
<4> [186.107948] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [186.107950] CR2: 00007f307e92fb24 CR3: 0000000287410001 CR4: 00000000001606f0
<4> [186.107951] Call Trace:
<4> [186.107956] _raw_spin_unlock_irq+0x12/0x40
<4> [186.107959] dequeue_task_idle+0x9/0x30
<4> [186.107962] __schedule+0x3cc/0x810
<4> [186.107965] schedule+0x37/0xe0
<4> [186.107969] ? percpu_rwsem_wait+0x117/0x180
<4> [186.107972] percpu_rwsem_wait+0x117/0x180
<4> [186.107975] ? percpu_down_write+0x140/0x140
<4> [186.107978] __percpu_down_read+0x43/0x60
<4> [186.107981] cpus_read_lock+0xc6/0xd0
<4> [186.107984] static_key_enable+0x9/0x20
<4> [186.107988] init_freq_invariance+0x1f0/0x3a0
<4> [186.107991] start_secondary+0x71/0x1b0
<4> [186.107993] secondary_startup_64+0xb6/0xc0
<4> [186.107997] irq event stamp: 6424720
<4> [186.108001] hardirqs last enabled at (6424719): [<ffffffff95a5f298>] dump_stack+0x93/0x9b
<4> [186.108004] hardirqs last disabled at (6424720): [<ffffffff95a781f4>] __schedule+0xc4/0x810
<4> [186.108007] softirqs last enabled at (6424590): [<ffffffff95e00395>] __do_softirq+0x395/0x49e
<4> [186.108011] softirqs last disabled at (6424571): [<ffffffff950c195a>] irq_exit+0xba/0xc0
<4> [186.108013] ---[ end trace 8ae0a00b9ac91c9b ]---
<3> [186.108015] bad: scheduling from the idle thread!
<4> [186.108018] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G U W 5.6.0-rc7-next-20200327-g975f7a88c64d-next-20200327 #1
<4> [186.108020] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016
<4> [186.108022] Call Trace:
<4> [186.108028] dump_stack+0x71/0x9b
<4> [186.108033] dequeue_task_idle+0x1a/0x30
<4> [186.108036] __schedule+0x3cc/0x810
<4> [186.108047] schedule+0x37/0xe0
<4> [186.108050] ? percpu_rwsem_wait+0x117/0x180
<4> [186.108053] percpu_rwsem_wait+0x117/0x180
<4> [186.108059] ? percpu_down_write+0x140/0x140
<4> [186.108067] __percpu_down_read+0x43/0x60
<4> [186.108072] cpus_read_lock+0xc6/0xd0
<4> [186.108077] static_key_enable+0x9/0x20
<4> [186.108081] init_freq_invariance+0x1f0/0x3a0
<4> [186.108089] start_secondary+0x71/0x1b0
<4> [186.108094] secondary_startup_64+0xb6/0xc0
<4> [186.108112] ------------[ cut here ]------------
<4> [186.108112] unpinning an unpinned lock
<4> [186.108117] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:4768 lock_unpin_lock+0x11e/0x130
<4> [186.108119] Modules linked in: vgem snd_hda_codec_hdmi mei_hdcp i915 x86_pkg_temp_thermal coretemp snd_hda_codec_realtek snd_hda_codec_generic crct10dif_pclmul snd_hda_intel crc32_pclmul snd_intel_dspcfg ghash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm mei_me mei realtek prime_numbers
<4> [186.108133] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G U W 5.6.0-rc7-next-20200327-g975f7a88c64d-next-20200327 #1
<4> [186.108134] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016
<4> [186.108136] RIP: 0010:lock_unpin_lock+0x11e/0x130
<4> [186.108138] Code: 96 e8 86 01 f9 ff 0f 0b e9 37 ff ff ff 0f 0b c7 82 bc 08 00 00 00 00 00 00 e9 3c ff ff ff 48 c7 c7 90 a8 2b 96 e8 62 01 f9 ff <0f> 0b e9 13 ff ff ff 90 66 2e 0f 1f 84 00 00 00 00 00 44 8b 0d 4d
<4> [186.108139] RSP: 0018:ffffffff96403db0 EFLAGS: 00010086
<4> [186.108140] RAX: 0000000000000000 RBX: ffffffff964191a8 RCX: 0000000000000003
<4> [186.108141] RDX: 0000000080000003 RSI: ffffffff95138419 RDI: 00000000ffffffff
<4> [186.108142] RBP: ffffffff964188c0 R08: 0000000000000001 R09: 0000000000000001
<4> [186.108143] R10: 00000000e6f5d832 R11: 0000000000078b54 R12: ffffa1290f83bc58
<4> [186.108144] R13: ffffffff96419180 R14: 0000000000000046 R15: 0000000000000001
<4> [186.108145] FS: 0000000000000000(0000) GS:ffffa1290f800000(0000) knlGS:0000000000000000
<4> [186.108146] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [186.108147] CR2: 00007f307e92fb24 CR3: 0000000287410001 CR4: 00000000001606f0
<4> [186.108147] Call Trace:
<4> [186.108151] __schedule+0x747/0x810
<4> [186.108154] schedule+0x37/0xe0
<4> [186.108156] ? percpu_rwsem_wait+0x117/0x180
<4> [186.108157] percpu_rwsem_wait+0x117/0x180
<4> [186.108160] ? percpu_down_write+0x140/0x140
<4> [186.108162] __percpu_down_read+0x43/0x60
<4> [186.108165] cpus_read_lock+0xc6/0xd0
<4> [186.108167] static_key_enable+0x9/0x20
<4> [186.108171] init_freq_invariance+0x1f0/0x3a0
<4> [186.108173] start_secondary+0x71/0x1b0
<4> [186.108175] secondary_startup_64+0xb6/0xc0
<4> [186.108178] irq event stamp: 6424730
<4> [186.108181] hardirqs last enabled at (6424729): [<ffffffff95a5f298>] dump_stack+0x93/0x9b
<4> [186.108183] hardirqs last disabled at (6424730): [<ffffffff95a7f32a>] _raw_spin_lock_irq+0xa/0x40
<4> [186.108185] softirqs last enabled at (6424590): [<ffffffff95e00395>] __do_softirq+0x395/0x49e
<4> [186.108188] softirqs last disabled at (6424571): [<ffffffff950c195a>] irq_exit+0xba/0xc0
<4> [186.108188] ---[ end trace 8ae0a00b9ac91c9c ]---
<4> [186.108191] ------------[ cut here ]------------
<4> [186.108191] releasing a pinned lock
<4> [186.108196] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:4640 lock_release+0x2a2/0x2c0
<4> [186.108196] Modules linked in: vgem snd_hda_codec_hdmi mei_hdcp i915 x86_pkg_temp_thermal coretemp snd_hda_codec_realtek snd_hda_codec_generic crct10dif_pclmul snd_hda_intel crc32_pclmul snd_intel_dspcfg ghash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm mei_me mei realtek prime_numbers
<4> [186.108204] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G U W 5.6.0-rc7-next-20200327-g975f7a88c64d-next-20200327 #1
<4> [186.108205] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016
<4> [186.108207] RIP: 0010:lock_release+0x2a2/0x2c0
<4> [186.108208] Code: be 3f 00 00 00 48 c7 c7 51 a7 2b 96 c6 05 68 6d 41 01 01 e8 40 8e ff ff eb ae 48 c7 c7 51 a8 2b 96 48 89 04 24 e8 be 0b f9 ff <0f> 0b 48 8b 04 24 e9 22 fe ff ff e8 4e 0e f9 ff 0f 1f 40 00 66 2e
<4> [186.108209] RSP: 0018:ffffffff96403d88 EFLAGS: 00010086
<4> [186.108210] RAX: 0000000000000000 RBX: ffffffff964188c0 RCX: 0000000000000003
<4> [186.108211] RDX: 0000000080000003 RSI: ffffffff95138419 RDI: 00000000ffffffff
<4> [186.108212] RBP: ffffa1290f83bc58 R08: 0000000000000001 R09: 0000000000000001
<4> [186.108213] R10: ffffffff96403dc0 R11: 00000000000795ac R12: 0000000000000046
<4> [186.108214] R13: ffffffff950fa029 R14: 0000000000000002 R15: 00000000d3a98c93
<4> [186.108215] FS: 0000000000000000(0000) GS:ffffa1290f800000(0000) knlGS:0000000000000000
<4> [186.108216] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [186.108217] CR2: 00007f307e92fb24 CR3: 0000000287410001 CR4: 00000000001606f0
<4> [186.108218] Call Trace:
<4> [186.108221] _raw_spin_unlock_irq+0x12/0x40
<4> [186.108225] dequeue_task_idle+0x9/0x30
<4> [186.108227] __schedule+0x3cc/0x810
<4> [186.108230] schedule+0x37/0xe0
<4> [186.108232] ? percpu_rwsem_wait+0x117/0x180
<4> [186.108233] percpu_rwsem_wait+0x117/0x180
<4> [186.108235] ? percpu_down_write+0x140/0x140
<4> [186.108238] __percpu_down_read+0x43/0x60
<4> [186.108240] cpus_read_lock+0xc6/0xd0
<4> [186.108242] static_key_enable+0x9/0x20
<4> [186.108245] init_freq_invariance+0x1f0/0x3a0
<4> [186.108247] start_secondary+0x71/0x1b0
<4> [186.108249] secondary_startup_64+0xb6/0xc0
<4> [186.108252] irq event stamp: 6424732
<4> [186.108255] hardirqs last enabled at (6424731): [<ffffffff95a7f57f>] _raw_spin_unlock_irq+0x1f/0x40
<4> [186.108257] hardirqs last disabled at (6424732): [<ffffffff95a781f4>] __schedule+0xc4/0x810
<4> [186.108258] softirqs last enabled at (6424590): [<ffffffff95e00395>] __do_softirq+0x395/0x49e
<4> [186.108261] softirqs last disabled at (6424571): [<ffffffff950c195a>] irq_exit+0xba/0xc0
<4> [186.108262] ---[ end trace 8ae0a00b9ac91c9d ]---
<3> [186.108263] bad: scheduling from the idle thread!
<4> [186.108266] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G U W 5.6.0-rc7-next-20200327-g975f7a88c64d-next-20200327 #1
<4> [186.108268] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016
<4> [186.108269] Call Trace:
<4> [186.108273] dump_stack+0x71/0x9b
<4> [186.108278] dequeue_task_idle+0x1a/0x30
<4> [186.108282] __schedule+0x3cc/0x810
<4> [186.108292] schedule+0x37/0xe0
<4> [186.108296] ? percpu_rwsem_wait+0x117/0x180
<4> [186.108298] percpu_rwsem_wait+0x117/0x180
<4> [186.108303] ? percpu_down_write+0x140/0x140
<4> [186.108311] __percpu_down_read+0x43/0x60
<4> [186.108316] cpus_read_lock+0xc6/0xd0
<4> [186.108319] static_key_enable+0x9/0x20
<4> [186.108323] init_freq_invariance+0x1f0/0x3a0
<4> [186.108330] start_secondary+0x71/0x1b0
<4> [186.108335] secondary_startup_64+0xb6/0xc0
<4> [186.108351] ------------[ cut here ]------------
repeating ad nauseam, e.g.
https://intel-gfx-ci.01.org/tree/linux-next/next-20200327/shard-hsw4/dmesg9.txt
Across all our test boxen.
-Chris
On Mon, Mar 30 2020, Chris Wilson wrote:
> <6> [184.949219] [IGT] perf_pmu: starting subtest cpu-hotplug
> <4> [185.092279] IRQ 24: no longer affine to CPU0
> <4> [185.092285] IRQ 25: no longer affine to CPU0
> <6> [185.093709] smpboot: CPU 0 is now offline
> <6> [186.107062] smpboot: Booting Node 0 Processor 0 APIC 0x0
> <3> [186.107643] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:49
> <3> [186.107648] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/0
> <4> [186.107650] no locks held by swapper/0/0.
> <4> [186.107652] irq event stamp: 6424624
> <4> [186.107658] hardirqs last enabled at (6424623): [<ffffffff951744bf>] tick_nohz_idle_enter+0x5f/0x90
> <4> [186.107664] hardirqs last disabled at (6424624): [<ffffffff950fa1e2>] do_idle+0x82/0x260
> <4> [186.107669] softirqs last enabled at (6424590): [<ffffffff95e00395>] __do_softirq+0x395/0x49e
> <4> [186.107674] softirqs last disabled at (6424571): [<ffffffff950c195a>] irq_exit+0xba/0xc0
> <3> [186.107677] Preemption disabled at:
> <4> [186.107681] [<ffffffff9504843b>] start_secondary+0x4b/0x1b0
> <4> [186.107685] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G U 5.6.0-rc7-next-20200327-g975f7a88c64d-next-20200327 #1
> <4> [186.107687] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016
> <4> [186.107688] Call Trace:
> <4> [186.107695] dump_stack+0x71/0x9b
> <4> [186.107702] ___might_sleep+0x178/0x260
> <4> [186.107708] cpus_read_lock+0x13/0xd0
> <4> [186.107713] static_key_enable+0x9/0x20
> <4> [186.107717] init_freq_invariance+0x1f0/0x3a0
> <4> [186.107724] start_secondary+0x71/0x1b0
> <4> [186.107729] secondary_startup_64+0xb6/0xc0
> <3> [186.107756] BUG: scheduling while atomic: swapper/0/0/0x00000002
> <4> [186.107763] 1 lock held by swapper/0/0:
> <4> [186.107767] #0: ffffffff9643e510 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_enable+0x9/0x20
> <4> [186.107775] Modules linked in: vgem snd_hda_codec_hdmi mei_hdcp i915 x86_pkg_temp_thermal coretemp snd_hda_codec_realtek snd_hda_codec_generic crct10dif_pclmul snd_hda_intel crc32_pclmul snd_intel_dspcfg ghash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich snd_pcm mei_me mei realtek prime_numbers
> <3> [186.107797] Preemption disabled at:
> <4> [186.107800] [<ffffffff9504843b>] start_secondary+0x4b/0x1b0
> <4> [186.107803] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G U W 5.6.0-rc7-next-20200327-g975f7a88c64d-next-20200327 #1
> <4> [186.107805] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016
> <4> [186.107807] Call Trace:
> <4> [186.107811] dump_stack+0x71/0x9b
> <4> [186.107815] ? start_secondary+0x4b/0x1b0
> <4> [186.107819] __schedule_bug+0x7b/0xd0
> <4> [186.107825] __schedule+0x776/0x810
> <4> [186.107832] ? mark_held_locks+0x49/0x70
> <4> [186.107839] schedule+0x37/0xe0
> <4> [186.107843] ? percpu_rwsem_wait+0x117/0x180
> <4> [186.107846] percpu_rwsem_wait+0x117/0x180
> <4> [186.107851] ? percpu_down_write+0x140/0x140
> <4> [186.107859] __percpu_down_read+0x43/0x60
> <4> [186.107864] cpus_read_lock+0xc6/0xd0
> <4> [186.107867] static_key_enable+0x9/0x20
> <4> [186.107871] init_freq_invariance+0x1f0/0x3a0
> <4> [186.107878] start_secondary+0x71/0x1b0
> <4> [186.107883] secondary_startup_64+0xb6/0xc0
>
> repeating ad nauseam, e.g.
> https://intel-gfx-ci.01.org/tree/linux-next/next-20200327/shard-hsw4/dmesg9.txt
>
> Across all our test boxen.
> -Chris
AFAICT this should be valid (I'm afraid I can't easily test it,
however); init doesn't take the hp lock (doesn't need it) and post-boot
hotplug will call this via the hotplug state machine with the right lock
held.
---
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 467191e51196..7651b06a1036 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -2014,7 +2014,7 @@ static void init_freq_invariance(void)
if (ret) {
on_each_cpu(init_counter_refs, NULL, 1);
- static_branch_enable(&arch_scale_freq_key);
+ static_branch_enable_cpuslocked(&arch_scale_freq_key);
} else {
pr_debug("Couldn't determine max cpu frequency, necessary for scale-invariant accounting.\n");
}
On Mon, Mar 30, 2020 at 12:05:42PM +0100, Chris Wilson wrote:
> Quoting tip-bot2 for Giovanni Gherdovich (2020-01-29 11:32:58)
> > The following commit has been merged into the sched/core branch of tip:
> >
> > Commit-ID: 1567c3e3467cddeb019a7b53ec632f834b6a9239
> > Gitweb: https://git.kernel.org/tip/1567c3e3467cddeb019a7b53ec632f834b6a9239
> > Author: Giovanni Gherdovich <[email protected]>
> > AuthorDate: Wed, 22 Jan 2020 16:16:12 +01:00
> > Committer: Ingo Molnar <[email protected]>
> > CommitterDate: Tue, 28 Jan 2020 21:36:59 +01:00
> >
> > x86, sched: Add support for frequency invariance
> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index 69881b2..28696bc 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -147,6 +147,8 @@ static inline void smpboot_restore_warm_reset_vector(void)
> > *((volatile u32 *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) = 0;
> > }
> >
> > +static void init_freq_invariance(void);
> > +
> > /*
> > * Report back to the Boot Processor during boot time or to the caller processor
> > * during CPU online.
> > @@ -183,6 +185,8 @@ static void smp_callin(void)
> > */
> > set_cpu_sibling_map(raw_smp_processor_id());
> >
> > + init_freq_invariance();
> > +
> > /*
> > * Get our bogomips.
> > * Update loops_per_jiffy in cpu_data. Previous call to
> > @@ -1337,7 +1341,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
> > set_sched_topology(x86_topology);
> >
> > set_cpu_sibling_map(0);
> > -
> > + init_freq_invariance();
> > smp_sanity_check();
> >
> > switch (apic_intr_mode) {
>
> Since this has become visible via linux-next [20200326?], we have been
> deluged by oops during cpu-hotplug.
Ooh, you're doing CPU-0 hotplug, yuck!
I think something like the below ought to work; let me go see if I can
get that cpu-0 hotplug crud working on my machines.
---
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index fe3ab9632f3b..681f96f05619 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -147,7 +147,7 @@ static inline void smpboot_restore_warm_reset_vector(void)
*((volatile u32 *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) = 0;
}
-static void init_freq_invariance(void);
+static void init_freq_invariance(bool secondary);
/*
* Report back to the Boot Processor during boot time or to the caller processor
@@ -185,7 +185,7 @@ static void smp_callin(void)
*/
set_cpu_sibling_map(raw_smp_processor_id());
- init_freq_invariance();
+ init_freq_invariance(true);
/*
* Get our bogomips.
@@ -1341,7 +1341,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
set_sched_topology(x86_topology);
set_cpu_sibling_map(0);
- init_freq_invariance();
+ init_freq_invariance(false);
smp_sanity_check();
switch (apic_intr_mode) {
@@ -2002,13 +2002,20 @@ static void init_counter_refs(void *arg)
this_cpu_write(arch_prev_mperf, mperf);
}
-static void init_freq_invariance(void)
+static void init_freq_invariance(bool secondary)
{
bool ret = false;
- if (smp_processor_id() != 0 || !boot_cpu_has(X86_FEATURE_APERFMPERF))
+ if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
return;
+ if (secondary) {
+ if (static_branch_likely(&arch_scale_freq_key)) {
+ init_counter_refs(NULL);
+ }
+ return;
+ }
+
if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
ret = intel_set_max_freq_ratio();
On Mon, Mar 30 2020, Peter Zijlstra wrote:
> +static void init_freq_invariance(bool secondary)
> {
> bool ret = false;
>
> - if (smp_processor_id() != 0 || !boot_cpu_has(X86_FEATURE_APERFMPERF))
> + if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
> return;
>
> + if (secondary) {
> + if (static_branch_likely(&arch_scale_freq_key)) {
> + init_counter_refs(NULL);
> + }
> + return;
> + }
> +
Oh doh, that's an "enable once and for all" thing. That makes much more
sense; sorry for the noise.
On Mon, 2020-03-30 at 12:05 +0100, Chris Wilson wrote:
> Quoting tip-bot2 for Giovanni Gherdovich (2020-01-29 11:32:58)
> > The following commit has been merged into the sched/core branch of tip:
> >
> > Commit-ID: 1567c3e3467cddeb019a7b53ec632f834b6a9239
> > Gitweb: https://git.kernel.org/tip/1567c3e3467cddeb019a7b53ec632f834b6a9239
> > Author: Giovanni Gherdovich <[email protected]>
> > AuthorDate: Wed, 22 Jan 2020 16:16:12 +01:00
> > Committer: Ingo Molnar <[email protected]>
> > CommitterDate: Tue, 28 Jan 2020 21:36:59 +01:00
> > [...]
>
> Since this has become visible via linux-next [20200326?], we have been
> deluged by oops during cpu-hotplug.
>
> <6> [184.949219] [IGT] perf_pmu: starting subtest cpu-hotplug
> <4> [185.092279] IRQ 24: no longer affine to CPU0
> <4> [185.092285] IRQ 25: no longer affine to CPU0
> <6> [185.093709] smpboot: CPU 0 is now offline
> <6> [186.107062] smpboot: Booting Node 0 Processor 0 APIC 0x0
> <3> [186.107643] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:49
> <3> [186.107648] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/0
> [...]
>
> repeating ad nauseam, e.g.
> https://intel-gfx-ci.01.org/tree/linux-next/next-20200327/shard-hsw4/dmesg9.txt
>
> Across all our test boxen.
> -Chris
Hello Chris,
thank you for catching this problem and sorry for the mess.
Until your message I wasn't aware that CPU0 can be hotplugged, but now that I
check the feature is been there since v3.8 :/
The code assumes cpu0 is always there and I need to fix that.
It seems your report comes from executing an automated test suite, can you
give me a link to the test sources and a hint on how to run it? I'd like to
reproduce locally so that I make sure I correctly address this problem.
Thanks,
Giovanni
Quoting Giovanni Gherdovich (2020-03-31 19:11:25)
> Hello Chris,
>
> thank you for catching this problem and sorry for the mess.
>
> Until your message I wasn't aware that CPU0 can be hotplugged, but now that I
> check the feature is been there since v3.8 :/
>
> The code assumes cpu0 is always there and I need to fix that.
>
> It seems your report comes from executing an automated test suite, can you
> give me a link to the test sources and a hint on how to run it? I'd like to
> reproduce locally so that I make sure I correctly address this problem.
https://gitlab.freedesktop.org/drm/igt-gpu-tools/
It's an i915 test (so expects i915 running and root access to your
machine, with the intent of breaking your machine), but the cpu
hotplugging could be extracted
https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tests/perf_pmu.c#L1153
since it's basically doing:
i = 0
while :; do
test -e /sys/devices/system/cpu/cpu$i/online || break
echo 0 > /sys/devices/system/cpu/cpu$i/online
sleep .1
echo 1 > /sys/devices/system/cpu/cpu$i/online
i = $[[ $i + 1 ]]
done
dmesg
Possibly running that under perf stat to keep perf_event_open, or
something else that hooks up the perf cpu hotplug callbacks.
Hope that helps,
-Chris