v1 (Compared to RFC/RFT v3)
- Minor suggestion for intel_pstate for coding
- Add SKL desktop model used in some Xeons
Tested-by: Giovanni Gherdovich <[email protected]>
This series has an overall positive performance impact on IO both on xfs and
ext4, and I'd be vary happy if it lands in v4.18. You dropped the migration
optimization from v1 to v2 after the reviewers' suggestion; I'm looking
forward to test that part too, so please add me to CC when you'll resend it.
I've tested your series on a single socket Xeon E3-1240 v5 (Skylake, 4 cores /
8 threads) with SSD storage. The platform is a Dell PowerEdge R230.
The benchmarks used are a mix of I/O intensive workloads on ext4 and xfs
(dbench4, sqlite, pgbench in read/write and read-only configuration, Flexible
IO aka FIO, etc) and scheduler stressers just to check that everything is okay
in that department too (hackbench, pipetest, schbench, sockperf on localhost
both in "throughput" and "under-load" mode, netperf in localhost, etc). There
is also some HPC with the NAS Parallel Benchmark, as when using openMPI as IPC
mechanism it ends up being write-intensive and that could be a good
experiment, even if the HPC people aren't exactly the target audience for a
frequency governor.
The large improvements are in areas you already highlighted in your cover
letter (dbench4, sqlite, and pgbench read/write too, very impressive
honestly). Minor wins are also observed in sockperf and running the git unit
tests (gitsource below). The scheduler stressers ends up, as expected, in the
"neutral" category where you'll also find FIO (which given other results I'd
have expected to improve a little at least). Marked "neutral" are also those
results where statistical significance wasn't reached (2 standard deviations,
which is roughly like a 0.05 p-value) even if they showed some difference in a
direction or the other. In the "small losses" section I found hackbench run
with processes (not threads) and pipes (not sockets) which I report for due
diligence but looking at the raw numbers it's more of a mixed bag than a real
loss, and the NAS high-perf computing benchmark when it uses openMP (as
opposed to openMPI) for IPC -- but again, we often find that supercomputers
people run the machines at full speed all the time.
At the bottom of this message you'll find some directions if you want to run
some test yourself using the same framework I used, MMTests from
https://github.com/gormanm/mmtests (we store a fair amount of benchmarks
parametrization up there).
Large wins:
- dbench4: +20% on ext4,
+14% on xfs (always asynch IO)
- sqlite (insert): +9% on both ext4 and xfs
- pgbench (read/write): +9% on ext4,
+10% on xfs
Moderate wins:
- sockperf (type: under-load, localhost): +1% with TCP,
+5% with UDP
- gisource (git unit tests, shell intensive): +3% on ext4
- NAS Parallel Benchmark (HPC, using openMPI, on xfs): +1%
- tbench4 (network part of dbench4, localhost): +1%
Neutral:
- pgbench (read-only) on ext4 and xfs
- siege
- netperf (streaming and round-robin) with TCP and UDP
- hackbench (sockets/process, sockets/thread and pipes/thread)
- pipetest
- Linux kernel build
- schbench
- sockperf (type: throughput) with TCP and UDP
- git unit tests on xfs
- FIO (both random and seq. read, both random and seq. write)
on ext4 and xfs, async IO
Moderate losses:
- hackbench (pipes/process): -10%
- NAS Parallel Benchmark with openMP: -1%
Each benchmark is run with a variety of configuration parameters (eg: number
of threads, number of clients, etc); to reach a final "score" the geometric
mean is used (with a few exceptions depending on the type of benchmark).
Detailed results follow. Amean, Hmean and Gmean are respectively arithmetic,
harmonic and geometric means.
For brevity I won't report all tables but only those for "large wins" and
"moderate losses". Note that I'm not overly worried for the hackbench-pipes
situation, as we've studied it in the past and determined that such
configuration is particularly weak, time is mostly spent on contention and the
scheduler code path isn't exercised. See the comment in the file
configs/config-global-dhp__scheduler-unbound in MMTests for a brief
description of the issue.
DBENCH4
=======
NOTES: asyncronous IO; varies the number of clients up to NUMCPUS*8.
MMTESTS CONFIG: global-dhp__io-dbench4-async-{ext4, xfs}
MEASURES: latency (millisecs)
LOWER is better
EXT4
4.16.0 4.16.0
vanilla hwp-boost
Amean 1 28.49 ( 0.00%) 19.68 ( 30.92%)
Amean 2 26.70 ( 0.00%) 25.59 ( 4.14%)
Amean 4 54.59 ( 0.00%) 43.56 ( 20.20%)
Amean 8 91.19 ( 0.00%) 77.56 ( 14.96%)
Amean 64 538.09 ( 0.00%) 438.67 ( 18.48%)
Stddev 1 6.70 ( 0.00%) 3.24 ( 51.66%)
Stddev 2 4.35 ( 0.00%) 3.57 ( 17.85%)
Stddev 4 7.99 ( 0.00%) 7.24 ( 9.29%)
Stddev 8 17.51 ( 0.00%) 15.80 ( 9.78%)
Stddev 64 49.54 ( 0.00%) 46.98 ( 5.17%)
XFS
4.16.0 4.16.0
vanilla hwp-boost
Amean 1 21.88 ( 0.00%) 16.03 ( 26.75%)
Amean 2 19.72 ( 0.00%) 19.82 ( -0.50%)
Amean 4 37.55 ( 0.00%) 29.52 ( 21.38%)
Amean 8 56.73 ( 0.00%) 51.83 ( 8.63%)
Amean 64 808.80 ( 0.00%) 698.12 ( 13.68%)
Stddev 1 6.29 ( 0.00%) 2.33 ( 62.99%)
Stddev 2 3.12 ( 0.00%) 2.26 ( 27.73%)
Stddev 4 7.56 ( 0.00%) 5.88 ( 22.28%)
Stddev 8 14.15 ( 0.00%) 12.49 ( 11.71%)
Stddev 64 380.54 ( 0.00%) 367.88 ( 3.33%)
SQLITE
======
NOTES: SQL insert test on a table that will be 2M in size.
MMTESTS CONFIG: global-dhp__db-sqlite-insert-medium-{ext4, xfs}
MEASURES: transactions per second
HIGHER is better
EXT4
4.16.0 4.16.0
vanilla hwp-boost
Hmean Trans 2098.79 ( 0.00%) 2292.16 ( 9.21%)
Stddev Trans 78.79 ( 0.00%) 95.73 ( -21.50%)
XFS
4.16.0 4.16.0
vanilla hwp-boost
Hmean Trans 1890.27 ( 0.00%) 2058.62 ( 8.91%)
Stddev Trans 52.54 ( 0.00%) 29.56 ( 43.73%)
PGBENCH-RW
==========
NOTES: packaged with Postgres. Varies the number of thread up to NUMCPUS. The
workload is scaled so that the approximate size is 80% of of the database
shared buffer which itself is 20% of RAM. The page cache is not flushed
after the database is populated for the test and starts cache-hot.
MMTESTS CONFIG: global-dhp__db-pgbench-timed-rw-small-{ext4, xfs}
MEASURES: transactions per second
HIGHER is better
EXT4
4.16.0 4.16.0
vanilla hwp-boost
Hmean 1 2692.19 ( 0.00%) 2660.98 ( -1.16%)
Hmean 4 5218.93 ( 0.00%) 5610.10 ( 7.50%)
Hmean 7 7332.68 ( 0.00%) 8378.24 ( 14.26%)
Hmean 8 7462.03 ( 0.00%) 8713.36 ( 16.77%)
Stddev 1 231.85 ( 0.00%) 257.49 ( -11.06%)
Stddev 4 681.11 ( 0.00%) 312.64 ( 54.10%)
Stddev 7 1072.07 ( 0.00%) 730.29 ( 31.88%)
Stddev 8 1472.77 ( 0.00%) 1057.34 ( 28.21%)
XFS
4.16.0 4.16.0
vanilla hwp-boost
Hmean 1 2675.02 ( 0.00%) 2661.69 ( -0.50%)
Hmean 4 5049.45 ( 0.00%) 5601.45 ( 10.93%)
Hmean 7 7302.18 ( 0.00%) 8348.16 ( 14.32%)
Hmean 8 7596.83 ( 0.00%) 8693.29 ( 14.43%)
Stddev 1 225.41 ( 0.00%) 246.74 ( -9.46%)
Stddev 4 761.33 ( 0.00%) 334.77 ( 56.03%)
Stddev 7 1093.93 ( 0.00%) 811.30 ( 25.84%)
Stddev 8 1465.06 ( 0.00%) 1118.81 ( 23.63%)
HACKBENCH
=========
NOTES: Varies the number of groups between 1 and NUMCPUS*4
MMTESTS CONFIG: global-dhp__scheduler-unbound
MEASURES: time (seconds)
LOWER is better
4.16.0 4.16.0
vanilla hwp-boost
Amean 1 0.8350 ( 0.00%) 1.1577 ( -38.64%)
Amean 3 2.8367 ( 0.00%) 3.7457 ( -32.04%)
Amean 5 6.7503 ( 0.00%) 5.7977 ( 14.11%)
Amean 7 7.8290 ( 0.00%) 8.0343 ( -2.62%)
Amean 12 11.0560 ( 0.00%) 11.9673 ( -8.24%)
Amean 18 15.2603 ( 0.00%) 15.5247 ( -1.73%)
Amean 24 17.0283 ( 0.00%) 17.9047 ( -5.15%)
Amean 30 19.9193 ( 0.00%) 23.4670 ( -17.81%)
Amean 32 21.4637 ( 0.00%) 23.4097 ( -9.07%)
Stddev 1 0.0636 ( 0.00%) 0.0255 ( 59.93%)
Stddev 3 0.1188 ( 0.00%) 0.0235 ( 80.22%)
Stddev 5 0.0755 ( 0.00%) 0.1398 ( -85.13%)
Stddev 7 0.2778 ( 0.00%) 0.1634 ( 41.17%)
Stddev 12 0.5785 ( 0.00%) 0.1030 ( 82.19%)
Stddev 18 1.2099 ( 0.00%) 0.7986 ( 33.99%)
Stddev 24 0.2057 ( 0.00%) 0.7030 (-241.72%)
Stddev 30 1.1303 ( 0.00%) 0.7654 ( 32.28%)
Stddev 32 0.2032 ( 0.00%) 3.1626 (-1456.69%)
NAS PARALLEL BENCHMARK, C-CLASS (w/ openMP)
===========================================
NOTES: The various computational kernels are run separately; see
https://www.nas.nasa.gov/publications/npb.html for the list of tasks (IS =
Integer Sort, EP = Embarrassingly Parallel, etc)
MMTESTS CONFIG: global-dhp__nas-c-class-omp-full
MEASURES: time (seconds)
LOWER is better
4.16.0 4.16.0
vanilla hwp-boost
Amean bt.C 169.82 ( 0.00%) 170.54 ( -0.42%)
Stddev bt.C 1.07 ( 0.00%) 0.97 ( 9.34%)
Amean cg.C 41.81 ( 0.00%) 42.08 ( -0.65%)
Stddev cg.C 0.06 ( 0.00%) 0.03 ( 48.24%)
Amean ep.C 26.63 ( 0.00%) 26.47 ( 0.61%)
Stddev ep.C 0.37 ( 0.00%) 0.24 ( 35.35%)
Amean ft.C 38.17 ( 0.00%) 38.41 ( -0.64%)
Stddev ft.C 0.33 ( 0.00%) 0.32 ( 3.78%)
Amean is.C 1.49 ( 0.00%) 1.40 ( 6.02%)
Stddev is.C 0.20 ( 0.00%) 0.16 ( 19.40%)
Amean lu.C 217.46 ( 0.00%) 220.21 ( -1.26%)
Stddev lu.C 0.23 ( 0.00%) 0.22 ( 0.74%)
Amean mg.C 18.56 ( 0.00%) 18.80 ( -1.31%)
Stddev mg.C 0.01 ( 0.00%) 0.01 ( 22.54%)
Amean sp.C 293.25 ( 0.00%) 296.73 ( -1.19%)
Stddev sp.C 0.10 ( 0.00%) 0.06 ( 42.67%)
Amean ua.C 170.74 ( 0.00%) 172.02 ( -0.75%)
Stddev ua.C 0.28 ( 0.00%) 0.31 ( -12.89%)
HOW TO REPRODUCE
================
To install MMTests, clone the git repo at
https://github.com/gormanm/mmtests.git
To run a config (ie a set of benchmarks, such as
config-global-dhp__nas-c-class-omp-full), use the command
./run-mmtests.sh --config configs/$CONFIG $MNEMONIC-NAME
from the top-level directory; the benchmark source will be downloaded from its
canonical internet location, compiled and run.
To compare results from two runs, use
./bin/compare-mmtests.pl --directory ./work/log \
--benchmark $BENCHMARK-NAME \
--names $MNEMONIC-NAME-1,$MNEMONIC-NAME-2
from the top-level directory.
==================
From RFC Series:
v3
- Removed atomic bit operation as suggested.
- Added description of contention with user space.
- Removed hwp cache, boost utililty function patch and merged with util callback
patch. This way any value set is used somewhere.
Waiting for test results from Mel Gorman, who is the original reporter.
v2
This is a much simpler version than the previous one and only consider IO
boost, using the existing mechanism. There is no change in this series
beyond intel_pstate driver.
Once PeterZ finishes his work on frequency invariant, I will revisit
thread migration optimization in HWP mode.
Other changes:
- Gradual boost instead of single step as suggested by PeterZ.
- Cross CPU synchronization concerns identified by Rafael.
- Split the patch for HWP MSR value caching as suggested by PeterZ.
Not changed as suggested:
There is no architecture way to identify platform with Per-core
P-states, so still have to enable feature based on CPU model.
-----------
v1
This series tries to address some concern in performance particularly with IO
workloads (Reported by Mel Gorman), when HWP is using intel_pstate powersave
policy.
Background
HWP performance can be controlled by user space using sysfs interface for
max/min frequency limits and energy performance preference settings. Based on
workload characteristics these can be adjusted from user space. These limits
are not changed dynamically by kernel based on workload.
By default HWP defaults to energy performance preference value of 0x80 on
majority of platforms(Scale is 0-255, 0 is max performance and 255 is min).
This value offers best performance/watt and for majority of server workloads
performance doesn't suffer. Also users always have option to use performance
policy of intel_pstate, to get best performance. But user tend to run with
out of box configuration, which is powersave policy on most of the distros.
In some case it is possible to dynamically adjust performance, for example,
when a CPU is woken up due to IO completion or thread migrate to a new CPU. In
this case HWP algorithm will take some time to build utilization and ramp up
P-states. So this may results in lower performance for some IO workloads and
workloads which tend to migrate. The idea of this patch series is to
temporarily boost performance dynamically in these cases. This is only
applicable only when user is using powersave policy, not in performance policy.
Results on a Skylake server:
Benchmark Improvement %
----------------------------------------------------------------------
dbench 50.36
thread IO bench (tiobench) 10.35
File IO 9.81
sqlite 15.76
X264 -104 cores 9.75
Spec Power (Negligible impact 7382 Vs. 7378)
Idle Power No change observed
-----------------------------------------------------------------------
HWP brings in best performace/watt at EPP=0x80. Since we are boosting
EPP here to 0, the performance/watt drops upto 10%. So there is a power
penalty of these changes.
Also Mel Gorman provided test results on a prior patchset, which shows
benifits of this series.
Srinivas Pandruvada (4):
cpufreq: intel_pstate: Add HWP boost utility and sched util hooks
cpufreq: intel_pstate: HWP boost performance on IO wakeup
cpufreq: intel_pstate: New sysfs entry to control HWP boost
cpufreq: intel_pstate: enable boost for Skylake Xeon
drivers/cpufreq/intel_pstate.c | 179 ++++++++++++++++++++++++++++++++++++++++-
1 file changed, 176 insertions(+), 3 deletions(-)
--
2.13.6
This change uses SCHED_CPUFREQ_IOWAIT flag to boost HWP performance.
Since SCHED_CPUFREQ_IOWAIT flag is set frequently, we don't start
boosting steps unless we see two consecutive flags in two ticks. This
avoids boosting due to IO because of regular system activities.
To avoid synchronization issues, the actual processing of the flag is
done on the local CPU callback.
Reported-by: Mel Gorman <[email protected]>
Tested-by: Giovanni Gherdovich <[email protected]>
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
drivers/cpufreq/intel_pstate.c | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 3949e3861f55..5b2b6b6d1ff4 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -223,6 +223,8 @@ struct global_params {
* operation
* @hwp_req_cached: Cached value of the last HWP Request MSR
* @hwp_cap_cached: Cached value of the last HWP Capabilities MSR
+ * @last_io_update: Last time when IO wake flag was set
+ * @sched_flags: Store scheduler flags for possible cross CPU update
* @hwp_boost_min: Last HWP boosted min performance
*
* This structure stores per CPU instance data for all CPUs.
@@ -258,6 +260,8 @@ struct cpudata {
s16 epp_saved;
u64 hwp_req_cached;
u64 hwp_cap_cached;
+ u64 last_io_update;
+ unsigned int sched_flags;
u32 hwp_boost_min;
};
@@ -1460,9 +1464,44 @@ static inline void intel_pstate_hwp_boost_down(struct cpudata *cpu)
cpu->last_update = cpu->sample.time;
}
+static inline void intel_pstate_update_util_hwp_local(struct cpudata *cpu,
+ u64 time)
+{
+ cpu->sample.time = time;
+
+ if (cpu->sched_flags & SCHED_CPUFREQ_IOWAIT) {
+ bool do_io = false;
+
+ cpu->sched_flags = 0;
+ /*
+ * Set iowait_boost flag and update time. Since IO WAIT flag
+ * is set all the time, we can't just conclude that there is
+ * some IO bound activity is scheduled on this CPU with just
+ * one occurrence. If we receive at least two in two
+ * consecutive ticks, then we treat as boost candidate.
+ */
+ if (time_before64(time, cpu->last_io_update + 2 * TICK_NSEC))
+ do_io = true;
+
+ cpu->last_io_update = time;
+
+ if (do_io)
+ intel_pstate_hwp_boost_up(cpu);
+
+ } else {
+ intel_pstate_hwp_boost_down(cpu);
+ }
+}
+
static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
u64 time, unsigned int flags)
{
+ struct cpudata *cpu = container_of(data, struct cpudata, update_util);
+
+ cpu->sched_flags |= flags;
+
+ if (smp_processor_id() == cpu->cpu)
+ intel_pstate_update_util_hwp_local(cpu, time);
}
static inline void intel_pstate_calc_avg_perf(struct cpudata *cpu)
--
2.13.6
Enable HWP boost on Skylake server and workstations.
Reported-by: Mel Gorman <[email protected]>
Tested-by: Giovanni Gherdovich <[email protected]>
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
drivers/cpufreq/intel_pstate.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 70bf63bb4e0e..01c8da1f99db 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1794,6 +1794,12 @@ static const struct x86_cpu_id intel_pstate_cpu_ee_disable_ids[] = {
{}
};
+static const struct x86_cpu_id intel_pstate_hwp_boost_ids[] __initconst = {
+ ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
+ ICPU(INTEL_FAM6_SKYLAKE_DESKTOP, core_funcs),
+ {}
+};
+
static int intel_pstate_init_cpu(unsigned int cpunum)
{
struct cpudata *cpu;
@@ -1824,6 +1830,10 @@ static int intel_pstate_init_cpu(unsigned int cpunum)
intel_pstate_disable_ee(cpunum);
intel_pstate_hwp_enable(cpu);
+
+ id = x86_match_cpu(intel_pstate_hwp_boost_ids);
+ if (id)
+ hwp_boost = true;
}
intel_pstate_get_cpu_pstates(cpu);
--
2.13.6
Added two utility functions to HWP boost up gradually and boost down to
the default cached HWP request values.
Boost up:
Boost up updates HWP request minimum value in steps. This minimum value
can reach upto at HWP request maximum values depends on how frequently,
this boost up function is called. At max, boost up will take three steps
to reach the maximum, depending on the current HWP request levels and HWP
capabilities. For example, if the current settings are:
If P0 (Turbo max) = P1 (Guaranteed max) = min
No boost at all.
If P0 (Turbo max) > P1 (Guaranteed max) = min
Should result in one level boost only for P0.
If P0 (Turbo max) = P1 (Guaranteed max) > min
Should result in two level boost:
(min + p1)/2 and P1.
If P0 (Turbo max) > P1 (Guaranteed max) > min
Should result in three level boost:
(min + p1)/2, P1 and P0.
We don't set any level between P0 and P1 as there is no guarantee that
they will be honored.
Boost down:
After the system is idle for hold time of 3ms, the HWP request is reset
to the default value from HWP init or user modified one via sysfs.
Caching of HWP Request and Capabilities
Store the HWP request value last set using MSR_HWP_REQUEST and read
MSR_HWP_CAPABILITIES. This avoid reading of MSRs in the boost utility
functions.
These boost utility functions calculated limits are based on the latest
HWP request value, which can be modified by setpolicy() callback. So if
user space modifies the minimum perf value, that will be accounted for
every time the boost up is called. There will be case when there can be
contention with the user modified minimum perf, in that case user value
will gain precedence. For example just before HWP_REQUEST MSR is updated
from setpolicy() callback, the boost up function is called via scheduler
tick callback. Here the cached MSR value is already the latest and limits
are updated based on the latest user limits, but on return the MSR write
callback called from setpolicy() callback will update the HWP_REQUEST
value. This will be used till next time the boost up function is called.
In addition add a variable to control HWP dynamic boosting. When HWP
dynamic boost is active then set the HWP specific update util hook. The
contents in the utility hooks will be filled in the subsequent patches.
Reported-by: Mel Gorman <[email protected]>
Tested-by: Giovanni Gherdovich <[email protected]>
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
drivers/cpufreq/intel_pstate.c | 100 +++++++++++++++++++++++++++++++++++++++--
1 file changed, 97 insertions(+), 3 deletions(-)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 08960a55eb27..3949e3861f55 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -221,6 +221,9 @@ struct global_params {
* preference/bias
* @epp_saved: Saved EPP/EPB during system suspend or CPU offline
* operation
+ * @hwp_req_cached: Cached value of the last HWP Request MSR
+ * @hwp_cap_cached: Cached value of the last HWP Capabilities MSR
+ * @hwp_boost_min: Last HWP boosted min performance
*
* This structure stores per CPU instance data for all CPUs.
*/
@@ -253,6 +256,9 @@ struct cpudata {
s16 epp_policy;
s16 epp_default;
s16 epp_saved;
+ u64 hwp_req_cached;
+ u64 hwp_cap_cached;
+ u32 hwp_boost_min;
};
static struct cpudata **all_cpu_data;
@@ -285,6 +291,7 @@ static struct pstate_funcs pstate_funcs __read_mostly;
static int hwp_active __read_mostly;
static bool per_cpu_limits __read_mostly;
+static bool hwp_boost __read_mostly;
static struct cpufreq_driver *intel_pstate_driver __read_mostly;
@@ -689,6 +696,7 @@ static void intel_pstate_get_hwp_max(unsigned int cpu, int *phy_max,
u64 cap;
rdmsrl_on_cpu(cpu, MSR_HWP_CAPABILITIES, &cap);
+ WRITE_ONCE(all_cpu_data[cpu]->hwp_cap_cached, cap);
if (global.no_turbo)
*current_max = HWP_GUARANTEED_PERF(cap);
else
@@ -763,6 +771,7 @@ static void intel_pstate_hwp_set(unsigned int cpu)
intel_pstate_set_epb(cpu, epp);
}
skip_epp:
+ WRITE_ONCE(cpu_data->hwp_req_cached, value);
wrmsrl_on_cpu(cpu, MSR_HWP_REQUEST, value);
}
@@ -1381,6 +1390,81 @@ static void intel_pstate_get_cpu_pstates(struct cpudata *cpu)
intel_pstate_set_min_pstate(cpu);
}
+/*
+ * Long hold time will keep high perf limits for long time,
+ * which negatively impacts perf/watt for some workloads,
+ * like specpower. 3ms is based on experiements on some
+ * workoads.
+ */
+static int hwp_boost_hold_time_ns = 3 * NSEC_PER_MSEC;
+
+static inline void intel_pstate_hwp_boost_up(struct cpudata *cpu)
+{
+ u64 hwp_req = READ_ONCE(cpu->hwp_req_cached);
+ u32 max_limit = (hwp_req & 0xff00) >> 8;
+ u32 min_limit = (hwp_req & 0xff);
+ u32 boost_level1;
+
+ /*
+ * Cases to consider (User changes via sysfs or boot time):
+ * If, P0 (Turbo max) = P1 (Guaranteed max) = min:
+ * No boost, return.
+ * If, P0 (Turbo max) > P1 (Guaranteed max) = min:
+ * Should result in one level boost only for P0.
+ * If, P0 (Turbo max) = P1 (Guaranteed max) > min:
+ * Should result in two level boost:
+ * (min + p1)/2 and P1.
+ * If, P0 (Turbo max) > P1 (Guaranteed max) > min:
+ * Should result in three level boost:
+ * (min + p1)/2, P1 and P0.
+ */
+
+ /* If max and min are equal or already at max, nothing to boost */
+ if (max_limit == min_limit || cpu->hwp_boost_min >= max_limit)
+ return;
+
+ if (!cpu->hwp_boost_min)
+ cpu->hwp_boost_min = min_limit;
+
+ /* level at half way mark between min and guranteed */
+ boost_level1 = (HWP_GUARANTEED_PERF(cpu->hwp_cap_cached) + min_limit) >> 1;
+
+ if (cpu->hwp_boost_min < boost_level1)
+ cpu->hwp_boost_min = boost_level1;
+ else if (cpu->hwp_boost_min < HWP_GUARANTEED_PERF(cpu->hwp_cap_cached))
+ cpu->hwp_boost_min = HWP_GUARANTEED_PERF(cpu->hwp_cap_cached);
+ else if (cpu->hwp_boost_min == HWP_GUARANTEED_PERF(cpu->hwp_cap_cached) &&
+ max_limit != HWP_GUARANTEED_PERF(cpu->hwp_cap_cached))
+ cpu->hwp_boost_min = max_limit;
+ else
+ return;
+
+ hwp_req = (hwp_req & ~GENMASK_ULL(7, 0)) | cpu->hwp_boost_min;
+ wrmsrl(MSR_HWP_REQUEST, hwp_req);
+ cpu->last_update = cpu->sample.time;
+}
+
+static inline void intel_pstate_hwp_boost_down(struct cpudata *cpu)
+{
+ if (cpu->hwp_boost_min) {
+ bool expired;
+
+ /* Check if we are idle for hold time to boost down */
+ expired = time_after64(cpu->sample.time, cpu->last_update +
+ hwp_boost_hold_time_ns);
+ if (expired) {
+ wrmsrl(MSR_HWP_REQUEST, cpu->hwp_req_cached);
+ cpu->hwp_boost_min = 0;
+ }
+ }
+ cpu->last_update = cpu->sample.time;
+}
+
+static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
+ u64 time, unsigned int flags)
+{
+}
+
static inline void intel_pstate_calc_avg_perf(struct cpudata *cpu)
{
struct sample *sample = &cpu->sample;
@@ -1684,7 +1768,7 @@ static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
{
struct cpudata *cpu = all_cpu_data[cpu_num];
- if (hwp_active)
+ if (hwp_active && !hwp_boost)
return;
if (cpu->update_util_set)
@@ -1693,7 +1777,9 @@ static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
/* Prevent intel_pstate_update_util() from using stale data. */
cpu->sample.time = 0;
cpufreq_add_update_util_hook(cpu_num, &cpu->update_util,
- intel_pstate_update_util);
+ (hwp_active ?
+ intel_pstate_update_util_hwp :
+ intel_pstate_update_util));
cpu->update_util_set = true;
}
@@ -1805,8 +1891,16 @@ static int intel_pstate_set_policy(struct cpufreq_policy *policy)
intel_pstate_set_update_util_hook(policy->cpu);
}
- if (hwp_active)
+ if (hwp_active) {
+ /*
+ * When hwp_boost was active before and dynamically it
+ * was turned off, in that case we need to clear the
+ * update util hook.
+ */
+ if (!hwp_boost)
+ intel_pstate_clear_update_util_hook(policy->cpu);
intel_pstate_hwp_set(policy->cpu);
+ }
mutex_unlock(&intel_pstate_limits_lock);
--
2.13.6
A new attribute is added to intel_pstate sysfs to enable/disable
HWP dynamic performance boost.
Reported-by: Mel Gorman <[email protected]>
Tested-by: Giovanni Gherdovich <[email protected]>
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
drivers/cpufreq/intel_pstate.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 5b2b6b6d1ff4..70bf63bb4e0e 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1033,6 +1033,30 @@ static ssize_t store_min_perf_pct(struct kobject *a, struct attribute *b,
return count;
}
+static ssize_t show_hwp_dynamic_boost(struct kobject *kobj,
+ struct attribute *attr, char *buf)
+{
+ return sprintf(buf, "%u\n", hwp_boost);
+}
+
+static ssize_t store_hwp_dynamic_boost(struct kobject *a, struct attribute *b,
+ const char *buf, size_t count)
+{
+ unsigned int input;
+ int ret;
+
+ ret = kstrtouint(buf, 10, &input);
+ if (ret)
+ return ret;
+
+ mutex_lock(&intel_pstate_driver_lock);
+ hwp_boost = !!input;
+ intel_pstate_update_policies();
+ mutex_unlock(&intel_pstate_driver_lock);
+
+ return count;
+}
+
show_one(max_perf_pct, max_perf_pct);
show_one(min_perf_pct, min_perf_pct);
@@ -1042,6 +1066,7 @@ define_one_global_rw(max_perf_pct);
define_one_global_rw(min_perf_pct);
define_one_global_ro(turbo_pct);
define_one_global_ro(num_pstates);
+define_one_global_rw(hwp_dynamic_boost);
static struct attribute *intel_pstate_attributes[] = {
&status.attr,
@@ -1082,6 +1107,11 @@ static void __init intel_pstate_sysfs_expose_params(void)
rc = sysfs_create_file(intel_pstate_kobject, &min_perf_pct.attr);
WARN_ON(rc);
+ if (hwp_active) {
+ rc = sysfs_create_file(intel_pstate_kobject,
+ &hwp_dynamic_boost.attr);
+ WARN_ON(rc);
+ }
}
/************************** sysfs end ************************/
--
2.13.6
On Tuesday, June 5, 2018 11:42:38 PM CEST Srinivas Pandruvada wrote:
> v1 (Compared to RFC/RFT v3)
> - Minor suggestion for intel_pstate for coding
> - Add SKL desktop model used in some Xeons
>
> Tested-by: Giovanni Gherdovich <[email protected]>
>
> This series has an overall positive performance impact on IO both on xfs and
> ext4, and I'd be vary happy if it lands in v4.18. You dropped the migration
> optimization from v1 to v2 after the reviewers' suggestion; I'm looking
> forward to test that part too, so please add me to CC when you'll resend it.
>
> I've tested your series on a single socket Xeon E3-1240 v5 (Skylake, 4 cores /
> 8 threads) with SSD storage. The platform is a Dell PowerEdge R230.
>
> The benchmarks used are a mix of I/O intensive workloads on ext4 and xfs
> (dbench4, sqlite, pgbench in read/write and read-only configuration, Flexible
> IO aka FIO, etc) and scheduler stressers just to check that everything is okay
> in that department too (hackbench, pipetest, schbench, sockperf on localhost
> both in "throughput" and "under-load" mode, netperf in localhost, etc). There
> is also some HPC with the NAS Parallel Benchmark, as when using openMPI as IPC
> mechanism it ends up being write-intensive and that could be a good
> experiment, even if the HPC people aren't exactly the target audience for a
> frequency governor.
>
> The large improvements are in areas you already highlighted in your cover
> letter (dbench4, sqlite, and pgbench read/write too, very impressive
> honestly). Minor wins are also observed in sockperf and running the git unit
> tests (gitsource below). The scheduler stressers ends up, as expected, in the
> "neutral" category where you'll also find FIO (which given other results I'd
> have expected to improve a little at least). Marked "neutral" are also those
> results where statistical significance wasn't reached (2 standard deviations,
> which is roughly like a 0.05 p-value) even if they showed some difference in a
> direction or the other. In the "small losses" section I found hackbench run
> with processes (not threads) and pipes (not sockets) which I report for due
> diligence but looking at the raw numbers it's more of a mixed bag than a real
> loss, and the NAS high-perf computing benchmark when it uses openMP (as
> opposed to openMPI) for IPC -- but again, we often find that supercomputers
> people run the machines at full speed all the time.
>
> At the bottom of this message you'll find some directions if you want to run
> some test yourself using the same framework I used, MMTests from
> https://github.com/gormanm/mmtests (we store a fair amount of benchmarks
> parametrization up there).
>
> Large wins:
>
> - dbench4: +20% on ext4,
> +14% on xfs (always asynch IO)
> - sqlite (insert): +9% on both ext4 and xfs
> - pgbench (read/write): +9% on ext4,
> +10% on xfs
>
> Moderate wins:
>
> - sockperf (type: under-load, localhost): +1% with TCP,
> +5% with UDP
> - gisource (git unit tests, shell intensive): +3% on ext4
> - NAS Parallel Benchmark (HPC, using openMPI, on xfs): +1%
> - tbench4 (network part of dbench4, localhost): +1%
>
> Neutral:
>
> - pgbench (read-only) on ext4 and xfs
> - siege
> - netperf (streaming and round-robin) with TCP and UDP
> - hackbench (sockets/process, sockets/thread and pipes/thread)
> - pipetest
> - Linux kernel build
> - schbench
> - sockperf (type: throughput) with TCP and UDP
> - git unit tests on xfs
> - FIO (both random and seq. read, both random and seq. write)
> on ext4 and xfs, async IO
>
> Moderate losses:
>
> - hackbench (pipes/process): -10%
> - NAS Parallel Benchmark with openMP: -1%
>
>
> Each benchmark is run with a variety of configuration parameters (eg: number
> of threads, number of clients, etc); to reach a final "score" the geometric
> mean is used (with a few exceptions depending on the type of benchmark).
> Detailed results follow. Amean, Hmean and Gmean are respectively arithmetic,
> harmonic and geometric means.
>
> For brevity I won't report all tables but only those for "large wins" and
> "moderate losses". Note that I'm not overly worried for the hackbench-pipes
> situation, as we've studied it in the past and determined that such
> configuration is particularly weak, time is mostly spent on contention and the
> scheduler code path isn't exercised. See the comment in the file
> configs/config-global-dhp__scheduler-unbound in MMTests for a brief
> description of the issue.
>
> DBENCH4
> =======
>
> NOTES: asyncronous IO; varies the number of clients up to NUMCPUS*8.
> MMTESTS CONFIG: global-dhp__io-dbench4-async-{ext4, xfs}
> MEASURES: latency (millisecs)
> LOWER is better
>
> EXT4
> 4.16.0 4.16.0
> vanilla hwp-boost
> Amean 1 28.49 ( 0.00%) 19.68 ( 30.92%)
> Amean 2 26.70 ( 0.00%) 25.59 ( 4.14%)
> Amean 4 54.59 ( 0.00%) 43.56 ( 20.20%)
> Amean 8 91.19 ( 0.00%) 77.56 ( 14.96%)
> Amean 64 538.09 ( 0.00%) 438.67 ( 18.48%)
> Stddev 1 6.70 ( 0.00%) 3.24 ( 51.66%)
> Stddev 2 4.35 ( 0.00%) 3.57 ( 17.85%)
> Stddev 4 7.99 ( 0.00%) 7.24 ( 9.29%)
> Stddev 8 17.51 ( 0.00%) 15.80 ( 9.78%)
> Stddev 64 49.54 ( 0.00%) 46.98 ( 5.17%)
>
> XFS
> 4.16.0 4.16.0
> vanilla hwp-boost
> Amean 1 21.88 ( 0.00%) 16.03 ( 26.75%)
> Amean 2 19.72 ( 0.00%) 19.82 ( -0.50%)
> Amean 4 37.55 ( 0.00%) 29.52 ( 21.38%)
> Amean 8 56.73 ( 0.00%) 51.83 ( 8.63%)
> Amean 64 808.80 ( 0.00%) 698.12 ( 13.68%)
> Stddev 1 6.29 ( 0.00%) 2.33 ( 62.99%)
> Stddev 2 3.12 ( 0.00%) 2.26 ( 27.73%)
> Stddev 4 7.56 ( 0.00%) 5.88 ( 22.28%)
> Stddev 8 14.15 ( 0.00%) 12.49 ( 11.71%)
> Stddev 64 380.54 ( 0.00%) 367.88 ( 3.33%)
>
> SQLITE
> ======
>
> NOTES: SQL insert test on a table that will be 2M in size.
> MMTESTS CONFIG: global-dhp__db-sqlite-insert-medium-{ext4, xfs}
> MEASURES: transactions per second
> HIGHER is better
>
> EXT4
> 4.16.0 4.16.0
> vanilla hwp-boost
> Hmean Trans 2098.79 ( 0.00%) 2292.16 ( 9.21%)
> Stddev Trans 78.79 ( 0.00%) 95.73 ( -21.50%)
>
> XFS
> 4.16.0 4.16.0
> vanilla hwp-boost
> Hmean Trans 1890.27 ( 0.00%) 2058.62 ( 8.91%)
> Stddev Trans 52.54 ( 0.00%) 29.56 ( 43.73%)
>
> PGBENCH-RW
> ==========
>
> NOTES: packaged with Postgres. Varies the number of thread up to NUMCPUS. The
> workload is scaled so that the approximate size is 80% of of the database
> shared buffer which itself is 20% of RAM. The page cache is not flushed
> after the database is populated for the test and starts cache-hot.
> MMTESTS CONFIG: global-dhp__db-pgbench-timed-rw-small-{ext4, xfs}
> MEASURES: transactions per second
> HIGHER is better
>
> EXT4
> 4.16.0 4.16.0
> vanilla hwp-boost
> Hmean 1 2692.19 ( 0.00%) 2660.98 ( -1.16%)
> Hmean 4 5218.93 ( 0.00%) 5610.10 ( 7.50%)
> Hmean 7 7332.68 ( 0.00%) 8378.24 ( 14.26%)
> Hmean 8 7462.03 ( 0.00%) 8713.36 ( 16.77%)
> Stddev 1 231.85 ( 0.00%) 257.49 ( -11.06%)
> Stddev 4 681.11 ( 0.00%) 312.64 ( 54.10%)
> Stddev 7 1072.07 ( 0.00%) 730.29 ( 31.88%)
> Stddev 8 1472.77 ( 0.00%) 1057.34 ( 28.21%)
>
> XFS
> 4.16.0 4.16.0
> vanilla hwp-boost
> Hmean 1 2675.02 ( 0.00%) 2661.69 ( -0.50%)
> Hmean 4 5049.45 ( 0.00%) 5601.45 ( 10.93%)
> Hmean 7 7302.18 ( 0.00%) 8348.16 ( 14.32%)
> Hmean 8 7596.83 ( 0.00%) 8693.29 ( 14.43%)
> Stddev 1 225.41 ( 0.00%) 246.74 ( -9.46%)
> Stddev 4 761.33 ( 0.00%) 334.77 ( 56.03%)
> Stddev 7 1093.93 ( 0.00%) 811.30 ( 25.84%)
> Stddev 8 1465.06 ( 0.00%) 1118.81 ( 23.63%)
>
> HACKBENCH
> =========
>
> NOTES: Varies the number of groups between 1 and NUMCPUS*4
> MMTESTS CONFIG: global-dhp__scheduler-unbound
> MEASURES: time (seconds)
> LOWER is better
>
> 4.16.0 4.16.0
> vanilla hwp-boost
> Amean 1 0.8350 ( 0.00%) 1.1577 ( -38.64%)
> Amean 3 2.8367 ( 0.00%) 3.7457 ( -32.04%)
> Amean 5 6.7503 ( 0.00%) 5.7977 ( 14.11%)
> Amean 7 7.8290 ( 0.00%) 8.0343 ( -2.62%)
> Amean 12 11.0560 ( 0.00%) 11.9673 ( -8.24%)
> Amean 18 15.2603 ( 0.00%) 15.5247 ( -1.73%)
> Amean 24 17.0283 ( 0.00%) 17.9047 ( -5.15%)
> Amean 30 19.9193 ( 0.00%) 23.4670 ( -17.81%)
> Amean 32 21.4637 ( 0.00%) 23.4097 ( -9.07%)
> Stddev 1 0.0636 ( 0.00%) 0.0255 ( 59.93%)
> Stddev 3 0.1188 ( 0.00%) 0.0235 ( 80.22%)
> Stddev 5 0.0755 ( 0.00%) 0.1398 ( -85.13%)
> Stddev 7 0.2778 ( 0.00%) 0.1634 ( 41.17%)
> Stddev 12 0.5785 ( 0.00%) 0.1030 ( 82.19%)
> Stddev 18 1.2099 ( 0.00%) 0.7986 ( 33.99%)
> Stddev 24 0.2057 ( 0.00%) 0.7030 (-241.72%)
> Stddev 30 1.1303 ( 0.00%) 0.7654 ( 32.28%)
> Stddev 32 0.2032 ( 0.00%) 3.1626 (-1456.69%)
>
> NAS PARALLEL BENCHMARK, C-CLASS (w/ openMP)
> ===========================================
>
> NOTES: The various computational kernels are run separately; see
> https://www.nas.nasa.gov/publications/npb.html for the list of tasks (IS =
> Integer Sort, EP = Embarrassingly Parallel, etc)
> MMTESTS CONFIG: global-dhp__nas-c-class-omp-full
> MEASURES: time (seconds)
> LOWER is better
>
> 4.16.0 4.16.0
> vanilla hwp-boost
> Amean bt.C 169.82 ( 0.00%) 170.54 ( -0.42%)
> Stddev bt.C 1.07 ( 0.00%) 0.97 ( 9.34%)
> Amean cg.C 41.81 ( 0.00%) 42.08 ( -0.65%)
> Stddev cg.C 0.06 ( 0.00%) 0.03 ( 48.24%)
> Amean ep.C 26.63 ( 0.00%) 26.47 ( 0.61%)
> Stddev ep.C 0.37 ( 0.00%) 0.24 ( 35.35%)
> Amean ft.C 38.17 ( 0.00%) 38.41 ( -0.64%)
> Stddev ft.C 0.33 ( 0.00%) 0.32 ( 3.78%)
> Amean is.C 1.49 ( 0.00%) 1.40 ( 6.02%)
> Stddev is.C 0.20 ( 0.00%) 0.16 ( 19.40%)
> Amean lu.C 217.46 ( 0.00%) 220.21 ( -1.26%)
> Stddev lu.C 0.23 ( 0.00%) 0.22 ( 0.74%)
> Amean mg.C 18.56 ( 0.00%) 18.80 ( -1.31%)
> Stddev mg.C 0.01 ( 0.00%) 0.01 ( 22.54%)
> Amean sp.C 293.25 ( 0.00%) 296.73 ( -1.19%)
> Stddev sp.C 0.10 ( 0.00%) 0.06 ( 42.67%)
> Amean ua.C 170.74 ( 0.00%) 172.02 ( -0.75%)
> Stddev ua.C 0.28 ( 0.00%) 0.31 ( -12.89%)
>
> HOW TO REPRODUCE
> ================
>
> To install MMTests, clone the git repo at
> https://github.com/gormanm/mmtests.git
>
> To run a config (ie a set of benchmarks, such as
> config-global-dhp__nas-c-class-omp-full), use the command
> ./run-mmtests.sh --config configs/$CONFIG $MNEMONIC-NAME
> from the top-level directory; the benchmark source will be downloaded from its
> canonical internet location, compiled and run.
>
> To compare results from two runs, use
> ./bin/compare-mmtests.pl --directory ./work/log \
> --benchmark $BENCHMARK-NAME \
> --names $MNEMONIC-NAME-1,$MNEMONIC-NAME-2
> from the top-level directory.
>
> ==================
> From RFC Series:
> v3
> - Removed atomic bit operation as suggested.
> - Added description of contention with user space.
> - Removed hwp cache, boost utililty function patch and merged with util callback
> patch. This way any value set is used somewhere.
>
> Waiting for test results from Mel Gorman, who is the original reporter.
>
> v2
> This is a much simpler version than the previous one and only consider IO
> boost, using the existing mechanism. There is no change in this series
> beyond intel_pstate driver.
>
> Once PeterZ finishes his work on frequency invariant, I will revisit
> thread migration optimization in HWP mode.
>
> Other changes:
> - Gradual boost instead of single step as suggested by PeterZ.
> - Cross CPU synchronization concerns identified by Rafael.
> - Split the patch for HWP MSR value caching as suggested by PeterZ.
>
> Not changed as suggested:
> There is no architecture way to identify platform with Per-core
> P-states, so still have to enable feature based on CPU model.
>
> -----------
> v1
>
> This series tries to address some concern in performance particularly with IO
> workloads (Reported by Mel Gorman), when HWP is using intel_pstate powersave
> policy.
>
> Background
> HWP performance can be controlled by user space using sysfs interface for
> max/min frequency limits and energy performance preference settings. Based on
> workload characteristics these can be adjusted from user space. These limits
> are not changed dynamically by kernel based on workload.
>
> By default HWP defaults to energy performance preference value of 0x80 on
> majority of platforms(Scale is 0-255, 0 is max performance and 255 is min).
> This value offers best performance/watt and for majority of server workloads
> performance doesn't suffer. Also users always have option to use performance
> policy of intel_pstate, to get best performance. But user tend to run with
> out of box configuration, which is powersave policy on most of the distros.
>
> In some case it is possible to dynamically adjust performance, for example,
> when a CPU is woken up due to IO completion or thread migrate to a new CPU. In
> this case HWP algorithm will take some time to build utilization and ramp up
> P-states. So this may results in lower performance for some IO workloads and
> workloads which tend to migrate. The idea of this patch series is to
> temporarily boost performance dynamically in these cases. This is only
> applicable only when user is using powersave policy, not in performance policy.
>
> Results on a Skylake server:
>
> Benchmark Improvement %
> ----------------------------------------------------------------------
> dbench 50.36
> thread IO bench (tiobench) 10.35
> File IO 9.81
> sqlite 15.76
> X264 -104 cores 9.75
>
> Spec Power (Negligible impact 7382 Vs. 7378)
> Idle Power No change observed
> -----------------------------------------------------------------------
>
> HWP brings in best performace/watt at EPP=0x80. Since we are boosting
> EPP here to 0, the performance/watt drops upto 10%. So there is a power
> penalty of these changes.
>
> Also Mel Gorman provided test results on a prior patchset, which shows
> benifits of this series.
>
> Srinivas Pandruvada (4):
> cpufreq: intel_pstate: Add HWP boost utility and sched util hooks
> cpufreq: intel_pstate: HWP boost performance on IO wakeup
> cpufreq: intel_pstate: New sysfs entry to control HWP boost
> cpufreq: intel_pstate: enable boost for Skylake Xeon
>
> drivers/cpufreq/intel_pstate.c | 179 ++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 176 insertions(+), 3 deletions(-)
>
>
Applied, thanks!
Srinivas Pandruvada <[email protected]> writes:
> Enable HWP boost on Skylake server and workstations.
>
Please revert this series, it led to significant energy usage and
graphics performance regressions [1]. The reasons are roughly the ones
we discussed by e-mail off-list last April: This causes the intel_pstate
driver to decrease the EPP to zero when the workload blocks on IO
frequently enough, which for the regressing benchmarks detailed in [1]
is a symptom of the workload being heavily IO-bound, which means they
won't benefit at all from the EPP boost since they aren't significantly
CPU-bound, and they will suffer a decrease in parallelism due to the
active CPU core using a larger fraction of the TDP in order to achieve
the same work, causing the GPU to have a lower power budget available,
leading to a decrease in system performance.
You may want to give a shot to my previous suggestion of using [2] in
order to detect whether the system is IO-bound, which you can use as an
indicator that the optimization implemented in this series cannot
possibly improve performance and can be expected to hurt energy usage.
Thanks.
[1] https://bugs.freedesktop.org/show_bug.cgi?id=107410
[2] https://patchwork.kernel.org/patch/10312259/
> Reported-by: Mel Gorman <[email protected]>
> Tested-by: Giovanni Gherdovich <[email protected]>
> Signed-off-by: Srinivas Pandruvada <[email protected]>
> ---
> drivers/cpufreq/intel_pstate.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 70bf63bb4e0e..01c8da1f99db 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -1794,6 +1794,12 @@ static const struct x86_cpu_id intel_pstate_cpu_ee_disable_ids[] = {
> {}
> };
>
> +static const struct x86_cpu_id intel_pstate_hwp_boost_ids[] __initconst = {
> + ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
> + ICPU(INTEL_FAM6_SKYLAKE_DESKTOP, core_funcs),
> + {}
> +};
> +
> static int intel_pstate_init_cpu(unsigned int cpunum)
> {
> struct cpudata *cpu;
> @@ -1824,6 +1830,10 @@ static int intel_pstate_init_cpu(unsigned int cpunum)
> intel_pstate_disable_ee(cpunum);
>
> intel_pstate_hwp_enable(cpu);
> +
> + id = x86_match_cpu(intel_pstate_hwp_boost_ids);
> + if (id)
> + hwp_boost = true;
> }
>
> intel_pstate_get_cpu_pstates(cpu);
> --
> 2.13.6
On Fri, Jul 27, 2018 at 10:34:03PM -0700, Francisco Jerez wrote:
> Srinivas Pandruvada <[email protected]> writes:
>
> > Enable HWP boost on Skylake server and workstations.
> >
>
> Please revert this series, it led to significant energy usage and
> graphics performance regressions [1]. The reasons are roughly the ones
> we discussed by e-mail off-list last April: This causes the intel_pstate
> driver to decrease the EPP to zero when the workload blocks on IO
> frequently enough, which for the regressing benchmarks detailed in [1]
> is a symptom of the workload being heavily IO-bound, which means they
> won't benefit at all from the EPP boost since they aren't significantly
> CPU-bound, and they will suffer a decrease in parallelism due to the
> active CPU core using a larger fraction of the TDP in order to achieve
> the same work, causing the GPU to have a lower power budget available,
> leading to a decrease in system performance.
It slices both ways. With the series, there are large boosts to
performance on other workloads where a slight increase in power usage is
acceptable in exchange for performance. For example,
Single socket skylake running sqlite
v4.17 41ab43c9
Min Trans 2580.85 ( 0.00%) 5401.58 ( 109.29%)
Hmean Trans 2610.38 ( 0.00%) 5518.36 ( 111.40%)
Stddev Trans 28.08 ( 0.00%) 208.90 (-644.02%)
CoeffVar Trans 1.08 ( 0.00%) 3.78 (-251.57%)
Max Trans 2648.02 ( 0.00%) 5992.74 ( 126.31%)
BHmean-50 Trans 2629.78 ( 0.00%) 5643.81 ( 114.61%)
BHmean-95 Trans 2620.38 ( 0.00%) 5538.32 ( 111.36%)
BHmean-99 Trans 2620.38 ( 0.00%) 5538.32 ( 111.36%)
That's over doubling the transactions per second for that workload.
Two-socket skylake running dbench4
v4.17 41ab43c9
Amean 1 40.85 ( 0.00%) 14.97 ( 63.36%)
Amean 2 42.31 ( 0.00%) 17.33 ( 59.04%)
Amean 4 53.77 ( 0.00%) 27.85 ( 48.20%)
Amean 8 68.86 ( 0.00%) 43.78 ( 36.42%)
Amean 16 82.62 ( 0.00%) 56.51 ( 31.60%)
Amean 32 135.80 ( 0.00%) 116.06 ( 14.54%)
Amean 64 737.51 ( 0.00%) 701.00 ( 4.95%)
Amean 512 14996.60 ( 0.00%) 14755.05 ( 1.61%)
This is reporting the average latency of operations running dbench. The
series over halves the latencies. There are many examples of basic
workloads that benefit heavily from the series and while I accept it may
not be universal, such as the case where the graphics card needs the power
and not the CPU, a straight revert is not the answer. Without the series,
HWP cripplies the CPU.
--
Mel Gorman
SUSE Labs
On Fri, 2018-07-27 at 22:34 -0700, Francisco Jerez wrote:
> Srinivas Pandruvada <[email protected]> writes:
>
> > Enable HWP boost on Skylake server and workstations.
> >
>
> Please revert this series, it led to significant energy usage and
> graphics performance regressions [1].
Which SKX platform is targeted to graphics?
> The reasons are roughly the ones
> we discussed by e-mail off-list last April: This causes the
> intel_pstate
> driver to decrease the EPP to zero
No. You didn't check this series. We are not using EPP at all.
The boost mechanism used here is not boost to max.
Thanks,
Srinivas
> when the workload blocks on IO
> frequently enough, which for the regressing benchmarks detailed in
> [1]
> is a symptom of the workload being heavily IO-bound, which means they
> won't benefit at all from the EPP boost since they aren't
> significantly
> CPU-bound, and they will suffer a decrease in parallelism due to the
> active CPU core using a larger fraction of the TDP in order to
> achieve
> the same work, causing the GPU to have a lower power budget
> available,
> leading to a decrease in system performance.
>
> You may want to give a shot to my previous suggestion of using [2] in
> order to detect whether the system is IO-bound, which you can use as
> an
> indicator that the optimization implemented in this series cannot
> possibly improve performance and can be expected to hurt energy
> usage.
>
> Thanks.
>
> [1] https://bugs.freedesktop.org/show_bug.cgi?id=107410
> [2] https://patchwork.kernel.org/patch/10312259/
>
> > Reported-by: Mel Gorman <[email protected]>
> > Tested-by: Giovanni Gherdovich <[email protected]>
> > Signed-off-by: Srinivas Pandruvada <[email protected]
> > .com>
> > ---
> > drivers/cpufreq/intel_pstate.c | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> >
> > diff --git a/drivers/cpufreq/intel_pstate.c
> > b/drivers/cpufreq/intel_pstate.c
> > index 70bf63bb4e0e..01c8da1f99db 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -1794,6 +1794,12 @@ static const struct x86_cpu_id
> > intel_pstate_cpu_ee_disable_ids[] = {
> > {}
> > };
> >
> > +static const struct x86_cpu_id intel_pstate_hwp_boost_ids[]
> > __initconst = {
> > + ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
> > + ICPU(INTEL_FAM6_SKYLAKE_DESKTOP, core_funcs),
> > + {}
> > +};
> > +
> > static int intel_pstate_init_cpu(unsigned int cpunum)
> > {
> > struct cpudata *cpu;
> > @@ -1824,6 +1830,10 @@ static int intel_pstate_init_cpu(unsigned
> > int cpunum)
> > intel_pstate_disable_ee(cpunum);
> >
> > intel_pstate_hwp_enable(cpu);
> > +
> > + id = x86_match_cpu(intel_pstate_hwp_boost_ids);
> > + if (id)
> > + hwp_boost = true;
> > }
> >
> > intel_pstate_get_cpu_pstates(cpu);
> > --
> > 2.13.6
Mel Gorman <[email protected]> writes:
> On Fri, Jul 27, 2018 at 10:34:03PM -0700, Francisco Jerez wrote:
>> Srinivas Pandruvada <[email protected]> writes:
>>
>> > Enable HWP boost on Skylake server and workstations.
>> >
>>
>> Please revert this series, it led to significant energy usage and
>> graphics performance regressions [1]. The reasons are roughly the ones
>> we discussed by e-mail off-list last April: This causes the intel_pstate
>> driver to decrease the EPP to zero when the workload blocks on IO
>> frequently enough, which for the regressing benchmarks detailed in [1]
>> is a symptom of the workload being heavily IO-bound, which means they
>> won't benefit at all from the EPP boost since they aren't significantly
>> CPU-bound, and they will suffer a decrease in parallelism due to the
>> active CPU core using a larger fraction of the TDP in order to achieve
>> the same work, causing the GPU to have a lower power budget available,
>> leading to a decrease in system performance.
>
> It slices both ways.
I don't think it's acceptable to land an optimization that trades
performance of one use-case for another, especially since one could make
both use-cases happy by avoiding the boost in cases where we know
beforehand that we aren't going to achieve any improvement in
performance, because an application waiting frequently on an IO device
which is 100% utilized isn't going to run faster just because we ramp up
the CPU frequency, since the IO device won't be able to process requests
from the application faster anyway, so we will only be pessimizing
energy efficiency (and potentially decreasing performance of the GPU
*and* of other CPU cores living on the same package for no benefit).
> With the series, there are large boosts to performance on other
> workloads where a slight increase in power usage is acceptable in
> exchange for performance. For example,
>
> Single socket skylake running sqlite
> v4.17 41ab43c9
> Min Trans 2580.85 ( 0.00%) 5401.58 ( 109.29%)
> Hmean Trans 2610.38 ( 0.00%) 5518.36 ( 111.40%)
> Stddev Trans 28.08 ( 0.00%) 208.90 (-644.02%)
> CoeffVar Trans 1.08 ( 0.00%) 3.78 (-251.57%)
> Max Trans 2648.02 ( 0.00%) 5992.74 ( 126.31%)
> BHmean-50 Trans 2629.78 ( 0.00%) 5643.81 ( 114.61%)
> BHmean-95 Trans 2620.38 ( 0.00%) 5538.32 ( 111.36%)
> BHmean-99 Trans 2620.38 ( 0.00%) 5538.32 ( 111.36%)
>
> That's over doubling the transactions per second for that workload.
>
> Two-socket skylake running dbench4
> v4.17 41ab43c9
> Amean 1 40.85 ( 0.00%) 14.97 ( 63.36%)
> Amean 2 42.31 ( 0.00%) 17.33 ( 59.04%)
> Amean 4 53.77 ( 0.00%) 27.85 ( 48.20%)
> Amean 8 68.86 ( 0.00%) 43.78 ( 36.42%)
> Amean 16 82.62 ( 0.00%) 56.51 ( 31.60%)
> Amean 32 135.80 ( 0.00%) 116.06 ( 14.54%)
> Amean 64 737.51 ( 0.00%) 701.00 ( 4.95%)
> Amean 512 14996.60 ( 0.00%) 14755.05 ( 1.61%)
>
> This is reporting the average latency of operations running
> dbench. The series over halves the latencies. There are many examples
> of basic workloads that benefit heavily from the series and while I
> accept it may not be universal, such as the case where the graphics
> card needs the power and not the CPU, a straight revert is not the
> answer. Without the series, HWP cripplies the CPU.
>
That seems like a huge overstatement. HWP doesn't "cripple" the CPU
without this series. It will certainly set lower clocks than with this
series for workloads like you show above that utilize the CPU very
intermittently (i.e. they underutilize it). But one could argue that
such workloads are inherently misdesigned and will perform suboptimally
regardless of the behavior of the CPUFREQ governor, for two different
reasons: On the one hand because they are unable to fully utilize their
CPU time (otherwise HWP would be giving them a CPU frequency close to
the maximum already), and on the other hand, because in order to achieve
maximum performance they will necessarily have to bounce back and forth
between the maximum P-state and idle at high frequency, which is
inherently energy-inefficient and will effectively *decrease* the
overall number of requests per second that an actual multi-threaded
server can process, even though the request throughput may seem to
increase in a single-threaded benchmark.
> --
> Mel Gorman
> SUSE Labs
"Pandruvada, Srinivas" <[email protected]> writes:
> On Sat, 2018-07-28 at 07:14 -0700, Srinivas Pandruvada wrote:
>> On Fri, 2018-07-27 at 22:34 -0700, Francisco Jerez wrote:
>> > Srinivas Pandruvada <[email protected]> writes:
>> >
>> > > Enable HWP boost on Skylake server and workstations.
>> > >
>> >
>> > Please revert this series, it led to significant energy usage and
>> > graphics performance regressions [1].
>>
>> Which SKX platform is targeted to graphics?
> There are some Xeon E3, which is using SKL desktop CPUID.
> Do you have a SKL desktop with FADT pm profile
> acpi_gbl_FADT.preferred_profile == PM_DESKTOP? You can take acpidump
> and check.
>
> Also what is
> /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost
>
Eero is the one that reproduced the regressions -- Can you get Srinivas
this information from your system?
Thank you.
> Thanks,
> Srinivas
>
>>
>> > The reasons are roughly the ones
>> > we discussed by e-mail off-list last April: This causes the
>> > intel_pstate
>> > driver to decrease the EPP to zero
>>
>> No. You didn't check this series. We are not using EPP at all.
>> The boost mechanism used here is not boost to max.
>>
>> Thanks,
>> Srinivas
>>
>> > when the workload blocks on IO
>> > frequently enough, which for the regressing benchmarks detailed in
>> > [1]
>> > is a symptom of the workload being heavily IO-bound, which means
>> > they
>> > won't benefit at all from the EPP boost since they aren't
>> > significantly
>> > CPU-bound, and they will suffer a decrease in parallelism due to
>> > the
>> > active CPU core using a larger fraction of the TDP in order to
>> > achieve
>> > the same work, causing the GPU to have a lower power budget
>> > available,
>> > leading to a decrease in system performance.
>> >
>> > You may want to give a shot to my previous suggestion of using [2]
>> > in
>> > order to detect whether the system is IO-bound, which you can use
>> > as
>> > an
>> > indicator that the optimization implemented in this series cannot
>> > possibly improve performance and can be expected to hurt energy
>> > usage.
>> >
>> > Thanks.
>> >
>> > [1] https://bugs.freedesktop.org/show_bug.cgi?id=107410
>> > [2] https://patchwork.kernel.org/patch/10312259/
>> >
>> > > Reported-by: Mel Gorman <[email protected]>
>> > > Tested-by: Giovanni Gherdovich <[email protected]>
>> > > Signed-off-by: Srinivas Pandruvada <[email protected]
>> > > el
>> > > .com>
>> > > ---
>> > > drivers/cpufreq/intel_pstate.c | 10 ++++++++++
>> > > 1 file changed, 10 insertions(+)
>> > >
>> > > diff --git a/drivers/cpufreq/intel_pstate.c
>> > > b/drivers/cpufreq/intel_pstate.c
>> > > index 70bf63bb4e0e..01c8da1f99db 100644
>> > > --- a/drivers/cpufreq/intel_pstate.c
>> > > +++ b/drivers/cpufreq/intel_pstate.c
>> > > @@ -1794,6 +1794,12 @@ static const struct x86_cpu_id
>> > > intel_pstate_cpu_ee_disable_ids[] = {
>> > > {}
>> > > };
>> > >
>> > > +static const struct x86_cpu_id intel_pstate_hwp_boost_ids[]
>> > > __initconst = {
>> > > + ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
>> > > + ICPU(INTEL_FAM6_SKYLAKE_DESKTOP, core_funcs),
>> > > + {}
>> > > +};
>> > > +
>> > > static int intel_pstate_init_cpu(unsigned int cpunum)
>> > > {
>> > > struct cpudata *cpu;
>> > > @@ -1824,6 +1830,10 @@ static int intel_pstate_init_cpu(unsigned
>> > > int cpunum)
>> > > intel_pstate_disable_ee(cpunum);
>> > >
>> > > intel_pstate_hwp_enable(cpu);
>> > > +
>> > > + id = x86_match_cpu(intel_pstate_hwp_boost_ids);
>> > > + if (id)
>> > > + hwp_boost = true;
>> > > }
>> > >
>> > > intel_pstate_get_cpu_pstates(cpu);
>> > > --
>> > > 2.13.6
Srinivas Pandruvada <[email protected]> writes:
> On Fri, 2018-07-27 at 22:34 -0700, Francisco Jerez wrote:
>> Srinivas Pandruvada <[email protected]> writes:
>>
>> > Enable HWP boost on Skylake server and workstations.
>> >
>>
>> Please revert this series, it led to significant energy usage and
>> graphics performance regressions [1].
> Which SKX platform is targeted to graphics?
>
See the bug report, it's a regular desktop SKL.
>> The reasons are roughly the ones
>> we discussed by e-mail off-list last April: This causes the
>> intel_pstate
>> driver to decrease the EPP to zero
> No. You didn't check this series. We are not using EPP at all.
> The boost mechanism used here is not boost to max.
>
I see you've changed the mechanism to obtain a latency boost since our
last discussion, but that doesn't address my concerns in any way: This
series causes the intel_pstate driver to clamp the CPU frequency above
the optimal frequency to run the workload at, as a response to the
application waiting on IO frequently, even though that's only a sign of
the application being IO-bound and *not* a sign of it being
latency-sensitive, since the application's IO and CPU work are properly
pipelined. This leads to a decrease in parallelism due to the active
CPU core using a larger fraction of the package TDP in order to achieve
the same work, leading to a decrease in system performance.
> Thanks,
> Srinivas
>
>> when the workload blocks on IO
>> frequently enough, which for the regressing benchmarks detailed in
>> [1]
>> is a symptom of the workload being heavily IO-bound, which means they
>> won't benefit at all from the EPP boost since they aren't
>> significantly
>> CPU-bound, and they will suffer a decrease in parallelism due to the
>> active CPU core using a larger fraction of the TDP in order to
>> achieve
>> the same work, causing the GPU to have a lower power budget
>> available,
>> leading to a decrease in system performance.
>>
>> You may want to give a shot to my previous suggestion of using [2] in
>> order to detect whether the system is IO-bound, which you can use as
>> an
>> indicator that the optimization implemented in this series cannot
>> possibly improve performance and can be expected to hurt energy
>> usage.
>>
>> Thanks.
>>
>> [1] https://bugs.freedesktop.org/show_bug.cgi?id=107410
>> [2] https://patchwork.kernel.org/patch/10312259/
>>
>> > Reported-by: Mel Gorman <[email protected]>
>> > Tested-by: Giovanni Gherdovich <[email protected]>
>> > Signed-off-by: Srinivas Pandruvada <[email protected]
>> > .com>
>> > ---
>> > drivers/cpufreq/intel_pstate.c | 10 ++++++++++
>> > 1 file changed, 10 insertions(+)
>> >
>> > diff --git a/drivers/cpufreq/intel_pstate.c
>> > b/drivers/cpufreq/intel_pstate.c
>> > index 70bf63bb4e0e..01c8da1f99db 100644
>> > --- a/drivers/cpufreq/intel_pstate.c
>> > +++ b/drivers/cpufreq/intel_pstate.c
>> > @@ -1794,6 +1794,12 @@ static const struct x86_cpu_id
>> > intel_pstate_cpu_ee_disable_ids[] = {
>> > {}
>> > };
>> >
>> > +static const struct x86_cpu_id intel_pstate_hwp_boost_ids[]
>> > __initconst = {
>> > + ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
>> > + ICPU(INTEL_FAM6_SKYLAKE_DESKTOP, core_funcs),
>> > + {}
>> > +};
>> > +
>> > static int intel_pstate_init_cpu(unsigned int cpunum)
>> > {
>> > struct cpudata *cpu;
>> > @@ -1824,6 +1830,10 @@ static int intel_pstate_init_cpu(unsigned
>> > int cpunum)
>> > intel_pstate_disable_ee(cpunum);
>> >
>> > intel_pstate_hwp_enable(cpu);
>> > +
>> > + id = x86_match_cpu(intel_pstate_hwp_boost_ids);
>> > + if (id)
>> > + hwp_boost = true;
>> > }
>> >
>> > intel_pstate_get_cpu_pstates(cpu);
>> > --
>> > 2.13.6
On Sat, 2018-07-28 at 13:23 -0700, Francisco Jerez wrote:
> "Pandruvada, Srinivas" <[email protected]> writes:
>
> > On Sat, 2018-07-28 at 07:14 -0700, Srinivas Pandruvada wrote:
> > > On Fri, 2018-07-27 at 22:34 -0700, Francisco Jerez wrote:
> > > > Srinivas Pandruvada <[email protected]>
> > > > writes:
> > > >
> > > > > Enable HWP boost on Skylake server and workstations.
> > > > >
> > > >
> > > > Please revert this series, it led to significant energy usage
> > > > and
> > > > graphics performance regressions [1].
> > >
> > > Which SKX platform is targeted to graphics?
> >
> > There are some Xeon E3, which is using SKL desktop CPUID.
> > Do you have a SKL desktop with FADT pm profile
> > acpi_gbl_FADT.preferred_profile == PM_DESKTOP? You can take
> > acpidump
> > and check.
> >
> > Also what is
> > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost
> >
>
> Eero is the one that reproduced the regressions -- Can you get
> Srinivas
> this information from your system?
I commented in the bug. Let's take there first.
Thanks,
Srinivas
>
> Thank you.
>
> > Thanks,
> > Srinivas
> >
> > >
> > > > The reasons are roughly the ones
> > > > we discussed by e-mail off-list last April: This causes the
> > > > intel_pstate
> > > > driver to decrease the EPP to zero
> > >
> > > No. You didn't check this series. We are not using EPP at all.
> > > The boost mechanism used here is not boost to max.
> > >
> > > Thanks,
> > > Srinivas
> > >
> > > > when the workload blocks on IO
> > > > frequently enough, which for the regressing benchmarks detailed
> > > > in
> > > > [1]
> > > > is a symptom of the workload being heavily IO-bound, which
> > > > means
> > > > they
> > > > won't benefit at all from the EPP boost since they aren't
> > > > significantly
> > > > CPU-bound, and they will suffer a decrease in parallelism due
> > > > to
> > > > the
> > > > active CPU core using a larger fraction of the TDP in order to
> > > > achieve
> > > > the same work, causing the GPU to have a lower power budget
> > > > available,
> > > > leading to a decrease in system performance.
> > > >
> > > > You may want to give a shot to my previous suggestion of using
> > > > [2]
> > > > in
> > > > order to detect whether the system is IO-bound, which you can
> > > > use
> > > > as
> > > > an
> > > > indicator that the optimization implemented in this series
> > > > cannot
> > > > possibly improve performance and can be expected to hurt energy
> > > > usage.
> > > >
> > > > Thanks.
> > > >
> > > > [1] https://bugs.freedesktop.org/show_bug.cgi?id=107410
> > > > [2] https://patchwork.kernel.org/patch/10312259/
> > > >
> > > > > Reported-by: Mel Gorman <[email protected]>
> > > > > Tested-by: Giovanni Gherdovich <[email protected]>
> > > > > Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux
> > > > > .int
> > > > > el
> > > > > .com>
> > > > > ---
> > > > > drivers/cpufreq/intel_pstate.c | 10 ++++++++++
> > > > > 1 file changed, 10 insertions(+)
> > > > >
> > > > > diff --git a/drivers/cpufreq/intel_pstate.c
> > > > > b/drivers/cpufreq/intel_pstate.c
> > > > > index 70bf63bb4e0e..01c8da1f99db 100644
> > > > > --- a/drivers/cpufreq/intel_pstate.c
> > > > > +++ b/drivers/cpufreq/intel_pstate.c
> > > > > @@ -1794,6 +1794,12 @@ static const struct x86_cpu_id
> > > > > intel_pstate_cpu_ee_disable_ids[] = {
> > > > > {}
> > > > > };
> > > > >
> > > > > +static const struct x86_cpu_id intel_pstate_hwp_boost_ids[]
> > > > > __initconst = {
> > > > > + ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
> > > > > + ICPU(INTEL_FAM6_SKYLAKE_DESKTOP, core_funcs),
> > > > > + {}
> > > > > +};
> > > > > +
> > > > > static int intel_pstate_init_cpu(unsigned int cpunum)
> > > > > {
> > > > > struct cpudata *cpu;
> > > > > @@ -1824,6 +1830,10 @@ static int
> > > > > intel_pstate_init_cpu(unsigned
> > > > > int cpunum)
> > > > > intel_pstate_disable_ee(cpunum);
> > > > >
> > > > > intel_pstate_hwp_enable(cpu);
> > > > > +
> > > > > + id =
> > > > > x86_match_cpu(intel_pstate_hwp_boost_ids);
> > > > > + if (id)
> > > > > + hwp_boost = true;
> > > > > }
> > > > >
> > > > > intel_pstate_get_cpu_pstates(cpu);
> > > > > --
> > > > > 2.13.6
Hi,
On 28.07.2018 17:14, Srinivas Pandruvada wrote:
> On Fri, 2018-07-27 at 22:34 -0700, Francisco Jerez wrote:
>> Srinivas Pandruvada <[email protected]> writes:
>>
>>> Enable HWP boost on Skylake server and workstations.
>>>
>>
>> Please revert this series, it led to significant energy usage and
>> graphics performance regressions [1].
> Which SKX platform is targeted to graphics?
Patch that Chris pointed out is this:
+static const struct x86_cpu_id intel_pstate_hwp_boost_ids[] = {
+ ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
+ ICPU(INTEL_FAM6_SKYLAKE_DESKTOP, core_funcs),
+ {}
+};
The regressing platforms in our test system were:
- SKL 6600K i5 / GT2
- SKL 6770HQ i7 / GT4e
SKL-U i5 / GT3e device wasn't impacted, so I assume U devices don't
match INTEL_FAM6_SKYLAKE_DESKTOP.
- Eero
>> The reasons are roughly the ones
>> we discussed by e-mail off-list last April: This causes the
>> intel_pstate
>> driver to decrease the EPP to zero
> No. You didn't check this series. We are not using EPP at all.
> The boost mechanism used here is not boost to max.
>
> Thanks,
> Srinivas
>
>> when the workload blocks on IO
>> frequently enough, which for the regressing benchmarks detailed in
>> [1]
>> is a symptom of the workload being heavily IO-bound, which means they
>> won't benefit at all from the EPP boost since they aren't
>> significantly
>> CPU-bound, and they will suffer a decrease in parallelism due to the
>> active CPU core using a larger fraction of the TDP in order to
>> achieve
>> the same work, causing the GPU to have a lower power budget
>> available,
>> leading to a decrease in system performance.
>>
>> You may want to give a shot to my previous suggestion of using [2] in
>> order to detect whether the system is IO-bound, which you can use as
>> an
>> indicator that the optimization implemented in this series cannot
>> possibly improve performance and can be expected to hurt energy
>> usage.
>>
>> Thanks.
>>
>> [1] https://bugs.freedesktop.org/show_bug.cgi?id=107410
>> [2] https://patchwork.kernel.org/patch/10312259/
>>
>>> Reported-by: Mel Gorman <[email protected]>
>>> Tested-by: Giovanni Gherdovich <[email protected]>
>>> Signed-off-by: Srinivas Pandruvada <[email protected]
>>> .com>
>>> ---
>>> drivers/cpufreq/intel_pstate.c | 10 ++++++++++
>>> 1 file changed, 10 insertions(+)
>>>
>>> diff --git a/drivers/cpufreq/intel_pstate.c
>>> b/drivers/cpufreq/intel_pstate.c
>>> index 70bf63bb4e0e..01c8da1f99db 100644
>>> --- a/drivers/cpufreq/intel_pstate.c
>>> +++ b/drivers/cpufreq/intel_pstate.c
>>> @@ -1794,6 +1794,12 @@ static const struct x86_cpu_id
>>> intel_pstate_cpu_ee_disable_ids[] = {
>>> {}
>>> };
>>>
>>> +static const struct x86_cpu_id intel_pstate_hwp_boost_ids[]
>>> __initconst = {
>>> + ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
>>> + ICPU(INTEL_FAM6_SKYLAKE_DESKTOP, core_funcs),
>>> + {}
>>> +};
>>> +
>>> static int intel_pstate_init_cpu(unsigned int cpunum)
>>> {
>>> struct cpudata *cpu;
>>> @@ -1824,6 +1830,10 @@ static int intel_pstate_init_cpu(unsigned
>>> int cpunum)
>>> intel_pstate_disable_ee(cpunum);
>>>
>>> intel_pstate_hwp_enable(cpu);
>>> +
>>> + id = x86_match_cpu(intel_pstate_hwp_boost_ids);
>>> + if (id)
>>> + hwp_boost = true;
>>> }
>>>
>>> intel_pstate_get_cpu_pstates(cpu);
>>> --
>>> 2.13.6
Hi Mel,
On 28.07.2018 15:36, Mel Gorman wrote:
> On Fri, Jul 27, 2018 at 10:34:03PM -0700, Francisco Jerez wrote:
>> Srinivas Pandruvada <[email protected]> writes:
>>
>>> Enable HWP boost on Skylake server and workstations.
>>>
>>
>> Please revert this series, it led to significant energy usage and
>> graphics performance regressions [1]. The reasons are roughly the ones
>> we discussed by e-mail off-list last April: This causes the intel_pstate
>> driver to decrease the EPP to zero when the workload blocks on IO
>> frequently enough, which for the regressing benchmarks detailed in [1]
>> is a symptom of the workload being heavily IO-bound, which means they
>> won't benefit at all from the EPP boost since they aren't significantly
>> CPU-bound, and they will suffer a decrease in parallelism due to the
>> active CPU core using a larger fraction of the TDP in order to achieve
>> the same work, causing the GPU to have a lower power budget available,
>> leading to a decrease in system performance.
>
> It slices both ways. With the series, there are large boosts to
> performance on other workloads where a slight increase in power usage is
> acceptable in exchange for performance. For example,
>
> Single socket skylake running sqlite
[...]
> That's over doubling the transactions per second for that workload.
>
> Two-socket skylake running dbench4
[...]> This is reporting the average latency of operations running
dbench. The
> series over halves the latencies. There are many examples of basic
> workloads that benefit heavily from the series and while I accept it may
> not be universal, such as the case where the graphics card needs the power
> and not the CPU, a straight revert is not the answer. Without the series,
> HWP cripplies the CPU.
I assume SQLite IO-bottleneck is for the disk. Disk doesn't share
the TDP limit with the CPU, like IGP does.
Constraints / performance considerations for TDP sharing IO-loads
differ from ones that don't share TDP with CPU cores.
Workloads that can be "IO-bound" and which can be on the same chip
with CPU i.e. share TDP with it are:
- 3D rendering
- Camera / video processing
- Compute
Intel, AMD and ARM manufacturers all have (non-server) chips where these
IP blocks are on the same die as CPU cores. If CPU part redundantly
doubles its power consumption, it's directly eating TDP budget away from
these devices.
For workloads where IO bottleneck doesn't share TDP budget with CPU,
like (sqlite) databases, you don't lose performance by running CPU
constantly at full tilt, you only use more power [1].
Questions:
* Does currently kernel CPU freq management have any idea which IO
devices share TDP with the CPU cores?
* Do you do performance testing also in conditions that hit TDP limits?
- Eero
[1] For them power usage is performance problem only if you start
hitting TDP limit with CPUs alone, or you hit temperature limits.
For CPUs alone to hit TDP limits, our test-case needs to be utilizing
multiple cores and device needs to have lowish TDP compared to the
performance of the chip.
TDP limiting adds test results variance significantly, but that's
a property of the chips themselves so it cannot be avoided. :-/
Temperature limiting might happen on small enclosures like the ones used
for the SKL HQ devices i.e. laptops & NUCs, but not on servers. In our
testing we try to avoid temperature limitations when its possible (=
extra cooling) as it increases variance so much that results are mostly
useless (same devices are also TDP limited i.e. already have high
variance).
Hi Eero,
On Mon, 2018-07-30 at 11:33 +0300, Eero Tamminen wrote:
> Hi,
>
> On 28.07.2018 17:14, Srinivas Pandruvada wrote:
> > On Fri, 2018-07-27 at 22:34 -0700, Francisco Jerez wrote:
> > > Srinivas Pandruvada <[email protected]> writes:
> > >
> > > > Enable HWP boost on Skylake server and workstations.
> > > >
> > >
> > > Please revert this series, it led to significant energy usage and
> > > graphics performance regressions [1].
> >
> > Which SKX platform is targeted to graphics?
>
> Patch that Chris pointed out is this:
> +static const struct x86_cpu_id intel_pstate_hwp_boost_ids[] = {
> + ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
> + ICPU(INTEL_FAM6_SKYLAKE_DESKTOP, core_funcs),
> + {}
> +};
>
> The regressing platforms in our test system were:
> - SKL 6600K i5 / GT2
> - SKL 6770HQ i7 / GT4e
>
> SKL-U i5 / GT3e device wasn't impacted, so I assume U devices don't
> match INTEL_FAM6_SKYLAKE_DESKTOP.
I have updated some steps in the bugzilla. Can you try that?
Also add one workload which will show this issue immediately including
any parameters you need to add.
Thanks,
Srinivas
>
>
>
> - Eero
>
> > > The reasons are roughly the ones
> > > we discussed by e-mail off-list last April: This causes the
> > > intel_pstate
> > > driver to decrease the EPP to zero
> >
> > No. You didn't check this series. We are not using EPP at all.
> > The boost mechanism used here is not boost to max.
> >
> > Thanks,
> > Srinivas
> >
> > > when the workload blocks on IO
> > > frequently enough, which for the regressing benchmarks detailed
> > > in
> > > [1]
> > > is a symptom of the workload being heavily IO-bound, which means
> > > they
> > > won't benefit at all from the EPP boost since they aren't
> > > significantly
> > > CPU-bound, and they will suffer a decrease in parallelism due to
> > > the
> > > active CPU core using a larger fraction of the TDP in order to
> > > achieve
> > > the same work, causing the GPU to have a lower power budget
> > > available,
> > > leading to a decrease in system performance.
> > >
> > > You may want to give a shot to my previous suggestion of using
> > > [2] in
> > > order to detect whether the system is IO-bound, which you can use
> > > as
> > > an
> > > indicator that the optimization implemented in this series cannot
> > > possibly improve performance and can be expected to hurt energy
> > > usage.
> > >
> > > Thanks.
> > >
> > > [1] https://bugs.freedesktop.org/show_bug.cgi?id=107410
> > > [2] https://patchwork.kernel.org/patch/10312259/
> > >
> > > > Reported-by: Mel Gorman <[email protected]>
> > > > Tested-by: Giovanni Gherdovich <[email protected]>
> > > > Signed-off-by: Srinivas Pandruvada <[email protected]
> > > > ntel
> > > > .com>
> > > > ---
> > > > drivers/cpufreq/intel_pstate.c | 10 ++++++++++
> > > > 1 file changed, 10 insertions(+)
> > > >
> > > > diff --git a/drivers/cpufreq/intel_pstate.c
> > > > b/drivers/cpufreq/intel_pstate.c
> > > > index 70bf63bb4e0e..01c8da1f99db 100644
> > > > --- a/drivers/cpufreq/intel_pstate.c
> > > > +++ b/drivers/cpufreq/intel_pstate.c
> > > > @@ -1794,6 +1794,12 @@ static const struct x86_cpu_id
> > > > intel_pstate_cpu_ee_disable_ids[] = {
> > > > {}
> > > > };
> > > >
> > > > +static const struct x86_cpu_id intel_pstate_hwp_boost_ids[]
> > > > __initconst = {
> > > > + ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
> > > > + ICPU(INTEL_FAM6_SKYLAKE_DESKTOP, core_funcs),
> > > > + {}
> > > > +};
> > > > +
> > > > static int intel_pstate_init_cpu(unsigned int cpunum)
> > > > {
> > > > struct cpudata *cpu;
> > > > @@ -1824,6 +1830,10 @@ static int
> > > > intel_pstate_init_cpu(unsigned
> > > > int cpunum)
> > > > intel_pstate_disable_ee(cpunum);
> > > >
> > > > intel_pstate_hwp_enable(cpu);
> > > > +
> > > > + id =
> > > > x86_match_cpu(intel_pstate_hwp_boost_ids);
> > > > + if (id)
> > > > + hwp_boost = true;
> > > > }
> > > >
> > > > intel_pstate_get_cpu_pstates(cpu);
> > > > --
> > > > 2.13.6
>
>
On Mon, 2018-07-30 at 14:16 +0300, Eero Tamminen wrote:
> Hi Mel,
>
> On 28.07.2018 15:36, Mel Gorman wrote:
> > On Fri, Jul 27, 2018 at 10:34:03PM -0700, Francisco Jerez wrote:
> > > Srinivas Pandruvada <[email protected]> writes:
> > >
> > > > Enable HWP boost on Skylake server and workstations.
> > > >
> > >
> > > Please revert this series, it led to significant energy usage and
> > > graphics performance regressions [1]. The reasons are roughly
> > > the ones
> > > we discussed by e-mail off-list last April: This causes the
> > > intel_pstate
> > > driver to decrease the EPP to zero when the workload blocks on IO
> > > frequently enough, which for the regressing benchmarks detailed
> > > in [1]
> > > is a symptom of the workload being heavily IO-bound, which means
> > > they
> > > won't benefit at all from the EPP boost since they aren't
> > > significantly
> > > CPU-bound, and they will suffer a decrease in parallelism due to
> > > the
> > > active CPU core using a larger fraction of the TDP in order to
> > > achieve
> > > the same work, causing the GPU to have a lower power budget
> > > available,
> > > leading to a decrease in system performance.
>
> >
> > It slices both ways. With the series, there are large boosts to
> > performance on other workloads where a slight increase in power
> > usage is
> > acceptable in exchange for performance. For example,
> >
> > Single socket skylake running sqlite
>
> [...]
> > That's over doubling the transactions per second for that workload.
> >
> > Two-socket skylake running dbench4
>
> [...]> This is reporting the average latency of operations running
> dbench. The
> > series over halves the latencies. There are many examples of basic
> > workloads that benefit heavily from the series and while I accept
> > it may
> > not be universal, such as the case where the graphics card needs
> > the power
> > and not the CPU, a straight revert is not the answer. Without the
> > series,
> > HWP cripplies the CPU.
>
> I assume SQLite IO-bottleneck is for the disk. Disk doesn't share
> the TDP limit with the CPU, like IGP does.
>
> Constraints / performance considerations for TDP sharing IO-loads
> differ from ones that don't share TDP with CPU cores.
>
>
> Workloads that can be "IO-bound" and which can be on the same chip
> with CPU i.e. share TDP with it are:
> - 3D rendering
> - Camera / video processing
> - Compute
>
> Intel, AMD and ARM manufacturers all have (non-server) chips where
> these
> IP blocks are on the same die as CPU cores. If CPU part redundantly
> doubles its power consumption, it's directly eating TDP budget away
> from
> these devices.
>
> For workloads where IO bottleneck doesn't share TDP budget with CPU,
> like (sqlite) databases, you don't lose performance by running CPU
> constantly at full tilt, you only use more power [1].
>
> Questions:
>
> * Does currently kernel CPU freq management have any idea which IO
> devices share TDP with the CPU cores?
No. The requests we do to hardware is just indication only (HW can
ignore it). The HW has a bias register to adjust and distribute power
among users.
We can have several other active device on servers beside CPU which
when runs need extra power. So the HW arbitrates power.
>
> * Do you do performance testing also in conditions that hit TDP
> limits?
Yes, several server benchmarks which also measures perf/watt.
For graphics we run KBL-G which hits TDP limits then we have a user
space power manager to distribute power.
Also one Intel 6600 and 7600 runs whole suit of phoronix tests.
Thanks,
Srinivas
>
>
> - Eero
>
> [1] For them power usage is performance problem only if you start
> hitting TDP limit with CPUs alone, or you hit temperature limits.
>
> For CPUs alone to hit TDP limits, our test-case needs to be
> utilizing
> multiple cores and device needs to have lowish TDP compared to the
> performance of the chip.
>
> TDP limiting adds test results variance significantly, but that's
> a property of the chips themselves so it cannot be avoided. :-/
>
> Temperature limiting might happen on small enclosures like the ones
> used
> for the SKL HQ devices i.e. laptops & NUCs, but not on servers. In
> our
> testing we try to avoid temperature limitations when its possible (=
> extra cooling) as it increases variance so much that results are
> mostly
> useless (same devices are also TDP limited i.e. already have high
> variance).
On Sat, Jul 28, 2018 at 01:21:51PM -0700, Francisco Jerez wrote:
> >> Please revert this series, it led to significant energy usage and
> >> graphics performance regressions [1]. The reasons are roughly the ones
> >> we discussed by e-mail off-list last April: This causes the intel_pstate
> >> driver to decrease the EPP to zero when the workload blocks on IO
> >> frequently enough, which for the regressing benchmarks detailed in [1]
> >> is a symptom of the workload being heavily IO-bound, which means they
> >> won't benefit at all from the EPP boost since they aren't significantly
> >> CPU-bound, and they will suffer a decrease in parallelism due to the
> >> active CPU core using a larger fraction of the TDP in order to achieve
> >> the same work, causing the GPU to have a lower power budget available,
> >> leading to a decrease in system performance.
> >
> > It slices both ways.
>
> I don't think it's acceptable to land an optimization that trades
> performance of one use-case for another,
The same logic applies to a revert but that aside, I see that there is
at least one patch floating around to disable HWP Boost for desktops and
laptops. Maybe that'll be sufficient for the cases where IGP is a major
component.
> especially since one could make
> both use-cases happy by avoiding the boost in cases where we know
> beforehand that we aren't going to achieve any improvement in
> performance, because an application waiting frequently on an IO device
> which is 100% utilized isn't going to run faster just because we ramp up
> the CPU frequency, since the IO device won't be able to process requests
> from the application faster anyway, so we will only be pessimizing
> energy efficiency (and potentially decreasing performance of the GPU
> *and* of other CPU cores living on the same package for no benefit).
>
The benchmarks in question are not necessarily utilising IO at 100% or
IO-bound. One pattern is a small fsync which ends up context switching
between the process and a journalling thread (may be dedicated thread, may be
workqueue depending on filesystem) and the process waking again in the very
near future on IO completion. While the workload may be single threaded,
more than one core is in use because of how the short sleeps migrate the
task to other cores. HWP does not necessarily notice that the task is
quite CPU-intensive due to the migrations and so the performance suffers.
Some effort is made to minimise the number of cores used with this sort
of waker/wakee relationship but it's not necessarily enough for HWP to
boost the frequency. Minimally, the journalling thread woken up will
not wake on the same CPU as the IO issuer except under extremely heavily
utilisation and this is not likely to change (stacking stacks too often
increases wakeup latency).
> > With the series, there are large boosts to performance on other
> > workloads where a slight increase in power usage is acceptable in
> > exchange for performance. For example,
> >
> > Single socket skylake running sqlite
> > v4.17 41ab43c9
> > Min Trans 2580.85 ( 0.00%) 5401.58 ( 109.29%)
> > Hmean Trans 2610.38 ( 0.00%) 5518.36 ( 111.40%)
> > Stddev Trans 28.08 ( 0.00%) 208.90 (-644.02%)
> > CoeffVar Trans 1.08 ( 0.00%) 3.78 (-251.57%)
> > Max Trans 2648.02 ( 0.00%) 5992.74 ( 126.31%)
> > BHmean-50 Trans 2629.78 ( 0.00%) 5643.81 ( 114.61%)
> > BHmean-95 Trans 2620.38 ( 0.00%) 5538.32 ( 111.36%)
> > BHmean-99 Trans 2620.38 ( 0.00%) 5538.32 ( 111.36%)
> >
> > That's over doubling the transactions per second for that workload.
> >
> > Two-socket skylake running dbench4
> > v4.17 41ab43c9
> > Amean 1 40.85 ( 0.00%) 14.97 ( 63.36%)
> > Amean 2 42.31 ( 0.00%) 17.33 ( 59.04%)
> > Amean 4 53.77 ( 0.00%) 27.85 ( 48.20%)
> > Amean 8 68.86 ( 0.00%) 43.78 ( 36.42%)
> > Amean 16 82.62 ( 0.00%) 56.51 ( 31.60%)
> > Amean 32 135.80 ( 0.00%) 116.06 ( 14.54%)
> > Amean 64 737.51 ( 0.00%) 701.00 ( 4.95%)
> > Amean 512 14996.60 ( 0.00%) 14755.05 ( 1.61%)
> >
> > This is reporting the average latency of operations running
> > dbench. The series over halves the latencies. There are many examples
> > of basic workloads that benefit heavily from the series and while I
> > accept it may not be universal, such as the case where the graphics
> > card needs the power and not the CPU, a straight revert is not the
> > answer. Without the series, HWP cripplies the CPU.
> >
>
> That seems like a huge overstatement. HWP doesn't "cripple" the CPU
> without this series. It will certainly set lower clocks than with this
> series for workloads like you show above that utilize the CPU very
> intermittently (i.e. they underutilize it).
Dbench for example can be quite CPU intensive. When bound to a single
core, it shows up to 80% utilisation of a single core. When unbound,
the usage of individual cores appears low due to the migrations. It may
be intermittent usage as it context switches to worker threads but it's
not low utilisation either.
intel_pstate also had logic for IO-boosting before HWP so the user-visible
impact for some workloads is that upgrading a machine's CPU can result
in regressions due to HWP. Similarly it has been observed prior to the
series that specifying no_hwp often performed better. So one could argue
that HWP isn't "crippled" but it did have surprising behaviour.
--
Mel Gorman
SUSE Labs
On Mon, 2018-07-30 at 16:43 +0100, Mel Gorman wrote:
> On Sat, Jul 28, 2018 at 01:21:51PM -0700, Francisco Jerez wrote:
> > > > Please revert this series, it led to significant energy usage
> > > > and
> > > > graphics performance regressions [1]. The reasons are roughly
> > > > the ones
> > > > we discussed by e-mail off-list last April: This causes the
> > > > intel_pstate
> > > > driver to decrease the EPP to zero when the workload blocks on
> > > > IO
> > > > frequently enough, which for the regressing benchmarks detailed
> > > > in [1]
> > > > is a symptom of the workload being heavily IO-bound, which
> > > > means they
> > > > won't benefit at all from the EPP boost since they aren't
> > > > significantly
> > > > CPU-bound, and they will suffer a decrease in parallelism due
> > > > to the
> > > > active CPU core using a larger fraction of the TDP in order to
> > > > achieve
> > > > the same work, causing the GPU to have a lower power budget
> > > > available,
> > > > leading to a decrease in system performance.
> > >
> > > It slices both ways.
> >
> > I don't think it's acceptable to land an optimization that trades
> > performance of one use-case for another,
>
> The same logic applies to a revert but that aside, I see that there
> is
> at least one patch floating around to disable HWP Boost for desktops
> and
> laptops. Maybe that'll be sufficient for the cases where IGP is a
> major
> component.
We don't have to revert the series. The only contention is desktop cpu
model. I didn't have this model before the last version.
One entry level server uses the same CPU model as desktop, some users
like Giovanni suggested to add. But we know that those machines are
servers from ACPI. So we can differentiate.
Thanks,
Srinivas
>
> > especially since one could make
> > both use-cases happy by avoiding the boost in cases where we know
> > beforehand that we aren't going to achieve any improvement in
> > performance, because an application waiting frequently on an IO
> > device
> > which is 100% utilized isn't going to run faster just because we
> > ramp up
> > the CPU frequency, since the IO device won't be able to process
> > requests
> > from the application faster anyway, so we will only be pessimizing
> > energy efficiency (and potentially decreasing performance of the
> > GPU
> > *and* of other CPU cores living on the same package for no
> > benefit).
> >
>
> The benchmarks in question are not necessarily utilising IO at 100%
> or
> IO-bound. One pattern is a small fsync which ends up context
> switching
> between the process and a journalling thread (may be dedicated
> thread, may be
> workqueue depending on filesystem) and the process waking again in
> the very
> near future on IO completion. While the workload may be single
> threaded,
> more than one core is in use because of how the short sleeps migrate
> the
> task to other cores. HWP does not necessarily notice that the task
> is
> quite CPU-intensive due to the migrations and so the performance
> suffers.
>
> Some effort is made to minimise the number of cores used with this
> sort
> of waker/wakee relationship but it's not necessarily enough for HWP
> to
> boost the frequency. Minimally, the journalling thread woken up will
> not wake on the same CPU as the IO issuer except under extremely
> heavily
> utilisation and this is not likely to change (stacking stacks too
> often
> increases wakeup latency).
>
> > > With the series, there are large boosts to performance on other
> > > workloads where a slight increase in power usage is acceptable in
> > > exchange for performance. For example,
> > >
> > > Single socket skylake running sqlite
> > > v4.17 41ab43c9
> > > Min Trans 2580.85 ( 0.00%) 5401.58 ( 109.29%)
> > > Hmean Trans 2610.38 ( 0.00%) 5518.36 ( 111.40%)
> > > Stddev Trans 28.08 ( 0.00%) 208.90 (-644.02%)
> > > CoeffVar Trans 1.08 ( 0.00%) 3.78 (-251.57%)
> > > Max Trans 2648.02 ( 0.00%) 5992.74 ( 126.31%)
> > > BHmean-50 Trans 2629.78 ( 0.00%) 5643.81 ( 114.61%)
> > > BHmean-95 Trans 2620.38 ( 0.00%) 5538.32 ( 111.36%)
> > > BHmean-99 Trans 2620.38 ( 0.00%) 5538.32 ( 111.36%)
> > >
> > > That's over doubling the transactions per second for that
> > > workload.
> > >
> > > Two-socket skylake running dbench4
> > > v4.17 41ab43c9
> > > Amean 1 40.85 ( 0.00%) 14.97 ( 63.36%)
> > > Amean 2 42.31 ( 0.00%) 17.33 ( 59.04%)
> > > Amean 4 53.77 ( 0.00%) 27.85 ( 48.20%)
> > > Amean 8 68.86 ( 0.00%) 43.78 ( 36.42%)
> > > Amean 16 82.62 ( 0.00%) 56.51 ( 31.60%)
> > > Amean 32 135.80 ( 0.00%) 116.06 ( 14.54%)
> > > Amean 64 737.51 ( 0.00%) 701.00 ( 4.95%)
> > > Amean 512 14996.60 ( 0.00%) 14755.05 ( 1.61%)
> > >
> > > This is reporting the average latency of operations running
> > > dbench. The series over halves the latencies. There are many
> > > examples
> > > of basic workloads that benefit heavily from the series and while
> > > I
> > > accept it may not be universal, such as the case where the
> > > graphics
> > > card needs the power and not the CPU, a straight revert is not
> > > the
> > > answer. Without the series, HWP cripplies the CPU.
> > >
> >
> > That seems like a huge overstatement. HWP doesn't "cripple" the
> > CPU
> > without this series. It will certainly set lower clocks than with
> > this
> > series for workloads like you show above that utilize the CPU very
> > intermittently (i.e. they underutilize it).
>
> Dbench for example can be quite CPU intensive. When bound to a single
> core, it shows up to 80% utilisation of a single core. When unbound,
> the usage of individual cores appears low due to the migrations. It
> may
> be intermittent usage as it context switches to worker threads but
> it's
> not low utilisation either.
>
> intel_pstate also had logic for IO-boosting before HWP so the user-
> visible
> impact for some workloads is that upgrading a machine's CPU can
> result
> in regressions due to HWP. Similarly it has been observed prior to
> the
> series that specifying no_hwp often performed better. So one could
> argue
> that HWP isn't "crippled" but it did have surprising behaviour.
>
Mel Gorman <[email protected]> writes:
> On Sat, Jul 28, 2018 at 01:21:51PM -0700, Francisco Jerez wrote:
>> >> Please revert this series, it led to significant energy usage and
>> >> graphics performance regressions [1]. The reasons are roughly the ones
>> >> we discussed by e-mail off-list last April: This causes the intel_pstate
>> >> driver to decrease the EPP to zero when the workload blocks on IO
>> >> frequently enough, which for the regressing benchmarks detailed in [1]
>> >> is a symptom of the workload being heavily IO-bound, which means they
>> >> won't benefit at all from the EPP boost since they aren't significantly
>> >> CPU-bound, and they will suffer a decrease in parallelism due to the
>> >> active CPU core using a larger fraction of the TDP in order to achieve
>> >> the same work, causing the GPU to have a lower power budget available,
>> >> leading to a decrease in system performance.
>> >
>> > It slices both ways.
>>
>> I don't think it's acceptable to land an optimization that trades
>> performance of one use-case for another,
>
> The same logic applies to a revert
No, it doesn't, the responsibility of addressing the fallout from a
change that happens to hurt performance even though it was supposed to
improve it lies on the author of the change, not on the reporter of the
regression.
> but that aside, I see that there is at least one patch floating around
> to disable HWP Boost for desktops and laptops. Maybe that'll be
> sufficient for the cases where IGP is a major component.
>
>> especially since one could make
>> both use-cases happy by avoiding the boost in cases where we know
>> beforehand that we aren't going to achieve any improvement in
>> performance, because an application waiting frequently on an IO device
>> which is 100% utilized isn't going to run faster just because we ramp up
>> the CPU frequency, since the IO device won't be able to process requests
>> from the application faster anyway, so we will only be pessimizing
>> energy efficiency (and potentially decreasing performance of the GPU
>> *and* of other CPU cores living on the same package for no benefit).
>>
>
> The benchmarks in question are not necessarily utilising IO at 100% or
> IO-bound.
Exactly. That's the only reason why they are able to take advantage of
HWP boost, while the regressing graphics benchmarks are not, since they
are utilizing an IO device at 100%. Both categories of use-cases sleep
on IO-wait frequently, but only the former are authentically CPU-bound.
> One pattern is a small fsync which ends up context switching between
> the process and a journalling thread (may be dedicated thread, may be
> workqueue depending on filesystem) and the process waking again in the
> very near future on IO completion. While the workload may be single
> threaded, more than one core is in use because of how the short sleeps
> migrate the task to other cores. HWP does not necessarily notice that
> the task is quite CPU-intensive due to the migrations and so the
> performance suffers.
>
> Some effort is made to minimise the number of cores used with this sort
> of waker/wakee relationship but it's not necessarily enough for HWP to
> boost the frequency. Minimally, the journalling thread woken up will
> not wake on the same CPU as the IO issuer except under extremely heavily
> utilisation and this is not likely to change (stacking stacks too often
> increases wakeup latency).
>
The task scheduler does go through the effort of attempting to re-use
the most frequently active CPU when a task wakes up, at least last time
I checked. But yes some migration patterns can exacerbate the downward
bias of the response of the HWP to an intermittent workload, primarily
in cases where the application is unable to take advantage of the
parallelism between CPU and the IO device involved, like you're
describing above.
>> > With the series, there are large boosts to performance on other
>> > workloads where a slight increase in power usage is acceptable in
>> > exchange for performance. For example,
>> >
>> > Single socket skylake running sqlite
>> > v4.17 41ab43c9
>> > Min Trans 2580.85 ( 0.00%) 5401.58 ( 109.29%)
>> > Hmean Trans 2610.38 ( 0.00%) 5518.36 ( 111.40%)
>> > Stddev Trans 28.08 ( 0.00%) 208.90 (-644.02%)
>> > CoeffVar Trans 1.08 ( 0.00%) 3.78 (-251.57%)
>> > Max Trans 2648.02 ( 0.00%) 5992.74 ( 126.31%)
>> > BHmean-50 Trans 2629.78 ( 0.00%) 5643.81 ( 114.61%)
>> > BHmean-95 Trans 2620.38 ( 0.00%) 5538.32 ( 111.36%)
>> > BHmean-99 Trans 2620.38 ( 0.00%) 5538.32 ( 111.36%)
>> >
>> > That's over doubling the transactions per second for that workload.
>> >
>> > Two-socket skylake running dbench4
>> > v4.17 41ab43c9
>> > Amean 1 40.85 ( 0.00%) 14.97 ( 63.36%)
>> > Amean 2 42.31 ( 0.00%) 17.33 ( 59.04%)
>> > Amean 4 53.77 ( 0.00%) 27.85 ( 48.20%)
>> > Amean 8 68.86 ( 0.00%) 43.78 ( 36.42%)
>> > Amean 16 82.62 ( 0.00%) 56.51 ( 31.60%)
>> > Amean 32 135.80 ( 0.00%) 116.06 ( 14.54%)
>> > Amean 64 737.51 ( 0.00%) 701.00 ( 4.95%)
>> > Amean 512 14996.60 ( 0.00%) 14755.05 ( 1.61%)
>> >
>> > This is reporting the average latency of operations running
>> > dbench. The series over halves the latencies. There are many examples
>> > of basic workloads that benefit heavily from the series and while I
>> > accept it may not be universal, such as the case where the graphics
>> > card needs the power and not the CPU, a straight revert is not the
>> > answer. Without the series, HWP cripplies the CPU.
>> >
>>
>> That seems like a huge overstatement. HWP doesn't "cripple" the CPU
>> without this series. It will certainly set lower clocks than with this
>> series for workloads like you show above that utilize the CPU very
>> intermittently (i.e. they underutilize it).
>
> Dbench for example can be quite CPU intensive. When bound to a single
> core, it shows up to 80% utilisation of a single core.
So even with an oracle cpufreq governor able to guess that the
application relies on the CPU being locked to the maximum frequency
despite it utilizing less than 80% of the CPU cycles, the application
will still perform 20% worse than an alternative application handling
its IO work asynchronously.
> When unbound, the usage of individual cores appears low due to the
> migrations. It may be intermittent usage as it context switches to
> worker threads but it's not low utilisation either.
>
> intel_pstate also had logic for IO-boosting before HWP
The IO-boosting logic of the intel_pstate governor has the same flaw as
this unfortunately.
> so the user-visible impact for some workloads is that upgrading a
> machine's CPU can result in regressions due to HWP. Similarly it has
> been observed prior to the series that specifying no_hwp often
> performed better. So one could argue that HWP isn't "crippled" but it
> did have surprising behaviour.
>
> --
> Mel Gorman
> SUSE Labs
On Mon, 2018-07-30 at 11:32 -0700, Francisco Jerez wrote:
> Mel Gorman <[email protected]> writes:
>
> > On Sat, Jul 28, 2018 at 01:21:51PM -0700, Francisco Jerez wrote:
> > > > > Please revert this series, it led to significant energy usage and
> > > > > graphics performance regressions [1]. The reasons are roughly the ones
> > > > > we discussed by e-mail off-list last April: This causes the intel_pstate
> > > > > driver to decrease the EPP to zero when the workload blocks on IO
> > > > > frequently enough, which for the regressing benchmarks detailed in [1]
> > > > > is a symptom of the workload being heavily IO-bound, which means they
> > > > > won't benefit at all from the EPP boost since they aren't significantly
> > > > > CPU-bound, and they will suffer a decrease in parallelism due to the
> > > > > active CPU core using a larger fraction of the TDP in order to achieve
> > > > > the same work, causing the GPU to have a lower power budget available,
> > > > > leading to a decrease in system performance.
> > > >
> > > > It slices both ways.
> > >
> > > I don't think it's acceptable to land an optimization that trades
> > > performance of one use-case for another,
> >
> > The same logic applies to a revert
>
> No, it doesn't, the responsibility of addressing the fallout from a
> change that happens to hurt performance even though it was supposed to
> improve it lies on the author of the change, not on the reporter of the
> regression.
The server and desktop worlds have different characteristics and needs, which
in this particular case appear to be conflicting. Luckily we can can
differentiate the two scenarios (as in the bugfix patch by Srinivas a few
hours ago).
> The task scheduler does go through the effort of attempting to re-use
> the most frequently active CPU when a task wakes up, at least last time
> I checked.
Unfortunately that doesn't happen in practice; the load balancer in the
scheduler is in a constant tension between spreading tasks evenly across all
cores (necessary when the machine is under heavy load) and packing on just a
few (which makes more sense when only a few threads are working and the box is
almost idle otherwise). Recent evolutions favour spreading. We often observe
tasks helplessly bounce from core to core losing all their accrued utilisation
score, and intel_pstate (with or without HWP) penalizes that.
On Mon, 2018-07-30 at 11:32 -0700, Francisco Jerez wrote:
> Mel Gorman <[email protected]> writes:
>
> [...]
> > One pattern is a small fsync which ends up context switching between
> > the process and a journalling thread (may be dedicated thread, may be
> > workqueue depending on filesystem) and the process waking again in the
> > very near future on IO completion. While the workload may be single
> > threaded, more than one core is in use because of how the short sleeps
> > migrate the task to other cores. HWP does not necessarily notice that
> > the task is quite CPU-intensive due to the migrations and so the
> > performance suffers.
> >
> > Some effort is made to minimise the number of cores used with this sort
> > of waker/wakee relationship but it's not necessarily enough for HWP to
> > boost the frequency. Minimally, the journalling thread woken up will
> > not wake on the same CPU as the IO issuer except under extremely heavily
> > utilisation and this is not likely to change (stacking stacks too often
> > increases wakeup latency).
> >
>
> The task scheduler does go through the effort of attempting to re-use
> the most frequently active CPU when a task wakes up, at least last time
> I checked. But yes some migration patterns can exacerbate the downward
> bias of the response of the HWP to an intermittent workload, primarily
> in cases where the application is unable to take advantage of the
> parallelism between CPU and the IO device involved, like you're
> describing above.
Unfortunately that doesn't happen in practice; the load balancer in the
scheduler is in a constant tension between spreading tasks evenly across all
cores (necessary when the machine is under heavy load) and packing on just a
few (which makes more sense when only a few threads are working and the box is
idle otherwise). Recent evolutions favour spreading. We often observe tasks
helplessly bounce from core to core losing all their accrued utilization
score, and intel_pstate (with or without HWP) penalizes that.
That's why in our distro SLES-15 (which is based on 4.12.14) we're sporting a
patch like this:
https://kernel.opensuse.org/cgit/kernel/commit/?h=SLE15&id=3a287868cb7a9 which
boosts tasks that have been placed on a previously idle CPU. We haven't even
proposed this patch upstream as we hope to solve those problems at a more
fundamental level, but when you're supporting power management (freq scaling)
in the server world you get compared to the performance governor, so your
policy needs to be aggressive.
>
> > > > With the series, there are large boosts to performance on other
> > > > workloads where a slight increase in power usage is acceptable in
> > > > exchange for performance. For example,
> > > >
> > > > Single socket skylake running sqlite
> > > > v4.17 41ab43c9
> > > > Min Trans 2580.85 ( 0.00%) 5401.58 ( 109.29%)
> > > > Hmean Trans 2610.38 ( 0.00%) 5518.36 ( 111.40%)
> > > > Stddev Trans 28.08 ( 0.00%) 208.90 (-644.02%)
> > > > CoeffVar Trans 1.08 ( 0.00%) 3.78 (-251.57%)
> > > > Max Trans 2648.02 ( 0.00%) 5992.74 ( 126.31%)
> > > > BHmean-50 Trans 2629.78 ( 0.00%) 5643.81 ( 114.61%)
> > > > BHmean-95 Trans 2620.38 ( 0.00%) 5538.32 ( 111.36%)
> > > > BHmean-99 Trans 2620.38 ( 0.00%) 5538.32 ( 111.36%)
> > > >
> > > > That's over doubling the transactions per second for that workload.
> > > >
> > > > Two-socket skylake running dbench4
> > > > v4.17 41ab43c9
> > > > Amean 1 40.85 ( 0.00%) 14.97 ( 63.36%)
> > > > Amean 2 42.31 ( 0.00%) 17.33 ( 59.04%)
> > > > Amean 4 53.77 ( 0.00%) 27.85 ( 48.20%)
> > > > Amean 8 68.86 ( 0.00%) 43.78 ( 36.42%)
> > > > Amean 16 82.62 ( 0.00%) 56.51 ( 31.60%)
> > > > Amean 32 135.80 ( 0.00%) 116.06 ( 14.54%)
> > > > Amean 64 737.51 ( 0.00%) 701.00 ( 4.95%)
> > > > Amean 512 14996.60 ( 0.00%) 14755.05 ( 1.61%)
> > > >
> > > > This is reporting the average latency of operations running
> > > > dbench. The series over halves the latencies. There are many examples
> > > > of basic workloads that benefit heavily from the series and while I
> > > > accept it may not be universal, such as the case where the graphics
> > > > card needs the power and not the CPU, a straight revert is not the
> > > > answer. Without the series, HWP cripplies the CPU.
> > > >
> > >
> > > That seems like a huge overstatement. HWP doesn't "cripple" the CPU
> > > without this series. It will certainly set lower clocks than with this
> > > series for workloads like you show above that utilize the CPU very
> > > intermittently (i.e. they underutilize it).
> >
> > Dbench for example can be quite CPU intensive. When bound to a single
> > core, it shows up to 80% utilisation of a single core.
>
> So even with an oracle cpufreq governor able to guess that the
> application relies on the CPU being locked to the maximum frequency
> despite it utilizing less than 80% of the CPU cycles, the application
> will still perform 20% worse than an alternative application handling
> its IO work asynchronously.
It's a matter of being pragmatic. You're saying that a given application is
badly designed and should be rewritten to leverage parallelism between CPU and
IO. But in the field you *have* applications that behave that way, and the OS
is in a position to do something to mitigate the damage.
>
> > When unbound, the usage of individual cores appears low due to the
> > migrations. It may be intermittent usage as it context switches to
> > worker threads but it's not low utilisation either.
> >
> > intel_pstate also had logic for IO-boosting before HWP
>
> The IO-boosting logic of the intel_pstate governor has the same flaw as
> this unfortunately.
>
Again it's a matter of pragmatism. You'll find that another governor uses
IO-boosting: schedutil. And while intel_pstate needs it because it gets
otherwise killed by migrating tasks, schedutil is based on the PELT
utilization signal and doesn't have that problem at all. The principle there
is plain and simple: if I've been "wasting time" waiting on "slow" IO (disk),
that probably means I'm getting late and there is soon some compute to do:
better catch up on the lost time and speed up. IO-wait boosting on schedutil
was discussed at
https://lore.kernel.org/lkml/[email protected]/
Giovanni Gherdovich
SUSE Labs
On Mon, Jul 30, 2018 at 07:06:21AM -0700, Srinivas Pandruvada wrote:
> On Mon, 2018-07-30 at 14:16 +0300, Eero Tamminen wrote:
> > Questions:
> >
> > * Does currently kernel CPU freq management have any idea which IO
> > devices share TDP with the CPU cores?
> No. The requests we do to hardware is just indication only (HW can
> ignore it). The HW has a bias register to adjust and distribute power
> among users.
> We can have several other active device on servers beside CPU which
> when runs need extra power. So the HW arbitrates power.
That's not entirely accurate AFAIK. "No" is accurate for Intel, but the
ARM people have their IPA thing (not a beer):
https://developer.arm.com/open-source/intelligent-power-allocation
drivers/thermal/power_allocator.c
which IIUC interacts with their cpufreq driver to disallow certain OPP
states.
And note that I discourage intel_pstate active mode.
On Tue, 2018-07-31 at 17:04 +0200, Peter Zijlstra wrote:
> On Mon, Jul 30, 2018 at 07:06:21AM -0700, Srinivas Pandruvada wrote:
> > On Mon, 2018-07-30 at 14:16 +0300, Eero Tamminen wrote:
> > > Questions:
> > >
> > > * Does currently kernel CPU freq management have any idea which
> > > IO
> > > devices share TDP with the CPU cores?
> > No. The requests we do to hardware is just indication only (HW can
> > ignore it). The HW has a bias register to adjust and distribute
> > power
> > among users.
> > We can have several other active device on servers beside CPU which
> > when runs need extra power. So the HW arbitrates power.
>
> That's not entirely accurate AFAIK. "No" is accurate for Intel, but
> the
> ARM people have their IPA thing (not a beer):
We also have that for using space programs (E.g. KBL-G).
Thanks,
Srinivas
>
> https://developer.arm.com/open-source/intelligent-power-allocation
>
> drivers/thermal/power_allocator.c
>
> which IIUC interacts with their cpufreq driver to disallow certain
> OPP
> states.
>
> And note that I discourage intel_pstate active mode.
Giovanni Gherdovich <[email protected]> writes:
> On Mon, 2018-07-30 at 11:32 -0700, Francisco Jerez wrote:
>> Mel Gorman <[email protected]> writes:
>>
>> > On Sat, Jul 28, 2018 at 01:21:51PM -0700, Francisco Jerez wrote:
>> > > > > Please revert this series, it led to significant energy usage and
>> > > > > graphics performance regressions [1]. The reasons are roughly the ones
>> > > > > we discussed by e-mail off-list last April: This causes the intel_pstate
>> > > > > driver to decrease the EPP to zero when the workload blocks on IO
>> > > > > frequently enough, which for the regressing benchmarks detailed in [1]
>> > > > > is a symptom of the workload being heavily IO-bound, which means they
>> > > > > won't benefit at all from the EPP boost since they aren't significantly
>> > > > > CPU-bound, and they will suffer a decrease in parallelism due to the
>> > > > > active CPU core using a larger fraction of the TDP in order to achieve
>> > > > > the same work, causing the GPU to have a lower power budget available,
>> > > > > leading to a decrease in system performance.
>> > > >
>> > > > It slices both ways.
>> > >
>> > > I don't think it's acceptable to land an optimization that trades
>> > > performance of one use-case for another,
>> >
>> > The same logic applies to a revert
>>
>> No, it doesn't, the responsibility of addressing the fallout from a
>> change that happens to hurt performance even though it was supposed to
>> improve it lies on the author of the change, not on the reporter of the
>> regression.
>
> The server and desktop worlds have different characteristics and needs, which
> in this particular case appear to be conflicting. Luckily we can can
> differentiate the two scenarios (as in the bugfix patch by Srinivas a few
> hours ago).
>
I'm skeptical that the needs of the server and desktop world are really
as different and conflicting as this discussion may make them seem. In
a server environment it can be as much (if not more) of a factor how
many requests the system can process per second rather than the latency
of any individual request (since the latency of the network can easily
be an order of magnitude higher than the latency reduction that can
possibly be achieved by tricking the HWP into reacting faster). I'm not
convinced about the usefulness of trading the former for the latter in a
server environment, particularly since we could achieve both goals
simultaneously.
>> The task scheduler does go through the effort of attempting to re-use
>> the most frequently active CPU when a task wakes up, at least last time
>> I checked.
>
> Unfortunately that doesn't happen in practice; the load balancer in the
> scheduler is in a constant tension between spreading tasks evenly across all
> cores (necessary when the machine is under heavy load) and packing on just a
> few (which makes more sense when only a few threads are working and the box is
> almost idle otherwise). Recent evolutions favour spreading. We often observe
> tasks helplessly bounce from core to core losing all their accrued utilisation
> score, and intel_pstate (with or without HWP) penalizes that.
>
That's unfortunate. Luckily it's easy enough for the cpufreq governor
to differentiate those cases from the applications that have enough
parallelism to utilize at least one system resource close to its maximum
throughput and become non-CPU-bound.
> On Mon, 2018-07-30 at 11:32 -0700, Francisco Jerez wrote:
>> Mel Gorman <[email protected]> writes:
>>
>> [...]
>> > One pattern is a small fsync which ends up context switching between
>> > the process and a journalling thread (may be dedicated thread, may be
>> > workqueue depending on filesystem) and the process waking again in the
>> > very near future on IO completion. While the workload may be single
>> > threaded, more than one core is in use because of how the short sleeps
>> > migrate the task to other cores. HWP does not necessarily notice that
>> > the task is quite CPU-intensive due to the migrations and so the
>> > performance suffers.
>> >
>> > Some effort is made to minimise the number of cores used with this sort
>> > of waker/wakee relationship but it's not necessarily enough for HWP to
>> > boost the frequency. Minimally, the journalling thread woken up will
>> > not wake on the same CPU as the IO issuer except under extremely heavily
>> > utilisation and this is not likely to change (stacking stacks too often
>> > increases wakeup latency).
>> >
>>
>> The task scheduler does go through the effort of attempting to re-use
>> the most frequently active CPU when a task wakes up, at least last time
>> I checked. But yes some migration patterns can exacerbate the downward
>> bias of the response of the HWP to an intermittent workload, primarily
>> in cases where the application is unable to take advantage of the
>> parallelism between CPU and the IO device involved, like you're
>> describing above.
>
> Unfortunately that doesn't happen in practice; the load balancer in the
> scheduler is in a constant tension between spreading tasks evenly across all
> cores (necessary when the machine is under heavy load) and packing on just a
> few (which makes more sense when only a few threads are working and the box is
> idle otherwise). Recent evolutions favour spreading. We often observe tasks
> helplessly bounce from core to core losing all their accrued utilization
> score, and intel_pstate (with or without HWP) penalizes that.
>
> That's why in our distro SLES-15 (which is based on 4.12.14) we're sporting a
> patch like this:
> https://kernel.opensuse.org/cgit/kernel/commit/?h=SLE15&id=3a287868cb7a9 which
> boosts tasks that have been placed on a previously idle CPU. We haven't even
> proposed this patch upstream as we hope to solve those problems at a more
> fundamental level, but when you're supporting power management (freq scaling)
> in the server world you get compared to the performance governor, so your
> policy needs to be aggressive.
>
>>
>> > > > With the series, there are large boosts to performance on other
>> > > > workloads where a slight increase in power usage is acceptable in
>> > > > exchange for performance. For example,
>> > > >
>> > > > Single socket skylake running sqlite
>> > > > v4.17 41ab43c9
>> > > > Min Trans 2580.85 ( 0.00%) 5401.58 ( 109.29%)
>> > > > Hmean Trans 2610.38 ( 0.00%) 5518.36 ( 111.40%)
>> > > > Stddev Trans 28.08 ( 0.00%) 208.90 (-644.02%)
>> > > > CoeffVar Trans 1.08 ( 0.00%) 3.78 (-251.57%)
>> > > > Max Trans 2648.02 ( 0.00%) 5992.74 ( 126.31%)
>> > > > BHmean-50 Trans 2629.78 ( 0.00%) 5643.81 ( 114.61%)
>> > > > BHmean-95 Trans 2620.38 ( 0.00%) 5538.32 ( 111.36%)
>> > > > BHmean-99 Trans 2620.38 ( 0.00%) 5538.32 ( 111.36%)
>> > > >
>> > > > That's over doubling the transactions per second for that workload.
>> > > >
>> > > > Two-socket skylake running dbench4
>> > > > v4.17 41ab43c9
>> > > > Amean 1 40.85 ( 0.00%) 14.97 ( 63.36%)
>> > > > Amean 2 42.31 ( 0.00%) 17.33 ( 59.04%)
>> > > > Amean 4 53.77 ( 0.00%) 27.85 ( 48.20%)
>> > > > Amean 8 68.86 ( 0.00%) 43.78 ( 36.42%)
>> > > > Amean 16 82.62 ( 0.00%) 56.51 ( 31.60%)
>> > > > Amean 32 135.80 ( 0.00%) 116.06 ( 14.54%)
>> > > > Amean 64 737.51 ( 0.00%) 701.00 ( 4.95%)
>> > > > Amean 512 14996.60 ( 0.00%) 14755.05 ( 1.61%)
>> > > >
>> > > > This is reporting the average latency of operations running
>> > > > dbench. The series over halves the latencies. There are many examples
>> > > > of basic workloads that benefit heavily from the series and while I
>> > > > accept it may not be universal, such as the case where the graphics
>> > > > card needs the power and not the CPU, a straight revert is not the
>> > > > answer. Without the series, HWP cripplies the CPU.
>> > > >
>> > >
>> > > That seems like a huge overstatement. HWP doesn't "cripple" the CPU
>> > > without this series. It will certainly set lower clocks than with this
>> > > series for workloads like you show above that utilize the CPU very
>> > > intermittently (i.e. they underutilize it).
>> >
>> > Dbench for example can be quite CPU intensive. When bound to a single
>> > core, it shows up to 80% utilisation of a single core.
>>
>> So even with an oracle cpufreq governor able to guess that the
>> application relies on the CPU being locked to the maximum frequency
>> despite it utilizing less than 80% of the CPU cycles, the application
>> will still perform 20% worse than an alternative application handling
>> its IO work asynchronously.
>
> It's a matter of being pragmatic. You're saying that a given application is
> badly designed and should be rewritten to leverage parallelism between CPU and
> IO.
Maybe some of them should be rewritten but that wasn't what I was trying
to say -- The point was that the kind of applications that benefit from
boosting on IO wait are necessarily within this category of workloads
that aren't able to take full advantage of any system resource anyway.
It's not like HWP would be "crippling" the CPU for a well-behaved
application.
> But in the field you *have* applications that behave that way, and the OS
> is in a position to do something to mitigate the damage.
>
Even when such a mitigation may actually reduce the performance of the
*same* applications when they are TDP-bound and the parallelism of the
system is limited by its energy usage? I'm not objecting to optimizing
for latency-sensitive applications a priori, but such optimizations
shouldn't be applied unless we have an indication that the performance
of the system can possibly improve as a result (e.g. because the
application doesn't already have a bottleneck on some IO device).
>>
>> > When unbound, the usage of individual cores appears low due to the
>> > migrations. It may be intermittent usage as it context switches to
>> > worker threads but it's not low utilisation either.
>> >
>> > intel_pstate also had logic for IO-boosting before HWP
>>
>> The IO-boosting logic of the intel_pstate governor has the same flaw as
>> this unfortunately.
>>
>
> Again it's a matter of pragmatism. You'll find that another governor uses
> IO-boosting: schedutil. And while intel_pstate needs it because it gets
> otherwise killed by migrating tasks,
Right, I've been working on an alternative to that.
> schedutil is based on the PELT utilization signal and doesn't have
> that problem at all.
Yes, I agree that the reaction time of PELT can be superior to HWP at
times.
> The principle there is plain and simple: if I've been "wasting time"
> waiting on "slow" IO (disk), that probably means I'm getting late and
> there is soon some compute to do: better catch up on the lost time and
> speed up. IO-wait boosting on schedutil was discussed at
> https://lore.kernel.org/lkml/[email protected]/
>
That principle is nowhere close to a universal rule. Waiting on an IO
device repeatedly is often a sign that the IO device is itself
overloaded and the CPU is running at an unnecessarily high frequency
(which means lower parallelism than optimal while TDP-bound), since
otherwise the CPU wouldn't need to wait on the IO device as frequently.
In such cases IOWAIT boosting gives you the opposite of what the
application needs.
>
> Giovanni Gherdovich
> SUSE Labs