This series tries to address some concern in performance particularly with IO
workloads (Reported by Mel Gorman), when HWP is using intel_pstate powersave
policy.
Background
HWP performance can be controlled by user space using sysfs interface for
max/min frequency limits and energy performance preference settings. Based on
workload characteristics these can be adjusted from user space. These limits
are not changed dynamically by kernel based on workload.
By default HWP defaults to energy performance preference value of 0x80 on
majority of platforms(Scale is 0-255, 0 is max performance and 255 is min).
This value offers best performance/watt and for majority of server workloads
performance doesn't suffer. Also users always have option to use performance
policy of intel_pstate, to get best performance. But user tend to run with
out of box configuration, which is powersave policy on most of the distros.
In some case it is possible to dynamically adjust performance, for example,
when a CPU is woken up due to IO completion or thread migrate to a new CPU. In
this case HWP algorithm will take some time to build utilization and ramp up
P-states. So this may results in lower performance for some IO workloads and
workloads which tend to migrate. The idea of this patch series is to
temporarily boost performance dynamically in these cases. This is only
applicable only when user is using powersave policy, not in performance policy.
Results on a Skylake server:
Benchmark Improvement %
----------------------------------------------------------------------
dbench 50.36
thread IO bench (tiobench) 10.35
File IO 9.81
sqlite 15.76
X264 -104 cores 9.75
Spec Power (Negligible impact 7382 Vs. 7378)
Idle Power No change observed
-----------------------------------------------------------------------
HWP brings in best performace/watt at EPP=0x80. Since we are boosting
EPP here to 0, the performance/watt drops upto 10%. So there is a power
penalty of these changes.
Also Mel Gorman provided test results on a prior patchset, which shows
benifits of this series.
Peter Zijlstra (1):
x86,sched: Add support for frequency invariance
Srinivas Pandruvada (9):
cpufreq: intel_pstate: Conditional frequency invariant accounting
cpufreq: intel_pstate: Utility functions to boost HWP performance
limits
cpufreq: intel_pstate: Add update_util_hook for HWP
cpufreq: intel_pstate: HWP boost performance on IO Wake
cpufreq / sched: Add interface to get utilization values
cpufreq: intel_pstate: HWP boost performance on busy task migrate
cpufreq: intel_pstate: Dyanmically update busy pct
cpufreq: intel_pstate: New sysfs entry to control HWP boost
cpufreq: intel_pstate: enable boost for SKX
arch/x86/include/asm/topology.h | 29 +++++
arch/x86/kernel/smpboot.c | 196 +++++++++++++++++++++++++++++-
drivers/cpufreq/intel_pstate.c | 260 +++++++++++++++++++++++++++++++++++++++-
include/linux/sched/cpufreq.h | 2 +
kernel/sched/core.c | 1 +
kernel/sched/cpufreq.c | 23 ++++
kernel/sched/sched.h | 7 ++
7 files changed, 513 insertions(+), 5 deletions(-)
--
2.9.5
When HWP dynamic boost is active then set the HWP specific update util
hook. Also start and stop processing in frequency invariant accounting
based on the HWP dyanmic boost setting
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
drivers/cpufreq/intel_pstate.c | 26 ++++++++++++++++++++++----
1 file changed, 22 insertions(+), 4 deletions(-)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index dc7dfa9..e200887 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -290,6 +290,7 @@ static struct pstate_funcs pstate_funcs __read_mostly;
static int hwp_active __read_mostly;
static bool per_cpu_limits __read_mostly;
+static bool hwp_boost __read_mostly;
static struct cpufreq_driver *intel_pstate_driver __read_mostly;
@@ -1420,6 +1421,12 @@ static void csd_init(struct cpudata *cpu)
cpu->csd.info = cpu;
}
+static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
+ u64 time, unsigned int flags)
+{
+
+}
+
static inline void intel_pstate_calc_avg_perf(struct cpudata *cpu)
{
struct sample *sample = &cpu->sample;
@@ -1723,7 +1730,7 @@ static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
{
struct cpudata *cpu = all_cpu_data[cpu_num];
- if (hwp_active)
+ if (hwp_active && !hwp_boost)
return;
if (cpu->update_util_set)
@@ -1731,8 +1738,12 @@ static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
/* Prevent intel_pstate_update_util() from using stale data. */
cpu->sample.time = 0;
- cpufreq_add_update_util_hook(cpu_num, &cpu->update_util,
- intel_pstate_update_util);
+ if (hwp_active)
+ cpufreq_add_update_util_hook(cpu_num, &cpu->update_util,
+ intel_pstate_update_util_hwp);
+ else
+ cpufreq_add_update_util_hook(cpu_num, &cpu->update_util,
+ intel_pstate_update_util);
cpu->update_util_set = true;
}
@@ -1844,8 +1855,15 @@ static int intel_pstate_set_policy(struct cpufreq_policy *policy)
intel_pstate_set_update_util_hook(policy->cpu);
}
- if (hwp_active)
+ if (hwp_active) {
+ if (hwp_boost) {
+ x86_arch_scale_freq_tick_enable();
+ } else {
+ intel_pstate_clear_update_util_hook(policy->cpu);
+ x86_arch_scale_freq_tick_disable();
+ }
intel_pstate_hwp_set(policy->cpu);
+ }
mutex_unlock(&intel_pstate_limits_lock);
--
2.9.5
Calculate hwp_boost_threshold_busy_pct (task busy percent, which is
worth boosting) and hwp_boost_pstate_threshold (Don't boost if
CPU already has some performance) based on platform, min, max and
turbo frequencies.
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
drivers/cpufreq/intel_pstate.c | 40 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 39 insertions(+), 1 deletion(-)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index ec455af..c43edce 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1463,6 +1463,42 @@ static inline int intel_pstate_get_sched_util(struct cpudata *cpu)
return util * 100 / max;
}
+
+static inline void intel_pstate_update_busy_threshold(struct cpudata *cpu)
+{
+ if (!hwp_boost_threshold_busy_pct) {
+ int min_freq, max_freq;
+
+ min_freq = cpu->pstate.min_pstate * cpu->pstate.scaling;
+ update_turbo_state();
+ max_freq = global.turbo_disabled || global.no_turbo ?
+ cpu->pstate.max_freq : cpu->pstate.turbo_freq;
+
+ /*
+ * We are guranteed to get atleast min P-state. If we assume
+ * P-state is proportional to load (such that 10% load
+ * increase will result in 10% P-state increase), we will
+ * get at least min P-state till we have atleast
+ * (min * 100/max) percent cpu load. So any load less than
+ * than this this we shouldn't do any boost. Then boosting
+ * is not free, we will add atleast 20% offset.
+ */
+ hwp_boost_threshold_busy_pct = min_freq * 100 / max_freq;
+ hwp_boost_threshold_busy_pct += 20;
+ pr_debug("hwp_boost_threshold_busy_pct = %d\n",
+ hwp_boost_threshold_busy_pct);
+ }
+
+ /* P1 percent out of total range of P-states */
+ if (cpu->pstate.max_freq != cpu->pstate.turbo_freq) {
+ hwp_boost_pstate_threshold =
+ cpu->pstate.max_freq * SCHED_CAPACITY_SCALE / cpu->pstate.turbo_freq;
+ pr_debug("hwp_boost_pstate_threshold = %d\n",
+ hwp_boost_pstate_threshold);
+ }
+
+}
+
static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
u64 time, unsigned int flags)
{
@@ -2061,8 +2097,10 @@ static int __intel_pstate_cpu_init(struct cpufreq_policy *policy)
policy->fast_switch_possible = true;
- if (hwp_active)
+ if (hwp_active) {
csd_init(cpu);
+ intel_pstate_update_busy_threshold(cpu);
+ }
return 0;
}
--
2.9.5
A new attribute is added to intel_pstate sysfs to enable/disable
HWP dynamic performance boost.
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
drivers/cpufreq/intel_pstate.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index c43edce..65d11d2 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1034,6 +1034,30 @@ static ssize_t store_min_perf_pct(struct kobject *a, struct attribute *b,
return count;
}
+static ssize_t show_hwp_dynamic_boost(struct kobject *kobj,
+ struct attribute *attr, char *buf)
+{
+ return sprintf(buf, "%u\n", hwp_boost);
+}
+
+static ssize_t store_hwp_dynamic_boost(struct kobject *a, struct attribute *b,
+ const char *buf, size_t count)
+{
+ unsigned int input;
+ int ret;
+
+ ret = kstrtouint(buf, 10, &input);
+ if (ret)
+ return ret;
+
+ mutex_lock(&intel_pstate_driver_lock);
+ hwp_boost = !!input;
+ intel_pstate_update_policies();
+ mutex_unlock(&intel_pstate_driver_lock);
+
+ return count;
+}
+
show_one(max_perf_pct, max_perf_pct);
show_one(min_perf_pct, min_perf_pct);
@@ -1043,6 +1067,7 @@ define_one_global_rw(max_perf_pct);
define_one_global_rw(min_perf_pct);
define_one_global_ro(turbo_pct);
define_one_global_ro(num_pstates);
+define_one_global_rw(hwp_dynamic_boost);
static struct attribute *intel_pstate_attributes[] = {
&status.attr,
@@ -1083,6 +1108,11 @@ static void __init intel_pstate_sysfs_expose_params(void)
rc = sysfs_create_file(intel_pstate_kobject, &min_perf_pct.attr);
WARN_ON(rc);
+ if (hwp_active) {
+ rc = sysfs_create_file(intel_pstate_kobject,
+ &hwp_dynamic_boost.attr);
+ WARN_ON(rc);
+ }
}
/************************** sysfs end ************************/
--
2.9.5
Setup necessary infrastructure to be able to boost HWP performance on a
remote CPU. First initialize data structure to be able to use
smp_call_function_single_async(). The boost up function simply set HWP
min to HWP max value and EPP to 0. The boost down function simply restores
to last cached HWP Request value.
To avoid reading HWP Request MSR during dynamic update, the HWP Request
MSR value is cached in the local memory. This caching is done whenever
HWP request MSR is modified during driver init on in setpolicy() callback
path.
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
drivers/cpufreq/intel_pstate.c | 42 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 42 insertions(+)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index f686bbe..dc7dfa9 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -221,6 +221,9 @@ struct global_params {
* preference/bias
* @epp_saved: Saved EPP/EPB during system suspend or CPU offline
* operation
+ * @hwp_req_cached: Cached value of the last HWP request MSR
+ * @csd: A structure used to issue SMP async call, which
+ * defines callback and arguments
*
* This structure stores per CPU instance data for all CPUs.
*/
@@ -253,6 +256,8 @@ struct cpudata {
s16 epp_policy;
s16 epp_default;
s16 epp_saved;
+ u64 hwp_req_cached;
+ call_single_data_t csd;
};
static struct cpudata **all_cpu_data;
@@ -763,6 +768,7 @@ static void intel_pstate_hwp_set(unsigned int cpu)
intel_pstate_set_epb(cpu, epp);
}
skip_epp:
+ cpu_data->hwp_req_cached = value;
wrmsrl_on_cpu(cpu, MSR_HWP_REQUEST, value);
}
@@ -1381,6 +1387,39 @@ static void intel_pstate_get_cpu_pstates(struct cpudata *cpu)
intel_pstate_set_min_pstate(cpu);
}
+
+static inline void intel_pstate_hwp_boost_up(struct cpudata *cpu)
+{
+ u64 hwp_req;
+ u8 max;
+
+ max = (u8) (cpu->hwp_req_cached >> 8);
+
+ hwp_req = cpu->hwp_req_cached & ~GENMASK_ULL(31, 24);
+ hwp_req = (hwp_req & ~GENMASK_ULL(7, 0)) | max;
+
+ wrmsrl(MSR_HWP_REQUEST, hwp_req);
+}
+
+static inline void intel_pstate_hwp_boost_down(struct cpudata *cpu)
+{
+ wrmsrl(MSR_HWP_REQUEST, cpu->hwp_req_cached);
+}
+
+static void intel_pstate_hwp_boost_up_local(void *arg)
+{
+ struct cpudata *cpu = arg;
+
+ intel_pstate_hwp_boost_up(cpu);
+}
+
+static void csd_init(struct cpudata *cpu)
+{
+ cpu->csd.flags = 0;
+ cpu->csd.func = intel_pstate_hwp_boost_up_local;
+ cpu->csd.info = cpu;
+}
+
static inline void intel_pstate_calc_avg_perf(struct cpudata *cpu)
{
struct sample *sample = &cpu->sample;
@@ -1894,6 +1933,9 @@ static int __intel_pstate_cpu_init(struct cpufreq_policy *policy)
policy->fast_switch_possible = true;
+ if (hwp_active)
+ csd_init(cpu);
+
return 0;
}
--
2.9.5
When a task is woken up from IO wait, boost HWP prformance to max. This
helps IO workloads on servers with per core P-states. But changing limits
has extra over head of issuing new HWP Request MSR, which takes 1000+
cycles. So this change limits setting HWP Request MSR. Also request can
be for a remote CPU.
Rate control in setting HWP Requests:
- If the current performance is around P1, simply ignore IO flag.
- Once set wait till hold time, till remove boost. While the boost
is on, another IO flags is notified, it will prolong boost.
- If the IO flags are notified multiple ticks apart, this may not be
IO bound task. Othewise idle system gets periodic boosts for one
IO wake.
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
drivers/cpufreq/intel_pstate.c | 75 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 75 insertions(+)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index e200887..d418265 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -20,6 +20,7 @@
#include <linux/tick.h>
#include <linux/slab.h>
#include <linux/sched/cpufreq.h>
+#include <linux/sched/topology.h>
#include <linux/list.h>
#include <linux/cpu.h>
#include <linux/cpufreq.h>
@@ -224,6 +225,8 @@ struct global_params {
* @hwp_req_cached: Cached value of the last HWP request MSR
* @csd: A structure used to issue SMP async call, which
* defines callback and arguments
+ * @hwp_boost_active: HWP performance is boosted on this CPU
+ * @last_io_update: Last time when IO wake flag was set
*
* This structure stores per CPU instance data for all CPUs.
*/
@@ -258,6 +261,8 @@ struct cpudata {
s16 epp_saved;
u64 hwp_req_cached;
call_single_data_t csd;
+ bool hwp_boost_active;
+ u64 last_io_update;
};
static struct cpudata **all_cpu_data;
@@ -1421,10 +1426,80 @@ static void csd_init(struct cpudata *cpu)
cpu->csd.info = cpu;
}
+/*
+ * Long hold time will keep high perf limits for long time,
+ * which negatively impacts perf/watt for some workloads,
+ * like specpower. 3ms is based on experiements on some
+ * workoads.
+ */
+static int hwp_boost_hold_time_ms = 3;
+
+/* Default: This will roughly around P1 on SKX */
+#define BOOST_PSTATE_THRESHOLD (SCHED_CAPACITY_SCALE / 2)
+static int hwp_boost_pstate_threshold = BOOST_PSTATE_THRESHOLD;
+
+static inline bool intel_pstate_check_boost_threhold(struct cpudata *cpu)
+{
+ /*
+ * If the last performance is above threshold, then return false,
+ * so that caller can ignore boosting.
+ */
+ if (arch_scale_freq_capacity(cpu->cpu) > hwp_boost_pstate_threshold)
+ return false;
+
+ return true;
+}
+
static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
u64 time, unsigned int flags)
{
+ struct cpudata *cpu = container_of(data, struct cpudata, update_util);
+
+ if (flags & SCHED_CPUFREQ_IOWAIT) {
+ /*
+ * Set iowait_boost flag and update time. Since IO WAIT flag
+ * is set all the time, we can't just conclude that there is
+ * some IO bound activity is scheduled on this CPU with just
+ * one occurrence. If we receive at least two in two
+ * consecutive ticks, then we start treating as IO. So
+ * there will be one tick latency.
+ */
+ if (time_before64(time, cpu->last_io_update + 2 * TICK_NSEC) &&
+ intel_pstate_check_boost_threhold(cpu))
+ cpu->iowait_boost = true;
+
+ cpu->last_io_update = time;
+ cpu->last_update = time;
+ }
+ /*
+ * If the boost is active, we will remove it after timeout on local
+ * CPU only.
+ */
+ if (cpu->hwp_boost_active) {
+ if (smp_processor_id() == cpu->cpu) {
+ bool expired;
+
+ expired = time_after64(time, cpu->last_update +
+ (hwp_boost_hold_time_ms * NSEC_PER_MSEC));
+ if (expired) {
+ intel_pstate_hwp_boost_down(cpu);
+ cpu->hwp_boost_active = false;
+ cpu->iowait_boost = false;
+ }
+ }
+ return;
+ }
+
+ cpu->last_update = time;
+
+ if (cpu->iowait_boost) {
+ cpu->hwp_boost_active = true;
+ if (smp_processor_id() == cpu->cpu)
+ intel_pstate_hwp_boost_up(cpu);
+ else
+ smp_call_function_single_async(cpu->cpu, &cpu->csd);
+ }
}
static inline void intel_pstate_calc_avg_perf(struct cpudata *cpu)
--
2.9.5
Added cpufreq_get_sched_util() to get the CFS, DL and max utilization
values for a CPU. This is required for getting utilization values
for cpufreq drivers outside of kernel/sched folder.
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
include/linux/sched/cpufreq.h | 2 ++
kernel/sched/cpufreq.c | 23 +++++++++++++++++++++++
2 files changed, 25 insertions(+)
diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index 5966744..a366600 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -20,6 +20,8 @@ void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
void (*func)(struct update_util_data *data, u64 time,
unsigned int flags));
void cpufreq_remove_update_util_hook(int cpu);
+void cpufreq_get_sched_util(int cpu, unsigned long *util_cfs,
+ unsigned long *util_dl, unsigned long *max);
#endif /* CONFIG_CPU_FREQ */
#endif /* _LINUX_SCHED_CPUFREQ_H */
diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
index 5e54cbc..36e2839 100644
--- a/kernel/sched/cpufreq.c
+++ b/kernel/sched/cpufreq.c
@@ -60,3 +60,26 @@ void cpufreq_remove_update_util_hook(int cpu)
rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), NULL);
}
EXPORT_SYMBOL_GPL(cpufreq_remove_update_util_hook);
+
+/**
+ * cpufreq_get_sched_util - Get utilization values.
+ * @cpu: The targeted CPU.
+ *
+ * Get the CFS, DL and max utilization.
+ * This function allows cpufreq driver outside the kernel/sched to access
+ * utilization value for a CPUs run queue.
+ */
+void cpufreq_get_sched_util(int cpu, unsigned long *util_cfs,
+ unsigned long *util_dl, unsigned long *max)
+{
+#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+ struct rq *rq = cpu_rq(cpu);
+
+ *max = arch_scale_cpu_capacity(NULL, cpu);
+ *util_cfs = cpu_util_cfs(rq);
+ *util_dl = cpu_util_dl(rq);
+#else
+ *util_cfs = *util_dl = 1;
+#endif
+}
+EXPORT_SYMBOL_GPL(cpufreq_get_sched_util);
--
2.9.5
Enable HWP boost on Skylake server platform.
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
drivers/cpufreq/intel_pstate.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 65d11d2..827c003 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1863,6 +1863,11 @@ static const struct x86_cpu_id intel_pstate_cpu_ee_disable_ids[] = {
{}
};
+static const struct x86_cpu_id intel_pstate_hwp_boost_ids[] __initconst = {
+ ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
+ {}
+};
+
static int intel_pstate_init_cpu(unsigned int cpunum)
{
struct cpudata *cpu;
@@ -1893,6 +1898,10 @@ static int intel_pstate_init_cpu(unsigned int cpunum)
intel_pstate_disable_ee(cpunum);
intel_pstate_hwp_enable(cpu);
+
+ id = x86_match_cpu(intel_pstate_hwp_boost_ids);
+ if (id)
+ hwp_boost = true;
}
intel_pstate_get_cpu_pstates(cpu);
--
2.9.5
When a busy task migrates to a new CPU boost HWP prformance to max. This
helps workloads on servers with per core P-states, which saturates all
CPUs and then they migrate frequently. But changing limits has extra over
head of issuing new HWP Request MSR, which takes 1000+
cycles. So this change limits setting HWP Request MSR.
Rate control in setting HWP Requests:
- If the current performance is around P1, simply ignore.
- Once set wait till hold time, till remove boost. While the boost
is on, another flags is notified, it will prolong boost.
- The task migrates needs to have some utilzation which is more
than threshold utilization, which will trigger P-state above minimum.
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
drivers/cpufreq/intel_pstate.c | 37 ++++++++++++++++++++++++++++++++++++-
1 file changed, 36 insertions(+), 1 deletion(-)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index d418265..ec455af 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -227,6 +227,7 @@ struct global_params {
* defines callback and arguments
* @hwp_boost_active: HWP performance is boosted on this CPU
* @last_io_update: Last time when IO wake flag was set
+ * @migrate_hint: Set when scheduler indicates thread migration
*
* This structure stores per CPU instance data for all CPUs.
*/
@@ -263,6 +264,7 @@ struct cpudata {
call_single_data_t csd;
bool hwp_boost_active;
u64 last_io_update;
+ bool migrate_hint;
};
static struct cpudata **all_cpu_data;
@@ -1438,6 +1440,8 @@ static int hwp_boost_hold_time_ms = 3;
#define BOOST_PSTATE_THRESHOLD (SCHED_CAPACITY_SCALE / 2)
static int hwp_boost_pstate_threshold = BOOST_PSTATE_THRESHOLD;
+static int hwp_boost_threshold_busy_pct;
+
static inline bool intel_pstate_check_boost_threhold(struct cpudata *cpu)
{
/*
@@ -1450,12 +1454,32 @@ static inline bool intel_pstate_check_boost_threhold(struct cpudata *cpu)
return true;
}
+static inline int intel_pstate_get_sched_util(struct cpudata *cpu)
+{
+ unsigned long util_cfs, util_dl, max, util;
+
+ cpufreq_get_sched_util(cpu->cpu, &util_cfs, &util_dl, &max);
+ util = min(util_cfs + util_dl, max);
+ return util * 100 / max;
+}
+
static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
u64 time, unsigned int flags)
{
struct cpudata *cpu = container_of(data, struct cpudata, update_util);
- if (flags & SCHED_CPUFREQ_IOWAIT) {
+ if (flags & SCHED_CPUFREQ_MIGRATION) {
+ if (intel_pstate_check_boost_threhold(cpu))
+ cpu->migrate_hint = true;
+
+ cpu->last_update = time;
+ /*
+ * The rq utilization data is not migrated yet to the new CPU
+ * rq, so wait for call on local CPU to boost.
+ */
+ if (smp_processor_id() != cpu->cpu)
+ return;
+ } else if (flags & SCHED_CPUFREQ_IOWAIT) {
/*
* Set iowait_boost flag and update time. Since IO WAIT flag
* is set all the time, we can't just conclude that there is
@@ -1499,6 +1523,17 @@ static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
intel_pstate_hwp_boost_up(cpu);
else
smp_call_function_single_async(cpu->cpu, &cpu->csd);
+ return;
+ }
+
+ /* Ignore if the migrated thread has low utilization */
+ if (cpu->migrate_hint && smp_processor_id() == cpu->cpu) {
+ int util = intel_pstate_get_sched_util(cpu);
+
+ if (util >= hwp_boost_threshold_busy_pct) {
+ cpu->hwp_boost_active = true;
+ intel_pstate_hwp_boost_up(cpu);
+ }
}
}
--
2.9.5
intel_pstate has two operating modes: active and passive. In "active"
mode, the in-built scaling governor is used and in "passive" mode,
the driver can be used with any governor like "schedutil". In "active"
mode the utilization values from schedutil is not used and there is
a requirement from high performance computing use cases, not to read
any APERF/MPERF MSRs. In this case no need to use CPU cycles for
frequency invariant accounting by reading APERF/MPERF MSRs.
With this change frequency invariant account is only enabled in
"passive" mode.
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
[Note: The tick will be enabled later in the series when hwp dynamic
boost is enabled]
drivers/cpufreq/intel_pstate.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 17e566af..f686bbe 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -2040,6 +2040,8 @@ static int intel_pstate_register_driver(struct cpufreq_driver *driver)
{
int ret;
+ x86_arch_scale_freq_tick_disable();
+
memset(&global, 0, sizeof(global));
global.max_perf_pct = 100;
@@ -2052,6 +2054,9 @@ static int intel_pstate_register_driver(struct cpufreq_driver *driver)
global.min_perf_pct = min_perf_pct_min();
+ if (driver == &intel_cpufreq)
+ x86_arch_scale_freq_tick_enable();
+
return 0;
}
--
2.9.5
From: Peter Zijlstra <[email protected]>
Implement arch_scale_freq_capacity() for 'modern' x86. This function
is used by the scheduler to correctly account usage in the face of
DVFS.
For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When
running a task that would consume 1/3rd of a CPU at 1000 MHz, it would
appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the
false impression this CPU is almost at capacity, even though it can go
faster [*].
Since modern x86 has hardware control over the actual frequency we run
at (because amongst other things, Turbo-Mode), we cannot simply use
the frequency as requested through cpufreq.
Instead we use the APERF/MPERF MSRs to compute the effective frequency
over the recent past. Also, because reading MSRs is expensive, don't
do so every time we need the value, but amortize the cost by doing it
every tick.
[*] this assumes a linear frequency/performance relation; which
everybody knows to be false, but given realities its the best
approximation we can make.
Cc: Thomas Gleixner <[email protected]>
Cc: Suravee Suthikulpanit <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Srinivas Pandruvada <[email protected]>
---
Changes on top of Peter's patch:
-Ported to the latest 4.17-rc4
-Added KNL/KNM related changes
-Account for Turbo boost disabled on a system in BIOS
-New interface to disable tick processing when we don't want
arch/x86/include/asm/topology.h | 29 ++++++
arch/x86/kernel/smpboot.c | 196 +++++++++++++++++++++++++++++++++++++++-
kernel/sched/core.c | 1 +
kernel/sched/sched.h | 7 ++
4 files changed, 232 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index c1d2a98..3fb5346 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -172,4 +172,33 @@ static inline void sched_clear_itmt_support(void)
}
#endif /* CONFIG_SCHED_MC_PRIO */
+#ifdef CONFIG_SMP
+#include <asm/cpufeature.h>
+
+#define arch_scale_freq_tick arch_scale_freq_tick
+#define arch_scale_freq_capacity arch_scale_freq_capacity
+
+DECLARE_PER_CPU(unsigned long, arch_cpu_freq);
+
+static inline long arch_scale_freq_capacity(int cpu)
+{
+ if (static_cpu_has(X86_FEATURE_APERFMPERF))
+ return per_cpu(arch_cpu_freq, cpu);
+
+ return 1024 /* SCHED_CAPACITY_SCALE */;
+}
+
+extern void arch_scale_freq_tick(void);
+extern void x86_arch_scale_freq_tick_enable(void);
+extern void x86_arch_scale_freq_tick_disable(void);
+#else
+static inline void x86_arch_scale_freq_tick_enable(void)
+{
+}
+
+static inline void x86_arch_scale_freq_tick_disable(void)
+{
+}
+#endif
+
#endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 0f1cbb0..9e2cb82 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -148,6 +148,8 @@ static inline void smpboot_restore_warm_reset_vector(void)
*((volatile u32 *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) = 0;
}
+static void set_cpu_max_freq(void);
+
/*
* Report back to the Boot Processor during boot time or to the caller processor
* during CPU online.
@@ -189,6 +191,8 @@ static void smp_callin(void)
*/
set_cpu_sibling_map(raw_smp_processor_id());
+ set_cpu_max_freq();
+
/*
* Get our bogomips.
* Update loops_per_jiffy in cpu_data. Previous call to
@@ -1259,7 +1263,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
set_sched_topology(x86_topology);
set_cpu_sibling_map(0);
-
+ set_cpu_max_freq();
smp_sanity_check();
switch (apic_intr_mode) {
@@ -1676,3 +1680,193 @@ void native_play_dead(void)
}
#endif
+
+/*
+ * APERF/MPERF frequency ratio computation.
+ *
+ * The scheduler wants to do frequency invariant accounting and needs a <1
+ * ratio to account for the 'current' frequency.
+ *
+ * Since the frequency on x86 is controlled by micro-controller and our P-state
+ * setting is little more than a request/hint, we need to observe the effective
+ * frequency. We do this with APERF/MPERF.
+ *
+ * One complication is that the APERF/MPERF ratio can be >1, specifically
+ * APERF/MPERF gives the ratio relative to the max non-turbo P-state. Therefore
+ * we need to re-normalize the ratio.
+ *
+ * We do this by tracking the max APERF/MPERF ratio previously observed and
+ * scaling our MPERF delta with that. Every time our ratio goes over 1, we
+ * proportionally scale up our old max.
+ *
+ * The down-side to this runtime max search is that you have to trigger the
+ * actual max frequency before your scale is right. Therefore allow
+ * architectures to initialize the max ratio on CPU bringup.
+ */
+
+static DEFINE_PER_CPU(u64, arch_prev_aperf);
+static DEFINE_PER_CPU(u64, arch_prev_mperf);
+static DEFINE_PER_CPU(u64, arch_prev_max_freq) = SCHED_CAPACITY_SCALE;
+
+static bool turbo_disabled(void)
+{
+ u64 misc_en;
+ int err;
+
+ err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en);
+ if (err)
+ return false;
+
+ return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
+}
+
+static bool atom_set_cpu_max_freq(void)
+{
+ u64 ratio, turbo_ratio;
+ int err;
+
+ if (turbo_disabled()) {
+ this_cpu_write(arch_prev_max_freq, SCHED_CAPACITY_SCALE);
+ return true;
+ }
+
+ err = rdmsrl_safe(MSR_ATOM_CORE_RATIOS, &ratio);
+ if (err)
+ return false;
+
+ err = rdmsrl_safe(MSR_ATOM_CORE_TURBO_RATIOS, &turbo_ratio);
+ if (err)
+ return false;
+
+ ratio = (ratio >> 16) & 0x7F; /* max P state ratio */
+ turbo_ratio = turbo_ratio & 0x7F; /* 1C turbo ratio */
+
+ this_cpu_write(arch_prev_max_freq,
+ div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio));
+ return true;
+}
+
+#include <asm/cpu_device_id.h>
+#include <asm/intel-family.h>
+
+#define ICPU(model) \
+ { X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF, 0}
+
+static const struct x86_cpu_id intel_turbo_ratio_adjust[] __initconst = {
+ ICPU(INTEL_FAM6_XEON_PHI_KNL),
+ ICPU(INTEL_FAM6_XEON_PHI_KNM),
+ {}
+};
+
+static bool core_set_cpu_max_freq(void)
+{
+ const struct x86_cpu_id *id;
+ u64 ratio, turbo_ratio;
+ int err;
+
+ if (turbo_disabled()) {
+ this_cpu_write(arch_prev_max_freq, SCHED_CAPACITY_SCALE);
+ return true;
+ }
+
+ err = rdmsrl_safe(MSR_PLATFORM_INFO, &ratio);
+ if (err)
+ return false;
+
+ err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &turbo_ratio);
+ if (err)
+ return false;
+
+ ratio = (ratio >> 8) & 0xFF; /* base ratio */
+ id = x86_match_cpu(intel_turbo_ratio_adjust);
+ if (id)
+ turbo_ratio = (turbo_ratio >> 8) & 0xFF; /* 1C turbo ratio */
+ else
+ turbo_ratio = turbo_ratio & 0xFF; /* 1C turbo ratio */
+
+ this_cpu_write(arch_prev_max_freq,
+ div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio));
+ return true;
+}
+
+static void intel_set_cpu_max_freq(void)
+{
+ if (atom_set_cpu_max_freq())
+ return;
+
+ if (core_set_cpu_max_freq())
+ return;
+}
+
+static void set_cpu_max_freq(void)
+{
+ u64 aperf, mperf;
+
+ if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
+ return;
+
+ switch (boot_cpu_data.x86_vendor) {
+ case X86_VENDOR_INTEL:
+ intel_set_cpu_max_freq();
+ break;
+ default:
+ break;
+ }
+
+ rdmsrl(MSR_IA32_APERF, aperf);
+ rdmsrl(MSR_IA32_MPERF, mperf);
+
+ this_cpu_write(arch_prev_aperf, aperf);
+ this_cpu_write(arch_prev_mperf, mperf);
+}
+
+DEFINE_PER_CPU(unsigned long, arch_cpu_freq);
+
+static bool tick_disable;
+
+void arch_scale_freq_tick(void)
+{
+ u64 freq, max_freq = this_cpu_read(arch_prev_max_freq);
+ u64 aperf, mperf;
+ u64 acnt, mcnt;
+
+ if (!static_cpu_has(X86_FEATURE_APERFMPERF) || tick_disable)
+ return;
+
+ rdmsrl(MSR_IA32_APERF, aperf);
+ rdmsrl(MSR_IA32_MPERF, mperf);
+
+ acnt = aperf - this_cpu_read(arch_prev_aperf);
+ mcnt = mperf - this_cpu_read(arch_prev_mperf);
+ if (!mcnt)
+ return;
+
+ this_cpu_write(arch_prev_aperf, aperf);
+ this_cpu_write(arch_prev_mperf, mperf);
+
+ acnt <<= 2*SCHED_CAPACITY_SHIFT;
+ mcnt *= max_freq;
+
+ freq = div64_u64(acnt, mcnt);
+
+ if (unlikely(freq > SCHED_CAPACITY_SCALE)) {
+ max_freq *= freq;
+ max_freq >>= SCHED_CAPACITY_SHIFT;
+
+ this_cpu_write(arch_prev_max_freq, max_freq);
+
+ freq = SCHED_CAPACITY_SCALE;
+ }
+
+ this_cpu_write(arch_cpu_freq, freq);
+}
+
+void x86_arch_scale_freq_tick_enable(void)
+{
+ tick_disable = false;
+}
+
+void x86_arch_scale_freq_tick_disable(void)
+{
+ tick_disable = true;
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 092f7c4..2bdef36 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3076,6 +3076,7 @@ void scheduler_tick(void)
struct task_struct *curr = rq->curr;
struct rq_flags rf;
+ arch_scale_freq_tick();
sched_clock_tick();
rq_lock(rq, &rf);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 15750c2..71851af 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1713,6 +1713,13 @@ static inline int hrtick_enabled(struct rq *rq)
#endif /* CONFIG_SCHED_HRTICK */
+#ifndef arch_scale_freq_tick
+static __always_inline
+void arch_scale_freq_tick(void)
+{
+}
+#endif
+
#ifndef arch_scale_freq_capacity
static __always_inline
unsigned long arch_scale_freq_capacity(int cpu)
--
2.9.5
On 15-05-18, 21:49, Srinivas Pandruvada wrote:
> Added cpufreq_get_sched_util() to get the CFS, DL and max utilization
> values for a CPU. This is required for getting utilization values
> for cpufreq drivers outside of kernel/sched folder.
>
> Signed-off-by: Srinivas Pandruvada <[email protected]>
> ---
> include/linux/sched/cpufreq.h | 2 ++
> kernel/sched/cpufreq.c | 23 +++++++++++++++++++++++
> 2 files changed, 25 insertions(+)
>
> diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
> index 5966744..a366600 100644
> --- a/include/linux/sched/cpufreq.h
> +++ b/include/linux/sched/cpufreq.h
> @@ -20,6 +20,8 @@ void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
> void (*func)(struct update_util_data *data, u64 time,
> unsigned int flags));
> void cpufreq_remove_update_util_hook(int cpu);
> +void cpufreq_get_sched_util(int cpu, unsigned long *util_cfs,
> + unsigned long *util_dl, unsigned long *max);
> #endif /* CONFIG_CPU_FREQ */
>
> #endif /* _LINUX_SCHED_CPUFREQ_H */
> diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
> index 5e54cbc..36e2839 100644
> --- a/kernel/sched/cpufreq.c
> +++ b/kernel/sched/cpufreq.c
> @@ -60,3 +60,26 @@ void cpufreq_remove_update_util_hook(int cpu)
> rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), NULL);
> }
> EXPORT_SYMBOL_GPL(cpufreq_remove_update_util_hook);
> +
> +/**
> + * cpufreq_get_sched_util - Get utilization values.
> + * @cpu: The targeted CPU.
> + *
> + * Get the CFS, DL and max utilization.
> + * This function allows cpufreq driver outside the kernel/sched to access
> + * utilization value for a CPUs run queue.
> + */
> +void cpufreq_get_sched_util(int cpu, unsigned long *util_cfs,
> + unsigned long *util_dl, unsigned long *max)
> +{
> +#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
What will happen when schedutil is compiled in the kernel but ondemand
is the one getting used currently ?
> + struct rq *rq = cpu_rq(cpu);
> +
> + *max = arch_scale_cpu_capacity(NULL, cpu);
> + *util_cfs = cpu_util_cfs(rq);
> + *util_dl = cpu_util_dl(rq);
> +#else
> + *util_cfs = *util_dl = 1;
> +#endif
> +}
> +EXPORT_SYMBOL_GPL(cpufreq_get_sched_util);
> --
> 2.9.5
--
viresh
Hi Srinivas,
On 15/05/18 21:49, Srinivas Pandruvada wrote:
[...]
>
> Peter Zijlstra (1):
> x86,sched: Add support for frequency invariance
Cool! I was going to ask Peter about this patch. You beat me to it. :)
I'll have a lokk at the set. BTW, just noticed that you Cc-ed me using
my old address. Please use [email protected]. ;)
Best,
- Juri
On Tue, May 15, 2018 at 09:49:03PM -0700, Srinivas Pandruvada wrote:
> intel_pstate has two operating modes: active and passive. In "active"
> mode, the in-built scaling governor is used and in "passive" mode,
> the driver can be used with any governor like "schedutil". In "active"
> mode the utilization values from schedutil is not used and there is
> a requirement from high performance computing use cases, not to read
> any APERF/MPERF MSRs. In this case no need to use CPU cycles for
> frequency invariant accounting by reading APERF/MPERF MSRs.
> With this change frequency invariant account is only enabled in
> "passive" mode.
WTH is active/passive? Is passive when we select performance governor?
Also; you have to explain why using APERF/MPERF is bad in that case. Why
do they care if we read those MSRs during the tick?
On Tue, May 15, 2018 at 09:49:04PM -0700, Srinivas Pandruvada wrote:
> Setup necessary infrastructure to be able to boost HWP performance on a
> remote CPU. First initialize data structure to be able to use
> smp_call_function_single_async(). The boost up function simply set HWP
> min to HWP max value and EPP to 0. The boost down function simply restores
> to last cached HWP Request value.
>
> To avoid reading HWP Request MSR during dynamic update, the HWP Request
> MSR value is cached in the local memory. This caching is done whenever
> HWP request MSR is modified during driver init on in setpolicy() callback
> path.
The patch does two independent things; best to split that.
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index f686bbe..dc7dfa9 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -221,6 +221,9 @@ struct global_params {
> * preference/bias
> * @epp_saved: Saved EPP/EPB during system suspend or CPU offline
> * operation
> + * @hwp_req_cached: Cached value of the last HWP request MSR
That's simply not true given the code below.
> @@ -763,6 +768,7 @@ static void intel_pstate_hwp_set(unsigned int cpu)
> intel_pstate_set_epb(cpu, epp);
> }
> skip_epp:
> + cpu_data->hwp_req_cached = value;
> wrmsrl_on_cpu(cpu, MSR_HWP_REQUEST, value);
> }
>
> @@ -1381,6 +1387,39 @@ static void intel_pstate_get_cpu_pstates(struct cpudata *cpu)
> intel_pstate_set_min_pstate(cpu);
> }
>
> +
> +static inline void intel_pstate_hwp_boost_up(struct cpudata *cpu)
> +{
> + u64 hwp_req;
> + u8 max;
> +
> + max = (u8) (cpu->hwp_req_cached >> 8);
> +
> + hwp_req = cpu->hwp_req_cached & ~GENMASK_ULL(31, 24);
> + hwp_req = (hwp_req & ~GENMASK_ULL(7, 0)) | max;
> +
> + wrmsrl(MSR_HWP_REQUEST, hwp_req);
> +}
> +
> +static inline void intel_pstate_hwp_boost_down(struct cpudata *cpu)
> +{
> + wrmsrl(MSR_HWP_REQUEST, cpu->hwp_req_cached);
> +}
This is not a traditional msr shadow; that would be updated on every
wrmsr. There is not a comment in sight explaining why this one is
different.
On Wed, May 16, 2018 at 09:16:40AM +0200, Peter Zijlstra wrote:
> On Tue, May 15, 2018 at 09:49:03PM -0700, Srinivas Pandruvada wrote:
> > intel_pstate has two operating modes: active and passive. In "active"
> > mode, the in-built scaling governor is used and in "passive" mode,
> > the driver can be used with any governor like "schedutil". In "active"
> > mode the utilization values from schedutil is not used and there is
> > a requirement from high performance computing use cases, not to read
> > any APERF/MPERF MSRs. In this case no need to use CPU cycles for
> > frequency invariant accounting by reading APERF/MPERF MSRs.
> > With this change frequency invariant account is only enabled in
> > "passive" mode.
>
> WTH is active/passive? Is passive when we select performance governor?
Bah, I cannot read it seems. active is when we use the intel_pstate
governor and passive is when we use schedutil and only use intel_pstate
as a driver.
> Also; you have to explain why using APERF/MPERF is bad in that case. Why
> do they care if we read those MSRs during the tick?
That still stands.. this needs to be properly explained.
On Tue, May 15, 2018 at 09:49:06PM -0700, Srinivas Pandruvada wrote:
> When a task is woken up from IO wait, boost HWP prformance to max. This
> helps IO workloads on servers with per core P-states. But changing limits
> has extra over head of issuing new HWP Request MSR, which takes 1000+
> cycles. So this change limits setting HWP Request MSR. Also request can
> be for a remote CPU.
> Rate control in setting HWP Requests:
> - If the current performance is around P1, simply ignore IO flag.
> - Once set wait till hold time, till remove boost. While the boost
> is on, another IO flags is notified, it will prolong boost.
> - If the IO flags are notified multiple ticks apart, this may not be
> IO bound task. Othewise idle system gets periodic boosts for one
> IO wake.
>
> Signed-off-by: Srinivas Pandruvada <[email protected]>
> ---
> drivers/cpufreq/intel_pstate.c | 75 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 75 insertions(+)
>
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index e200887..d418265 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -20,6 +20,7 @@
> #include <linux/tick.h>
> #include <linux/slab.h>
> #include <linux/sched/cpufreq.h>
> +#include <linux/sched/topology.h>
> #include <linux/list.h>
> #include <linux/cpu.h>
> #include <linux/cpufreq.h>
> @@ -224,6 +225,8 @@ struct global_params {
> * @hwp_req_cached: Cached value of the last HWP request MSR
> * @csd: A structure used to issue SMP async call, which
> * defines callback and arguments
> + * @hwp_boost_active: HWP performance is boosted on this CPU
> + * @last_io_update: Last time when IO wake flag was set
> *
> * This structure stores per CPU instance data for all CPUs.
> */
> @@ -258,6 +261,8 @@ struct cpudata {
> s16 epp_saved;
> u64 hwp_req_cached;
> call_single_data_t csd;
> + bool hwp_boost_active;
> + u64 last_io_update;
This structure has abysmal layout; you should look at that.
Also, mandatory complaint about using _Bool in composites.
> };
>
> static struct cpudata **all_cpu_data;
> @@ -1421,10 +1426,80 @@ static void csd_init(struct cpudata *cpu)
> cpu->csd.info = cpu;
> }
>
> +/*
> + * Long hold time will keep high perf limits for long time,
> + * which negatively impacts perf/watt for some workloads,
> + * like specpower. 3ms is based on experiements on some
> + * workoads.
> + */
> +static int hwp_boost_hold_time_ms = 3;
> +
> +/* Default: This will roughly around P1 on SKX */
> +#define BOOST_PSTATE_THRESHOLD (SCHED_CAPACITY_SCALE / 2)
Yuck.. why the need to hardcode this? Can't you simply read the P1 value
for the part at hand?
> +static int hwp_boost_pstate_threshold = BOOST_PSTATE_THRESHOLD;
> +
> +static inline bool intel_pstate_check_boost_threhold(struct cpudata *cpu)
> +{
> + /*
> + * If the last performance is above threshold, then return false,
> + * so that caller can ignore boosting.
> + */
> + if (arch_scale_freq_capacity(cpu->cpu) > hwp_boost_pstate_threshold)
> + return false;
> +
> + return true;
> +}
> +
> static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
> u64 time, unsigned int flags)
> {
> + struct cpudata *cpu = container_of(data, struct cpudata, update_util);
> +
> + if (flags & SCHED_CPUFREQ_IOWAIT) {
> + /*
> + * Set iowait_boost flag and update time. Since IO WAIT flag
> + * is set all the time, we can't just conclude that there is
> + * some IO bound activity is scheduled on this CPU with just
> + * one occurrence. If we receive at least two in two
> + * consecutive ticks, then we start treating as IO. So
> + * there will be one tick latency.
> + */
> + if (time_before64(time, cpu->last_io_update + 2 * TICK_NSEC) &&
> + intel_pstate_check_boost_threhold(cpu))
> + cpu->iowait_boost = true;
> +
> + cpu->last_io_update = time;
> + cpu->last_update = time;
> + }
>
> + /*
> + * If the boost is active, we will remove it after timeout on local
> + * CPU only.
> + */
> + if (cpu->hwp_boost_active) {
> + if (smp_processor_id() == cpu->cpu) {
> + bool expired;
> +
> + expired = time_after64(time, cpu->last_update +
> + (hwp_boost_hold_time_ms * NSEC_PER_MSEC));
> + if (expired) {
> + intel_pstate_hwp_boost_down(cpu);
> + cpu->hwp_boost_active = false;
> + cpu->iowait_boost = false;
> + }
> + }
> + return;
> + }
> +
> + cpu->last_update = time;
> +
> + if (cpu->iowait_boost) {
> + cpu->hwp_boost_active = true;
> + if (smp_processor_id() == cpu->cpu)
> + intel_pstate_hwp_boost_up(cpu);
> + else
> + smp_call_function_single_async(cpu->cpu, &cpu->csd);
> + }
> }
Hurmph, this looks like you're starting to duplicate the schedutil
iowait logic. Why didn't you copy the gradual boosting thing?
On Tue, May 15, 2018 at 09:49:09PM -0700, Srinivas Pandruvada wrote:
> +static inline void intel_pstate_update_busy_threshold(struct cpudata *cpu)
> +{
> + /* P1 percent out of total range of P-states */
> + if (cpu->pstate.max_freq != cpu->pstate.turbo_freq) {
> + hwp_boost_pstate_threshold =
> + cpu->pstate.max_freq * SCHED_CAPACITY_SCALE / cpu->pstate.turbo_freq;
> + pr_debug("hwp_boost_pstate_threshold = %d\n",
> + hwp_boost_pstate_threshold);
> + }
> +
> +}
> +
> static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
> u64 time, unsigned int flags)
> {
> @@ -2061,8 +2097,10 @@ static int __intel_pstate_cpu_init(struct cpufreq_policy *policy)
>
> policy->fast_switch_possible = true;
>
> - if (hwp_active)
> + if (hwp_active) {
> csd_init(cpu);
> + intel_pstate_update_busy_threshold(cpu);
> + }
>
> return 0;
> }
This should go in patch #5 and then you can remove that SKX hack. Which
you left in, even though you now did it right.
On Tue, May 15, 2018 at 09:49:09PM -0700, Srinivas Pandruvada wrote:
> +static inline void intel_pstate_update_busy_threshold(struct cpudata *cpu)
> +{
> + if (!hwp_boost_threshold_busy_pct) {
> + int min_freq, max_freq;
> +
> + min_freq = cpu->pstate.min_pstate * cpu->pstate.scaling;
> + update_turbo_state();
> + max_freq = global.turbo_disabled || global.no_turbo ?
> + cpu->pstate.max_freq : cpu->pstate.turbo_freq;
> +
> + /*
> + * We are guranteed to get atleast min P-state.
If we assume
> + * P-state is proportional to load (such that 10% load
> + * increase will result in 10% P-state increase),
we will
> + * get at least min P-state till we have atleast
> + * (min * 100/max) percent cpu load.
turbo makes that story less clear ofcourse.
So any load less than
> + * than this this we shouldn't do any boost. Then boosting
> + * is not free, we will add atleast 20% offset.
This I don't get.. so you want to remain at min P longer?
> + */
> + hwp_boost_threshold_busy_pct = min_freq * 100 / max_freq;
> + hwp_boost_threshold_busy_pct += 20;
> + pr_debug("hwp_boost_threshold_busy_pct = %d\n",
> + hwp_boost_threshold_busy_pct);
> + }
> +}
And then this part should go in the previous patch.
On Tue, May 15, 2018 at 09:49:11PM -0700, Srinivas Pandruvada wrote:
> Enable HWP boost on Skylake server platform.
Why not unconditionally enable it on everything HWP ?
On Tue, May 15, 2018 at 09:49:07PM -0700, Srinivas Pandruvada wrote:
> --- a/kernel/sched/cpufreq.c
> +++ b/kernel/sched/cpufreq.c
> @@ -60,3 +60,26 @@ void cpufreq_remove_update_util_hook(int cpu)
> rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), NULL);
> }
> EXPORT_SYMBOL_GPL(cpufreq_remove_update_util_hook);
> +
> +/**
> + * cpufreq_get_sched_util - Get utilization values.
> + * @cpu: The targeted CPU.
> + *
> + * Get the CFS, DL and max utilization.
> + * This function allows cpufreq driver outside the kernel/sched to access
> + * utilization value for a CPUs run queue.
> + */
> +void cpufreq_get_sched_util(int cpu, unsigned long *util_cfs,
> + unsigned long *util_dl, unsigned long *max)
> +{
> +#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> + struct rq *rq = cpu_rq(cpu);
> +
> + *max = arch_scale_cpu_capacity(NULL, cpu);
> + *util_cfs = cpu_util_cfs(rq);
> + *util_dl = cpu_util_dl(rq);
> +#else
> + *util_cfs = *util_dl = 1;
> +#endif
> +}
> +EXPORT_SYMBOL_GPL(cpufreq_get_sched_util);
So I _really_ hate this... I'd much rather you make schedutil work with
the hwp passive stuff.
Also, afaict intel_pstate is bool, not tristate, so no need for an
EXPORT at all.
On Wed, May 16, 2018 at 9:29 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, May 16, 2018 at 09:16:40AM +0200, Peter Zijlstra wrote:
>> On Tue, May 15, 2018 at 09:49:03PM -0700, Srinivas Pandruvada wrote:
>> > intel_pstate has two operating modes: active and passive. In "active"
>> > mode, the in-built scaling governor is used and in "passive" mode,
>> > the driver can be used with any governor like "schedutil". In "active"
>> > mode the utilization values from schedutil is not used and there is
>> > a requirement from high performance computing use cases, not to read
>> > any APERF/MPERF MSRs. In this case no need to use CPU cycles for
>> > frequency invariant accounting by reading APERF/MPERF MSRs.
>> > With this change frequency invariant account is only enabled in
>> > "passive" mode.
>>
>> WTH is active/passive? Is passive when we select performance governor?
>
> Bah, I cannot read it seems. active is when we use the intel_pstate
> governor and passive is when we use schedutil and only use intel_pstate
> as a driver.
>
>> Also; you have to explain why using APERF/MPERF is bad in that case. Why
>> do they care if we read those MSRs during the tick?
>
> That still stands.. this needs to be properly explained.
I guess this is from the intel_pstate perspective only.
The active mode is only used with HWP, so intel_pstate doesn't look at
the utilization (in any form) in the passive mode today.
Still, there are other reasons for PELT to be scale-invariant, so ...
On Wed, May 16, 2018 at 9:22 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, May 15, 2018 at 09:49:04PM -0700, Srinivas Pandruvada wrote:
>> Setup necessary infrastructure to be able to boost HWP performance on a
>> remote CPU. First initialize data structure to be able to use
>> smp_call_function_single_async(). The boost up function simply set HWP
>> min to HWP max value and EPP to 0. The boost down function simply restores
>> to last cached HWP Request value.
>>
>> To avoid reading HWP Request MSR during dynamic update, the HWP Request
>> MSR value is cached in the local memory. This caching is done whenever
>> HWP request MSR is modified during driver init on in setpolicy() callback
>> path.
>
> The patch does two independent things; best to split that.
>
>
>> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
>> index f686bbe..dc7dfa9 100644
>> --- a/drivers/cpufreq/intel_pstate.c
>> +++ b/drivers/cpufreq/intel_pstate.c
>> @@ -221,6 +221,9 @@ struct global_params {
>> * preference/bias
>> * @epp_saved: Saved EPP/EPB during system suspend or CPU offline
>> * operation
>> + * @hwp_req_cached: Cached value of the last HWP request MSR
>
> That's simply not true given the code below.
It looks like the last "not boosted EPP" value.
>> @@ -763,6 +768,7 @@ static void intel_pstate_hwp_set(unsigned int cpu)
>> intel_pstate_set_epb(cpu, epp);
>> }
>> skip_epp:
>> + cpu_data->hwp_req_cached = value;
>> wrmsrl_on_cpu(cpu, MSR_HWP_REQUEST, value);
>> }
>>
>> @@ -1381,6 +1387,39 @@ static void intel_pstate_get_cpu_pstates(struct cpudata *cpu)
>> intel_pstate_set_min_pstate(cpu);
>> }
>>
>> +
>> +static inline void intel_pstate_hwp_boost_up(struct cpudata *cpu)
>> +{
>> + u64 hwp_req;
>> + u8 max;
>> +
>> + max = (u8) (cpu->hwp_req_cached >> 8);
>> +
>> + hwp_req = cpu->hwp_req_cached & ~GENMASK_ULL(31, 24);
>> + hwp_req = (hwp_req & ~GENMASK_ULL(7, 0)) | max;
>> +
>> + wrmsrl(MSR_HWP_REQUEST, hwp_req);
>> +}
>> +
>> +static inline void intel_pstate_hwp_boost_down(struct cpudata *cpu)
>> +{
>> + wrmsrl(MSR_HWP_REQUEST, cpu->hwp_req_cached);
>> +}
>
> This is not a traditional msr shadow; that would be updated on every
> wrmsr. There is not a comment in sight explaining why this one is
> different.
So if the "cached" thing is the last "not boosted EPP", that's why it
is not updated here.
On Wed, May 16, 2018 at 6:49 AM, Srinivas Pandruvada
<[email protected]> wrote:
> When a busy task migrates to a new CPU boost HWP prformance to max. This
> helps workloads on servers with per core P-states, which saturates all
> CPUs and then they migrate frequently. But changing limits has extra over
> head of issuing new HWP Request MSR, which takes 1000+
> cycles. So this change limits setting HWP Request MSR.
> Rate control in setting HWP Requests:
> - If the current performance is around P1, simply ignore.
> - Once set wait till hold time, till remove boost. While the boost
> is on, another flags is notified, it will prolong boost.
> - The task migrates needs to have some utilzation which is more
> than threshold utilization, which will trigger P-state above minimum.
>
> Signed-off-by: Srinivas Pandruvada <[email protected]>
> ---
> drivers/cpufreq/intel_pstate.c | 37 ++++++++++++++++++++++++++++++++++++-
> 1 file changed, 36 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index d418265..ec455af 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -227,6 +227,7 @@ struct global_params {
> * defines callback and arguments
> * @hwp_boost_active: HWP performance is boosted on this CPU
> * @last_io_update: Last time when IO wake flag was set
> + * @migrate_hint: Set when scheduler indicates thread migration
> *
> * This structure stores per CPU instance data for all CPUs.
> */
> @@ -263,6 +264,7 @@ struct cpudata {
> call_single_data_t csd;
> bool hwp_boost_active;
> u64 last_io_update;
> + bool migrate_hint;
Why do you need this in the struct?
It looks like it only is used locally in intel_pstate_update_util_hwp().
> };
Thanks for posting this one; I meant to start a thread on this for a
while but never got around to doing so.
I left the 'important' parts of the patch for context but removed all
the arch fiddling to find the max freq, as that is not so important
here.
On Tue, May 15, 2018 at 09:49:02PM -0700, Srinivas Pandruvada wrote:
> From: Peter Zijlstra <[email protected]>
>
> Implement arch_scale_freq_capacity() for 'modern' x86. This function
> is used by the scheduler to correctly account usage in the face of
> DVFS.
>
> For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When
> running a task that would consume 1/3rd of a CPU at 1000 MHz, it would
> appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the
> false impression this CPU is almost at capacity, even though it can go
> faster [*].
>
> Since modern x86 has hardware control over the actual frequency we run
> at (because amongst other things, Turbo-Mode), we cannot simply use
> the frequency as requested through cpufreq.
>
> Instead we use the APERF/MPERF MSRs to compute the effective frequency
> over the recent past. Also, because reading MSRs is expensive, don't
> do so every time we need the value, but amortize the cost by doing it
> every tick.
>
> [*] this assumes a linear frequency/performance relation; which
> everybody knows to be false, but given realities its the best
> approximation we can make.
>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Suravee Suthikulpanit <[email protected]>
> Cc: "Rafael J. Wysocki" <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Srinivas Pandruvada <[email protected]>
> ---
> diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> index c1d2a98..3fb5346 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -172,4 +172,33 @@ static inline void sched_clear_itmt_support(void)
> }
> #endif /* CONFIG_SCHED_MC_PRIO */
>
> +#ifdef CONFIG_SMP
> +#include <asm/cpufeature.h>
> +
> +#define arch_scale_freq_tick arch_scale_freq_tick
> +#define arch_scale_freq_capacity arch_scale_freq_capacity
> +
> +DECLARE_PER_CPU(unsigned long, arch_cpu_freq);
> +
> +static inline long arch_scale_freq_capacity(int cpu)
> +{
> + if (static_cpu_has(X86_FEATURE_APERFMPERF))
> + return per_cpu(arch_cpu_freq, cpu);
> +
> + return 1024 /* SCHED_CAPACITY_SCALE */;
> +}
> +
> +extern void arch_scale_freq_tick(void);
> +extern void x86_arch_scale_freq_tick_enable(void);
> +extern void x86_arch_scale_freq_tick_disable(void);
> +#else
> +static inline void x86_arch_scale_freq_tick_enable(void)
> +{
> +}
> +
> +static inline void x86_arch_scale_freq_tick_disable(void)
> +{
> +}
> +#endif
> +
> #endif /* _ASM_X86_TOPOLOGY_H */
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 0f1cbb0..9e2cb82 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1676,3 +1680,193 @@ void native_play_dead(void)
> }
>
> #endif
> +
> +/*
> + * APERF/MPERF frequency ratio computation.
> + *
> + * The scheduler wants to do frequency invariant accounting and needs a <1
> + * ratio to account for the 'current' frequency.
> + *
> + * Since the frequency on x86 is controlled by micro-controller and our P-state
> + * setting is little more than a request/hint, we need to observe the effective
> + * frequency. We do this with APERF/MPERF.
> + *
> + * One complication is that the APERF/MPERF ratio can be >1, specifically
> + * APERF/MPERF gives the ratio relative to the max non-turbo P-state. Therefore
> + * we need to re-normalize the ratio.
> + *
> + * We do this by tracking the max APERF/MPERF ratio previously observed and
> + * scaling our MPERF delta with that. Every time our ratio goes over 1, we
> + * proportionally scale up our old max.
One very important point however is that I wrote this patch in the
context of Vincent's new scale invariance proposal:
https://lkml.kernel.org/r/[email protected]
The reason is that while this 'works' with the current scale invariance,
the way turbo is handled is not optimal for it.
At OSPM we briefly touched upon this subject, since also ARM will need
something like this for some of their chips, so this is of general
interrest.
The problem with turbo of course is that our max frequency is variable;
but we really rather would like a unit value for scaling. Returning a >1
value results in weird things (think of running for 1.5ms in 1ms
wall-time for example).
This implementation simply finds the absolute max observed and scales
that as 1, with a result that when we're busy we'll always run at <1
because we cannot sustain turbo. This might result in the scheduler
thinking we're not fully busy, when in fact we are.
At OSPM it was suggested to instead track an average max or set 1 at the
sustainable freq and clip overshoot. The problem with that is that it is
actually hard to track an average max if you cannot tell what max even
is.
The problem with clipping of course is that we'll end up biasing the
frequencies higher than required -- which might be OK if the overshoot
is 'small' as it would typically be for an average max thing, but not
when we set 1 at the sustainable frequency.
I think the new scale invariance solves a bunch of these problems by
always saturating, irrespective of the actual frequency we run at.
Of course, IIRC it had other issues...
> + * The down-side to this runtime max search is that you have to trigger the
> + * actual max frequency before your scale is right. Therefore allow
> + * architectures to initialize the max ratio on CPU bringup.
> + */
> +
> +static DEFINE_PER_CPU(u64, arch_prev_aperf);
> +static DEFINE_PER_CPU(u64, arch_prev_mperf);
> +static DEFINE_PER_CPU(u64, arch_prev_max_freq) = SCHED_CAPACITY_SCALE;
> +
> +DEFINE_PER_CPU(unsigned long, arch_cpu_freq);
> +
> +static bool tick_disable;
> +
> +void arch_scale_freq_tick(void)
> +{
> + u64 freq, max_freq = this_cpu_read(arch_prev_max_freq);
> + u64 aperf, mperf;
> + u64 acnt, mcnt;
> +
> + if (!static_cpu_has(X86_FEATURE_APERFMPERF) || tick_disable)
> + return;
> +
> + rdmsrl(MSR_IA32_APERF, aperf);
> + rdmsrl(MSR_IA32_MPERF, mperf);
> +
> + acnt = aperf - this_cpu_read(arch_prev_aperf);
> + mcnt = mperf - this_cpu_read(arch_prev_mperf);
> + if (!mcnt)
> + return;
> +
> + this_cpu_write(arch_prev_aperf, aperf);
> + this_cpu_write(arch_prev_mperf, mperf);
> +
> + acnt <<= 2*SCHED_CAPACITY_SHIFT;
> + mcnt *= max_freq;
> +
> + freq = div64_u64(acnt, mcnt);
> +
> + if (unlikely(freq > SCHED_CAPACITY_SCALE)) {
> + max_freq *= freq;
> + max_freq >>= SCHED_CAPACITY_SHIFT;
> +
> + this_cpu_write(arch_prev_max_freq, max_freq);
> +
> + freq = SCHED_CAPACITY_SCALE;
> + }
> +
> + this_cpu_write(arch_cpu_freq, freq);
> +}
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 092f7c4..2bdef36 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3076,6 +3076,7 @@ void scheduler_tick(void)
> struct task_struct *curr = rq->curr;
> struct rq_flags rf;
>
> + arch_scale_freq_tick();
> sched_clock_tick();
>
> rq_lock(rq, &rf);
On Wed, May 16, 2018 at 6:49 AM, Srinivas Pandruvada
<[email protected]> wrote:
> When a task is woken up from IO wait, boost HWP prformance to max. This
> helps IO workloads on servers with per core P-states. But changing limits
> has extra over head of issuing new HWP Request MSR, which takes 1000+
> cycles. So this change limits setting HWP Request MSR. Also request can
> be for a remote CPU.
> Rate control in setting HWP Requests:
> - If the current performance is around P1, simply ignore IO flag.
> - Once set wait till hold time, till remove boost. While the boost
> is on, another IO flags is notified, it will prolong boost.
> - If the IO flags are notified multiple ticks apart, this may not be
> IO bound task. Othewise idle system gets periodic boosts for one
> IO wake.
>
> Signed-off-by: Srinivas Pandruvada <[email protected]>
> ---
> drivers/cpufreq/intel_pstate.c | 75 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 75 insertions(+)
>
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index e200887..d418265 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -20,6 +20,7 @@
> #include <linux/tick.h>
> #include <linux/slab.h>
> #include <linux/sched/cpufreq.h>
> +#include <linux/sched/topology.h>
> #include <linux/list.h>
> #include <linux/cpu.h>
> #include <linux/cpufreq.h>
> @@ -224,6 +225,8 @@ struct global_params {
> * @hwp_req_cached: Cached value of the last HWP request MSR
> * @csd: A structure used to issue SMP async call, which
> * defines callback and arguments
> + * @hwp_boost_active: HWP performance is boosted on this CPU
> + * @last_io_update: Last time when IO wake flag was set
> *
> * This structure stores per CPU instance data for all CPUs.
> */
> @@ -258,6 +261,8 @@ struct cpudata {
> s16 epp_saved;
> u64 hwp_req_cached;
> call_single_data_t csd;
> + bool hwp_boost_active;
> + u64 last_io_update;
> };
>
> static struct cpudata **all_cpu_data;
> @@ -1421,10 +1426,80 @@ static void csd_init(struct cpudata *cpu)
> cpu->csd.info = cpu;
> }
>
> +/*
> + * Long hold time will keep high perf limits for long time,
> + * which negatively impacts perf/watt for some workloads,
> + * like specpower. 3ms is based on experiements on some
> + * workoads.
> + */
> +static int hwp_boost_hold_time_ms = 3;
> +
> +/* Default: This will roughly around P1 on SKX */
> +#define BOOST_PSTATE_THRESHOLD (SCHED_CAPACITY_SCALE / 2)
> +static int hwp_boost_pstate_threshold = BOOST_PSTATE_THRESHOLD;
> +
> +static inline bool intel_pstate_check_boost_threhold(struct cpudata *cpu)
> +{
> + /*
> + * If the last performance is above threshold, then return false,
> + * so that caller can ignore boosting.
> + */
> + if (arch_scale_freq_capacity(cpu->cpu) > hwp_boost_pstate_threshold)
> + return false;
> +
> + return true;
> +}
> +
> static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
> u64 time, unsigned int flags)
> {
> + struct cpudata *cpu = container_of(data, struct cpudata, update_util);
> +
> + if (flags & SCHED_CPUFREQ_IOWAIT) {
> + /*
> + * Set iowait_boost flag and update time. Since IO WAIT flag
> + * is set all the time, we can't just conclude that there is
> + * some IO bound activity is scheduled on this CPU with just
> + * one occurrence. If we receive at least two in two
> + * consecutive ticks, then we start treating as IO. So
> + * there will be one tick latency.
> + */
> + if (time_before64(time, cpu->last_io_update + 2 * TICK_NSEC) &&
> + intel_pstate_check_boost_threhold(cpu))
> + cpu->iowait_boost = true;
> +
> + cpu->last_io_update = time;
> + cpu->last_update = time;
This is a shared data structure and it gets updated without
synchronization, unless I'm missing something.
How much does the cross-CPU case matter?
> + }
>
> + /*
> + * If the boost is active, we will remove it after timeout on local
> + * CPU only.
> + */
> + if (cpu->hwp_boost_active) {
> + if (smp_processor_id() == cpu->cpu) {
> + bool expired;
> +
> + expired = time_after64(time, cpu->last_update +
> + (hwp_boost_hold_time_ms * NSEC_PER_MSEC));
> + if (expired) {
> + intel_pstate_hwp_boost_down(cpu);
> + cpu->hwp_boost_active = false;
> + cpu->iowait_boost = false;
> + }
> + }
> + return;
> + }
> +
> + cpu->last_update = time;
> +
> + if (cpu->iowait_boost) {
> + cpu->hwp_boost_active = true;
> + if (smp_processor_id() == cpu->cpu)
> + intel_pstate_hwp_boost_up(cpu);
> + else
> + smp_call_function_single_async(cpu->cpu, &cpu->csd);
> + }
> }
>
> static inline void intel_pstate_calc_avg_perf(struct cpudata *cpu)
> --
> 2.9.5
>
On Wed, May 16, 2018 at 11:15:09AM +0200, Rafael J. Wysocki wrote:
> So if the "cached" thing is the last "not boosted EPP", that's why it
> is not updated here.
Sure, I see what it does, just saying that naming and comments are wrong
vs the actual code.
On 15/05/18 21:49, Srinivas Pandruvada wrote:
> intel_pstate has two operating modes: active and passive. In "active"
> mode, the in-built scaling governor is used and in "passive" mode,
> the driver can be used with any governor like "schedutil". In "active"
> mode the utilization values from schedutil is not used and there is
> a requirement from high performance computing use cases, not to read
> any APERF/MPERF MSRs. In this case no need to use CPU cycles for
> frequency invariant accounting by reading APERF/MPERF MSRs.
> With this change frequency invariant account is only enabled in
> "passive" mode.
>
> Signed-off-by: Srinivas Pandruvada <[email protected]>
> ---
> [Note: The tick will be enabled later in the series when hwp dynamic
> boost is enabled]
>
> drivers/cpufreq/intel_pstate.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 17e566af..f686bbe 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -2040,6 +2040,8 @@ static int intel_pstate_register_driver(struct cpufreq_driver *driver)
> {
> int ret;
>
> + x86_arch_scale_freq_tick_disable();
> +
> memset(&global, 0, sizeof(global));
> global.max_perf_pct = 100;
>
> @@ -2052,6 +2054,9 @@ static int intel_pstate_register_driver(struct cpufreq_driver *driver)
>
> global.min_perf_pct = min_perf_pct_min();
>
> + if (driver == &intel_cpufreq)
> + x86_arch_scale_freq_tick_enable();
This will unconditionally trigger the reading/calculation at each tick
even though information is not actually consumed (e.g., running
performance or any other governor), right? Do we want that?
Anyway, FWIW I started testing this on a E5-2609 v3 and I'm not seeing
hackbench regressions so far (running with schedutil governor).
Best,
- Juri
On Wed, 2018-05-16 at 12:43 +0200, Peter Zijlstra wrote:
> On Wed, May 16, 2018 at 11:15:09AM +0200, Rafael J. Wysocki wrote:
>
> > So if the "cached" thing is the last "not boosted EPP", that's why
> > it
> > is not updated here.
>
> Sure, I see what it does, just saying that naming and comments are
> wrong
> vs the actual code.
I will fix in the next revision.
Thanks,
Srinivas
On Wed, 2018-05-16 at 09:22 +0200, Peter Zijlstra wrote:
> On Tue, May 15, 2018 at 09:49:04PM -0700, Srinivas Pandruvada wrote:
> > Setup necessary infrastructure to be able to boost HWP performance
> > on a
> > remote CPU. First initialize data structure to be able to use
> > smp_call_function_single_async(). The boost up function simply set
> > HWP
> > min to HWP max value and EPP to 0. The boost down function simply
> > restores
> > to last cached HWP Request value.
> >
> > To avoid reading HWP Request MSR during dynamic update, the HWP
> > Request
> > MSR value is cached in the local memory. This caching is done
> > whenever
> > HWP request MSR is modified during driver init on in setpolicy()
> > callback
> > path.
>
> The patch does two independent things; best to split that.
I had that split before, but thought no one is consuming the cached
value in that patch, so combined them. If this is not a problem, it is
better to split the patch.
Thanks,
Srinivas
>
>
> > diff --git a/drivers/cpufreq/intel_pstate.c
> > b/drivers/cpufreq/intel_pstate.c
> > index f686bbe..dc7dfa9 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -221,6 +221,9 @@ struct global_params {
> > * preference/bias
> > * @epp_saved: Saved EPP/EPB during system suspend
> > or CPU offline
> > * operation
> > + * @hwp_req_cached: Cached value of the last HWP request
> > MSR
>
> That's simply not true given the code below.
>
> > @@ -763,6 +768,7 @@ static void intel_pstate_hwp_set(unsigned int
> > cpu)
> > intel_pstate_set_epb(cpu, epp);
> > }
> > skip_epp:
> > + cpu_data->hwp_req_cached = value;
> > wrmsrl_on_cpu(cpu, MSR_HWP_REQUEST, value);
> > }
> >
> > @@ -1381,6 +1387,39 @@ static void
> > intel_pstate_get_cpu_pstates(struct cpudata *cpu)
> > intel_pstate_set_min_pstate(cpu);
> > }
> >
> > +
> > +static inline void intel_pstate_hwp_boost_up(struct cpudata *cpu)
> > +{
> > + u64 hwp_req;
> > + u8 max;
> > +
> > + max = (u8) (cpu->hwp_req_cached >> 8);
> > +
> > + hwp_req = cpu->hwp_req_cached & ~GENMASK_ULL(31, 24);
> > + hwp_req = (hwp_req & ~GENMASK_ULL(7, 0)) | max;
> > +
> > + wrmsrl(MSR_HWP_REQUEST, hwp_req);
> > +}
> > +
> > +static inline void intel_pstate_hwp_boost_down(struct cpudata
> > *cpu)
> > +{
> > + wrmsrl(MSR_HWP_REQUEST, cpu->hwp_req_cached);
> > +}
>
> This is not a traditional msr shadow; that would be updated on every
> wrmsr. There is not a comment in sight explaining why this one is
> different.
Hi Juri,
On Wed, 2018-05-16 at 08:49 +0200, Juri Lelli wrote:
> Hi Srinivas,
>
> On 15/05/18 21:49, Srinivas Pandruvada wrote:
>
> [...]
>
> >
> > Peter Zijlstra (1):
> > x86,sched: Add support for frequency invariance
>
> Cool! I was going to ask Peter about this patch. You beat me to it.
> :)
>
> I'll have a lokk at the set. BTW, just noticed that you Cc-ed me
> using
> my old address. Please use [email protected]. ;)
I didn't realize that. From next revision I will change to this
address.
Thanks,
Srinivas
>
> Best,
>
> - Juri
On Wed, 2018-05-16 at 09:49 +0200, Peter Zijlstra wrote:
> On Tue, May 15, 2018 at 09:49:11PM -0700, Srinivas Pandruvada wrote:
> > Enable HWP boost on Skylake server platform.
>
> Why not unconditionally enable it on everything HWP ?
Never noticed in any difference performance in clients with HWP nor
anybody complained. Since clients uses single power domain, some other
CPUs always gives boost.
Thanks,
Srinivas
On Wed, May 16, 2018 at 05:19:25PM +0200, Juri Lelli wrote:
> Anyway, FWIW I started testing this on a E5-2609 v3 and I'm not seeing
> hackbench regressions so far (running with schedutil governor).
https://en.wikipedia.org/wiki/Haswell_(microarchitecture)#Server_processors
Lists the E5 2609 v3 as not having turbo at all, which is basically a
best case scenario for this patch.
As I wrote earlier today; when turbo exists, like say the 2699, then
when we're busy we'll run at U=2.3/3.6 ~ .64, which might confuse
things.
On Wed, May 16, 2018 at 08:46:24AM -0700, Srinivas Pandruvada wrote:
> On Wed, 2018-05-16 at 09:49 +0200, Peter Zijlstra wrote:
> > On Tue, May 15, 2018 at 09:49:11PM -0700, Srinivas Pandruvada wrote:
> > > Enable HWP boost on Skylake server platform.
> >
> > Why not unconditionally enable it on everything HWP ?
> Never noticed in any difference performance in clients with HWP nor
> anybody complained. Since clients uses single power domain, some other
> CPUs always gives boost.
Can't the hardware tell us about per-core stuff? Do we really have to
use tables for that?
On Wed, 2018-05-16 at 17:19 +0200, Juri Lelli wrote:
> On 15/05/18 21:49, Srinivas Pandruvada wrote:
> > intel_pstate has two operating modes: active and passive. In
> > "active"
> > mode, the in-built scaling governor is used and in "passive" mode,
> > the driver can be used with any governor like "schedutil". In
> > "active"
> > mode the utilization values from schedutil is not used and there is
> > a requirement from high performance computing use cases, not to
> > read
> > any APERF/MPERF MSRs. In this case no need to use CPU cycles for
> > frequency invariant accounting by reading APERF/MPERF MSRs.
> > With this change frequency invariant account is only enabled in
> > "passive" mode.
> >
> > Signed-off-by: Srinivas Pandruvada <[email protected]
> > .com>
> > ---
> > [Note: The tick will be enabled later in the series when hwp
> > dynamic
> > boost is enabled]
> >
> > drivers/cpufreq/intel_pstate.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/cpufreq/intel_pstate.c
> > b/drivers/cpufreq/intel_pstate.c
> > index 17e566af..f686bbe 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -2040,6 +2040,8 @@ static int
> > intel_pstate_register_driver(struct cpufreq_driver *driver)
> > {
> > int ret;
> >
> > + x86_arch_scale_freq_tick_disable();
> > +
> > memset(&global, 0, sizeof(global));
> > global.max_perf_pct = 100;
> >
> > @@ -2052,6 +2054,9 @@ static int
> > intel_pstate_register_driver(struct cpufreq_driver *driver)
> >
> > global.min_perf_pct = min_perf_pct_min();
> >
> > + if (driver == &intel_cpufreq)
> > + x86_arch_scale_freq_tick_enable();
>
> This will unconditionally trigger the reading/calculation at each
> tick
> even though information is not actually consumed (e.g., running
> performance or any other governor), right? Do we want that?
Good point. I should call x86_arch_scale_freq_tick_disable() in
performance mode switch for active mode.
Thanks,
Srinivas
>
> Anyway, FWIW I started testing this on a E5-2609 v3 and I'm not
> seeing
> hackbench regressions so far (running with schedutil governor).
>
> Best,
>
> - Juri
On 16/05/18 17:47, Peter Zijlstra wrote:
> On Wed, May 16, 2018 at 05:19:25PM +0200, Juri Lelli wrote:
>
> > Anyway, FWIW I started testing this on a E5-2609 v3 and I'm not seeing
> > hackbench regressions so far (running with schedutil governor).
>
> https://en.wikipedia.org/wiki/Haswell_(microarchitecture)#Server_processors
>
> Lists the E5 2609 v3 as not having turbo at all, which is basically a
> best case scenario for this patch.
>
> As I wrote earlier today; when turbo exists, like say the 2699, then
> when we're busy we'll run at U=2.3/3.6 ~ .64, which might confuse
> things.
Indeed. I was mostly trying to see if adding this to the tick might
introduce noticeable overhead.
On Wed, 2018-05-16 at 11:07 +0200, Rafael J. Wysocki wrote:
> On Wed, May 16, 2018 at 9:29 AM, Peter Zijlstra <[email protected]
> > wrote:
> > On Wed, May 16, 2018 at 09:16:40AM +0200, Peter Zijlstra wrote:
> > > On Tue, May 15, 2018 at 09:49:03PM -0700, Srinivas Pandruvada
> > > wrote:
> > > > intel_pstate has two operating modes: active and passive. In
> > > > "active"
> > > > mode, the in-built scaling governor is used and in "passive"
> > > > mode,
> > > > the driver can be used with any governor like "schedutil". In
> > > > "active"
> > > > mode the utilization values from schedutil is not used and
> > > > there is
> > > > a requirement from high performance computing use cases, not to
> > > > read
> > > > any APERF/MPERF MSRs. In this case no need to use CPU cycles
> > > > for
> > > > frequency invariant accounting by reading APERF/MPERF MSRs.
> > > > With this change frequency invariant account is only enabled in
> > > > "passive" mode.
> > >
> > > WTH is active/passive? Is passive when we select performance
> > > governor?
> >
> > Bah, I cannot read it seems. active is when we use the intel_pstate
> > governor and passive is when we use schedutil and only use
> > intel_pstate
> > as a driver.
> >
> > > Also; you have to explain why using APERF/MPERF is bad in that
> > > case. Why
> > > do they care if we read those MSRs during the tick?
> >
> > That still stands.. this needs to be properly explained.
>
> I guess this is from the intel_pstate perspective only.
>
> The active mode is only used with HWP, so intel_pstate doesn't look
> at
> the utilization (in any form) in the passive mode today.
>
> Still, there are other reasons for PELT to be scale-invariant, so ...
Not sure about the use case in active mode other than dynamic HWP boost
later in this series. If any, I can remove this patch.
On Wed, 2018-05-16 at 09:37 +0200, Peter Zijlstra wrote:
> On Tue, May 15, 2018 at 09:49:06PM -0700, Srinivas Pandruvada wrote:
[...]
> >
@@ -258,6 +261,8 @@ struct cpudata {
> > s16 epp_saved;
> > u64 hwp_req_cached;
> > call_single_data_t csd;
> > + bool hwp_boost_active;
> > + u64 last_io_update;
>
> This structure has abysmal layout; you should look at that.
> Also, mandatory complaint about using _Bool in composites.
>
I will take care about this in next version.
[...]
> > +/* Default: This will roughly around P1 on SKX */
> > +#define BOOST_PSTATE_THRESHOLD (SCHED_CAPACITY_SCALE / 2)
>
> Yuck.. why the need to hardcode this? Can't you simply read the P1
> value
> for the part at hand?
Done later in the series. So need to reorder.
[..]
>
+ if (cpu->iowait_boost) {
> > + cpu->hwp_boost_active = true;
> > + if (smp_processor_id() == cpu->cpu)
> > + intel_pstate_hwp_boost_up(cpu);
> > + else
> > + smp_call_function_single_async(cpu->cpu,
> > &cpu->csd);
> > + }
> > }
>
> Hurmph, this looks like you're starting to duplicate the schedutil
> iowait logic. Why didn't you copy the gradual boosting thing?
I tried what we implemented in intel_pstate in legacy mode, which gave
the best performance for servers (ramp up faster and slow ramp down).
This caused regression on some workloads, as each time we can call
HWP_REQUEST, we spend 1200+ cycles in issuing MSR. So keeping it at max
for timeout. Even keeping at P1 didn't help in power numbers.
On Wed, 2018-05-16 at 11:45 +0200, Rafael J. Wysocki wrote:
[...]
>
> > + if (time_before64(time, cpu->last_io_update + 2 *
> > TICK_NSEC) &&
> > + intel_pstate_check_boost_threhold(cpu))
> > + cpu->iowait_boost = true;
> > +
> > + cpu->last_io_update = time;
> > + cpu->last_update = time;
>
> This is a shared data structure and it gets updated without
> synchronization, unless I'm missing something.
Good point.
>
> How much does the cross-CPU case matter?
I was under impression that IOWAIT flag is set on local CPU calls only,
but I see IOWAIT flags set for remote CPU all the time. So we will miss
if we don't take care of cross CPU calls.
But I can lump them as part of smp async call for all cross cpu updates
to avoid sync issue.
>
> > + }
> >
> > + /*
> > + * If the boost is active, we will remove it after timeout
> > on local
> > + * CPU only.
> > + */
> > + if (cpu->hwp_boost_active) {
> > + if (smp_processor_id() == cpu->cpu) {
> > + bool expired;
> > +
> > + expired = time_after64(time, cpu-
> > >last_update +
> > + (hwp_boost_hold_time
> > _ms * NSEC_PER_MSEC));
> > + if (expired) {
> > + intel_pstate_hwp_boost_down(cpu);
> > + cpu->hwp_boost_active = false;
> > + cpu->iowait_boost = false;
> > + }
> > + }
> > + return;
> > + }
> > +
> > + cpu->last_update = time;
> > +
> > + if (cpu->iowait_boost) {
> > + cpu->hwp_boost_active = true;
> > + if (smp_processor_id() == cpu->cpu)
> > + intel_pstate_hwp_boost_up(cpu);
> > + else
> > + smp_call_function_single_async(cpu->cpu,
> > &cpu->csd);
> > + }
> > }
> >
> > static inline void intel_pstate_calc_avg_perf(struct cpudata *cpu)
> > --
> > 2.9.5
> >
On Wed, 2018-05-16 at 11:49 +0200, Rafael J. Wysocki wrote:
> On Wed, May 16, 2018 at 6:49 AM, Srinivas Pandruvada
> <[email protected]> wrote:
> > When a busy task migrates to a new CPU boost HWP prformance to max.
> > This
> > helps workloads on servers with per core P-states, which saturates
> > all
> > CPUs and then they migrate frequently. But changing limits has
> > extra over
> > head of issuing new HWP Request MSR, which takes 1000+
> > cycles. So this change limits setting HWP Request MSR.
> > Rate control in setting HWP Requests:
> > - If the current performance is around P1, simply ignore.
> > - Once set wait till hold time, till remove boost. While the boost
> > is on, another flags is notified, it will prolong boost.
> > - The task migrates needs to have some utilzation which is more
> > than threshold utilization, which will trigger P-state above
> > minimum.
> >
> > Signed-off-by: Srinivas Pandruvada <[email protected]
> > .com>
> > ---
> > drivers/cpufreq/intel_pstate.c | 37
> > ++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 36 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/cpufreq/intel_pstate.c
> > b/drivers/cpufreq/intel_pstate.c
> > index d418265..ec455af 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -227,6 +227,7 @@ struct global_params {
> > * defines callback and arguments
> > * @hwp_boost_active: HWP performance is boosted on this CPU
> > * @last_io_update: Last time when IO wake flag was set
> > + * @migrate_hint: Set when scheduler indicates thread
> > migration
> > *
> > * This structure stores per CPU instance data for all CPUs.
> > */
> > @@ -263,6 +264,7 @@ struct cpudata {
> > call_single_data_t csd;
> > bool hwp_boost_active;
> > u64 last_io_update;
> > + bool migrate_hint;
>
> Why do you need this in the struct?
>
> It looks like it only is used locally in
> intel_pstate_update_util_hwp().
It is required as we set just set a flag for the new CPU in its cpudata
that a task is migrated on it. We update performance on the next tick
on local cpu only when the rq utilization is updated to check whether
it is worth boosting.
Thanks,
Srinivas
On Wed, 2018-05-16 at 12:10 +0530, Viresh Kumar wrote:
> On 15-05-18, 21:49, Srinivas Pandruvada wrote:
> > Added cpufreq_get_sched_util() to get the CFS, DL and max
> > utilization
> > values for a CPU. This is required for getting utilization values
> > for cpufreq drivers outside of kernel/sched folder.
> >
> > Signed-off-by: Srinivas Pandruvada <[email protected]
> > .com>
> > ---
> > include/linux/sched/cpufreq.h | 2 ++
> > kernel/sched/cpufreq.c | 23 +++++++++++++++++++++++
> > 2 files changed, 25 insertions(+)
> >
> > diff --git a/include/linux/sched/cpufreq.h
> > b/include/linux/sched/cpufreq.h
> > index 5966744..a366600 100644
> > --- a/include/linux/sched/cpufreq.h
> > +++ b/include/linux/sched/cpufreq.h
> > @@ -20,6 +20,8 @@ void cpufreq_add_update_util_hook(int cpu, struct
> > update_util_data *data,
> > void (*func)(struct update_util_data *data,
> > u64 time,
> > unsigned int flags));
> > void cpufreq_remove_update_util_hook(int cpu);
> > +void cpufreq_get_sched_util(int cpu, unsigned long *util_cfs,
> > + unsigned long *util_dl, unsigned long
> > *max);
> > #endif /* CONFIG_CPU_FREQ */
> >
> > #endif /* _LINUX_SCHED_CPUFREQ_H */
> > diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
> > index 5e54cbc..36e2839 100644
> > --- a/kernel/sched/cpufreq.c
> > +++ b/kernel/sched/cpufreq.c
> > @@ -60,3 +60,26 @@ void cpufreq_remove_update_util_hook(int cpu)
> > rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu),
> > NULL);
> > }
> > EXPORT_SYMBOL_GPL(cpufreq_remove_update_util_hook);
> > +
> > +/**
> > + * cpufreq_get_sched_util - Get utilization values.
> > + * @cpu: The targeted CPU.
> > + *
> > + * Get the CFS, DL and max utilization.
> > + * This function allows cpufreq driver outside the kernel/sched to
> > access
> > + * utilization value for a CPUs run queue.
> > + */
> > +void cpufreq_get_sched_util(int cpu, unsigned long *util_cfs,
> > + unsigned long *util_dl, unsigned long
> > *max)
> > +{
> > +#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
>
> What will happen when schedutil is compiled in the kernel but
> ondemand
> is the one getting used currently ?
It should still work. The only reason I have to use ifdef because of
compile issues when CONFIG_CPU_FREQ_SCHEDUTIL is not defined.
The reason for that is that cpu_util_cfs() and cpu_util_dl() is defined
under #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL.
The actual code inside the cpu_util_cfs() and cpu_util_dl() uses
rq->cfs.avg.util_avg and rq->cfs.avg.util_avg, which are updated
irrespective of cpufreq governor. May be better to remove ifdefs for
cpu_util_dl() and cpu_util_cfs().
Thanks,
Srinivas
>
> > + struct rq *rq = cpu_rq(cpu);
> > +
> > + *max = arch_scale_cpu_capacity(NULL, cpu);
> > + *util_cfs = cpu_util_cfs(rq);
> > + *util_dl = cpu_util_dl(rq);
> > +#else
> > + *util_cfs = *util_dl = 1;
> > +#endif
> > +}
> > +EXPORT_SYMBOL_GPL(cpufreq_get_sched_util);
> > --
> > 2.9.5
>
>
On Wed, 2018-05-16 at 10:11 +0200, Peter Zijlstra wrote:
> On Tue, May 15, 2018 at 09:49:07PM -0700, Srinivas Pandruvada wrote:
> > --- a/kernel/sched/cpufreq.c
> > +++ b/kernel/sched/cpufreq.c
> > @@ -60,3 +60,26 @@ void cpufreq_remove_update_util_hook(int cpu)
> > rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu),
> > NULL);
> > }
> > EXPORT_SYMBOL_GPL(cpufreq_remove_update_util_hook);
> > +
> > +/**
> > + * cpufreq_get_sched_util - Get utilization values.
> > + * @cpu: The targeted CPU.
> > + *
> > + * Get the CFS, DL and max utilization.
> > + * This function allows cpufreq driver outside the kernel/sched to
> > access
> > + * utilization value for a CPUs run queue.
> > + */
> > +void cpufreq_get_sched_util(int cpu, unsigned long *util_cfs,
> > + unsigned long *util_dl, unsigned long
> > *max)
> > +{
> > +#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> > + struct rq *rq = cpu_rq(cpu);
> > +
> > + *max = arch_scale_cpu_capacity(NULL, cpu);
> > + *util_cfs = cpu_util_cfs(rq);
> > + *util_dl = cpu_util_dl(rq);
> > +#else
> > + *util_cfs = *util_dl = 1;
> > +#endif
> > +}
> > +EXPORT_SYMBOL_GPL(cpufreq_get_sched_util);
>
> So I _really_ hate this... I'd much rather you make schedutil work
> with
> the hwp passive stuff.
Are you not happy with ifdefs are utility function itself? Can you
explain more how this should be done?
utilization values are not passed with scheduler update util callback.
So need to have access to rq->cfs.avg.util* and rq->dl.running_bw
access from intel_pstate.
The ifdefs can be removed, if we remove ifdefs for cpu_util_dl() or
cpu_util_cfs(), which are not doing anything special other than
accessing rq->cfs.avg.util* and rq->dl.running_bw. I think it is better
to remove from these functions.
>
> Also, afaict intel_pstate is bool, not tristate, so no need for an
> EXPORT at all.
Correct. I am just following other interface functions for sched util
hooks in this file which are using EXPORT.
But I will remove EXPORT in the next revision.
Thanks,
Srinivas
On Wed, 2018-05-16 at 17:54 +0200, Peter Zijlstra wrote:
> On Wed, May 16, 2018 at 08:46:24AM -0700, Srinivas Pandruvada wrote:
> > On Wed, 2018-05-16 at 09:49 +0200, Peter Zijlstra wrote:
> > > On Tue, May 15, 2018 at 09:49:11PM -0700, Srinivas Pandruvada
> > > wrote:
> > > > Enable HWP boost on Skylake server platform.
> > >
> > > Why not unconditionally enable it on everything HWP ?
> >
> > Never noticed in any difference performance in clients with HWP nor
> > anybody complained. Since clients uses single power domain, some
> > other
> > CPUs always gives boost.
>
> Can't the hardware tell us about per-core stuff? Do we really have to
> use tables for that?
AFAIK not in a standard way across all platforms. But I will check on
this.
On Wed, May 16, 2018 at 03:40:39PM -0700, Srinivas Pandruvada wrote:
> On Wed, 2018-05-16 at 10:11 +0200, Peter Zijlstra wrote:
> > So I _really_ hate this... I'd much rather you make schedutil work
> > with the hwp passive stuff.
> Are you not happy with ifdefs are utility function itself? Can you
> explain more how this should be done?
The reason I hate it is because it exports data that should not be. We
created the schedutil governor as part of the scheduler such that we
could tightly couple it.
So if you want to use all that, get schedutil to work for you.
On Wed, May 16, 2018 at 10:55:13AM -0700, Srinivas Pandruvada wrote:
> On Wed, 2018-05-16 at 09:37 +0200, Peter Zijlstra wrote:
> > On Tue, May 15, 2018 at 09:49:06PM -0700, Srinivas Pandruvada wrote:
> > Hurmph, this looks like you're starting to duplicate the schedutil
> > iowait logic. Why didn't you copy the gradual boosting thing?
> I tried what we implemented in intel_pstate in legacy mode, which gave
> the best performance for servers (ramp up faster and slow ramp down).
> This caused regression on some workloads, as each time we can call
> HWP_REQUEST, we spend 1200+ cycles in issuing MSR. So keeping it at max
> for timeout. Even keeping at P1 didn't help in power numbers.
> > > - If the IO flags are notified multiple ticks apart, this may not be
> > > IO bound task. Othewise idle system gets periodic boosts for one
> > > IO wake.
That is what made me think of the gradual boosting. Because that is very
similar to the scenario that made us do that.
But like said elsewhere, ideally we'd use schedutil exclusively and
reduce intel_pstate to a pure driver. The normal way do deal with slow
transitions is rate limiting. Another option is for the slow HWP msr
driver we could choose to only issue the update when we reach max_freq.
But if the MSR ever gets cheaper we can issue it more often/finer
grained.
On 16/05/18 18:31, Juri Lelli wrote:
> On 16/05/18 17:47, Peter Zijlstra wrote:
> > On Wed, May 16, 2018 at 05:19:25PM +0200, Juri Lelli wrote:
> >
> > > Anyway, FWIW I started testing this on a E5-2609 v3 and I'm not seeing
> > > hackbench regressions so far (running with schedutil governor).
> >
> > https://en.wikipedia.org/wiki/Haswell_(microarchitecture)#Server_processors
> >
> > Lists the E5 2609 v3 as not having turbo at all, which is basically a
> > best case scenario for this patch.
> >
> > As I wrote earlier today; when turbo exists, like say the 2699, then
> > when we're busy we'll run at U=2.3/3.6 ~ .64, which might confuse
> > things.
>
> Indeed. I was mostly trying to see if adding this to the tick might
> introduce noticeable overhead.
Blindly testing on an i5-5200U (2.2/2.7 GHz) gave the following
# perf bench sched messaging --pipe --thread --group 2 --loop 20000
count mean std min 50% 95% 99% max
hostname kernel
i5-5200U test_after 30.0 13.843433 0.590605 12.369 13.810 14.85635 15.08205 15.127
test_before 30.0 13.571167 0.999798 12.228 13.302 15.57805 16.40029 16.690
It might be interesting to see what happens when using a single CPU
only?
Also, I will look at how the util signals look when a single CPU is
busy..
On 17/05/18 12:59, Juri Lelli wrote:
> On 16/05/18 18:31, Juri Lelli wrote:
> > On 16/05/18 17:47, Peter Zijlstra wrote:
> > > On Wed, May 16, 2018 at 05:19:25PM +0200, Juri Lelli wrote:
> > >
> > > > Anyway, FWIW I started testing this on a E5-2609 v3 and I'm not seeing
> > > > hackbench regressions so far (running with schedutil governor).
> > >
> > > https://en.wikipedia.org/wiki/Haswell_(microarchitecture)#Server_processors
> > >
> > > Lists the E5 2609 v3 as not having turbo at all, which is basically a
> > > best case scenario for this patch.
> > >
> > > As I wrote earlier today; when turbo exists, like say the 2699, then
> > > when we're busy we'll run at U=2.3/3.6 ~ .64, which might confuse
> > > things.
> >
> > Indeed. I was mostly trying to see if adding this to the tick might
> > introduce noticeable overhead.
>
> Blindly testing on an i5-5200U (2.2/2.7 GHz) gave the following
>
> # perf bench sched messaging --pipe --thread --group 2 --loop 20000
>
> count mean std min 50% 95% 99% max
> hostname kernel
> i5-5200U test_after 30.0 13.843433 0.590605 12.369 13.810 14.85635 15.08205 15.127
> test_before 30.0 13.571167 0.999798 12.228 13.302 15.57805 16.40029 16.690
>
> It might be interesting to see what happens when using a single CPU
> only?
>
> Also, I will look at how the util signals look when a single CPU is
> busy..
And this is showing where the problem is (as you were saying [1]):
https://gist.github.com/jlelli/f5438221186e5ed3660194e4f645fe93
Just look at the plots (and ignore setup).
First one (pid:4483) shows a single task busy running on a single CPU,
which seems to be able to sustain turbo for 5 sec. So task util reaches
~1024.
Second one (pid:4283) shows the same task, but running together with
other 3 tasks (each one pinned to a different CPU). In this case util
saturates at ~943, which is due to the fact that max freq is still
considered to be the turbo one. :/
[1] https://marc.info/?l=linux-kernel&m=152646464017810&w=2
On Thu, 2018-05-17 at 17:04 +0200, Juri Lelli wrote:
> On 17/05/18 12:59, Juri Lelli wrote:
> > On 16/05/18 18:31, Juri Lelli wrote:
> > > On 16/05/18 17:47, Peter Zijlstra wrote:
> > > > On Wed, May 16, 2018 at 05:19:25PM +0200, Juri Lelli wrote:
> > > >
> > > > > Anyway, FWIW I started testing this on a E5-2609 v3 and I'm
> > > > > not seeing
> > > > > hackbench regressions so far (running with schedutil
> > > > > governor).
> > > >
> > > > https://en.wikipedia.org/wiki/Haswell_(microarchitecture)#Serve
> > > > r_processors
> > > >
> > > > Lists the E5 2609 v3 as not having turbo at all, which is
> > > > basically a
> > > > best case scenario for this patch.
> > > >
> > > > As I wrote earlier today; when turbo exists, like say the 2699,
> > > > then
> > > > when we're busy we'll run at U=2.3/3.6 ~ .64, which might
> > > > confuse
> > > > things.
> > >
> > > Indeed. I was mostly trying to see if adding this to the tick
> > > might
> > > introduce noticeable overhead.
> >
> > Blindly testing on an i5-5200U (2.2/2.7 GHz) gave the following
> >
> > # perf bench sched messaging --pipe --thread --group 2 --loop 20000
> >
> > count mean std min 50%
> > 95% 99% max
> > hostname
> > kernel
> >
> > i5-5200U
> > test_after 30.0 13.843433 0.590605 12.369 13.810 14.85635
> > 15.08205 15.127
> > test_before 30.0 13.571167 0.999798 12.228 13.302 1
> > 5.57805 16.40029 16.690
> >
> > It might be interesting to see what happens when using a single CPU
> > only?
> >
> > Also, I will look at how the util signals look when a single CPU is
> > busy..
>
> And this is showing where the problem is (as you were saying [1]):
>
> https://gist.github.com/jlelli/f5438221186e5ed3660194e4f645fe93
>
> Just look at the plots (and ignore setup).
>
> First one (pid:4483) shows a single task busy running on a single
> CPU,
> which seems to be able to sustain turbo for 5 sec. So task util
> reaches
> ~1024.
>
> Second one (pid:4283) shows the same task, but running together with
> other 3 tasks (each one pinned to a different CPU). In this case util
> saturates at ~943, which is due to the fact that max freq is still
> considered to be the turbo one. :/
One more point to note. Even if we calculate some utilization based on
the freq-invariant and arrive at a P-state, we will not be able to
control any P-state in turbo region (not even as a cap) on several
Intel processors using PERF_CTL MSRs.
>
> [1] https://marc.info/?l=linux-kernel&m=152646464017810&w=2
On Thu, May 17, 2018 at 08:41:32AM -0700, Srinivas Pandruvada wrote:
> One more point to note. Even if we calculate some utilization based on
> the freq-invariant and arrive at a P-state, we will not be able to
> control any P-state in turbo region (not even as a cap) on several
> Intel processors using PERF_CTL MSRs.
Right, but don't we need to set the PERF_CTL to max P in order to access
the turbo bins? So we still need to compute a P state, but as soon as we
reach max P, we're done.
And its not as if setting anything below max P is a firm setting either
anyway, its hints all the way down.
On Thu, 2018-05-17 at 18:16 +0200, Peter Zijlstra wrote:
> On Thu, May 17, 2018 at 08:41:32AM -0700, Srinivas Pandruvada wrote:
> > One more point to note. Even if we calculate some utilization based
> > on
> > the freq-invariant and arrive at a P-state, we will not be able to
> > control any P-state in turbo region (not even as a cap) on several
> > Intel processors using PERF_CTL MSRs.
>
> Right, but don't we need to set the PERF_CTL to max P in order to
> access
> the turbo bins?
Any PERF_CTL setting above what we call "Turbo Activation ratio" (which
can be less than P1 read from platform info MSR) will do that on these
systems (Most clients from Ivy bridge).
> So we still need to compute a P state, but as soon as we
> reach max P, we're done.
What will happen if we look at all core turbo as max and cap any
utilization above this to 1024?
>
> And its not as if setting anything below max P is a firm setting
> either
> anyway, its hints all the way down.
On Thu, May 17, 2018 at 6:42 PM, Srinivas Pandruvada
<[email protected]> wrote:
> On Thu, 2018-05-17 at 18:16 +0200, Peter Zijlstra wrote:
>> On Thu, May 17, 2018 at 08:41:32AM -0700, Srinivas Pandruvada wrote:
>> > One more point to note. Even if we calculate some utilization based
>> > on
>> > the freq-invariant and arrive at a P-state, we will not be able to
>> > control any P-state in turbo region (not even as a cap) on several
>> > Intel processors using PERF_CTL MSRs.
>>
>> Right, but don't we need to set the PERF_CTL to max P in order to
>> access the turbo bins?
>
> Any PERF_CTL setting above what we call "Turbo Activation ratio" (which
> can be less than P1 read from platform info MSR) will do that on these
> systems (Most clients from Ivy bridge).
>
>> So we still need to compute a P state, but as soon as we
>> reach max P, we're done.
>
> What will happen if we look at all core turbo as max and cap any
> utilization above this to 1024?
I was going to suggest that.
Otherwise, if we ever get (say) the max one-core turbo at any point
and the system only runs parallel workloads after that, it will appear
as underutilized.
On Thu, May 17, 2018 at 06:56:37PM +0200, Rafael J. Wysocki wrote:
> On Thu, May 17, 2018 at 6:42 PM, Srinivas Pandruvada
> > What will happen if we look at all core turbo as max and cap any
> > utilization above this to 1024?
>
> I was going to suggest that.
To the basic premise behind all our frequency scaling is that there's a
linear relation between utilization and frequency, where u=1 gets us the
fastest.
Now, we all know this is fairly crude, but it is what we work with.
OTOH, the whole premise of turbo is that you don't in fact know what the
fastest is, and in that respect setting u=1 at the guaranteed or
sustainable frequency makes sense.
The direct concequence of allowing clipping is that u=1 doesn't select
the highest frequency, but since we don't select anything anyway
(p-code does that for us) all we really need is to have u=1 above that
turbo activation point you mentioned.
For parts where we have to directly select frequency this obviously
comes apart.
However; what happens when the sustainable freq drops below our initial
'max'? Imagine us dropping below the all-core-turbo because of AVX. Then
we're back to running at u<1 at full tilt.
Or for mobile parts, the sustainable frequency could drop because of
severe thermal limits. Now I _think_ we have the possibility for getting
interrupts and reading the new guaranteed frequency, so we could
re-guage.
So in theory I think it works, in practise we need to always be able to
find the actual max -- be it all-core turbo, AVX or thermal constrained
frequency. Can we do that in all cases?
I need to go back to see what the complains against Vincent's proposal
were, because I really liked the fact that it did away with all this.
On Thu, May 17, 2018 at 8:28 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, May 17, 2018 at 06:56:37PM +0200, Rafael J. Wysocki wrote:
>> On Thu, May 17, 2018 at 6:42 PM, Srinivas Pandruvada
>
>> > What will happen if we look at all core turbo as max and cap any
>> > utilization above this to 1024?
>>
>> I was going to suggest that.
>
> To the basic premise behind all our frequency scaling is that there's a
> linear relation between utilization and frequency, where u=1 gets us the
> fastest.
>
> Now, we all know this is fairly crude, but it is what we work with.
>
> OTOH, the whole premise of turbo is that you don't in fact know what the
> fastest is, and in that respect setting u=1 at the guaranteed or
> sustainable frequency makes sense.
>
> The direct concequence of allowing clipping is that u=1 doesn't select
> the highest frequency, but since we don't select anything anyway
> (p-code does that for us) all we really need is to have u=1 above that
> turbo activation point you mentioned.
>
> For parts where we have to directly select frequency this obviously
> comes apart.
>
> However; what happens when the sustainable freq drops below our initial
> 'max'? Imagine us dropping below the all-core-turbo because of AVX. Then
> we're back to running at u<1 at full tilt.
>
> Or for mobile parts, the sustainable frequency could drop because of
> severe thermal limits. Now I _think_ we have the possibility for getting
> interrupts and reading the new guaranteed frequency, so we could
> re-guage.
>
> So in theory I think it works, in practise we need to always be able to
> find the actual max -- be it all-core turbo, AVX or thermal constrained
> frequency. Can we do that in all cases?
We should be, but unfortunately that's a dynamic thing.
For example, the AVX limit only kicks in when AVX instructions are executed.
> I need to go back to see what the complains against Vincent's proposal
> were, because I really liked the fact that it did away with all this.
That would be the best way to deal with this mess, I agree.
On 17-May 20:28, Peter Zijlstra wrote:
> On Thu, May 17, 2018 at 06:56:37PM +0200, Rafael J. Wysocki wrote:
> > On Thu, May 17, 2018 at 6:42 PM, Srinivas Pandruvada
>
> > > What will happen if we look at all core turbo as max and cap any
> > > utilization above this to 1024?
> >
> > I was going to suggest that.
>
> To the basic premise behind all our frequency scaling is that there's a
> linear relation between utilization and frequency, where u=1 gets us the
> fastest.
>
> Now, we all know this is fairly crude, but it is what we work with.
>
> OTOH, the whole premise of turbo is that you don't in fact know what the
> fastest is, and in that respect setting u=1 at the guaranteed or
> sustainable frequency makes sense.
Looking from the FAIR class standpoint, we can also argue that
although you know that the max possible utilization is 1024, you are
not always granted to reach it because of RT and Interrupts pressure.
Or in big.LITTLE systems, because of the arch scaling factor.
Is it not something quite similar to the problem of having
"some not always available OPPs"
?
To track these "capacity limitations" we already have the two
different concepts of cpu_capacity_orig and cpu_capacity.
Are not "thermal constraints" and "some not always available OPPs"
just another form of "capacity limitations".
They are:
- transient
exactly like RT and Interrupt pressure
- HW related
which is the main different wrt RT and Interrupt pressure
But, apart from this last point (i.e.they have an HW related "nature"),
IMHO they seems quite similar concept... which are already addresses,
although only within the FAIR class perhaps.
Thus, my simple (maybe dumb) questions are:
- why can't we just fold turbo boost frequency into the existing concepts?
- what are the limitations of such a "simple" approach?
IOW: utilization always measures wrt the maximum possible capacity
(i.e. max turbo mode) and then there is a way to know what is, on each
CPU and at every decision time, the actual "transient maximum" we can
expect to reach for a "reasonable" future time.
> The direct concequence of allowing clipping is that u=1 doesn't select
> the highest frequency, but since we don't select anything anyway
> (p-code does that for us) all we really need is to have u=1 above that
> turbo activation point you mentioned.
If clipping means that we can also have >1024 values which are just
clamped at read/get time, this could maybe have some side-effects on
math (signals propagations across TG) and type ranges control?
> For parts where we have to directly select frequency this obviously
> comes apart.
Moreover, utilization is not (will not be) just for frequency driving.
We should keep the task placement perspective into account.
On that side, I personally like the definition _I think_ we have now:
utilization is the amount of maximum capacity used
where maximum is a constant defined at boot time and representing the
absolute max you can expect to get...
... apart from "transient capacity limitations".
Scaling the maximum depending on these transient conditions to me it
reads like "changing the scale". Which I fear will make it more
difficult for example to compare in space (different CPUs) or time
(different scheduler events) what a utilization measure means.
For example, if you have a busy loop running on a CPU which is subject
to RT pressure, you will read a <100% utilization (let say 60%). Still
it's interesting to know that maybe I can try to move that task on an
IDLE CPU to run it faster.
Should not be the same for turbo boost?
If the same task is generating only 60% utilization, because of not
available turbo boost OPPs, should still not be useful to see that
there is for example another CPU (maybe on a different NUMA node)
which is IDLE and cold, where we can move the task there to exploit
the 100% capacity provided by the topmost turbo boost mode?
> However; what happens when the sustainable freq drops below our initial
> 'max'? Imagine us dropping below the all-core-turbo because of AVX. Then
> we're back to running at u<1 at full tilt.
>
> Or for mobile parts, the sustainable frequency could drop because of
> severe thermal limits. Now I _think_ we have the possibility for getting
> interrupts and reading the new guaranteed frequency, so we could
> re-guage.
>
> So in theory I think it works, in practise we need to always be able to
> find the actual max -- be it all-core turbo, AVX or thermal constrained
> frequency. Can we do that in all cases?
>
>
> I need to go back to see what the complains against Vincent's proposal
> were, because I really liked the fact that it did away with all this.
AFAIR Vincent proposal was mainly addressing a different issue: fast
ramp-up... I don't recall if there was any specific intent to cover
the issue of "transient maximum capacities".
And still, based on my (maybe bogus) reasoning above, I think we are
discussing here a slightly different problem which has already a
(maybe partial) solution.
--
#include <best/regards.h>
Patrick Bellasi
On Fri, May 18, 2018 at 11:57:42AM +0100, Patrick Bellasi wrote:
> Thus, my simple (maybe dumb) questions are:
> - why can't we just fold turbo boost frequency into the existing concepts?
> - what are the limitations of such a "simple" approach?
Perhaps... but does this not further complicate the whole capacity vs
util thing we already have in say the misfit patches? And the
util_fits_capacity() thing from the EAS ones.
The thing is, we either need to dynamically scale the util or the
capacity or both. I think for Thermal there are patches out there that
drop the capacity.
But we'd then have to do the same for turbo/vector and all the other
stuff as well. Otherwise we risk things like running at low U with 0%
idle and not triggering the tipping point between eas and regular
balancing.
So either way around we need to know the 'true' max, either to fudge
util or to fudge capacity. And I'm not sure we can know in some of these
cases :/
And while Vincent's patches might have been inspired by another problem,
they do have the effect of always allowing util to go to 1, which is
nice for this.
On 18-May 13:29, Peter Zijlstra wrote:
> On Fri, May 18, 2018 at 11:57:42AM +0100, Patrick Bellasi wrote:
> > Thus, my simple (maybe dumb) questions are:
> > - why can't we just fold turbo boost frequency into the existing concepts?
> > - what are the limitations of such a "simple" approach?
>
> Perhaps... but does this not further complicate the whole capacity vs
> util thing we already have in say the misfit patches?
Not sure about that...
> And the util_fits_capacity() thing from the EAS ones.
In this case instead, if we can track somehow (not saying we can)
what is the currently available "transient maximum capacity"...
then a util_fits_capacity() should just look at that.
If the transient capacity is already folded into cpu_capacity, as it
is now for RT and IRQ pressure, then likely we don't have to change
anything.
> The thing is, we either need to dynamically scale the util or the
> capacity or both. I think for Thermal there are patches out there that
> drop the capacity.
Not sure... but I would feel more comfortable by something which caps
the maximum capacity. Meaning, eventually you can fill up the maximum
possible capacity only "up to" a given value, because of thermal or other
reasons most of the scheduler maybe doesn't even have to know why?
> But we'd then have to do the same for turbo/vector and all the other
> stuff as well. Otherwise we risk things like running at low U with 0%
> idle and not triggering the tipping point between eas and regular
> balancing.
Interacting with the tipping point and/or OPP changes is indeed an
interesting side of the problem I was not considering so far...
But again, the tipping point could not be defined as a delta
with respect to the "transient maximum capacity" ?
> So either way around we need to know the 'true' max, either to fudge
> util or to fudge capacity.
Right, but what I see from a concepts standpoint is something like:
+--+--+ cpu_capacity_orig (CONSTANT at boot time)
| | |
| | | HW generated constraints
| v |
+-----+ cpu_capacity_max (depending on thermal/turbo boost)
| | |
| | | SW generated constraints
| v |
+-----+ cpu_capacity (depending on RT/IRQ pressure)
| | |
| | | tipping point delta
+--v--+
| | Energy Aware mode available capacity
+-----+
Where all the wkp/lb heuristics are updated to properly consider the
cpu_capacity_max metrics whenever it comes to know what is the max
speed we can reach now on a CPU.
> And I'm not sure we can know in some of these cases :/
Right, this schema will eventually work only under the hypothesis that
"somehow" we can update cpu_capacity_max from HW events.
Not entirely sure that's possible and/or at which time granularity on
all different platforms.
> And while Vincent's patches might have been inspired by another problem,
> they do have the effect of always allowing util to go to 1, which is
> nice for this.
Sure, that's a nice point, but still I have the feeling that always
reaching u=1 can defeat other interesting properties of a task,
For example, comparing task requirements in different CPUs and/or at
different times, which plays a big role for energy aware task
placement decisions.
--
#include <best/regards.h>
Patrick Bellasi
Hi,
On 18/05/18 12:29, Peter Zijlstra wrote:
> On Fri, May 18, 2018 at 11:57:42AM +0100, Patrick Bellasi wrote:
>> Thus, my simple (maybe dumb) questions are:
>> - why can't we just fold turbo boost frequency into the existing concepts?
>> - what are the limitations of such a "simple" approach?
>
> Perhaps... but does this not further complicate the whole capacity vs
> util thing we already have in say the misfit patches?
What do you mean by that ? Bear in mind, I'm a complete stranger to turbo
boost so I fail to see why, as Patrick puts it, it can't fit within the
existing concepts (i.e. max util is 1024 but is only reachable when turbo
boosted).
Hi Peter,
maybe you missed this previous my response:
20180518133353.GO30654@e110439-lin
?
Would like to have your tought about the concept of "transient maximum
capacity" I was describing...
On 18-May 14:33, Patrick Bellasi wrote:
> On 18-May 13:29, Peter Zijlstra wrote:
> > On Fri, May 18, 2018 at 11:57:42AM +0100, Patrick Bellasi wrote:
> > > Thus, my simple (maybe dumb) questions are:
> > > - why can't we just fold turbo boost frequency into the existing concepts?
> > > - what are the limitations of such a "simple" approach?
> >
> > Perhaps... but does this not further complicate the whole capacity vs
> > util thing we already have in say the misfit patches?
>
> Not sure about that...
>
> > And the util_fits_capacity() thing from the EAS ones.
>
> In this case instead, if we can track somehow (not saying we can)
> what is the currently available "transient maximum capacity"...
> then a util_fits_capacity() should just look at that.
>
> If the transient capacity is already folded into cpu_capacity, as it
> is now for RT and IRQ pressure, then likely we don't have to change
> anything.
>
> > The thing is, we either need to dynamically scale the util or the
> > capacity or both. I think for Thermal there are patches out there that
> > drop the capacity.
>
> Not sure... but I would feel more comfortable by something which caps
> the maximum capacity. Meaning, eventually you can fill up the maximum
> possible capacity only "up to" a given value, because of thermal or other
> reasons most of the scheduler maybe doesn't even have to know why?
>
> > But we'd then have to do the same for turbo/vector and all the other
> > stuff as well. Otherwise we risk things like running at low U with 0%
> > idle and not triggering the tipping point between eas and regular
> > balancing.
>
> Interacting with the tipping point and/or OPP changes is indeed an
> interesting side of the problem I was not considering so far...
>
> But again, the tipping point could not be defined as a delta
> with respect to the "transient maximum capacity" ?
>
> > So either way around we need to know the 'true' max, either to fudge
> > util or to fudge capacity.
>
> Right, but what I see from a concepts standpoint is something like:
>
> +--+--+ cpu_capacity_orig (CONSTANT at boot time)
> | | |
> | | | HW generated constraints
> | v |
> +-----+ cpu_capacity_max (depending on thermal/turbo boost)
> | | |
> | | | SW generated constraints
> | v |
> +-----+ cpu_capacity (depending on RT/IRQ pressure)
> | | |
> | | | tipping point delta
> +--v--+
> | | Energy Aware mode available capacity
> +-----+
>
> Where all the wkp/lb heuristics are updated to properly consider the
> cpu_capacity_max metrics whenever it comes to know what is the max
> speed we can reach now on a CPU.
>
> > And I'm not sure we can know in some of these cases :/
>
> Right, this schema will eventually work only under the hypothesis that
> "somehow" we can update cpu_capacity_max from HW events.
>
> Not entirely sure that's possible and/or at which time granularity on
> all different platforms.
>
> > And while Vincent's patches might have been inspired by another problem,
> > they do have the effect of always allowing util to go to 1, which is
> > nice for this.
>
> Sure, that's a nice point, but still I have the feeling that always
> reaching u=1 can defeat other interesting properties of a task,
> For example, comparing task requirements in different CPUs and/or at
> different times, which plays a big role for energy aware task
> placement decisions.
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi
--
#include <best/regards.h>
Patrick Bellasi