From: Rafael J. Wysocki <[email protected]>
A severe performance regression on AMD EPYC processors when using
the schedutil scaling governor was discovered by Phoronix.com and
attributed to the following commits:
41ea667227ba ("x86, sched: Calculate frequency invariance for
AMD systems")
976df7e5730e ("x86, sched: Use midpoint of max_boost and max_P
for frequency invariance on AMD EPYC")
The source of the problem is that the maximum performance level taken
for computing the arch_max_freq_ratio value used in the x86 scale-
invariance code is higher than the one corresponding to the
cpuinfo.max_freq value coming from the acpi_cpufreq driver.
This effectively causes the scale-invariant utilization to fall below
100% even if the CPU runs at cpuinfo.max_freq or slightly faster, so
the schedutil governor selects a frequency below cpuinfo.max_freq
then. That frequency corresponds to a frequency table entry below
the maximum performance level necessary to get to the "boost" range
of CPU frequencies.
However, if the cpuinfo.max_freq value coming from acpi_cpufreq was
higher, the schedutil governor would select higher frequencies which
in turn would allow acpi_cpufreq to set more adequate performance
levels and to get to the "boost" range of CPU frequencies more often.
This issue affects any systems where acpi_cpufreq is used and the
"boost" (or "turbo") frequencies are enabled, not just AMD EPYC.
Moreover, commit db865272d9c4 ("cpufreq: Avoid configuring old
governors as default with intel_pstate") from the 5.10 development
cycle made it extremely easy to default to schedutil even if the
preferred driver is acpi_cpufreq as long as intel_pstate is built
too, because the mere presence of the latter effectively removes the
ondemand governor from the defaults. Distro kernels are likely to
include both intel_pstate and acpi_cpufreq on x86, so their users
who cannot use intel_pstate or choose to use acpi_cpufreq may
easily be affectecd by this issue.
To address this issue, extend the frequency table constructed by
acpi_cpufreq for each CPU to cover the entire range of available
frequencies (including the "boost" ones) if CPPC is available and
indicates that "boost" (or "turbo") frequencies are enabled. That
causes cpuinfo.max_freq to become the maximum "boost" frequency of
the given CPU (instead of the maximum frequency returned by the ACPI
_PSS object that corresponds to the "nominal" performance level).
Fixes: 41ea667227ba ("x86, sched: Calculate frequency invariance for AMD systems")
Fixes: 976df7e5730e ("x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC")
Fixes: db865272d9c4 ("cpufreq: Avoid configuring old governors as default with intel_pstate")
Link: https://www.phoronix.com/scan.php?page=article&item=linux511-amd-schedutil&num=1
Link: https://lore.kernel.org/linux-pm/[email protected]/
Reported-by: Michael Larabel <[email protected]>
Diagnosed-by: Giovanni Gherdovich <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
drivers/cpufreq/acpi-cpufreq.c | 107 ++++++++++++++++++++++++++++++++++++-----
1 file changed, 95 insertions(+), 12 deletions(-)
Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c
+++ linux-pm/drivers/cpufreq/acpi-cpufreq.c
@@ -26,6 +26,7 @@
#include <linux/uaccess.h>
#include <acpi/processor.h>
+#include <acpi/cppc_acpi.h>
#include <asm/msr.h>
#include <asm/processor.h>
@@ -53,6 +54,7 @@ struct acpi_cpufreq_data {
unsigned int resume;
unsigned int cpu_feature;
unsigned int acpi_perf_cpu;
+ unsigned int first_perf_state;
cpumask_var_t freqdomain_cpus;
void (*cpu_freq_write)(struct acpi_pct_register *reg, u32 val);
u32 (*cpu_freq_read)(struct acpi_pct_register *reg);
@@ -221,10 +223,10 @@ static unsigned extract_msr(struct cpufr
perf = to_perf_data(data);
- cpufreq_for_each_entry(pos, policy->freq_table)
+ cpufreq_for_each_entry(pos, policy->freq_table + data->first_perf_state)
if (msr == perf->states[pos->driver_data].status)
return pos->frequency;
- return policy->freq_table[0].frequency;
+ return policy->freq_table[data->first_perf_state].frequency;
}
static unsigned extract_freq(struct cpufreq_policy *policy, u32 val)
@@ -363,6 +365,7 @@ static unsigned int get_cur_freq_on_cpu(
struct cpufreq_policy *policy;
unsigned int freq;
unsigned int cached_freq;
+ unsigned int state;
pr_debug("%s (%d)\n", __func__, cpu);
@@ -374,7 +377,11 @@ static unsigned int get_cur_freq_on_cpu(
if (unlikely(!data || !policy->freq_table))
return 0;
- cached_freq = policy->freq_table[to_perf_data(data)->state].frequency;
+ state = to_perf_data(data)->state;
+ if (state < data->first_perf_state)
+ state = data->first_perf_state;
+
+ cached_freq = policy->freq_table[state].frequency;
freq = extract_freq(policy, get_cur_val(cpumask_of(cpu), data));
if (freq != cached_freq) {
/*
@@ -628,16 +635,54 @@ static int acpi_cpufreq_blacklist(struct
}
#endif
+#ifdef CONFIG_ACPI_CPPC_LIB
+static u64 get_max_boost_ratio(unsigned int cpu)
+{
+ struct cppc_perf_caps perf_caps;
+ u64 highest_perf, nominal_perf;
+ int ret;
+
+ if (acpi_pstate_strict)
+ return 0;
+
+ ret = cppc_get_perf_caps(cpu, &perf_caps);
+ if (ret) {
+ pr_debug("CPU%d: Unable to get performance capabilities (%d)\n",
+ cpu, ret);
+ return 0;
+ }
+
+ highest_perf = perf_caps.highest_perf;
+ nominal_perf = perf_caps.nominal_perf;
+
+ if (!highest_perf || !nominal_perf) {
+ pr_debug("CPU%d: highest or nominal performance missing\n", cpu);
+ return 0;
+ }
+
+ if (highest_perf < nominal_perf) {
+ pr_debug("CPU%d: nominal performance above highest\n", cpu);
+ return 0;
+ }
+
+ return div_u64(highest_perf << SCHED_CAPACITY_SHIFT, nominal_perf);
+}
+#else
+static inline u64 get_max_boost_ratio(unsigned int cpu) { return 0; }
+#endif
+
static int acpi_cpufreq_cpu_init(struct cpufreq_policy *policy)
{
- unsigned int i;
- unsigned int valid_states = 0;
- unsigned int cpu = policy->cpu;
+ struct cpufreq_frequency_table *freq_table;
+ struct acpi_processor_performance *perf;
struct acpi_cpufreq_data *data;
+ unsigned int cpu = policy->cpu;
+ struct cpuinfo_x86 *c = &cpu_data(cpu);
+ unsigned int valid_states = 0;
unsigned int result = 0;
- struct cpuinfo_x86 *c = &cpu_data(policy->cpu);
- struct acpi_processor_performance *perf;
- struct cpufreq_frequency_table *freq_table;
+ unsigned int state_count;
+ u64 max_boost_ratio;
+ unsigned int i;
#ifdef CONFIG_SMP
static int blacklisted;
#endif
@@ -750,8 +795,20 @@ static int acpi_cpufreq_cpu_init(struct
goto err_unreg;
}
- freq_table = kcalloc(perf->state_count + 1, sizeof(*freq_table),
- GFP_KERNEL);
+ state_count = perf->state_count + 1;
+
+ max_boost_ratio = get_max_boost_ratio(cpu);
+ if (max_boost_ratio) {
+ /*
+ * Make a room for one more entry to represent the highest
+ * available "boost" frequency.
+ */
+ state_count++;
+ valid_states++;
+ data->first_perf_state = valid_states;
+ }
+
+ freq_table = kcalloc(state_count, sizeof(*freq_table), GFP_KERNEL);
if (!freq_table) {
result = -ENOMEM;
goto err_unreg;
@@ -785,6 +842,30 @@ static int acpi_cpufreq_cpu_init(struct
valid_states++;
}
freq_table[valid_states].frequency = CPUFREQ_TABLE_END;
+
+ if (max_boost_ratio) {
+ unsigned int state = data->first_perf_state;
+ unsigned int freq = freq_table[state].frequency;
+
+ /*
+ * Because the loop above sorts the freq_table entries in the
+ * descending order, freq is the maximum frequency in the table.
+ * Assume that it corresponds to the CPPC nominal frequency and
+ * use it to populate the frequency field of the extra "boost"
+ * frequency entry.
+ */
+ freq_table[0].frequency = freq * max_boost_ratio >> SCHED_CAPACITY_SHIFT;
+ /*
+ * The purpose of the extra "boost" frequency entry is to make
+ * the rest of cpufreq aware of the real maximum frequency, but
+ * the way to request it is the same as for the first_perf_state
+ * entry that is expected to cover the entire range of "boost"
+ * frequencies of the CPU, so copy the driver_data value from
+ * that entry.
+ */
+ freq_table[0].driver_data = freq_table[state].driver_data;
+ }
+
policy->freq_table = freq_table;
perf->state = 0;
@@ -858,8 +939,10 @@ static void acpi_cpufreq_cpu_ready(struc
{
struct acpi_processor_performance *perf = per_cpu_ptr(acpi_perf_data,
policy->cpu);
+ struct acpi_cpufreq_data *data = policy->driver_data;
+ unsigned int freq = policy->freq_table[data->first_perf_state].frequency;
- if (perf->states[0].core_frequency * 1000 != policy->cpuinfo.max_freq)
+ if (perf->states[0].core_frequency * 1000 != freq)
pr_warn(FW_WARN "P-state 0 is not max freq\n");
}
On Thu, 2021-02-04 at 18:25 +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> A severe performance regression on AMD EPYC processors when using
> the schedutil scaling governor was discovered by Phoronix.com and
> attributed to the following commits:
>
> 41ea667227ba ("x86, sched: Calculate frequency invariance for
> AMD systems")
>
> 976df7e5730e ("x86, sched: Use midpoint of max_boost and max_P
> for frequency invariance on AMD EPYC")
>
> [snip]
>
> Fixes: 41ea667227ba ("x86, sched: Calculate frequency invariance for AMD systems")
> Fixes: 976df7e5730e ("x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC")
> Fixes: db865272d9c4 ("cpufreq: Avoid configuring old governors as default with intel_pstate")
> Link: https://www.phoronix.com/scan.php?page=article&item=linux511-amd-schedutil&num=1
> Link: https://lore.kernel.org/linux-pm/[email protected]/
> Reported-by: Michael Larabel <[email protected]>
> Diagnosed-by: Giovanni Gherdovich <[email protected]>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
> drivers/cpufreq/acpi-cpufreq.c | 107 ++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 95 insertions(+), 12 deletions(-)
>
> Index: linux-pm/drivers/cpufreq/acpi-cpufreq.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/acpi-cpufreq.c
> +++ linux-pm/drivers/cpufreq/acpi-cpufreq.c
> [...]
Tested-by: Giovanni Gherdovich <[email protected]>
Reviewed-by: Giovanni Gherdovich <[email protected]>
Note there is also the Tested-by: Michael, from the other thread
https://lore.kernel.org/lkml/[email protected]/
I tested this patch with the "NASA Parallel Benchmarks" from [link below], the
results confirms that the 5.10 performance is recovered:
Ratios of completion times, lower is better (5.10 is the baseline)
5.10 5.11-rc6 5.11-rc6-ggherdov 5.11-rc6-rafael
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Integer Sort 1.00 1.21 0.91 0.93
Embarrassingly Parallel 1.00 1.60 1.00 1.00
Discrete FFT 1.00 1.68 0.67 0.67
CPU : MODEL : 2x AMD EPYC 7742
FREQUENCY TABLE : P2: 1.50 GHz
P1: 2.00 GHz
P0: 2.25 GHz
MAX BOOST : 3.40 GHz
[link] https://www.nas.nasa.gov/publications/npb.html
Thanks,
Giovanni