Hi Everyone,
The purpose of this series is to provide the scheduler with asymmetric CPU
capacity information on x86 hybrid systems based on Intel hardware.
The asymmetric CPU capacity information is important on hybrid systems as it
allows utilization to be computed for tasks in a consistent way across all
CPUs in the system, regardless of their capacity. This, in turn, allows
the schedutil cpufreq governor to set CPU performance levels consistently
in the cases when tasks migrate between CPUs of different capacities. It
should also help to improve task placement and load balancing decisions on
hybrid systems and it is key for anything along the lines of EAS.
The information in question comes from the MSR_HWP_CAPABILITIES register and
is provided to the scheduler by the intel_pstate driver, as per the changelog
of patch [3/3]. Patch [2/3] introduces the arch infrastructure needed for
that (in the form of a per-CPU capacity variable) and patch [1/3] is a
preliminary code adjustment.
The changes made by patch [2/3] are very simple, which is why this series is
being sent as an RFC. Namely, it increases overhead on non-hybrid as well as
on hybrid systems which may be regarded as objectionable, even though the
overhead increase is arguably not significant. The memory overhead is an
unsigned long variable per CPU which is not a lot IMV and there is also
additional memory access overhead at each arch_scale_cpu_capacity() call site
which I'm not expecting to be noticeable, however. In any case, the extra
overhead can be avoided at the cost of making the code a bit more complex
(for example, the additional per-CPU memory can be allocated dynamically
on hybrid systems only and a static branch can be used for enabling access
to it when necessary). I'm just not sure if the extra complexity is really
worth it, so I'd like to know the x86 maintainers' take on this. If you'd
prefer the overhead to be avoided, please let me know.
Of course, any other feedback on the patches is welcome as well.
Thank you!
From: Rafael J. Wysocki <[email protected]>
Make intel_pstate use the HWP_HIGHEST_PERF values from
MSR_HWP_CAPABILITIES to set asymmetric CPU capacity information
via the previously introduced arch_set_cpu_capacity() on hybrid
systems without SMT.
Setting asymmetric CPU capacity is generally necessary to allow the
scheduler to compute task sizes in a consistent way across all CPUs
in a system where they differ by capacity. That, in turn, should help
to improve task placement and load balancing decisions. It is also
necessary for the schedutil cpufreq governor to operate as expected
on hybrid systems where tasks migrate between CPUs of different
capacities.
The underlying observation is that intel_pstate already uses
MSR_HWP_CAPABILITIES to get CPU performance information which is
exposed by it via sysfs and CPU performance scaling is based on it.
Thus using this information for setting asymmetric CPU capacity is
consistent with what the driver has been doing already. Moreover,
HWP_HIGHEST_PERF reflects the maximum capacity of a given CPU including
both the instructions-per-cycle (IPC) factor and the maximum turbo
frequency and the units in which that value is expressed are the same
for all CPUs in the system, so the maximum capacity ratio between two
CPUs can be obtained by computing the ratio of their HWP_HIGHEST_PERF
values. Of course, in principle that capacity ratio need not be
directly applicable at lower frequencies, so using it for providing the
asymmetric CPU capacity information to the scheduler is a rough
approximation, but it is as good as it gets. Also, measurements
indicate that this approximation is not too bad in practice.
If the given system is hybrid and non-SMT, the new code disables ITMT
support in the scheduler (because it may get in the way of asymmetric CPU
capacity code in the scheduler that automatically gets enabled by setting
asymmetric CPU capacity) after initializing all online CPUs and finds
the one with the maximum HWP_HIGHEST_PERF value. Next, it computes the
capacity number for each (online) CPU by dividing the product of its
HWP_HIGHEST_PERF and SCHED_CAPACITY_SCALE by the maximum HWP_HIGHEST_PERF.
When a CPU goes offline, its capacity is reset to SCHED_CAPACITY_SCALE
and if it is the one with the maximum HWP_HIGHEST_PERF value, the
capacity numbers for all of the other online CPUs are recomputed. This
also takes care of a cleanup during driver operation mode changes.
Analogously, when a new CPU goes online, its capacity number is updated
and if its HWP_HIGHEST_PERF value is greater than the current maximum
one, the capacity numbers for all of the other online CPUs are
recomputed.
The case when the driver is notified of a CPU capacity change, either
through the HWP interrupt or through an ACPI notification, is handled
similarly to the CPU online case above, except that if the target CPU
is the current highest-capacity one and its capacity is reduced, the
capacity numbers for all of the other online CPUs need to be recomputed
either.
If the driver's "no_trubo" sysfs attribute is updated, all of the CPU
capacity information is computed from scratch to reflect the new turbo
status.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
drivers/cpufreq/intel_pstate.c | 187 ++++++++++++++++++++++++++++++++++++++++-
1 file changed, 183 insertions(+), 4 deletions(-)
Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -16,6 +16,7 @@
#include <linux/tick.h>
#include <linux/slab.h>
#include <linux/sched/cpufreq.h>
+#include <linux/sched/smt.h>
#include <linux/list.h>
#include <linux/cpu.h>
#include <linux/cpufreq.h>
@@ -215,6 +216,7 @@ struct global_params {
* @hwp_req_cached: Cached value of the last HWP Request MSR
* @hwp_cap_cached: Cached value of the last HWP Capabilities MSR
* @last_io_update: Last time when IO wake flag was set
+ * @capacity_perf: Perf from HWP_CAP used for capacity computations
* @sched_flags: Store scheduler flags for possible cross CPU update
* @hwp_boost_min: Last HWP boosted min performance
* @suspended: Whether or not the driver has been suspended.
@@ -253,6 +255,7 @@ struct cpudata {
u64 hwp_req_cached;
u64 hwp_cap_cached;
u64 last_io_update;
+ unsigned int capacity_perf;
unsigned int sched_flags;
u32 hwp_boost_min;
bool suspended;
@@ -295,6 +298,7 @@ static int hwp_mode_bdw __ro_after_init;
static bool per_cpu_limits __ro_after_init;
static bool hwp_forced __ro_after_init;
static bool hwp_boost __read_mostly;
+static bool hwp_is_hybrid;
static struct cpufreq_driver *intel_pstate_driver __read_mostly;
@@ -934,6 +938,93 @@ static struct freq_attr *hwp_cpufreq_att
NULL,
};
+static struct cpudata *hybrid_max_perf_cpu __read_mostly;
+/*
+ * This protects hybrid_max_perf_cpu, the @capacity_perf fields in struct
+ * cpudata, and the x86 arch capacity information from concurrent updates.
+ */
+static DEFINE_MUTEX(hybrid_capacity_lock);
+
+static unsigned int hybrid_get_cap_perf(struct cpudata *cpu)
+{
+ u64 hwp_cap = READ_ONCE(cpu->hwp_cap_cached);
+
+ if (READ_ONCE(global.no_turbo))
+ return HWP_GUARANTEED_PERF(hwp_cap);
+
+ return HWP_HIGHEST_PERF(hwp_cap);
+}
+
+static void hybrid_set_cpu_capacity(struct cpudata *cpu)
+{
+ u64 cap = div_u64((u64)SCHED_CAPACITY_SCALE * cpu->capacity_perf,
+ hybrid_max_perf_cpu->capacity_perf);
+
+ arch_set_cpu_capacity(cpu->cpu, cap);
+}
+
+static void hybrid_set_capacity_of_cpus(void)
+{
+ int cpunum;
+
+ for_each_online_cpu(cpunum) {
+ struct cpudata *cpu = all_cpu_data[cpunum];
+
+ /*
+ * Skip hybrid_max_perf_cpu because its capacity is the
+ * maximum and need not be computed.
+ */
+ if (cpu && cpu != hybrid_max_perf_cpu)
+ hybrid_set_cpu_capacity(cpu);
+ }
+}
+
+static void hybrid_update_cpu_scaling(void)
+{
+ struct cpudata *max_perf_cpu = NULL;
+ unsigned int max_cap_perf = 0;
+ int cpunum;
+
+ for_each_online_cpu(cpunum) {
+ struct cpudata *cpu = all_cpu_data[cpunum];
+ unsigned int cap_perf;
+
+ /*
+ * If hybrid_max_perf_cpu is not NULL at this point, it is
+ * being replaced, so skip it.
+ */
+ if (!cpu || cpu == hybrid_max_perf_cpu)
+ continue;
+
+ cap_perf = hybrid_get_cap_perf(cpu);
+ cpu->capacity_perf = cap_perf;
+ if (cap_perf > max_cap_perf) {
+ max_cap_perf = cap_perf;
+ max_perf_cpu = cpu;
+ }
+ }
+
+ if (max_perf_cpu) {
+ arch_set_cpu_capacity(max_perf_cpu->cpu, SCHED_CAPACITY_SCALE);
+ hybrid_max_perf_cpu = max_perf_cpu;
+ hybrid_set_capacity_of_cpus();
+ } else {
+ /* Revert to the flat CPU capacity structure. */
+ for_each_online_cpu(cpunum)
+ arch_set_cpu_capacity(cpunum, SCHED_CAPACITY_SCALE);
+ }
+}
+
+static void hybrid_init_cpu_scaling(void)
+{
+ mutex_lock(&hybrid_capacity_lock);
+
+ hybrid_max_perf_cpu = NULL;
+ hybrid_update_cpu_scaling();
+
+ mutex_unlock(&hybrid_capacity_lock);
+}
+
static void __intel_pstate_get_hwp_cap(struct cpudata *cpu)
{
u64 cap;
@@ -962,6 +1053,40 @@ static void intel_pstate_get_hwp_cap(str
}
}
+static void hybrid_update_capacity(struct cpudata *cpu)
+{
+ unsigned int max_cap_perf, cap_perf;
+
+ mutex_lock(&hybrid_capacity_lock);
+
+ if (!hybrid_max_perf_cpu)
+ goto unlock;
+
+ max_cap_perf = hybrid_max_perf_cpu->capacity_perf;
+
+ intel_pstate_get_hwp_cap(cpu);
+
+ cap_perf = hybrid_get_cap_perf(cpu);
+ cpu->capacity_perf = cap_perf;
+
+ if (cap_perf > max_cap_perf) {
+ arch_set_cpu_capacity(cpu->cpu, SCHED_CAPACITY_SCALE);
+ hybrid_max_perf_cpu = cpu;
+ hybrid_set_capacity_of_cpus();
+ goto unlock;
+ }
+
+ if (cpu == hybrid_max_perf_cpu && cap_perf < max_cap_perf) {
+ hybrid_update_cpu_scaling();
+ goto unlock;
+ }
+
+ hybrid_set_cpu_capacity(cpu);
+
+unlock:
+ mutex_unlock(&hybrid_capacity_lock);
+}
+
static void intel_pstate_hwp_set(unsigned int cpu)
{
struct cpudata *cpu_data = all_cpu_data[cpu];
@@ -1070,6 +1195,16 @@ static void intel_pstate_hwp_offline(str
value |= HWP_ENERGY_PERF_PREFERENCE(HWP_EPP_POWERSAVE);
wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value);
+
+ mutex_lock(&hybrid_capacity_lock);
+
+ if (hybrid_max_perf_cpu == cpu)
+ hybrid_update_cpu_scaling();
+
+ mutex_unlock(&hybrid_capacity_lock);
+
+ /* Reset the capacity of the CPU going offline to the initial value. */
+ arch_set_cpu_capacity(cpu->cpu, SCHED_CAPACITY_SCALE);
}
#define POWER_CTL_EE_ENABLE 1
@@ -1164,21 +1299,41 @@ static void __intel_pstate_update_max_fr
static void intel_pstate_update_limits(unsigned int cpu)
{
struct cpufreq_policy *policy = cpufreq_cpu_acquire(cpu);
+ struct cpudata *cpudata;
if (!policy)
return;
- __intel_pstate_update_max_freq(all_cpu_data[cpu], policy);
+ cpudata = all_cpu_data[cpu];
+
+ __intel_pstate_update_max_freq(cpudata, policy);
+
+ /* Prevent the driver from being unregistered now. */
+ mutex_lock(&intel_pstate_driver_lock);
cpufreq_cpu_release(policy);
+
+ hybrid_update_capacity(cpudata);
+
+ mutex_unlock(&intel_pstate_driver_lock);
}
static void intel_pstate_update_limits_for_all(void)
{
int cpu;
- for_each_possible_cpu(cpu)
- intel_pstate_update_limits(cpu);
+ for_each_possible_cpu(cpu) {
+ struct cpufreq_policy *policy = cpufreq_cpu_acquire(cpu);
+
+ if (!policy)
+ continue;
+
+ __intel_pstate_update_max_freq(all_cpu_data[cpu], policy);
+
+ cpufreq_cpu_release(policy);
+ }
+
+ hybrid_init_cpu_scaling();
}
/************************** sysfs begin ************************/
@@ -1612,6 +1767,13 @@ static void intel_pstate_notify_work(str
__intel_pstate_update_max_freq(cpudata, policy);
cpufreq_cpu_release(policy);
+
+ /*
+ * The driver will not be unregistered while this function is
+ * running, so update the capacity without acquiring the driver
+ * lock.
+ */
+ hybrid_update_capacity(cpudata);
}
wrmsrl_on_cpu(cpudata->cpu, MSR_HWP_STATUS, 0);
@@ -2013,8 +2175,10 @@ static void intel_pstate_get_cpu_pstates
if (pstate_funcs.get_cpu_scaling) {
cpu->pstate.scaling = pstate_funcs.get_cpu_scaling(cpu->cpu);
- if (cpu->pstate.scaling != perf_ctl_scaling)
+ if (cpu->pstate.scaling != perf_ctl_scaling) {
intel_pstate_hybrid_hwp_adjust(cpu);
+ hwp_is_hybrid = true;
+ }
} else {
cpu->pstate.scaling = perf_ctl_scaling;
}
@@ -2682,6 +2846,8 @@ static int intel_pstate_cpu_online(struc
*/
intel_pstate_hwp_reenable(cpu);
cpu->suspended = false;
+
+ hybrid_update_capacity(cpu);
}
return 0;
@@ -3124,6 +3290,19 @@ static int intel_pstate_register_driver(
global.min_perf_pct = min_perf_pct_min();
+ /*
+ * On hybrid systems, use asym capacity instead of ITMT, but because
+ * the capacity of SMT threads is not deterministic even approximately,
+ * do not do that when SMT is in use.
+ */
+ if (hwp_is_hybrid && !sched_smt_active()) {
+ sched_clear_itmt_support();
+
+ hybrid_init_cpu_scaling();
+
+ arch_rebuild_sched_domains();
+ }
+
return 0;
}
From: Rafael J. Wysocki <[email protected]>
Add arch_rebuild_sched_domains() for rebuilding scheduling domains and
updating topology on x86 and make the ITMT code use it.
First of all, this reduces code duplication somewhat and eliminates
a need to use an extern variable, but it will also lay the ground for
future work related to CPU capacity scaling.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
arch/x86/include/asm/topology.h | 6 ++++--
arch/x86/kernel/itmt.c | 12 ++++--------
arch/x86/kernel/smpboot.c | 10 +++++++++-
3 files changed, 17 insertions(+), 11 deletions(-)
Index: linux-pm/arch/x86/include/asm/topology.h
===================================================================
--- linux-pm.orig/arch/x86/include/asm/topology.h
+++ linux-pm/arch/x86/include/asm/topology.h
@@ -235,8 +235,6 @@ struct pci_bus;
int x86_pci_root_bus_node(int bus);
void x86_pci_root_bus_resources(int bus, struct list_head *resources);
-extern bool x86_topology_update;
-
#ifdef CONFIG_SCHED_MC_PRIO
#include <asm/percpu.h>
@@ -284,9 +282,13 @@ static inline long arch_scale_freq_capac
extern void arch_set_max_freq_ratio(bool turbo_disabled);
extern void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled);
+
+void arch_rebuild_sched_domains(void);
#else
static inline void arch_set_max_freq_ratio(bool turbo_disabled) { }
static inline void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled) { }
+
+static inline void arch_rebuild_sched_domains(void) { }
#endif
extern void arch_scale_freq_tick(void);
Index: linux-pm/arch/x86/kernel/itmt.c
===================================================================
--- linux-pm.orig/arch/x86/kernel/itmt.c
+++ linux-pm/arch/x86/kernel/itmt.c
@@ -54,10 +54,8 @@ static int sched_itmt_update_handler(str
old_sysctl = sysctl_sched_itmt_enabled;
ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
- if (!ret && write && old_sysctl != sysctl_sched_itmt_enabled) {
- x86_topology_update = true;
- rebuild_sched_domains();
- }
+ if (!ret && write && old_sysctl != sysctl_sched_itmt_enabled)
+ arch_rebuild_sched_domains();
mutex_unlock(&itmt_update_mutex);
@@ -114,8 +112,7 @@ int sched_set_itmt_support(void)
sysctl_sched_itmt_enabled = 1;
- x86_topology_update = true;
- rebuild_sched_domains();
+ arch_rebuild_sched_domains();
mutex_unlock(&itmt_update_mutex);
@@ -150,8 +147,7 @@ void sched_clear_itmt_support(void)
if (sysctl_sched_itmt_enabled) {
/* disable sched_itmt if we are no longer ITMT capable */
sysctl_sched_itmt_enabled = 0;
- x86_topology_update = true;
- rebuild_sched_domains();
+ arch_rebuild_sched_domains();
}
mutex_unlock(&itmt_update_mutex);
Index: linux-pm/arch/x86/kernel/smpboot.c
===================================================================
--- linux-pm.orig/arch/x86/kernel/smpboot.c
+++ linux-pm/arch/x86/kernel/smpboot.c
@@ -39,6 +39,7 @@
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/cpuset.h>
#include <linux/init.h>
#include <linux/smp.h>
#include <linux/export.h>
@@ -125,7 +126,7 @@ static DEFINE_PER_CPU_ALIGNED(struct mwa
int __read_mostly __max_smt_threads = 1;
/* Flag to indicate if a complete sched domain rebuild is required */
-bool x86_topology_update;
+static bool x86_topology_update;
int arch_update_cpu_topology(void)
{
@@ -135,6 +136,13 @@ int arch_update_cpu_topology(void)
return retval;
}
+#ifdef CONFIG_X86_64
+void arch_rebuild_sched_domains(void) {
+ x86_topology_update = true;
+ rebuild_sched_domains();
+}
+#endif
+
static unsigned int smpboot_warm_reset_vector_count;
static inline void smpboot_setup_warm_reset_vector(unsigned long start_eip)
On 25/04/2024 21:06, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> Make intel_pstate use the HWP_HIGHEST_PERF values from
> MSR_HWP_CAPABILITIES to set asymmetric CPU capacity information
> via the previously introduced arch_set_cpu_capacity() on hybrid
> systems without SMT.
Are there such systems around? My i7-13700K has P-cores (CPU0..CPU15)
with SMT.
> Setting asymmetric CPU capacity is generally necessary to allow the
> scheduler to compute task sizes in a consistent way across all CPUs
> in a system where they differ by capacity. That, in turn, should help
> to improve task placement and load balancing decisions. It is also
> necessary for the schedutil cpufreq governor to operate as expected
> on hybrid systems where tasks migrate between CPUs of different
> capacities.
>
> The underlying observation is that intel_pstate already uses
> MSR_HWP_CAPABILITIES to get CPU performance information which is
> exposed by it via sysfs and CPU performance scaling is based on it.
> Thus using this information for setting asymmetric CPU capacity is
> consistent with what the driver has been doing already. Moreover,
> HWP_HIGHEST_PERF reflects the maximum capacity of a given CPU including
> both the instructions-per-cycle (IPC) factor and the maximum turbo
> frequency and the units in which that value is expressed are the same
> for all CPUs in the system, so the maximum capacity ratio between two
> CPUs can be obtained by computing the ratio of their HWP_HIGHEST_PERF
> values. Of course, in principle that capacity ratio need not be
> directly applicable at lower frequencies, so using it for providing the
> asymmetric CPU capacity information to the scheduler is a rough
> approximation, but it is as good as it gets. Also, measurements
> indicate that this approximation is not too bad in practice.
So cpu_capacity has a direct mapping to itmt prio. cpu_capacity is itmt
prio with max itmt prio scaled to 1024.
Running it on i7-13700K (while allowing SMT) gives:
root@gulliver:~# dmesg | grep sched_set_itmt_core_prio
[ 3.957826] sched_set_itmt_core_prio() cpu=0 prio=68
[ 3.990401] sched_set_itmt_core_prio() cpu=1 prio=68
[ 4.015551] sched_set_itmt_core_prio() cpu=2 prio=68
[ 4.040720] sched_set_itmt_core_prio() cpu=3 prio=68
[ 4.065871] sched_set_itmt_core_prio() cpu=4 prio=68
[ 4.091018] sched_set_itmt_core_prio() cpu=5 prio=68
[ 4.116175] sched_set_itmt_core_prio() cpu=6 prio=68
[ 4.141374] sched_set_itmt_core_prio() cpu=7 prio=68
[ 4.166543] sched_set_itmt_core_prio() cpu=8 prio=69
[ 4.196289] sched_set_itmt_core_prio() cpu=9 prio=69
[ 4.214964] sched_set_itmt_core_prio() cpu=10 prio=69
[ 4.239281] sched_set_itmt_core_prio() cpu=11 prio=69
[ 4.263438] sched_set_itmt_core_prio() cpu=12 prio=68
[ 4.283790] sched_set_itmt_core_prio() cpu=13 prio=68
[ 4.308905] sched_set_itmt_core_prio() cpu=14 prio=68
[ 4.331751] sched_set_itmt_core_prio() cpu=15 prio=68
[ 4.356002] sched_set_itmt_core_prio() cpu=16 prio=42
[ 4.381639] sched_set_itmt_core_prio() cpu=17 prio=42
[ 4.395175] sched_set_itmt_core_prio() cpu=18 prio=42
[ 4.425625] sched_set_itmt_core_prio() cpu=19 prio=42
[ 4.449670] sched_set_itmt_core_prio() cpu=20 prio=42
[ 4.479681] sched_set_itmt_core_prio() cpu=21 prio=42
[ 4.506319] sched_set_itmt_core_prio() cpu=22 prio=42
[ 4.523774] sched_set_itmt_core_prio() cpu=23 prio=42
root@gulliver:~# dmesg | grep hybrid_set_cpu_capacity
[ 4.450883] hybrid_set_cpu_capacity() cpu=0 cap=1009
[ 4.455846] hybrid_set_cpu_capacity() cpu=1 cap=1009
[ 4.460806] hybrid_set_cpu_capacity() cpu=2 cap=1009
[ 4.465766] hybrid_set_cpu_capacity() cpu=3 cap=1009
[ 4.470730] hybrid_set_cpu_capacity() cpu=4 cap=1009
[ 4.475699] hybrid_set_cpu_capacity() cpu=5 cap=1009
[ 4.480664] hybrid_set_cpu_capacity() cpu=6 cap=1009
[ 4.485626] hybrid_set_cpu_capacity() cpu=7 cap=1009
[ 4.490588] hybrid_set_cpu_capacity() cpu=9 cap=1024
[ 4.495550] hybrid_set_cpu_capacity() cpu=10 cap=1024
[ 4.500598] hybrid_set_cpu_capacity() cpu=11 cap=1024
[ 4.505649] hybrid_set_cpu_capacity() cpu=12 cap=1009
[ 4.510701] hybrid_set_cpu_capacity() cpu=13 cap=1009
[ 4.515749] hybrid_set_cpu_capacity() cpu=14 cap=1009
[ 4.520802] hybrid_set_cpu_capacity() cpu=15 cap=1009
[ 4.525846] hybrid_set_cpu_capacity() cpu=16 cap=623
[ 4.530810] hybrid_set_cpu_capacity() cpu=17 cap=623
[ 4.535772] hybrid_set_cpu_capacity() cpu=18 cap=623
[ 4.540732] hybrid_set_cpu_capacity() cpu=19 cap=623
[ 4.545690] hybrid_set_cpu_capacity() cpu=20 cap=623
[ 4.550651] hybrid_set_cpu_capacity() cpu=21 cap=623
[ 4.555612] hybrid_set_cpu_capacity() cpu=22 cap=623
[ 4.560571] hybrid_set_cpu_capacity() cpu=23 cap=623
> If the given system is hybrid and non-SMT, the new code disables ITMT
> support in the scheduler (because it may get in the way of asymmetric CPU
> capacity code in the scheduler that automatically gets enabled by setting
> asymmetric CPU capacity) after initializing all online CPUs and finds
> the one with the maximum HWP_HIGHEST_PERF value. Next, it computes the
> capacity number for each (online) CPU by dividing the product of its
> HWP_HIGHEST_PERF and SCHED_CAPACITY_SCALE by the maximum HWP_HIGHEST_PERF.
SO either CAS at wakeup and in load_balance or SIS at wakeup and ITMT in
load balance.
> When a CPU goes offline, its capacity is reset to SCHED_CAPACITY_SCALE
> and if it is the one with the maximum HWP_HIGHEST_PERF value, the
> capacity numbers for all of the other online CPUs are recomputed. This
> also takes care of a cleanup during driver operation mode changes.
>
> Analogously, when a new CPU goes online, its capacity number is updated
> and if its HWP_HIGHEST_PERF value is greater than the current maximum
> one, the capacity numbers for all of the other online CPUs are
> recomputed.
>
> The case when the driver is notified of a CPU capacity change, either
> through the HWP interrupt or through an ACPI notification, is handled
> similarly to the CPU online case above, except that if the target CPU
> is the current highest-capacity one and its capacity is reduced, the
> capacity numbers for all of the other online CPUs need to be recomputed
> either.
>
> If the driver's "no_trubo" sysfs attribute is updated, all of the CPU
> capacity information is computed from scratch to reflect the new turbo
> status.
So if I do:
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
I get:
[ 1692.801368] hybrid_update_cpu_scaling() called
[ 1692.801381] hybrid_update_cpu_scaling() max_cap_perf=44, max_perf_cpu=0
[ 1692.801389] hybrid_set_cpu_capacity() cpu=1 cap=1024
[ 1692.801395] hybrid_set_cpu_capacity() cpu=2 cap=1024
[ 1692.801399] hybrid_set_cpu_capacity() cpu=3 cap=1024
[ 1692.801402] hybrid_set_cpu_capacity() cpu=4 cap=1024
[ 1692.801405] hybrid_set_cpu_capacity() cpu=5 cap=1024
[ 1692.801408] hybrid_set_cpu_capacity() cpu=6 cap=1024
[ 1692.801410] hybrid_set_cpu_capacity() cpu=7 cap=1024
[ 1692.801413] hybrid_set_cpu_capacity() cpu=8 cap=1024
[ 1692.801416] hybrid_set_cpu_capacity() cpu=9 cap=1024
[ 1692.801419] hybrid_set_cpu_capacity() cpu=10 cap=1024
[ 1692.801422] hybrid_set_cpu_capacity() cpu=11 cap=1024
[ 1692.801425] hybrid_set_cpu_capacity() cpu=12 cap=1024
[ 1692.801428] hybrid_set_cpu_capacity() cpu=13 cap=1024
[ 1692.801431] hybrid_set_cpu_capacity() cpu=14 cap=1024
[ 1692.801433] hybrid_set_cpu_capacity() cpu=15 cap=1024
[ 1692.801436] hybrid_set_cpu_capacity() cpu=16 cap=605
[ 1692.801439] hybrid_set_cpu_capacity() cpu=17 cap=605
[ 1692.801442] hybrid_set_cpu_capacity() cpu=18 cap=605
[ 1692.801445] hybrid_set_cpu_capacity() cpu=19 cap=605
[ 1692.801448] hybrid_set_cpu_capacity() cpu=20 cap=605
[ 1692.801451] hybrid_set_cpu_capacity() cpu=21 cap=605
[ 1692.801453] hybrid_set_cpu_capacity() cpu=22 cap=605
[ 1692.801456] hybrid_set_cpu_capacity() cpu=23 cap=605
Turbo on this machine stands only for the cpu_capacity diff 1009 vs 1024?
[...]
On Thu, May 02, 2024 at 12:42:54PM +0200, Dietmar Eggemann wrote:
> On 25/04/2024 21:06, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <[email protected]>
> >
> > Make intel_pstate use the HWP_HIGHEST_PERF values from
> > MSR_HWP_CAPABILITIES to set asymmetric CPU capacity information
> > via the previously introduced arch_set_cpu_capacity() on hybrid
> > systems without SMT.
>
> Are there such systems around? My i7-13700K has P-cores (CPU0..CPU15)
> with SMT.
We have been experimenting with nosmt in the kernel command line.
>
> > Setting asymmetric CPU capacity is generally necessary to allow the
> > scheduler to compute task sizes in a consistent way across all CPUs
> > in a system where they differ by capacity. That, in turn, should help
> > to improve task placement and load balancing decisions. It is also
> > necessary for the schedutil cpufreq governor to operate as expected
> > on hybrid systems where tasks migrate between CPUs of different
> > capacities.
> >
> > The underlying observation is that intel_pstate already uses
> > MSR_HWP_CAPABILITIES to get CPU performance information which is
> > exposed by it via sysfs and CPU performance scaling is based on it.
> > Thus using this information for setting asymmetric CPU capacity is
> > consistent with what the driver has been doing already. Moreover,
> > HWP_HIGHEST_PERF reflects the maximum capacity of a given CPU including
> > both the instructions-per-cycle (IPC) factor and the maximum turbo
> > frequency and the units in which that value is expressed are the same
> > for all CPUs in the system, so the maximum capacity ratio between two
> > CPUs can be obtained by computing the ratio of their HWP_HIGHEST_PERF
> > values. Of course, in principle that capacity ratio need not be
> > directly applicable at lower frequencies, so using it for providing the
> > asymmetric CPU capacity information to the scheduler is a rough
> > approximation, but it is as good as it gets. Also, measurements
> > indicate that this approximation is not too bad in practice.
>
> So cpu_capacity has a direct mapping to itmt prio. cpu_capacity is itmt
> prio with max itmt prio scaled to 1024.
ITMT enables asym_packing in the load balancer. Since it only cares about
which CPU has higher priority, scaling to 1024 is not necessary.
>
> Running it on i7-13700K (while allowing SMT) gives:
>
> root@gulliver:~# dmesg | grep sched_set_itmt_core_prio
> [ 3.957826] sched_set_itmt_core_prio() cpu=0 prio=68
> [ 3.990401] sched_set_itmt_core_prio() cpu=1 prio=68
> [ 4.015551] sched_set_itmt_core_prio() cpu=2 prio=68
> [ 4.040720] sched_set_itmt_core_prio() cpu=3 prio=68
> [ 4.065871] sched_set_itmt_core_prio() cpu=4 prio=68
> [ 4.091018] sched_set_itmt_core_prio() cpu=5 prio=68
> [ 4.116175] sched_set_itmt_core_prio() cpu=6 prio=68
> [ 4.141374] sched_set_itmt_core_prio() cpu=7 prio=68
> [ 4.166543] sched_set_itmt_core_prio() cpu=8 prio=69
> [ 4.196289] sched_set_itmt_core_prio() cpu=9 prio=69
> [ 4.214964] sched_set_itmt_core_prio() cpu=10 prio=69
> [ 4.239281] sched_set_itmt_core_prio() cpu=11 prio=69
> [ 4.263438] sched_set_itmt_core_prio() cpu=12 prio=68
> [ 4.283790] sched_set_itmt_core_prio() cpu=13 prio=68
> [ 4.308905] sched_set_itmt_core_prio() cpu=14 prio=68
> [ 4.331751] sched_set_itmt_core_prio() cpu=15 prio=68
> [ 4.356002] sched_set_itmt_core_prio() cpu=16 prio=42
> [ 4.381639] sched_set_itmt_core_prio() cpu=17 prio=42
> [ 4.395175] sched_set_itmt_core_prio() cpu=18 prio=42
> [ 4.425625] sched_set_itmt_core_prio() cpu=19 prio=42
> [ 4.449670] sched_set_itmt_core_prio() cpu=20 prio=42
> [ 4.479681] sched_set_itmt_core_prio() cpu=21 prio=42
> [ 4.506319] sched_set_itmt_core_prio() cpu=22 prio=42
> [ 4.523774] sched_set_itmt_core_prio() cpu=23 prio=42
>
> root@gulliver:~# dmesg | grep hybrid_set_cpu_capacity
> [ 4.450883] hybrid_set_cpu_capacity() cpu=0 cap=1009
> [ 4.455846] hybrid_set_cpu_capacity() cpu=1 cap=1009
> [ 4.460806] hybrid_set_cpu_capacity() cpu=2 cap=1009
> [ 4.465766] hybrid_set_cpu_capacity() cpu=3 cap=1009
> [ 4.470730] hybrid_set_cpu_capacity() cpu=4 cap=1009
> [ 4.475699] hybrid_set_cpu_capacity() cpu=5 cap=1009
> [ 4.480664] hybrid_set_cpu_capacity() cpu=6 cap=1009
> [ 4.485626] hybrid_set_cpu_capacity() cpu=7 cap=1009
> [ 4.490588] hybrid_set_cpu_capacity() cpu=9 cap=1024
> [ 4.495550] hybrid_set_cpu_capacity() cpu=10 cap=1024
> [ 4.500598] hybrid_set_cpu_capacity() cpu=11 cap=1024
> [ 4.505649] hybrid_set_cpu_capacity() cpu=12 cap=1009
> [ 4.510701] hybrid_set_cpu_capacity() cpu=13 cap=1009
> [ 4.515749] hybrid_set_cpu_capacity() cpu=14 cap=1009
> [ 4.520802] hybrid_set_cpu_capacity() cpu=15 cap=1009
> [ 4.525846] hybrid_set_cpu_capacity() cpu=16 cap=623
> [ 4.530810] hybrid_set_cpu_capacity() cpu=17 cap=623
> [ 4.535772] hybrid_set_cpu_capacity() cpu=18 cap=623
> [ 4.540732] hybrid_set_cpu_capacity() cpu=19 cap=623
> [ 4.545690] hybrid_set_cpu_capacity() cpu=20 cap=623
> [ 4.550651] hybrid_set_cpu_capacity() cpu=21 cap=623
> [ 4.555612] hybrid_set_cpu_capacity() cpu=22 cap=623
> [ 4.560571] hybrid_set_cpu_capacity() cpu=23 cap=623
>
> > If the given system is hybrid and non-SMT, the new code disables ITMT
> > support in the scheduler (because it may get in the way of asymmetric CPU
> > capacity code in the scheduler that automatically gets enabled by setting
> > asymmetric CPU capacity) after initializing all online CPUs and finds
> > the one with the maximum HWP_HIGHEST_PERF value. Next, it computes the
> > capacity number for each (online) CPU by dividing the product of its
> > HWP_HIGHEST_PERF and SCHED_CAPACITY_SCALE by the maximum HWP_HIGHEST_PERF.
>
> SO either CAS at wakeup and in load_balance or SIS at wakeup and ITMT in
> load balance.
May I know what CAS and SIS stand for?
Thanks and BR,
Ricardo
On 03/05/2024 05:32, Ricardo Neri wrote:
> On Thu, May 02, 2024 at 12:42:54PM +0200, Dietmar Eggemann wrote:
>> On 25/04/2024 21:06, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <[email protected]>
>>>
>>> Make intel_pstate use the HWP_HIGHEST_PERF values from
>>> MSR_HWP_CAPABILITIES to set asymmetric CPU capacity information
>>> via the previously introduced arch_set_cpu_capacity() on hybrid
>>> systems without SMT.
>>
>> Are there such systems around? My i7-13700K has P-cores (CPU0..CPU15)
>> with SMT.
>
> We have been experimenting with nosmt in the kernel command line.
OK.
[...]
>>> If the given system is hybrid and non-SMT, the new code disables ITMT
>>> support in the scheduler (because it may get in the way of asymmetric CPU
>>> capacity code in the scheduler that automatically gets enabled by setting
>>> asymmetric CPU capacity) after initializing all online CPUs and finds
>>> the one with the maximum HWP_HIGHEST_PERF value. Next, it computes the
>>> capacity number for each (online) CPU by dividing the product of its
>>> HWP_HIGHEST_PERF and SCHED_CAPACITY_SCALE by the maximum HWP_HIGHEST_PERF.
>>
>> SO either CAS at wakeup and in load_balance or SIS at wakeup and ITMT in
>> load balance.
>
> May I know what CAS and SIS stand for?
Capacity Aware Scheduling and Select_Idle_Sibling().
Either select_idle_sibling() -> select_idle_capacity() (1)
or select_idle_sibling() -> select_idle_cpu() /* nosmt */ (2)
So my system with now 'nosmt' goes (1).
On Thu, May 2, 2024 at 12:43 PM Dietmar Eggemann
<[email protected]> wrote:
>
> On 25/04/2024 21:06, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <[email protected]>
> >
> > Make intel_pstate use the HWP_HIGHEST_PERF values from
> > MSR_HWP_CAPABILITIES to set asymmetric CPU capacity information
> > via the previously introduced arch_set_cpu_capacity() on hybrid
> > systems without SMT.
>
> Are there such systems around? My i7-13700K has P-cores (CPU0..CPU15)
> with SMT.
As Ricardo said, nosmt is one way to run without SMT. Another one is
to disable SMT in the BIOS setup.
Anyway, the point here is that with SMT, accurate tracking of task
utilization is rather hopeless.
> > Setting asymmetric CPU capacity is generally necessary to allow the
> > scheduler to compute task sizes in a consistent way across all CPUs
> > in a system where they differ by capacity. That, in turn, should help
> > to improve task placement and load balancing decisions. It is also
> > necessary for the schedutil cpufreq governor to operate as expected
> > on hybrid systems where tasks migrate between CPUs of different
> > capacities.
> >
> > The underlying observation is that intel_pstate already uses
> > MSR_HWP_CAPABILITIES to get CPU performance information which is
> > exposed by it via sysfs and CPU performance scaling is based on it.
> > Thus using this information for setting asymmetric CPU capacity is
> > consistent with what the driver has been doing already. Moreover,
> > HWP_HIGHEST_PERF reflects the maximum capacity of a given CPU including
> > both the instructions-per-cycle (IPC) factor and the maximum turbo
> > frequency and the units in which that value is expressed are the same
> > for all CPUs in the system, so the maximum capacity ratio between two
> > CPUs can be obtained by computing the ratio of their HWP_HIGHEST_PERF
> > values. Of course, in principle that capacity ratio need not be
> > directly applicable at lower frequencies, so using it for providing the
> > asymmetric CPU capacity information to the scheduler is a rough
> > approximation, but it is as good as it gets. Also, measurements
> > indicate that this approximation is not too bad in practice.
>
> So cpu_capacity has a direct mapping to itmt prio. cpu_capacity is itmt
> prio with max itmt prio scaled to 1024.
Right.
The choice to make the ITMT prio reflect the capacity is deliberate,
although this code works with values retrieved via CPPC (which are the
same as the HWP_CAP values in the majority of cases but not always).
> Running it on i7-13700K (while allowing SMT) gives:
>
> root@gulliver:~# dmesg | grep sched_set_itmt_core_prio
> [ 3.957826] sched_set_itmt_core_prio() cpu=0 prio=68
> [ 3.990401] sched_set_itmt_core_prio() cpu=1 prio=68
> [ 4.015551] sched_set_itmt_core_prio() cpu=2 prio=68
> [ 4.040720] sched_set_itmt_core_prio() cpu=3 prio=68
> [ 4.065871] sched_set_itmt_core_prio() cpu=4 prio=68
> [ 4.091018] sched_set_itmt_core_prio() cpu=5 prio=68
> [ 4.116175] sched_set_itmt_core_prio() cpu=6 prio=68
> [ 4.141374] sched_set_itmt_core_prio() cpu=7 prio=68
> [ 4.166543] sched_set_itmt_core_prio() cpu=8 prio=69
> [ 4.196289] sched_set_itmt_core_prio() cpu=9 prio=69
> [ 4.214964] sched_set_itmt_core_prio() cpu=10 prio=69
> [ 4.239281] sched_set_itmt_core_prio() cpu=11 prio=69
CPUs 8 - 10 appear to be "favored cores" that can turbo up higher than
the other P-cores.
> [ 4.263438] sched_set_itmt_core_prio() cpu=12 prio=68
> [ 4.283790] sched_set_itmt_core_prio() cpu=13 prio=68
> [ 4.308905] sched_set_itmt_core_prio() cpu=14 prio=68
> [ 4.331751] sched_set_itmt_core_prio() cpu=15 prio=68
> [ 4.356002] sched_set_itmt_core_prio() cpu=16 prio=42
> [ 4.381639] sched_set_itmt_core_prio() cpu=17 prio=42
> [ 4.395175] sched_set_itmt_core_prio() cpu=18 prio=42
> [ 4.425625] sched_set_itmt_core_prio() cpu=19 prio=42
> [ 4.449670] sched_set_itmt_core_prio() cpu=20 prio=42
> [ 4.479681] sched_set_itmt_core_prio() cpu=21 prio=42
> [ 4.506319] sched_set_itmt_core_prio() cpu=22 prio=42
> [ 4.523774] sched_set_itmt_core_prio() cpu=23 prio=42
>
> root@gulliver:~# dmesg | grep hybrid_set_cpu_capacity
> [ 4.450883] hybrid_set_cpu_capacity() cpu=0 cap=1009
> [ 4.455846] hybrid_set_cpu_capacity() cpu=1 cap=1009
> [ 4.460806] hybrid_set_cpu_capacity() cpu=2 cap=1009
> [ 4.465766] hybrid_set_cpu_capacity() cpu=3 cap=1009
> [ 4.470730] hybrid_set_cpu_capacity() cpu=4 cap=1009
> [ 4.475699] hybrid_set_cpu_capacity() cpu=5 cap=1009
> [ 4.480664] hybrid_set_cpu_capacity() cpu=6 cap=1009
> [ 4.485626] hybrid_set_cpu_capacity() cpu=7 cap=1009
> [ 4.490588] hybrid_set_cpu_capacity() cpu=9 cap=1024
> [ 4.495550] hybrid_set_cpu_capacity() cpu=10 cap=1024
> [ 4.500598] hybrid_set_cpu_capacity() cpu=11 cap=1024
And the "favored cores" get the max capacity.
> [ 4.505649] hybrid_set_cpu_capacity() cpu=12 cap=1009
> [ 4.510701] hybrid_set_cpu_capacity() cpu=13 cap=1009
> [ 4.515749] hybrid_set_cpu_capacity() cpu=14 cap=1009
> [ 4.520802] hybrid_set_cpu_capacity() cpu=15 cap=1009
> [ 4.525846] hybrid_set_cpu_capacity() cpu=16 cap=623
> [ 4.530810] hybrid_set_cpu_capacity() cpu=17 cap=623
> [ 4.535772] hybrid_set_cpu_capacity() cpu=18 cap=623
> [ 4.540732] hybrid_set_cpu_capacity() cpu=19 cap=623
> [ 4.545690] hybrid_set_cpu_capacity() cpu=20 cap=623
> [ 4.550651] hybrid_set_cpu_capacity() cpu=21 cap=623
> [ 4.555612] hybrid_set_cpu_capacity() cpu=22 cap=623
> [ 4.560571] hybrid_set_cpu_capacity() cpu=23 cap=623
>
> > If the given system is hybrid and non-SMT, the new code disables ITMT
> > support in the scheduler (because it may get in the way of asymmetric CPU
> > capacity code in the scheduler that automatically gets enabled by setting
> > asymmetric CPU capacity) after initializing all online CPUs and finds
> > the one with the maximum HWP_HIGHEST_PERF value. Next, it computes the
> > capacity number for each (online) CPU by dividing the product of its
> > HWP_HIGHEST_PERF and SCHED_CAPACITY_SCALE by the maximum HWP_HIGHEST_PERF.
>
> SO either CAS at wakeup and in load_balance or SIS at wakeup and ITMT in
> load balance.
Yup, at least for this version of the patch.
> > When a CPU goes offline, its capacity is reset to SCHED_CAPACITY_SCALE
> > and if it is the one with the maximum HWP_HIGHEST_PERF value, the
> > capacity numbers for all of the other online CPUs are recomputed. This
> > also takes care of a cleanup during driver operation mode changes.
> >
> > Analogously, when a new CPU goes online, its capacity number is updated
> > and if its HWP_HIGHEST_PERF value is greater than the current maximum
> > one, the capacity numbers for all of the other online CPUs are
> > recomputed.
> >
> > The case when the driver is notified of a CPU capacity change, either
> > through the HWP interrupt or through an ACPI notification, is handled
> > similarly to the CPU online case above, except that if the target CPU
> > is the current highest-capacity one and its capacity is reduced, the
> > capacity numbers for all of the other online CPUs need to be recomputed
> > either.
> >
> > If the driver's "no_trubo" sysfs attribute is updated, all of the CPU
> > capacity information is computed from scratch to reflect the new turbo
> > status.
>
> So if I do:
>
> echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
>
> I get:
>
> [ 1692.801368] hybrid_update_cpu_scaling() called
> [ 1692.801381] hybrid_update_cpu_scaling() max_cap_perf=44, max_perf_cpu=0
> [ 1692.801389] hybrid_set_cpu_capacity() cpu=1 cap=1024
> [ 1692.801395] hybrid_set_cpu_capacity() cpu=2 cap=1024
> [ 1692.801399] hybrid_set_cpu_capacity() cpu=3 cap=1024
> [ 1692.801402] hybrid_set_cpu_capacity() cpu=4 cap=1024
> [ 1692.801405] hybrid_set_cpu_capacity() cpu=5 cap=1024
> [ 1692.801408] hybrid_set_cpu_capacity() cpu=6 cap=1024
> [ 1692.801410] hybrid_set_cpu_capacity() cpu=7 cap=1024
> [ 1692.801413] hybrid_set_cpu_capacity() cpu=8 cap=1024
> [ 1692.801416] hybrid_set_cpu_capacity() cpu=9 cap=1024
> [ 1692.801419] hybrid_set_cpu_capacity() cpu=10 cap=1024
> [ 1692.801422] hybrid_set_cpu_capacity() cpu=11 cap=1024
> [ 1692.801425] hybrid_set_cpu_capacity() cpu=12 cap=1024
> [ 1692.801428] hybrid_set_cpu_capacity() cpu=13 cap=1024
> [ 1692.801431] hybrid_set_cpu_capacity() cpu=14 cap=1024
> [ 1692.801433] hybrid_set_cpu_capacity() cpu=15 cap=1024
> [ 1692.801436] hybrid_set_cpu_capacity() cpu=16 cap=605
> [ 1692.801439] hybrid_set_cpu_capacity() cpu=17 cap=605
> [ 1692.801442] hybrid_set_cpu_capacity() cpu=18 cap=605
> [ 1692.801445] hybrid_set_cpu_capacity() cpu=19 cap=605
> [ 1692.801448] hybrid_set_cpu_capacity() cpu=20 cap=605
> [ 1692.801451] hybrid_set_cpu_capacity() cpu=21 cap=605
> [ 1692.801453] hybrid_set_cpu_capacity() cpu=22 cap=605
> [ 1692.801456] hybrid_set_cpu_capacity() cpu=23 cap=605
>
> Turbo on this machine stands only for the cpu_capacity diff 1009 vs 1024?
Not really.
The capacity of the fastest CPU is always 1024 and the capacities of
all of the other CPUs are adjusted to that.
When turbo is disabled, the capacity of the "favored cores" is the
same as for the other P-cores (i.e. 1024) and the capacity of E-cores
is relative to that.
Of course, this means that task placement may be somewhat messed up
after disabling or enabling turbo (which is a global switch), but I
don't think that there is a way to avoid it.
On 06/05/2024 16:39, Rafael J. Wysocki wrote:
> On Thu, May 2, 2024 at 12:43 PM Dietmar Eggemann
> <[email protected]> wrote:
>>
>> On 25/04/2024 21:06, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <[email protected]>
[...]
>> So cpu_capacity has a direct mapping to itmt prio. cpu_capacity is itmt
>> prio with max itmt prio scaled to 1024.
>
> Right.
>
> The choice to make the ITMT prio reflect the capacity is deliberate,
> although this code works with values retrieved via CPPC (which are the
> same as the HWP_CAP values in the majority of cases but not always).
>
>> Running it on i7-13700K (while allowing SMT) gives:
>>
>> root@gulliver:~# dmesg | grep sched_set_itmt_core_prio
>> [ 3.957826] sched_set_itmt_core_prio() cpu=0 prio=68
>> [ 3.990401] sched_set_itmt_core_prio() cpu=1 prio=68
>> [ 4.015551] sched_set_itmt_core_prio() cpu=2 prio=68
>> [ 4.040720] sched_set_itmt_core_prio() cpu=3 prio=68
>> [ 4.065871] sched_set_itmt_core_prio() cpu=4 prio=68
>> [ 4.091018] sched_set_itmt_core_prio() cpu=5 prio=68
>> [ 4.116175] sched_set_itmt_core_prio() cpu=6 prio=68
>> [ 4.141374] sched_set_itmt_core_prio() cpu=7 prio=68
>> [ 4.166543] sched_set_itmt_core_prio() cpu=8 prio=69
>> [ 4.196289] sched_set_itmt_core_prio() cpu=9 prio=69
>> [ 4.214964] sched_set_itmt_core_prio() cpu=10 prio=69
>> [ 4.239281] sched_set_itmt_core_prio() cpu=11 prio=69
>
> CPUs 8 - 10 appear to be "favored cores" that can turbo up higher than
> the other P-cores.
>
>> [ 4.263438] sched_set_itmt_core_prio() cpu=12 prio=68
>> [ 4.283790] sched_set_itmt_core_prio() cpu=13 prio=68
>> [ 4.308905] sched_set_itmt_core_prio() cpu=14 prio=68
>> [ 4.331751] sched_set_itmt_core_prio() cpu=15 prio=68
>> [ 4.356002] sched_set_itmt_core_prio() cpu=16 prio=42
>> [ 4.381639] sched_set_itmt_core_prio() cpu=17 prio=42
>> [ 4.395175] sched_set_itmt_core_prio() cpu=18 prio=42
>> [ 4.425625] sched_set_itmt_core_prio() cpu=19 prio=42
>> [ 4.449670] sched_set_itmt_core_prio() cpu=20 prio=42
>> [ 4.479681] sched_set_itmt_core_prio() cpu=21 prio=42
>> [ 4.506319] sched_set_itmt_core_prio() cpu=22 prio=42
>> [ 4.523774] sched_set_itmt_core_prio() cpu=23 prio=42
I wonder what the relation between this CPU capacity value based on
HWP_CAP is to the per-IPC class performance values of the 'HFI
performance and efficiency score' table is.
Running '[PATCH v3 00/24] sched: Introduce classes of tasks for load
balance' on i7-13700K w/ 'nosmt' I get:
Score
CPUs Class 0 1 2 3
SSE AVX2 VNNI PAUSE
0 2,4,6, 12, 14 68 80 106 53
8, 10 69 81 108 54
16-23 42 42 42 42
Looks like the HWP_CAP values are in sync with the scores of IPP Class
0. I was expecting that the HWP_CAP values reflect more an average over
all classes? Or maybe I misinterpret this relation?
[...]
>>> If the driver's "no_trubo" sysfs attribute is updated, all of the CPU
>>> capacity information is computed from scratch to reflect the new turbo
>>> status.
>>
>> So if I do:
>>
>> echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
>>
>> I get:
>>
>> [ 1692.801368] hybrid_update_cpu_scaling() called
>> [ 1692.801381] hybrid_update_cpu_scaling() max_cap_perf=44, max_perf_cpu=0
>> [ 1692.801389] hybrid_set_cpu_capacity() cpu=1 cap=1024
>> [ 1692.801395] hybrid_set_cpu_capacity() cpu=2 cap=1024
>> [ 1692.801399] hybrid_set_cpu_capacity() cpu=3 cap=1024
>> [ 1692.801402] hybrid_set_cpu_capacity() cpu=4 cap=1024
>> [ 1692.801405] hybrid_set_cpu_capacity() cpu=5 cap=1024
>> [ 1692.801408] hybrid_set_cpu_capacity() cpu=6 cap=1024
>> [ 1692.801410] hybrid_set_cpu_capacity() cpu=7 cap=1024
>> [ 1692.801413] hybrid_set_cpu_capacity() cpu=8 cap=1024
>> [ 1692.801416] hybrid_set_cpu_capacity() cpu=9 cap=1024
>> [ 1692.801419] hybrid_set_cpu_capacity() cpu=10 cap=1024
>> [ 1692.801422] hybrid_set_cpu_capacity() cpu=11 cap=1024
>> [ 1692.801425] hybrid_set_cpu_capacity() cpu=12 cap=1024
>> [ 1692.801428] hybrid_set_cpu_capacity() cpu=13 cap=1024
>> [ 1692.801431] hybrid_set_cpu_capacity() cpu=14 cap=1024
>> [ 1692.801433] hybrid_set_cpu_capacity() cpu=15 cap=1024
>> [ 1692.801436] hybrid_set_cpu_capacity() cpu=16 cap=605
>> [ 1692.801439] hybrid_set_cpu_capacity() cpu=17 cap=605
>> [ 1692.801442] hybrid_set_cpu_capacity() cpu=18 cap=605
>> [ 1692.801445] hybrid_set_cpu_capacity() cpu=19 cap=605
>> [ 1692.801448] hybrid_set_cpu_capacity() cpu=20 cap=605
>> [ 1692.801451] hybrid_set_cpu_capacity() cpu=21 cap=605
>> [ 1692.801453] hybrid_set_cpu_capacity() cpu=22 cap=605
>> [ 1692.801456] hybrid_set_cpu_capacity() cpu=23 cap=605
>>
>> Turbo on this machine stands only for the cpu_capacity diff 1009 vs 1024?
>
> Not really.
>
> The capacity of the fastest CPU is always 1024 and the capacities of
> all of the other CPUs are adjusted to that.
>
> When turbo is disabled, the capacity of the "favored cores" is the
> same as for the other P-cores (i.e. 1024) and the capacity of E-cores
> is relative to that.
>
> Of course, this means that task placement may be somewhat messed up
> after disabling or enabling turbo (which is a global switch), but I
> don't think that there is a way to avoid it.
I assume that this is OK. In task placement we don't deal with a system
of perfectly aligned values (including their sums) anyway.
And we recreate the sched domains (including updating the capacity sums
on sched groups) after this so the so load balance (smp nice etc) should
be fine too.
On 25/04/2024 21:06, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> Make intel_pstate use the HWP_HIGHEST_PERF values from
> MSR_HWP_CAPABILITIES to set asymmetric CPU capacity information
> via the previously introduced arch_set_cpu_capacity() on hybrid
> systems without SMT.
>
> Setting asymmetric CPU capacity is generally necessary to allow the
> scheduler to compute task sizes in a consistent way across all CPUs
> in a system where they differ by capacity. That, in turn, should help
> to improve task placement and load balancing decisions. It is also
> necessary for the schedutil cpufreq governor to operate as expected
> on hybrid systems where tasks migrate between CPUs of different
> capacities.
[...]
For Arm64 we expose the cpu_capacity under:
/sys/devices/system/cpu/cpu*/cpu_capacity
Might be handy for X86 hybrid as well.
Code snippet copied from from drivers/base/arch_topology.c :
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 9e94b3f05a57..c445e5d1efc8 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -3746,5 +3746,49 @@ static int __init intel_pstate_setup(char *str)
}
early_param("intel_pstate", intel_pstate_setup);
+static ssize_t cpu_capacity_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct cpu *cpu = container_of(dev, struct cpu, dev);
+
+ return sysfs_emit(buf, "%lu\n", arch_scale_cpu_capacity(cpu->dev.id));
+}
+
+static DEVICE_ATTR_RO(cpu_capacity);
+
+static int cpu_capacity_sysctl_add(unsigned int cpu)
+{
+ struct device *cpu_dev = get_cpu_device(cpu);
+
+ if (!cpu_dev)
+ return -ENOENT;
+
+ device_create_file(cpu_dev, &dev_attr_cpu_capacity);
+
+ return 0;
+}
+
+static int cpu_capacity_sysctl_remove(unsigned int cpu)
+{
+ struct device *cpu_dev = get_cpu_device(cpu);
+
+ if (!cpu_dev)
+ return -ENOENT;
+
+ device_remove_file(cpu_dev, &dev_attr_cpu_capacity);
+
+ return 0;
+}
+
+static int register_cpu_capacity_sysctl(void)
+{
+ cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "topology/cpu-capacity",
+ cpu_capacity_sysctl_add, cpu_capacity_sysctl_remove);
+
+ return 0;
+}
+subsys_initcall(register_cpu_capacity_sysctl);
+
[...]