Hi,
This is the fourth version of this patchset. Previous versions can be found
here [1], here [2], and here[3]. For brevity, I did not include the cover
letter from the original posting. You can read it here [1].
This patchset depends on a separate patchset to handle better asym_packing
between SMT cores [4]. This patchset was recently merged into the tip tree.
For your convenience, you can retrieve this patchset from [5], and is based
on the master branch of the tip tree as on 12th June 2023.
Changes since v3:
Vincent highlighted the issue of using the current tasks of runqueues to
collect IPCC statistics. Since these are the last tasks to be pulled during
load balance, the proposed tie breakers would have little effect. Instead,
I have reworked the collection of IPCC statistics to look at the back of
the runqueues. This makes the IPCC scores more useful, as these are the
first tasks to be pulled during load balance.
Rafael argued that the HFI driver should not handle tasks directly. I have
reworked the code to let the HFI driver read the classification result.
A new sched_ipcc.c file under arch/x86 is in charge or handling tasks and
postprocess classification.
Rafael also raised concerns about the need for a memory barrier to order
reads and writes of the cached IPCC scores that the scheduler can access
without holding the HFI lock. I decided to use a seqcount. The seqcount
provides store-release and load-acquire semantics, ensuring proper
ordering of reads and writes of the cached IPCC scores. It also provides a
compiler barrier. A CPU memory barrier is not needed.
Zhao noticed that there is a new CPUID_7_1_EAX leaf and using cpuid_bits[]
in scattered.c is not needed. I fixed it.
[email protected] found a build break due to a misplaced definition of
MSR_IA32_HW_FEEDBACK_CHAR. I moved the definition to patch 13.
Updated patches: 6, 10, 13, 14, 15, 17, 21, 22
New patches: none
Unchanged patches: 1, 2, 3, 4, 5, 7, 8, 9, 11, 12, 16, 18, 19, 20, 23, 24
Thanks in advance for your kind feedback!
BR,
Ricardo
[1]. https://lore.kernel.org/lkml/[email protected]/
[2]. https://lore.kernel.org/lkml/[email protected]/
[3]. https://lore.kernel.org/all/[email protected]/
[4]. https://lore.kernel.org/all/[email protected]/
[5]. https://github.com/ricardon/tip/tree/rneri/ipc_classes_v4
Ricardo Neri (24):
sched/task_struct: Introduce IPC classes of tasks
sched: Add interfaces for IPC classes
sched/core: Initialize the IPC class of a new task
sched/core: Add user_tick as argument to scheduler_tick()
sched/core: Update the IPC class of the current task
sched/fair: Collect load-balancing stats for IPC classes
sched/fair: Compute IPC class scores for load balancing
sched/fair: Use IPCC stats to break ties between asym_packing sched
groups
sched/fair: Use IPCC stats to break ties between fully_busy SMT groups
sched/fair: Use IPCC scores to select a busiest runqueue
thermal: intel: hfi: Introduce Intel Thread Director classes
x86/cpufeatures: Add the Intel Thread Director feature definitions
x86/sched: Update the IPC class of the current task
thermal: intel: hfi: Store per-CPU IPCC scores
thermal: intel: hfi: Report the IPC class score of a CPU
thermal: intel: hfi: Define a default class for unclassified tasks
thermal: intel: hfi: Enable the Intel Thread Director
sched/task_struct: Add helpers for IPC classification
sched/core: Initialize helpers of task classification
sched/fair: Introduce sched_smt_siblings_idle()
x86/sched/ipcc: Implement model-specific checks for task
classification
x86/cpufeatures: Add feature bit for HRESET
x86/hreset: Configure history reset
x86/process: Reset hardware history in context switch
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/hreset.h | 30 +++
arch/x86/include/asm/msr-index.h | 5 +
arch/x86/include/asm/topology.h | 15 ++
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/cpu/common.c | 30 ++-
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
arch/x86/kernel/process_32.c | 3 +
arch/x86/kernel/process_64.c | 3 +
arch/x86/kernel/sched_ipcc.c | 93 +++++++
drivers/thermal/intel/Kconfig | 1 +
drivers/thermal/intel/intel_hfi.c | 219 +++++++++++++++-
include/linux/sched.h | 24 +-
include/linux/sched/topology.h | 6 +
init/Kconfig | 12 +
kernel/sched/core.c | 10 +-
kernel/sched/fair.c | 316 ++++++++++++++++++++++-
kernel/sched/sched.h | 66 +++++
kernel/sched/topology.c | 9 +
kernel/time/timer.c | 2 +-
21 files changed, 840 insertions(+), 17 deletions(-)
create mode 100644 arch/x86/include/asm/hreset.h
create mode 100644 arch/x86/kernel/sched_ipcc.c
--
2.25.1
New tasks shall start life as unclassified. They will be classified by
hardware when they run.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Perry Yuan <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Zhao Liu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* None
Changes since v2:
* None
Changes since v1:
* None
---
kernel/sched/core.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d9a04faa48a1..77b769c23186 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4500,6 +4500,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.prev_sum_exec_runtime = 0;
p->se.nr_migrations = 0;
p->se.vruntime = 0;
+#ifdef CONFIG_IPC_CLASSES
+ p->ipcc = IPC_CLASS_UNCLASSIFIED;
+#endif
INIT_LIST_HEAD(&p->se.group_node);
#ifdef CONFIG_FAIR_GROUP_SCHED
--
2.25.1
On hybrid processors, the architecture differences between the types of
CPUs result in different instructions-per-cycle (IPC) rates for each type
of CPU. IPCs may vary further by the type of instructions being executed.
Instructions can be grouped into classes of similar IPCs.
Tasks can be classified into groups based on the type of instructions they
execute.
Add a new member task_struct::ipcc to associate a particular task to
an IPC class that depends on the instructions it executes.
The scheduler may use the IPC class of a task and data about the
performance among CPUs of a given IPC class to improve throughput. It
may, for instance, place certain classes of tasks on CPUs of higher
performance.
The methods to determine the classification of a task and its relative
IPC score are specific to each CPU architecture.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Perry Yuan <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Zhao Liu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* None
Changes since v2:
* Changed the type of task_struct::ipcc to unsigned short. A subsequent
patch uses bit fields to use 9 bits, along with other auxiliary
members.
Changes since v1:
* Renamed task_struct::class as task_struct::ipcc. (Joel)
* Use task_struct::ipcc = 0 for unclassified tasks. (PeterZ)
* Renamed CONFIG_SCHED_TASK_CLASSES as CONFIG_IPC_CLASSES. (PeterZ, Joel)
---
include/linux/sched.h | 10 ++++++++++
init/Kconfig | 12 ++++++++++++
2 files changed, 22 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1292d38d66cc..9fdee040f450 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -129,6 +129,8 @@ struct user_event_mm;
__TASK_TRACED | EXIT_DEAD | EXIT_ZOMBIE | \
TASK_PARKED)
+#define IPC_CLASS_UNCLASSIFIED 0
+
#define task_is_running(task) (READ_ONCE((task)->__state) == TASK_RUNNING)
#define task_is_traced(task) ((READ_ONCE(task->jobctl) & JOBCTL_TRACED) != 0)
@@ -1534,6 +1536,14 @@ struct task_struct {
struct user_event_mm *user_event_mm;
#endif
+#ifdef CONFIG_IPC_CLASSES
+ /*
+ * A hardware-defined classification of task that reflects but is
+ * not identical to the number of instructions per cycle.
+ */
+ unsigned short ipcc;
+#endif
+
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
diff --git a/init/Kconfig b/init/Kconfig
index 32c24950c4ce..ea3371ccb530 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -839,6 +839,18 @@ config UCLAMP_BUCKETS_COUNT
If in doubt, use the default value.
+config IPC_CLASSES
+ bool "IPC classes of tasks"
+ depends on SMP
+ help
+ If selected, each task is assigned a classification value that
+ reflects the type of instructions that the task executes. This
+ classification reflects but is not equal to the number of
+ instructions retired per cycle.
+
+ The scheduler uses the classification value to improve the placement
+ of tasks.
+
endmenu
#
--
2.25.1
Enable Intel Thread Director from the CPU hotplug callback: globally from
CPU0 and then enable the thread-classification hardware in each logical
processor individually.
Also, initialize the number of classes supported.
Let the scheduler know that it can start using IPC classes.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Perry Yuan <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Zhao Liu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Acked-by: Rafael J. Wysocki <[email protected]> # intel_hfi.c
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* Dropped the definition of MSR_IA32_HW_FEEDBACK_CHAR. It is now added in
patch 14 to fix a build break during bisection.
(Reported-by: kernel test robot <[email protected]>).
* Added Acked-by from Rafael.
Changes since v2:
* Use the new sched_enable_ipc_classes() interface to enable the use of
IPC classes in the scheduler.
Changes since v1:
* None
---
arch/x86/include/asm/msr-index.h | 1 +
drivers/thermal/intel/intel_hfi.c | 41 +++++++++++++++++++++++++++++--
2 files changed, 40 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 0bc4ed0ff787..7823b87bf383 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1108,6 +1108,7 @@
/* Hardware Feedback Interface */
#define MSR_IA32_HW_FEEDBACK_PTR 0x17d0
#define MSR_IA32_HW_FEEDBACK_CONFIG 0x17d1
+#define MSR_IA32_HW_FEEDBACK_THREAD_CONFIG 0x17d4
#define MSR_IA32_HW_FEEDBACK_CHAR 0x17d2
/* x2APIC locked status */
diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index e23c49da02ee..75bf18dc7f51 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -33,6 +33,7 @@
#include <linux/percpu-defs.h>
#include <linux/printk.h>
#include <linux/processor.h>
+#include <linux/sched/topology.h>
#include <linux/seqlock.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
@@ -50,6 +51,8 @@
/* Hardware Feedback Interface MSR configuration bits */
#define HW_FEEDBACK_PTR_VALID_BIT BIT(0)
#define HW_FEEDBACK_CONFIG_HFI_ENABLE_BIT BIT(0)
+#define HW_FEEDBACK_CONFIG_ITD_ENABLE_BIT BIT(1)
+#define HW_FEEDBACK_THREAD_CONFIG_ENABLE_BIT BIT(0)
/* CPUID detection and enumeration definitions for HFI */
@@ -74,6 +77,15 @@ union cpuid6_edx {
u32 full;
};
+union cpuid6_ecx {
+ struct {
+ u32 dont_care0:8;
+ u32 nr_classes:8;
+ u32 dont_care1:16;
+ } split;
+ u32 full;
+};
+
union hfi_thread_feedback_char_msr {
struct {
u64 classid : 8;
@@ -541,6 +553,11 @@ void intel_hfi_online(unsigned int cpu)
init_hfi_cpu_index(info);
+ if (cpu_feature_enabled(X86_FEATURE_ITD)) {
+ msr_val = HW_FEEDBACK_THREAD_CONFIG_ENABLE_BIT;
+ wrmsrl(MSR_IA32_HW_FEEDBACK_THREAD_CONFIG, msr_val);
+ }
+
/*
* Now check if the HFI instance of the package/die of @cpu has been
* initialized (by checking its header). In such case, all we have to
@@ -596,8 +613,22 @@ void intel_hfi_online(unsigned int cpu)
*/
rdmsrl(MSR_IA32_HW_FEEDBACK_CONFIG, msr_val);
msr_val |= HW_FEEDBACK_CONFIG_HFI_ENABLE_BIT;
+
+ if (cpu_feature_enabled(X86_FEATURE_ITD))
+ msr_val |= HW_FEEDBACK_CONFIG_ITD_ENABLE_BIT;
+
wrmsrl(MSR_IA32_HW_FEEDBACK_CONFIG, msr_val);
+ /*
+ * We have all we need to support IPC classes. Task classification is
+ * now working.
+ *
+ * All class scores are zero until after the first HFI update. That is
+ * OK. The scheduler queries these scores at every load balance.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_ITD))
+ sched_enable_ipc_classes();
+
unlock:
mutex_unlock(&hfi_instance_lock);
return;
@@ -675,8 +706,14 @@ static __init int hfi_parse_features(void)
*/
hfi_features.class_stride = nr_capabilities;
- /* For now, use only one class of the HFI table */
- hfi_features.nr_classes = 1;
+ if (cpu_feature_enabled(X86_FEATURE_ITD)) {
+ union cpuid6_ecx ecx;
+
+ ecx.full = cpuid_ecx(CPUID_HFI_LEAF);
+ hfi_features.nr_classes = ecx.split.nr_classes;
+ } else {
+ hfi_features.nr_classes = 1;
+ }
/*
* The header contains change indications for each supported feature.
--
2.25.1
Just as tasks start life unclassified, initialize the classification
auxiliary variables.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Perry Yuan <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Zhao Liu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* None
Changes since v2:
* None
Changes since v1:
* None
---
kernel/sched/core.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 876396b1d077..9c28be596938 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4502,6 +4502,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.vruntime = 0;
#ifdef CONFIG_IPC_CLASSES
p->ipcc = IPC_CLASS_UNCLASSIFIED;
+ p->ipcc_tmp = IPC_CLASS_UNCLASSIFIED;
+ p->ipcc_cntr = 0;
#endif
INIT_LIST_HEAD(&p->se.group_node);
--
2.25.1
Intel Thread Director provides a classification value based on the type of
instructions that CPU is currently executing. Use this classification to
update the IPC class of the current task.
The responsibility for configuring and enabling both the Hardware Feedback
Interface and Intel Thread Director lies with the HFI driver, but it should
not directly handle tasks.
Update the HFI driver to read the register that provides the classification
result. Implement the arch_update_ipcc() interface of the scheduler under
arch/x86 code to update the IPC class of individual tasks.
Task classification only makes sense when used along with the HFI driver.
Make HFI driver select CONFIG_IPC_CLASSES. However, users may still select
CONFIG_IPC_CLASSES. Add function stubs to prevent build errors.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Perry Yuan <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Zhao Liu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* Relocated the implementation of arch_update_ipcc() from drivers/thermal
to arch/x86. (Rafael)
* Moved the definition of MSR_IA32_HW_FEEDBACK_CHAR into this patch.
(Reported-by: kernel test robot <[email protected]>)
* Select CONFIG_IPC_CLASSES when CONFIG_INTEL_HFI_THERMAL is selected to
reduce the configuration burden of the user/administrator. (Srinivas)
Changes since v2:
* Removed the implementation of arch_has_ipc_classes().
Changes since v1:
* Adjusted the result the classification of Intel Thread Director to start
at class 1. Class 0 for the scheduler means that the task is
unclassified.
* Redefined union hfi_thread_feedback_char_msr to ensure all
bit-fields are packed. (PeterZ)
* Removed CONFIG_INTEL_THREAD_DIRECTOR. (PeterZ)
* Shortened the names of the functions that implement IPC classes.
* Removed argument smt_siblings_idle from intel_hfi_update_ipcc().
(PeterZ)
---
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/topology.h | 11 +++++++++
arch/x86/kernel/Makefile | 2 ++
arch/x86/kernel/sched_ipcc.c | 35 +++++++++++++++++++++++++++++
drivers/thermal/intel/Kconfig | 1 +
drivers/thermal/intel/intel_hfi.c | 37 +++++++++++++++++++++++++++++++
6 files changed, 87 insertions(+)
create mode 100644 arch/x86/kernel/sched_ipcc.c
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 3aedae61af4f..0bc4ed0ff787 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1108,6 +1108,7 @@
/* Hardware Feedback Interface */
#define MSR_IA32_HW_FEEDBACK_PTR 0x17d0
#define MSR_IA32_HW_FEEDBACK_CONFIG 0x17d1
+#define MSR_IA32_HW_FEEDBACK_CHAR 0x17d2
/* x2APIC locked status */
#define MSR_IA32_XAPIC_DISABLE_STATUS 0xBD
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index caf41c4869a0..7d2ed7f7bb8c 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -235,4 +235,15 @@ void init_freq_invariance_cppc(void);
#define arch_init_invariance_cppc init_freq_invariance_cppc
#endif
+#ifdef CONFIG_INTEL_HFI_THERMAL
+int intel_hfi_read_classid(u8 *classid);
+#else
+static inline int intel_hfi_read_classid(u8 *classid) { return -ENODEV; }
+#endif
+
+#ifdef CONFIG_IPC_CLASSES
+void intel_update_ipcc(struct task_struct *curr);
+#define arch_update_ipcc intel_update_ipcc
+#endif
+
#endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 4070a01c11b7..f9b9d213ddaa 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -145,6 +145,8 @@ obj-$(CONFIG_CFI_CLANG) += cfi.o
obj-$(CONFIG_CALL_THUNKS) += callthunks.o
+obj-$(CONFIG_IPC_CLASSES) += sched_ipcc.o
+
###
# 64 bit specific files
ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/sched_ipcc.c b/arch/x86/kernel/sched_ipcc.c
new file mode 100644
index 000000000000..685e7b3b5375
--- /dev/null
+++ b/arch/x86/kernel/sched_ipcc.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Intel support for scheduler IPC classes
+ *
+ * Copyright (c) 2023, Intel Corporation.
+ *
+ * Author: Ricardo Neri <[email protected]>
+ *
+ * On hybrid processors, the architecture differences between types of CPUs
+ * lead to different number of retired instructions per cycle (IPC). IPCs may
+ * differ further by classes of instructions.
+ *
+ * The scheduler assigns an IPC class to every task with arch_update_ipcc()
+ * from data that hardware provides. Implement this interface for x86.
+ *
+ * See kernel/sched/sched.h for details.
+ */
+
+#include <linux/sched.h>
+
+#include <asm/topology.h>
+
+void intel_update_ipcc(struct task_struct *curr)
+{
+ u8 hfi_class;
+
+ if (intel_hfi_read_classid(&hfi_class))
+ return;
+
+ /*
+ * 0 is a valid classification for Intel Thread Director. A scheduler
+ * IPCC class of 0 means that the task is unclassified. Adjust.
+ */
+ curr->ipcc = hfi_class + 1;
+}
diff --git a/drivers/thermal/intel/Kconfig b/drivers/thermal/intel/Kconfig
index ecd7e07eece0..418db04dc876 100644
--- a/drivers/thermal/intel/Kconfig
+++ b/drivers/thermal/intel/Kconfig
@@ -109,6 +109,7 @@ config INTEL_HFI_THERMAL
depends on CPU_SUP_INTEL
depends on X86_THERMAL_VECTOR
select THERMAL_NETLINK
+ select IPC_CLASSES
help
Select this option to enable the Hardware Feedback Interface. If
selected, hardware provides guidance to the operating system on
diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index b41547912fda..20ee4264dcd4 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -72,6 +72,15 @@ union cpuid6_edx {
u32 full;
};
+union hfi_thread_feedback_char_msr {
+ struct {
+ u64 classid : 8;
+ u64 __reserved : 55;
+ u64 valid : 1;
+ } split;
+ u64 full;
+};
+
/**
* struct hfi_cpu_data - HFI capabilities per CPU
* @perf_cap: Performance capability
@@ -171,6 +180,34 @@ static struct workqueue_struct *hfi_updates_wq;
#define HFI_UPDATE_INTERVAL HZ
#define HFI_MAX_THERM_NOTIFY_COUNT 16
+/**
+ * intel_hfi_read_classid() - Read the currrent classid
+ * @classid: Variable to which the classid will be written.
+ *
+ * Read the classification that Intel Thread Director has produced when this
+ * function is called. Thread classification must be enabled before calling
+ * this function.
+ *
+ * Return: 0 if the produced classification is valid. Error otherwise.
+ */
+int intel_hfi_read_classid(u8 *classid)
+{
+ union hfi_thread_feedback_char_msr msr;
+
+ /* We should not be here if ITD is not supported. */
+ if (!cpu_feature_enabled(X86_FEATURE_ITD)) {
+ pr_warn_once("task classification requested but not supported!");
+ return -ENODEV;
+ }
+
+ rdmsrl(MSR_IA32_HW_FEEDBACK_CHAR, msr.full);
+ if (!msr.split.valid)
+ return -EINVAL;
+
+ *classid = msr.split.classid;
+ return 0;
+}
+
static void get_hfi_caps(struct hfi_instance *hfi_instance,
struct thermal_genl_cpu_caps *cpu_caps)
{
--
2.25.1
A task may be unclassified if it has been recently created, spend most of
its lifetime sleeping, or hardware has not provided a classification.
Most tasks will be eventually classified as scheduler's IPC class 1
(HFI class 0). This class corresponds to the capabilities in the legacy,
classless, HFI table.
IPC class 1 is a reasonable choice until hardware provides an actual
classification. Meanwhile, the scheduler will place classes of tasks with
higher IPC scores on higher-performance CPUs.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Perry Yuan <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Zhao Liu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Acked-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* Added Acked-by tag from Rafael.
Changes since v2:
* None
Changes since v1:
* Now the default class is 1.
---
drivers/thermal/intel/intel_hfi.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index fec2c01fda1b..e23c49da02ee 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -182,6 +182,19 @@ static struct workqueue_struct *hfi_updates_wq;
#define HFI_UPDATE_INTERVAL HZ
#define HFI_MAX_THERM_NOTIFY_COUNT 16
+/*
+ * A task may be unclassified if it has been recently created, spend most of
+ * its lifetime sleeping, or hardware has not provided a classification.
+ *
+ * Most tasks will be classified as scheduler's IPC class 1 (HFI class 0)
+ * eventually. Meanwhile, the scheduler will place classes of tasks with higher
+ * IPC scores on higher-performance CPUs.
+ *
+ * IPC class 1 is a reasonable choice. It matches the performance capability
+ * of the legacy, classless, HFI table.
+ */
+#define HFI_UNCLASSIFIED_DEFAULT 1
+
/* A cache of the HFI perf capabilities for lockless access. */
static int __percpu *hfi_ipcc_scores;
/* Sequence counter for hfi_ipcc_scores */
@@ -212,7 +225,7 @@ unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu)
return -EINVAL;
if (ipcc == IPC_CLASS_UNCLASSIFIED)
- return -EINVAL;
+ ipcc = HFI_UNCLASSIFIED_DEFAULT;
/*
* Scheduler IPC classes start at 1. HFI classes start at 0.
--
2.25.1
Implement the arch_get_ipcc_score() interface of the scheduler. Use the
performance capabilities of the extended Hardware Feedback Interface table
as the IPC score.
Use the cached per-CPU IPCC scores. A seqcount provides lockless access and
the required memory ordering to avoid stale data.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Perry Yuan <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Zhao Liu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* Ordered memory loads using a seqcount.
* Removed local variable hfi_class from intel_hfi_get_ipcc_score().
(Rafael).
Changes since v2:
* None
Changes since v1:
* Adjusted the returned HFI class (which starts at 0) to match the
scheduler IPCC class (which starts at 1). (PeterZ)
* Used the new interface names.
---
arch/x86/include/asm/topology.h | 4 ++++
drivers/thermal/intel/intel_hfi.c | 40 +++++++++++++++++++++++++++++--
2 files changed, 42 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 7d2ed7f7bb8c..8dd328cdb5cf 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -237,13 +237,17 @@ void init_freq_invariance_cppc(void);
#ifdef CONFIG_INTEL_HFI_THERMAL
int intel_hfi_read_classid(u8 *classid);
+unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu);
#else
static inline int intel_hfi_read_classid(u8 *classid) { return -ENODEV; }
+static inline unsigned long
+intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu) { return -ENODEV; }
#endif
#ifdef CONFIG_IPC_CLASSES
void intel_update_ipcc(struct task_struct *curr);
#define arch_update_ipcc intel_update_ipcc
+#define arch_get_ipcc_score intel_hfi_get_ipcc_score
#endif
#endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index d822ed0bb5c1..fec2c01fda1b 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -199,6 +199,42 @@ static int alloc_hfi_ipcc_scores(void)
return hfi_ipcc_scores ? 0 : -ENOMEM;
}
+unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu)
+{
+ int *scores, score;
+ unsigned long seq;
+
+ scores = per_cpu_ptr(hfi_ipcc_scores, cpu);
+ if (!scores)
+ return -ENODEV;
+
+ if (cpu < 0 || cpu >= nr_cpu_ids)
+ return -EINVAL;
+
+ if (ipcc == IPC_CLASS_UNCLASSIFIED)
+ return -EINVAL;
+
+ /*
+ * Scheduler IPC classes start at 1. HFI classes start at 0.
+ * See note intel_hfi_update_ipcc().
+ */
+ if (ipcc >= hfi_features.nr_classes + 1)
+ return -EINVAL;
+
+ /*
+ * The seqcount implies load-acquire semantics to order loads with
+ * lockless stores of the write side in set_hfi_ipcc_score(). It
+ * also implies a compiler barrier.
+ */
+ do {
+ seq = read_seqcount_begin(&hfi_ipcc_seqcount);
+ /* @ipcc is never 0. */
+ score = scores[ipcc - 1];
+ } while (read_seqcount_retry(&hfi_ipcc_seqcount, seq));
+
+ return score;
+}
+
static void set_hfi_ipcc_scores(struct hfi_instance *hfi_instance)
{
int cpu;
@@ -213,8 +249,8 @@ static void set_hfi_ipcc_scores(struct hfi_instance *hfi_instance)
raw_spin_lock_irq(&hfi_instance->table_lock);
/*
* The seqcount implies store-release semantics to order stores with
- * lockless loads from the seqcount read side. It also implies a
- * compiler barrier.
+ * lockless loads from the seqcount read side in
+ * intel_hfi_get_ipcc_score(). It also implies a compiler barrier.
*/
write_seqcount_begin(&hfi_ipcc_seqcount);
for_each_cpu(cpu, hfi_instance->cpus) {
--
2.25.1
The raw classification that hardware provides for a task may not
be directly usable by the scheduler: the classification may change too
frequently or architecture-specific implementations may need to consider
additional factors. For instance, some processors with Intel Thread
Director need to consider the state of the SMT siblings of a core.
Provide per-task helper variables that architectures can use to
postprocess the classification that hardware provides.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Perry Yuan <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Zhao Liu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* None
Changes since v2:
* None
Changes since v1:
* Used bit-fields to fit all the IPC class data in 4 bytes. (PeterZ)
* Shortened names of the helpers.
* Renamed helpers with the ipcc_ prefix.
* Reworded commit message for clarity
---
include/linux/sched.h | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0e1c38ad09c2..719147460ca8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1541,7 +1541,17 @@ struct task_struct {
* A hardware-defined classification of task that reflects but is
* not identical to the number of instructions per cycle.
*/
- unsigned short ipcc;
+ unsigned int ipcc : 9;
+ /*
+ * A candidate classification that arch-specific implementations
+ * qualify for correctness.
+ */
+ unsigned int ipcc_tmp : 9;
+ /*
+ * Counter to filter out transient candidate classifications
+ * of a task.
+ */
+ unsigned int ipcc_cntr : 14;
#endif
/*
--
2.25.1
X86 needs to know the idle state of the SMT siblings of a CPU to improve
the accuracy of IPCC classification. X86 implements support for IPC classes
in the thermal HFI driver.
Rename is_core_idle() as sched_smt_siblings_idle() and make it available
outside the scheduler code.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Perry Yuan <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Zhao Liu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
is_core_idle() is no longer an inline function after this patch. To rule
out performance degradation, I compared the execution time of the inline
and non-inline versions on a 4-socket Cascade Lake system using the NUMA
stressor of stress-ng.
I used this test command:
$ stress-ng --numa 1500 -t 10m
During the test, is_core_idle() was called ~200,000 times. To measure the
execution time, I recorded the value of the TSC counter before and after
calling is_core_idle() and calculated the difference.
value.
I arbitrarily removed outliers (defined as any delta larger than 5000
counts). This required removing ~40 samples.
The table below summarizes the difference in execution time. All values
are expressed in TSC counts, except for the standard deviation, expressed
as a percentage of the average.
Average Median Std(%) Mode
TSCdelta inline 668.76 626 67.24 42
TSCdelta non-inline 677.64 624 67.67 46
The metrics show that both the inline and non-inline versions exhibit
similar performance characteristics.
---
Changes since v3:
* None
Changes since v2:
* Brought back this previously dropped patch.
* Profiled inline vs non-inline is_core_idle(). I found not major penalty.
* Merged is_core_idle() and sched_smt_siblings_idle() into a single
function. (Dietmar)
Changes since v1:
* Dropped this patch.
---
include/linux/sched.h | 2 ++
kernel/sched/fair.c | 13 ++++++++++---
2 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 719147460ca8..986344ebf2f6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2461,4 +2461,6 @@ static inline void sched_core_fork(struct task_struct *p) { }
extern void sched_set_stop_task(int cpu, struct task_struct *stop);
+extern bool sched_smt_siblings_idle(int cpu);
+
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da3e009eef42..a52589a6c561 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1064,7 +1064,14 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
* Scheduling class queueing methods:
*/
-static inline bool is_core_idle(int cpu)
+/**
+ * sched_smt_siblings_idle - Check whether SMT siblings of a CPU are idle
+ * @cpu: The CPU to check
+ *
+ * Returns true if all the SMT siblings of @cpu are idle or @cpu does not have
+ * SMT siblings. The idle state of @cpu is not considered.
+ */
+bool sched_smt_siblings_idle(int cpu)
{
#ifdef CONFIG_SCHED_SMT
int sibling;
@@ -1767,7 +1774,7 @@ static inline int numa_idle_core(int idle_core, int cpu)
* Prefer cores instead of packing HT siblings
* and triggering future load balancing.
*/
- if (is_core_idle(cpu))
+ if (sched_smt_siblings_idle(cpu))
idle_core = cpu;
return idle_core;
@@ -9652,7 +9659,7 @@ static bool sched_use_asym_prio(struct sched_domain *sd, int cpu)
if (!sched_smt_active())
return true;
- return sd->flags & SD_SHARE_CPUCAPACITY || is_core_idle(cpu);
+ return sd->flags & SD_SHARE_CPUCAPACITY || sched_smt_siblings_idle(cpu);
}
/**
--
2.25.1
update_sd_pick_busiest() selects as busiest the candidate group passed to
it as argument if it has the same priority as the current busiest. Either
group is a good choice. IPCC statistics reflect the class of work on a
scheduling group. Use this data to break the priority tie between the
candidate and current busiest groups.
Pick as busiest the scheduling group that yields a higher IPCC score
after load balancing.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Perry Yuan <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Zhao Liu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* None
Changes since v2:
* None
Changes since v1:
* Added a comment to clarify why sched_asym_prefer() needs a tie breaker
only in update_sd_pick_busiest(). (PeterZ)
* Renamed functions for accuracy:
sched_asym_class_prefer() >> sched_asym_ipcc_prefer()
sched_asym_class_pick() >> sched_asym_ipcc_pick()
* Reworded commit message for clarity.
---
kernel/sched/fair.c | 72 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 72 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a51c65c9335f..fb3d793fe9ad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9510,6 +9510,60 @@ static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
sgs->ipcc_score_before = before;
}
+/**
+ * sched_asym_ipcc_prefer - Select a sched group based on its IPCC score
+ * @a: Load balancing statistics of a sched group
+ * @b: Load balancing statistics of a second sched group
+ *
+ * Returns: true if @a has a higher IPCC score than @b after load balance.
+ * False otherwise.
+ */
+static bool sched_asym_ipcc_prefer(struct sg_lb_stats *a,
+ struct sg_lb_stats *b)
+{
+ if (!sched_ipcc_enabled())
+ return false;
+
+ /* @a increases overall throughput after load balance. */
+ if (a->ipcc_score_after > b->ipcc_score_after)
+ return true;
+
+ /*
+ * If @a and @b yield the same overall throughput, pick @a if
+ * its current throughput is lower than that of @b.
+ */
+ if (a->ipcc_score_after == b->ipcc_score_after)
+ return a->ipcc_score_before < b->ipcc_score_before;
+
+ return false;
+}
+
+/**
+ * sched_asym_ipcc_pick - Select a sched group based on its IPCC score
+ * @a: A scheduling group
+ * @b: A second scheduling group
+ * @a_stats: Load balancing statistics of @a
+ * @b_stats: Load balancing statistics of @b
+ *
+ * Returns: true if @a has the same priority and @a has tasks with IPC classes
+ * that yield higher overall throughput after load balance. False otherwise.
+ */
+static bool sched_asym_ipcc_pick(struct sched_group *a,
+ struct sched_group *b,
+ struct sg_lb_stats *a_stats,
+ struct sg_lb_stats *b_stats)
+{
+ /*
+ * Only use the class-specific preference selection if both sched
+ * groups have the same priority.
+ */
+ if (arch_asym_cpu_priority(a->asym_prefer_cpu) !=
+ arch_asym_cpu_priority(b->asym_prefer_cpu))
+ return false;
+
+ return sched_asym_ipcc_prefer(a_stats, b_stats);
+}
+
#else /* CONFIG_IPC_CLASSES */
static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
struct rq *rq)
@@ -9526,6 +9580,14 @@ static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
{
}
+static bool sched_asym_ipcc_pick(struct sched_group *a,
+ struct sched_group *b,
+ struct sg_lb_stats *a_stats,
+ struct sg_lb_stats *b_stats)
+{
+ return false;
+}
+
#endif /* CONFIG_IPC_CLASSES */
/**
@@ -9759,6 +9821,16 @@ static bool update_sd_pick_busiest(struct lb_env *env,
/* Prefer to move from lowest priority CPU's work */
if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))
return false;
+
+ /*
+ * Unlike other callers of sched_asym_prefer(), here both @sg
+ * and @sds::busiest have tasks running. When they have equal
+ * priority, their IPC class scores can be used to select a
+ * better busiest.
+ */
+ if (sched_asym_ipcc_pick(sds->busiest, sg, &sds->busiest_stat, sgs))
+ return false;
+
break;
case group_misfit_task:
--
2.25.1
The HRESET instruction isolates the classification of individual
tasks when they run sequentially on the same logical processor. It resets
the classification history that the logical processor maintains.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Perry Yuan <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Zhao Liu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* Moved definition of HRESET to its correct leaf: CPUID_7_1_EAX (Zhao)
Changes since v2:
* None
Changes since v1:
* None
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/msr-index.h | 3 +++
2 files changed, 4 insertions(+)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 98a84cbf4261..5edb63af2996 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -319,6 +319,7 @@
#define X86_FEATURE_FSRC (12*32+12) /* "" Fast short REP {CMPSB,SCASB} */
#define X86_FEATURE_LKGS (12*32+18) /* "" Load "kernel" (userspace) GS */
#define X86_FEATURE_AMX_FP16 (12*32+21) /* "" AMX fp16 Support */
+#define X86_FEATURE_HRESET (12*32+22) /* Hardware history reset instruction */
#define X86_FEATURE_AVX_IFMA (12*32+23) /* "" Support for VPMADD52[H,L]UQ */
#define X86_FEATURE_LAM (12*32+26) /* Linear Address Masking */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 7823b87bf383..605c01539b0d 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1111,6 +1111,9 @@
#define MSR_IA32_HW_FEEDBACK_THREAD_CONFIG 0x17d4
#define MSR_IA32_HW_FEEDBACK_CHAR 0x17d2
+/* Hardware History Reset */
+#define MSR_IA32_HW_HRESET_ENABLE 0x17da
+
/* x2APIC locked status */
#define MSR_IA32_XAPIC_DISABLE_STATUS 0xBD
#define LEGACY_XAPIC_DISABLED BIT(0) /*
--
2.25.1
Intel Thread Director (ITD) provides hardware resources to classify
the current task. The classification reflects the type of instructions that
a task currently executes.
ITD extends the Hardware Feedback Interface table to provide performance
and energy efficiency capabilities for each of the supported classes of
tasks.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Perry Yuan <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Zhao Liu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* None
Changes since v2:
* None
Changes since v1:
* Removed dependency on CONFIG_INTEL_THREAD_DIRECTOR. Instead, depend on
CONFIG_IPC_CLASSES.
* Added DISABLE_ITD to the correct DISABLE_MASK: 14 instead of 13.
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +++++++-
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
3 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index cb8ca46213be..98a84cbf4261 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -353,6 +353,7 @@
#define X86_FEATURE_HWP_EPP (14*32+10) /* HWP Energy Perf. Preference */
#define X86_FEATURE_HWP_PKG_REQ (14*32+11) /* HWP Package Level Request */
#define X86_FEATURE_HFI (14*32+19) /* Hardware Feedback Interface */
+#define X86_FEATURE_ITD (14*32+23) /* Intel Thread Director */
/* AMD SVM Feature Identification, CPUID level 0x8000000a (EDX), word 15 */
#define X86_FEATURE_NPT (15*32+ 0) /* Nested Page Table support */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index fafe9be7a6f4..fad78bd840cd 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -105,6 +105,12 @@
# define DISABLE_TDX_GUEST (1 << (X86_FEATURE_TDX_GUEST & 31))
#endif
+#ifdef CONFIG_IPC_CLASSES
+# define DISABLE_ITD 0
+#else
+# define DISABLE_ITD (1 << (X86_FEATURE_ITD & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -123,7 +129,7 @@
DISABLE_CALL_DEPTH_TRACKING)
#define DISABLED_MASK12 (DISABLE_LAM)
#define DISABLED_MASK13 0
-#define DISABLED_MASK14 0
+#define DISABLED_MASK14 (DISABLE_ITD)
#define DISABLED_MASK15 0
#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
DISABLE_ENQCMD)
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index f6748c8bd647..7a87b823eef3 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -81,6 +81,7 @@ static const struct cpuid_dep cpuid_deps[] = {
{ X86_FEATURE_XFD, X86_FEATURE_XSAVES },
{ X86_FEATURE_XFD, X86_FEATURE_XGETBV1 },
{ X86_FEATURE_AMX_TILE, X86_FEATURE_XFD },
+ { X86_FEATURE_ITD, X86_FEATURE_HFI },
{}
};
--
2.25.1
When using IPCC scores to break ties between two scheduling groups, it is
necessary to consider both the current score and the score that would
result after load balancing.
Compute the combined IPC class score of a scheduling group and the local
scheduling group. Compute both the current score and the prospective score.
Collect IPCC statistics only for asym_packing and fully_busy scheduling
groups. These are the only cases that use IPCC scores.
These IPCC statistics are used during idle load balancing. The candidate
scheduling group will have one fewer busy CPU after load balancing. This
observation is important for cores with SMT support.
The IPCC score of scheduling groups composed of SMT siblings needs to
consider that the siblings share CPU resources. When computing the total
IPCC score of the scheduling group, divide the score of each sibling by
the number of busy siblings.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Perry Yuan <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Zhao Liu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* None
Changes since v2:
* Also collect IPCC stats for fully_busy sched groups.
* Restrict use of IPCC stats to SD_ASYM_PACKING. (Ionela)
* Handle errors of arch_get_ipcc_score(). (Ionela)
Changes since v1:
* Implemented cleanups and reworks from PeterZ. I took all his
suggestions, except the computation of the IPC score before and after
load balancing. We are computing not the average score, but the *total*.
* Check for the SD_SHARE_CPUCAPACITY to compute the throughput of the SMT
siblings of a physical core.
* Used the new interface names.
* Reworded commit message for clarity.
---
kernel/sched/fair.c | 68 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 68 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0cab5e501b6..a51c65c9335f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9114,6 +9114,8 @@ struct sg_lb_stats {
unsigned long min_score; /* Min(score(rq->curr->ipcc)) */
unsigned short min_ipcc; /* Class of the task with the minimum IPCC score in the rq */
unsigned long sum_score; /* Sum(score(rq->curr->ipcc)) */
+ long ipcc_score_after; /* Prospective IPCC score after load balancing */
+ unsigned long ipcc_score_before; /* IPCC score before load balancing */
#endif
};
@@ -9452,6 +9454,62 @@ static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
}
}
+static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
+ struct sched_group *sg,
+ struct lb_env *env)
+{
+ unsigned long score_on_dst_cpu, before;
+ int busy_cpus;
+ long after;
+
+ if (!sched_ipcc_enabled())
+ return;
+
+ /*
+ * IPCC scores are only useful during idle load balancing. For now,
+ * only asym_packing uses IPCC scores.
+ */
+ if (!(env->sd->flags & SD_ASYM_PACKING) ||
+ env->idle == CPU_NOT_IDLE)
+ return;
+
+ /*
+ * IPCC scores are used to break ties only between these types of
+ * groups.
+ */
+ if (sgs->group_type != group_fully_busy &&
+ sgs->group_type != group_asym_packing)
+ return;
+
+ busy_cpus = sgs->group_weight - sgs->idle_cpus;
+
+ /* No busy CPUs in the group. No tasks to move. */
+ if (!busy_cpus)
+ return;
+
+ score_on_dst_cpu = arch_get_ipcc_score(sgs->min_ipcc, env->dst_cpu);
+
+ /*
+ * Do not use IPC scores. sgs::ipcc_score_{after, before} will be zero
+ * and not used.
+ */
+ if (IS_ERR_VALUE(score_on_dst_cpu))
+ return;
+
+ before = sgs->sum_score;
+ after = before - sgs->min_score;
+
+ /* SMT siblings share throughput. */
+ if (busy_cpus > 1 && sg->flags & SD_SHARE_CPUCAPACITY) {
+ before /= busy_cpus;
+ /* One sibling will become idle after load balance. */
+ after /= busy_cpus - 1;
+ }
+
+ sgs->ipcc_score_after = after + score_on_dst_cpu;
+ sgs->ipcc_score_before = before;
+}
+
#else /* CONFIG_IPC_CLASSES */
static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
struct rq *rq)
@@ -9461,6 +9519,13 @@ static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
static void init_rq_ipcc_stats(struct sg_lb_stats *sgs)
{
}
+
+static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
+ struct sched_group *sg,
+ struct lb_env *env)
+{
+}
+
#endif /* CONFIG_IPC_CLASSES */
/**
@@ -9620,6 +9685,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
+ if (!local_group)
+ update_sg_lb_stats_scores(sgs, group, env);
+
/* Computing avg_load makes sense only when group is overloaded */
if (sgs->group_type == group_overloaded)
sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
--
2.25.1
Configure the MSR that controls the behavior of HRESET on each logical
processor.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Perry Yuan <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Zhao Liu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* None
Changes since v2:
* None
Changes since v1:
* Marked hardware_history_features as __ro_after_init instead of
__read_mostly. (PeterZ)
---
arch/x86/kernel/cpu/common.c | 23 ++++++++++++++++++++++-
1 file changed, 22 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 8f284e185aea..d47a442900ad 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -396,6 +396,26 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
cr4_clear_bits(X86_CR4_UMIP);
}
+static u32 hardware_history_features __ro_after_init;
+
+static __always_inline void setup_hreset(struct cpuinfo_x86 *c)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_HRESET))
+ return;
+
+ /*
+ * Use on all CPUs the hardware history features that the boot
+ * CPU supports.
+ */
+ if (c == &boot_cpu_data)
+ hardware_history_features = cpuid_ebx(0x20);
+
+ if (!hardware_history_features)
+ return;
+
+ wrmsrl(MSR_IA32_HW_HRESET_ENABLE, hardware_history_features);
+}
+
/* These bits should not change their value after CPU init is finished. */
static const unsigned long cr4_pinned_mask =
X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP |
@@ -1834,10 +1854,11 @@ static void identify_cpu(struct cpuinfo_x86 *c)
/* Disable the PN if appropriate */
squash_the_stupid_serial_number(c);
- /* Set up SMEP/SMAP/UMIP */
+ /* Set up SMEP/SMAP/UMIP/HRESET */
setup_smep(c);
setup_smap(c);
setup_umip(c);
+ setup_hreset(c);
/* Enable FSGSBASE instructions if available. */
if (cpu_has(c, X86_FEATURE_FSGSBASE)) {
--
2.25.1
Reset the classification history of the current task when switching to the
next task. Hardware will start the classification of the next task from
scratch.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Perry Yuan <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Zhao Liu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v3:
* None
Changes since v2:
* None
Changes since v1:
* Measurements of the cost of the HRESET instruction
Methodology:
I created a tight loop with interrupts and preemption disabled. I
recorded the value of the TSC counter before and after executing
HRESET or RDTSC. I repeated the measurement 100,000 times.
I performed the experiment using an Alder Lake S system. I set the
frequency of the CPUs at a fixed value.
The table below compares the cost of HRESET with RDTSC (expressed in
the elapsed TSC count). The cost of the two instructions is
comparable.
PCore ECore
Frequency (GHz) 5.0 3.8
HRESET (avg) 28.5 44.7
HRESET (stdev %) 3.6 2.3
RDTSC (avg) 25.2 35.7
RDTSC (stdev %) 3.9 2.6
* Used an ALTERNATIVE macro instead of static_cpu_has() to execute HRESET
when supported. (PeterZ)
---
arch/x86/include/asm/hreset.h | 30 ++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/common.c | 7 +++++++
arch/x86/kernel/process_32.c | 3 +++
arch/x86/kernel/process_64.c | 3 +++
4 files changed, 43 insertions(+)
create mode 100644 arch/x86/include/asm/hreset.h
diff --git a/arch/x86/include/asm/hreset.h b/arch/x86/include/asm/hreset.h
new file mode 100644
index 000000000000..d68ca2fb8642
--- /dev/null
+++ b/arch/x86/include/asm/hreset.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_HRESET_H
+
+/**
+ * HRESET - History reset. Available since binutils v2.36.
+ *
+ * Request the processor to reset the history of task classification on the
+ * current logical processor. The history components to be
+ * reset are specified in %eax. Only bits specified in CPUID(0x20).EBX
+ * and enabled in the IA32_HRESET_ENABLE MSR can be selected.
+ *
+ * The assembly code looks like:
+ *
+ * hreset %eax
+ *
+ * The corresponding machine code looks like:
+ *
+ * F3 0F 3A F0 ModRM Imm
+ *
+ * The value of ModRM is 0xc0 to specify %eax register addressing.
+ * The ignored immediate operand is set to 0.
+ *
+ * The instruction is documented in the Intel SDM.
+ */
+
+#define __ASM_HRESET ".byte 0xf3, 0xf, 0x3a, 0xf0, 0xc0, 0x0"
+
+void reset_hardware_history(void);
+
+#endif /* _ASM_X86_HRESET_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index d47a442900ad..8cdd88dab7ed 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -53,6 +53,7 @@
#include <asm/mce.h>
#include <asm/msr.h>
#include <asm/cacheinfo.h>
+#include <asm/hreset.h>
#include <asm/memtype.h>
#include <asm/microcode.h>
#include <asm/microcode_intel.h>
@@ -398,6 +399,12 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
static u32 hardware_history_features __ro_after_init;
+void reset_hardware_history(void)
+{
+ asm_inline volatile (ALTERNATIVE("", __ASM_HRESET, X86_FEATURE_HRESET)
+ : : "a" (hardware_history_features) : "memory");
+}
+
static __always_inline void setup_hreset(struct cpuinfo_x86 *c)
{
if (!cpu_feature_enabled(X86_FEATURE_HRESET))
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 708c87b88cc1..7353bb119e79 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -52,6 +52,7 @@
#include <asm/switch_to.h>
#include <asm/vm86.h>
#include <asm/resctrl.h>
+#include <asm/hreset.h>
#include <asm/proto.h>
#include "process.h"
@@ -214,6 +215,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
/* Load the Intel cache allocation PQR MSR. */
resctrl_sched_in(next_p);
+ reset_hardware_history();
+
return prev_p;
}
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3d181c16a2f6..fcf14cba2a9d 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -54,6 +54,7 @@
#include <asm/xen/hypervisor.h>
#include <asm/vdso.h>
#include <asm/resctrl.h>
+#include <asm/hreset.h>
#include <asm/unistd.h>
#include <asm/fsgsbase.h>
#ifdef CONFIG_IA32_EMULATION
@@ -659,6 +660,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
/* Load the Intel cache allocation PQR MSR. */
resctrl_sched_in(next_p);
+ reset_hardware_history();
+
return prev_p;
}
--
2.25.1
On Monday 12 Jun 2023 at 21:24:05 (-0700), Ricardo Neri wrote:
> When using IPCC scores to break ties between two scheduling groups, it is
> necessary to consider both the current score and the score that would
> result after load balancing.
>
> Compute the combined IPC class score of a scheduling group and the local
> scheduling group. Compute both the current score and the prospective score.
>
> Collect IPCC statistics only for asym_packing and fully_busy scheduling
> groups. These are the only cases that use IPCC scores.
>
> These IPCC statistics are used during idle load balancing. The candidate
> scheduling group will have one fewer busy CPU after load balancing. This
> observation is important for cores with SMT support.
>
> The IPCC score of scheduling groups composed of SMT siblings needs to
> consider that the siblings share CPU resources. When computing the total
> IPCC score of the scheduling group, divide the score of each sibling by
> the number of busy siblings.
>
> Cc: Ben Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Ionela Voinescu <[email protected]>
> Cc: Joel Fernandes (Google) <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Lukasz Luba <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Perry Yuan <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Tim C. Chen <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Zhao Liu <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Ricardo Neri <[email protected]>
> ---
> Changes since v3:
> * None
>
> Changes since v2:
> * Also collect IPCC stats for fully_busy sched groups.
> * Restrict use of IPCC stats to SD_ASYM_PACKING. (Ionela)
> * Handle errors of arch_get_ipcc_score(). (Ionela)
>
> Changes since v1:
> * Implemented cleanups and reworks from PeterZ. I took all his
> suggestions, except the computation of the IPC score before and after
> load balancing. We are computing not the average score, but the *total*.
> * Check for the SD_SHARE_CPUCAPACITY to compute the throughput of the SMT
> siblings of a physical core.
> * Used the new interface names.
> * Reworded commit message for clarity.
> ---
> kernel/sched/fair.c | 68 +++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 68 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c0cab5e501b6..a51c65c9335f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9114,6 +9114,8 @@ struct sg_lb_stats {
> unsigned long min_score; /* Min(score(rq->curr->ipcc)) */
> unsigned short min_ipcc; /* Class of the task with the minimum IPCC score in the rq */
> unsigned long sum_score; /* Sum(score(rq->curr->ipcc)) */
> + long ipcc_score_after; /* Prospective IPCC score after load balancing */
> + unsigned long ipcc_score_before; /* IPCC score before load balancing */
> #endif
> };
>
> @@ -9452,6 +9454,62 @@ static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> }
> }
>
> +static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
> + struct sched_group *sg,
> + struct lb_env *env)
> +{
> + unsigned long score_on_dst_cpu, before;
> + int busy_cpus;
> + long after;
> +
> + if (!sched_ipcc_enabled())
> + return;
> +
> + /*
> + * IPCC scores are only useful during idle load balancing. For now,
> + * only asym_packing uses IPCC scores.
> + */
> + if (!(env->sd->flags & SD_ASYM_PACKING) ||
> + env->idle == CPU_NOT_IDLE)
> + return;
> +
> + /*
> + * IPCC scores are used to break ties only between these types of
> + * groups.
> + */
> + if (sgs->group_type != group_fully_busy &&
> + sgs->group_type != group_asym_packing)
> + return;
> +
> + busy_cpus = sgs->group_weight - sgs->idle_cpus;
> +
> + /* No busy CPUs in the group. No tasks to move. */
> + if (!busy_cpus)
> + return;
> +
> + score_on_dst_cpu = arch_get_ipcc_score(sgs->min_ipcc, env->dst_cpu);
> +
> + /*
> + * Do not use IPC scores. sgs::ipcc_score_{after, before} will be zero
> + * and not used.
> + */
> + if (IS_ERR_VALUE(score_on_dst_cpu))
> + return;
> +
> + before = sgs->sum_score;
> + after = before - sgs->min_score;
I don't believe this can end up being negative as the sum of all
scores should be higher or equal to the min score, right?
I'm just wondering if ipcc_score_after can be made unsigned long as well,
just for consistency.
> +
> + /* SMT siblings share throughput. */
> + if (busy_cpus > 1 && sg->flags & SD_SHARE_CPUCAPACITY) {
> + before /= busy_cpus;
> + /* One sibling will become idle after load balance. */
> + after /= busy_cpus - 1;
> + }
> +
> + sgs->ipcc_score_after = after + score_on_dst_cpu;
> + sgs->ipcc_score_before = before;
Shouldn't the score_on_dst_cpu be added to "after" before being divided
between the SMT siblings?
Thanks,
Ionela.
> +}
> +
> #else /* CONFIG_IPC_CLASSES */
> static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> struct rq *rq)
> @@ -9461,6 +9519,13 @@ static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> static void init_rq_ipcc_stats(struct sg_lb_stats *sgs)
> {
> }
> +
> +static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
> + struct sched_group *sg,
> + struct lb_env *env)
> +{
> +}
> +
> #endif /* CONFIG_IPC_CLASSES */
>
> /**
> @@ -9620,6 +9685,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>
> sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
>
> + if (!local_group)
> + update_sg_lb_stats_scores(sgs, group, env);
> +
> /* Computing avg_load makes sense only when group is overloaded */
> if (sgs->group_type == group_overloaded)
> sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
> --
> 2.25.1
>
Hi,
On Monday 12 Jun 2023 at 21:24:16 (-0700), Ricardo Neri wrote:
> The raw classification that hardware provides for a task may not
> be directly usable by the scheduler: the classification may change too
> frequently or architecture-specific implementations may need to consider
> additional factors. For instance, some processors with Intel Thread
> Director need to consider the state of the SMT siblings of a core.
>
> Provide per-task helper variables that architectures can use to
> postprocess the classification that hardware provides.
>
> Cc: Ben Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Ionela Voinescu <[email protected]>
> Cc: Joel Fernandes (Google) <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Lukasz Luba <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Perry Yuan <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Tim C. Chen <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Zhao Liu <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Ricardo Neri <[email protected]>
> ---
> Changes since v3:
> * None
>
> Changes since v2:
> * None
>
> Changes since v1:
> * Used bit-fields to fit all the IPC class data in 4 bytes. (PeterZ)
> * Shortened names of the helpers.
> * Renamed helpers with the ipcc_ prefix.
> * Reworded commit message for clarity
> ---
> include/linux/sched.h | 12 +++++++++++-
> 1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 0e1c38ad09c2..719147460ca8 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1541,7 +1541,17 @@ struct task_struct {
> * A hardware-defined classification of task that reflects but is
> * not identical to the number of instructions per cycle.
> */
> - unsigned short ipcc;
> + unsigned int ipcc : 9;
> + /*
> + * A candidate classification that arch-specific implementations
> + * qualify for correctness.
> + */
> + unsigned int ipcc_tmp : 9;
> + /*
> + * Counter to filter out transient candidate classifications
> + * of a task.
> + */
> + unsigned int ipcc_cntr : 14;
> #endif
>
Why does this need to be split in task_struct? Isn't this architecture
specific? IMO the scheduler should never make use of class information
itself. It only receives the value though the call of an arch function
and passes it as an argument to an arch function to get a performance
score. So the way one interprets the class value (splits it in relevant
and helper bits) should be defined and considered in the architecture
specific code.
Thanks,
Ionela.
> /*
> --
> 2.25.1
>
On Thu, Jun 22, 2023 at 10:02:44AM +0100, Ionela Voinescu wrote:
> On Monday 12 Jun 2023 at 21:24:05 (-0700), Ricardo Neri wrote:
> > When using IPCC scores to break ties between two scheduling groups, it is
> > necessary to consider both the current score and the score that would
> > result after load balancing.
> >
> > Compute the combined IPC class score of a scheduling group and the local
> > scheduling group. Compute both the current score and the prospective score.
> >
> > Collect IPCC statistics only for asym_packing and fully_busy scheduling
> > groups. These are the only cases that use IPCC scores.
> >
> > These IPCC statistics are used during idle load balancing. The candidate
> > scheduling group will have one fewer busy CPU after load balancing. This
> > observation is important for cores with SMT support.
> >
> > The IPCC score of scheduling groups composed of SMT siblings needs to
> > consider that the siblings share CPU resources. When computing the total
> > IPCC score of the scheduling group, divide the score of each sibling by
> > the number of busy siblings.
> >
> > Cc: Ben Segall <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Ionela Voinescu <[email protected]>
> > Cc: Joel Fernandes (Google) <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: Lukasz Luba <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Perry Yuan <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Srinivas Pandruvada <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Tim C. Chen <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: Zhao Liu <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: Ricardo Neri <[email protected]>
> > ---
> > Changes since v3:
> > * None
> >
> > Changes since v2:
> > * Also collect IPCC stats for fully_busy sched groups.
> > * Restrict use of IPCC stats to SD_ASYM_PACKING. (Ionela)
> > * Handle errors of arch_get_ipcc_score(). (Ionela)
> >
> > Changes since v1:
> > * Implemented cleanups and reworks from PeterZ. I took all his
> > suggestions, except the computation of the IPC score before and after
> > load balancing. We are computing not the average score, but the *total*.
> > * Check for the SD_SHARE_CPUCAPACITY to compute the throughput of the SMT
> > siblings of a physical core.
> > * Used the new interface names.
> > * Reworded commit message for clarity.
> > ---
> > kernel/sched/fair.c | 68 +++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 68 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c0cab5e501b6..a51c65c9335f 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9114,6 +9114,8 @@ struct sg_lb_stats {
> > unsigned long min_score; /* Min(score(rq->curr->ipcc)) */
> > unsigned short min_ipcc; /* Class of the task with the minimum IPCC score in the rq */
> > unsigned long sum_score; /* Sum(score(rq->curr->ipcc)) */
> > + long ipcc_score_after; /* Prospective IPCC score after load balancing */
> > + unsigned long ipcc_score_before; /* IPCC score before load balancing */
> > #endif
> > };
> >
> > @@ -9452,6 +9454,62 @@ static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> > }
> > }
> >
> > +static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
> > + struct sched_group *sg,
> > + struct lb_env *env)
> > +{
> > + unsigned long score_on_dst_cpu, before;
> > + int busy_cpus;
> > + long after;
> > +
> > + if (!sched_ipcc_enabled())
> > + return;
> > +
> > + /*
> > + * IPCC scores are only useful during idle load balancing. For now,
> > + * only asym_packing uses IPCC scores.
> > + */
> > + if (!(env->sd->flags & SD_ASYM_PACKING) ||
> > + env->idle == CPU_NOT_IDLE)
> > + return;
> > +
> > + /*
> > + * IPCC scores are used to break ties only between these types of
> > + * groups.
> > + */
> > + if (sgs->group_type != group_fully_busy &&
> > + sgs->group_type != group_asym_packing)
> > + return;
> > +
> > + busy_cpus = sgs->group_weight - sgs->idle_cpus;
> > +
> > + /* No busy CPUs in the group. No tasks to move. */
> > + if (!busy_cpus)
> > + return;
> > +
> > + score_on_dst_cpu = arch_get_ipcc_score(sgs->min_ipcc, env->dst_cpu);
> > +
> > + /*
> > + * Do not use IPC scores. sgs::ipcc_score_{after, before} will be zero
> > + * and not used.
> > + */
> > + if (IS_ERR_VALUE(score_on_dst_cpu))
> > + return;
> > +
> > + before = sgs->sum_score;
> > + after = before - sgs->min_score;
>
> I don't believe this can end up being negative as the sum of all
> scores should be higher or equal to the min score, right?
Yes, I agree. `after` cannot be negative.
>
> I'm just wondering if ipcc_score_after can be made unsigned long as well,
> just for consistency.
Sure. I can make it of type unsigned long as well.
>
> > +
> > + /* SMT siblings share throughput. */
> > + if (busy_cpus > 1 && sg->flags & SD_SHARE_CPUCAPACITY) {
> > + before /= busy_cpus;
> > + /* One sibling will become idle after load balance. */
> > + after /= busy_cpus - 1;
> > + }
> > +
> > + sgs->ipcc_score_after = after + score_on_dst_cpu;
> > + sgs->ipcc_score_before = before;
>
> Shouldn't the score_on_dst_cpu be added to "after" before being divided
> between the SMT siblings?
No, because ipcc_score_after represents the joint score of the busiest
core and the destination core after load balance has taken place. The
destination core was previously idle and now contributes to the joint
score.
Thanks and BR,
Ricardo
On Thu, Jun 22, 2023 at 11:20:47AM +0100, Ionela Voinescu wrote:
> Hi,
>
> On Monday 12 Jun 2023 at 21:24:16 (-0700), Ricardo Neri wrote:
> > The raw classification that hardware provides for a task may not
> > be directly usable by the scheduler: the classification may change too
> > frequently or architecture-specific implementations may need to consider
> > additional factors. For instance, some processors with Intel Thread
> > Director need to consider the state of the SMT siblings of a core.
> >
> > Provide per-task helper variables that architectures can use to
> > postprocess the classification that hardware provides.
> >
> > Cc: Ben Segall <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Ionela Voinescu <[email protected]>
> > Cc: Joel Fernandes (Google) <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: Lukasz Luba <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Perry Yuan <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Srinivas Pandruvada <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Tim C. Chen <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: Zhao Liu <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: Ricardo Neri <[email protected]>
> > ---
> > Changes since v3:
> > * None
> >
> > Changes since v2:
> > * None
> >
> > Changes since v1:
> > * Used bit-fields to fit all the IPC class data in 4 bytes. (PeterZ)
> > * Shortened names of the helpers.
> > * Renamed helpers with the ipcc_ prefix.
> > * Reworded commit message for clarity
> > ---
> > include/linux/sched.h | 12 +++++++++++-
> > 1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 0e1c38ad09c2..719147460ca8 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1541,7 +1541,17 @@ struct task_struct {
> > * A hardware-defined classification of task that reflects but is
> > * not identical to the number of instructions per cycle.
> > */
> > - unsigned short ipcc;
> > + unsigned int ipcc : 9;
> > + /*
> > + * A candidate classification that arch-specific implementations
> > + * qualify for correctness.
> > + */
> > + unsigned int ipcc_tmp : 9;
> > + /*
> > + * Counter to filter out transient candidate classifications
> > + * of a task.
> > + */
> > + unsigned int ipcc_cntr : 14;
> > #endif
> >
>
> Why does this need to be split in task_struct? Isn't this architecture
> specific? IMO the scheduler should never make use of class information
> itself. It only receives the value though the call of an arch function
> and passes it as an argument to an arch function to get a performance
> score. So the way one interprets the class value (splits it in relevant
> and helper bits) should be defined and considered in the architecture
> specific code.
This is an excellent observation. The scheduler is unconcerned with what
the classes mean. I'll remove these auxiliary members.
This also removes 2 patches from the series.
Thanks and BR,
Ricardo
On Sun, 2023-06-25 at 13:11 -0700, Ricardo Neri wrote:
>
> > > +
> > > + score_on_dst_cpu = arch_get_ipcc_score(sgs->min_ipcc, env->dst_cpu);
> > > +
> > > + /*
> > > + * Do not use IPC scores. sgs::ipcc_score_{after, before} will be zero
> > > + * and not used.
> > > + */
The comment is not matching the check below. If zero
is not used, the check should also reflect the case.
> > > + if (IS_ERR_VALUE(score_on_dst_cpu))
> > > + return;
> > > +
> > > + before = sgs->sum_score;
> > > + after = before - sgs->min_score;
> >
>
Tim
Hey,
On Sunday 25 Jun 2023 at 13:11:55 (-0700), Ricardo Neri wrote:
> On Thu, Jun 22, 2023 at 10:02:44AM +0100, Ionela Voinescu wrote:
> > On Monday 12 Jun 2023 at 21:24:05 (-0700), Ricardo Neri wrote:
> > > When using IPCC scores to break ties between two scheduling groups, it is
> > > necessary to consider both the current score and the score that would
> > > result after load balancing.
> > >
> > > Compute the combined IPC class score of a scheduling group and the local
> > > scheduling group. Compute both the current score and the prospective score.
> > >
> > > Collect IPCC statistics only for asym_packing and fully_busy scheduling
> > > groups. These are the only cases that use IPCC scores.
> > >
> > > These IPCC statistics are used during idle load balancing. The candidate
> > > scheduling group will have one fewer busy CPU after load balancing. This
> > > observation is important for cores with SMT support.
> > >
> > > The IPCC score of scheduling groups composed of SMT siblings needs to
> > > consider that the siblings share CPU resources. When computing the total
> > > IPCC score of the scheduling group, divide the score of each sibling by
> > > the number of busy siblings.
> > >
> > > Cc: Ben Segall <[email protected]>
> > > Cc: Daniel Bristot de Oliveira <[email protected]>
> > > Cc: Dietmar Eggemann <[email protected]>
> > > Cc: Ionela Voinescu <[email protected]>
> > > Cc: Joel Fernandes (Google) <[email protected]>
> > > Cc: Len Brown <[email protected]>
> > > Cc: Lukasz Luba <[email protected]>
> > > Cc: Mel Gorman <[email protected]>
> > > Cc: Perry Yuan <[email protected]>
> > > Cc: Rafael J. Wysocki <[email protected]>
> > > Cc: Srinivas Pandruvada <[email protected]>
> > > Cc: Steven Rostedt <[email protected]>
> > > Cc: Tim C. Chen <[email protected]>
> > > Cc: Valentin Schneider <[email protected]>
> > > Cc: Zhao Liu <[email protected]>
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Signed-off-by: Ricardo Neri <[email protected]>
> > > ---
> > > Changes since v3:
> > > * None
> > >
> > > Changes since v2:
> > > * Also collect IPCC stats for fully_busy sched groups.
> > > * Restrict use of IPCC stats to SD_ASYM_PACKING. (Ionela)
> > > * Handle errors of arch_get_ipcc_score(). (Ionela)
> > >
> > > Changes since v1:
> > > * Implemented cleanups and reworks from PeterZ. I took all his
> > > suggestions, except the computation of the IPC score before and after
> > > load balancing. We are computing not the average score, but the *total*.
> > > * Check for the SD_SHARE_CPUCAPACITY to compute the throughput of the SMT
> > > siblings of a physical core.
> > > * Used the new interface names.
> > > * Reworded commit message for clarity.
> > > ---
> > > kernel/sched/fair.c | 68 +++++++++++++++++++++++++++++++++++++++++++++
> > > 1 file changed, 68 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index c0cab5e501b6..a51c65c9335f 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -9114,6 +9114,8 @@ struct sg_lb_stats {
> > > unsigned long min_score; /* Min(score(rq->curr->ipcc)) */
> > > unsigned short min_ipcc; /* Class of the task with the minimum IPCC score in the rq */
> > > unsigned long sum_score; /* Sum(score(rq->curr->ipcc)) */
> > > + long ipcc_score_after; /* Prospective IPCC score after load balancing */
> > > + unsigned long ipcc_score_before; /* IPCC score before load balancing */
> > > #endif
> > > };
> > >
> > > @@ -9452,6 +9454,62 @@ static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> > > }
> > > }
> > >
> > > +static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
> > > + struct sched_group *sg,
> > > + struct lb_env *env)
> > > +{
> > > + unsigned long score_on_dst_cpu, before;
> > > + int busy_cpus;
> > > + long after;
> > > +
> > > + if (!sched_ipcc_enabled())
> > > + return;
> > > +
> > > + /*
> > > + * IPCC scores are only useful during idle load balancing. For now,
> > > + * only asym_packing uses IPCC scores.
> > > + */
> > > + if (!(env->sd->flags & SD_ASYM_PACKING) ||
> > > + env->idle == CPU_NOT_IDLE)
> > > + return;
> > > +
> > > + /*
> > > + * IPCC scores are used to break ties only between these types of
> > > + * groups.
> > > + */
> > > + if (sgs->group_type != group_fully_busy &&
> > > + sgs->group_type != group_asym_packing)
> > > + return;
> > > +
> > > + busy_cpus = sgs->group_weight - sgs->idle_cpus;
> > > +
> > > + /* No busy CPUs in the group. No tasks to move. */
> > > + if (!busy_cpus)
> > > + return;
> > > +
> > > + score_on_dst_cpu = arch_get_ipcc_score(sgs->min_ipcc, env->dst_cpu);
> > > +
> > > + /*
> > > + * Do not use IPC scores. sgs::ipcc_score_{after, before} will be zero
> > > + * and not used.
> > > + */
> > > + if (IS_ERR_VALUE(score_on_dst_cpu))
> > > + return;
> > > +
> > > + before = sgs->sum_score;
> > > + after = before - sgs->min_score;
> >
> > I don't believe this can end up being negative as the sum of all
> > scores should be higher or equal to the min score, right?
>
> Yes, I agree. `after` cannot be negative.
>
> >
> > I'm just wondering if ipcc_score_after can be made unsigned long as well,
> > just for consistency.
>
> Sure. I can make it of type unsigned long as well.
>
> >
> > > +
> > > + /* SMT siblings share throughput. */
> > > + if (busy_cpus > 1 && sg->flags & SD_SHARE_CPUCAPACITY) {
> > > + before /= busy_cpus;
> > > + /* One sibling will become idle after load balance. */
> > > + after /= busy_cpus - 1;
> > > + }
> > > +
> > > + sgs->ipcc_score_after = after + score_on_dst_cpu;
> > > + sgs->ipcc_score_before = before;
> >
> > Shouldn't the score_on_dst_cpu be added to "after" before being divided
> > between the SMT siblings?
>
> No, because ipcc_score_after represents the joint score of the busiest
> core and the destination core after load balance has taken place. The
> destination core was previously idle and now contributes to the joint
> score.
>
Right! score_on_dst_cpu does not contribute to the per-cpu throughput of
the busiest core, but it reflects the improvement in score gained by the
move to the destination.
Thanks,
Ionela.
> Thanks and BR,
> Ricardo
On Tue, Jun 13, 2023 at 6:23 AM Ricardo Neri
<[email protected]> wrote:
>
> Implement the arch_get_ipcc_score() interface of the scheduler. Use the
> performance capabilities of the extended Hardware Feedback Interface table
> as the IPC score.
>
> Use the cached per-CPU IPCC scores. A seqcount provides lockless access and
> the required memory ordering to avoid stale data.
>
> Cc: Ben Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Ionela Voinescu <[email protected]>
> Cc: Joel Fernandes (Google) <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Lukasz Luba <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Perry Yuan <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Tim C. Chen <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Zhao Liu <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Ricardo Neri <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
> ---
> Changes since v3:
> * Ordered memory loads using a seqcount.
> * Removed local variable hfi_class from intel_hfi_get_ipcc_score().
> (Rafael).
>
> Changes since v2:
> * None
>
> Changes since v1:
> * Adjusted the returned HFI class (which starts at 0) to match the
> scheduler IPCC class (which starts at 1). (PeterZ)
> * Used the new interface names.
> ---
> arch/x86/include/asm/topology.h | 4 ++++
> drivers/thermal/intel/intel_hfi.c | 40 +++++++++++++++++++++++++++++--
> 2 files changed, 42 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> index 7d2ed7f7bb8c..8dd328cdb5cf 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -237,13 +237,17 @@ void init_freq_invariance_cppc(void);
>
> #ifdef CONFIG_INTEL_HFI_THERMAL
> int intel_hfi_read_classid(u8 *classid);
> +unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu);
> #else
> static inline int intel_hfi_read_classid(u8 *classid) { return -ENODEV; }
> +static inline unsigned long
> +intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu) { return -ENODEV; }
> #endif
>
> #ifdef CONFIG_IPC_CLASSES
> void intel_update_ipcc(struct task_struct *curr);
> #define arch_update_ipcc intel_update_ipcc
> +#define arch_get_ipcc_score intel_hfi_get_ipcc_score
> #endif
>
> #endif /* _ASM_X86_TOPOLOGY_H */
> diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> index d822ed0bb5c1..fec2c01fda1b 100644
> --- a/drivers/thermal/intel/intel_hfi.c
> +++ b/drivers/thermal/intel/intel_hfi.c
> @@ -199,6 +199,42 @@ static int alloc_hfi_ipcc_scores(void)
> return hfi_ipcc_scores ? 0 : -ENOMEM;
> }
>
> +unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu)
> +{
> + int *scores, score;
> + unsigned long seq;
> +
> + scores = per_cpu_ptr(hfi_ipcc_scores, cpu);
> + if (!scores)
> + return -ENODEV;
> +
> + if (cpu < 0 || cpu >= nr_cpu_ids)
> + return -EINVAL;
> +
> + if (ipcc == IPC_CLASS_UNCLASSIFIED)
> + return -EINVAL;
> +
> + /*
> + * Scheduler IPC classes start at 1. HFI classes start at 0.
> + * See note intel_hfi_update_ipcc().
> + */
> + if (ipcc >= hfi_features.nr_classes + 1)
> + return -EINVAL;
> +
> + /*
> + * The seqcount implies load-acquire semantics to order loads with
> + * lockless stores of the write side in set_hfi_ipcc_score(). It
> + * also implies a compiler barrier.
> + */
> + do {
> + seq = read_seqcount_begin(&hfi_ipcc_seqcount);
> + /* @ipcc is never 0. */
> + score = scores[ipcc - 1];
> + } while (read_seqcount_retry(&hfi_ipcc_seqcount, seq));
> +
> + return score;
> +}
> +
> static void set_hfi_ipcc_scores(struct hfi_instance *hfi_instance)
> {
> int cpu;
> @@ -213,8 +249,8 @@ static void set_hfi_ipcc_scores(struct hfi_instance *hfi_instance)
> raw_spin_lock_irq(&hfi_instance->table_lock);
> /*
> * The seqcount implies store-release semantics to order stores with
> - * lockless loads from the seqcount read side. It also implies a
> - * compiler barrier.
> + * lockless loads from the seqcount read side in
> + * intel_hfi_get_ipcc_score(). It also implies a compiler barrier.
> */
> write_seqcount_begin(&hfi_ipcc_seqcount);
> for_each_cpu(cpu, hfi_instance->cpus) {
> --
> 2.25.1
>
On Thu, Jun 29, 2023 at 08:56:36PM +0200, Rafael J. Wysocki wrote:
> On Tue, Jun 13, 2023 at 6:23 AM Ricardo Neri
> <[email protected]> wrote:
> >
> > Implement the arch_get_ipcc_score() interface of the scheduler. Use the
> > performance capabilities of the extended Hardware Feedback Interface table
> > as the IPC score.
> >
> > Use the cached per-CPU IPCC scores. A seqcount provides lockless access and
> > the required memory ordering to avoid stale data.
> >
> > Cc: Ben Segall <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Ionela Voinescu <[email protected]>
> > Cc: Joel Fernandes (Google) <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: Lukasz Luba <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Perry Yuan <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Srinivas Pandruvada <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Tim C. Chen <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: Zhao Liu <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: Ricardo Neri <[email protected]>
>
> Acked-by: Rafael J. Wysocki <[email protected]>
>
Thank you Rafael!
On Mon, Jun 26, 2023 at 02:01:25PM -0700, Tim Chen wrote:
> On Sun, 2023-06-25 at 13:11 -0700, Ricardo Neri wrote:
> >
> > > > +
> > > > + score_on_dst_cpu = arch_get_ipcc_score(sgs->min_ipcc, env->dst_cpu);
> > > > +
> > > > + /*
> > > > + * Do not use IPC scores. sgs::ipcc_score_{after, before} will be zero
> > > > + * and not used.
> > > > + */
>
> The comment is not matching the check below. If zero
> is not used, the check should also reflect the case.
Agreed. This comment is not clear. I meant to say that returning here
has the effect of leaving the `before` and `after` scores of this group as
zero.
Since zero is the minimum possible score, this group will not be selected
during the tie breaker, unless the statistics of all other groups are also
zero.
>
> > > > + if (IS_ERR_VALUE(score_on_dst_cpu))
> > > > + return;
> > > > +
> > > > + before = sgs->sum_score;
> > > > + after = before - sgs->min_score;
> > >
> >
> Tim