2022-09-09 23:29:30

by Ricardo Neri

[permalink] [raw]
Subject: [RFC PATCH 00/23] sched: Introduce classes of tasks for load balance

+++ Introduction

On hybrid processors, the microarchitectural properties of the different
types of CPUs cause them to have different instruction-per-cycle (IPC)
capabilities. IPC can be higher on some CPUs for advanced instructions
Figure 1 illustrates this concept. It plots hypothetical workloads
grouped by classes of instructions vs the IPC ratio between high and low
performance CPUs.

IPC ratio
^
| Class0 . Class1 . ClassN-1 . ClassN
| . . . +
| . . . +
| . . . +
| . . + + + + + + + +
| . . .
| . . .
| . + + + + + + + + + + + .
| + + + + + + + . .
| + . . .
| + . . .
| + . . .
| + . . .
1 |-------------------------------------------//---------------------------->
| wokloads of interest

Figure 1. Hypothetical workloads sorted by IPC ratio


The load balancer can discover the use of advanced instructions and prefer
CPUs with higher IPC for tasks running those instructions.

Hardware is free to partition its instruction set into an arbitrary number
of classes. It must provide a mechanism identify the class of the
currently running task and inform the kernel about the performance of each
class of task on each type of CPU.

This patchset introduces the concept of classes of tasks, proposes the
interfaces that hardware needs to implement and proposes changes to the
load balancer to leverage this extra information in combination with
asymmetric packing.

This patchset includes a full implementation for Intel hybrid processors
using Intel Thread Director technology [1].


+++ Structure of this series

Patches 1-6 introduce the concept of classes of tasks. They also introduce
interfaces that architectures implement to update the class of a task and
to inform the scheduler about the class-specific performance scores.

Patches 7-9 use the class of the current task of each runqueue to break
ties between two otherwise identical group_asym_packing scheduling groups.

Patches 10-16 implement support for classes of tasks on Intel hybrid
processors using Intel Thread Director technology.

Patches 17-19 introduce extra helper members to task_struct to deal with
transient classification of tasks and arch-specific implementation
vagaries.

Patches 20-22 are specific to Intel Thread Director. They reset the
classification hardware when switching to a new task.


+++ Classes of tasks

The first step to leverage the asymmetries in IPC ratios is to assign a
class label to each individual task. Hardware can monitor the instruction
stream and classify a task as it runs. At user tick, the kernel reads the
resulting classification and assigns it to the currently running task. It
stores the class in the proposed task_struct::class.


+++ Balancing load using classes of tasks. Theory of operation

Intel hybrid processors rely on asymmetric packing to place tasks on
higher performance CPUs first. The class of the current task on each
runqueue can be used to break ties between two otherwise identical
scheduling groups.

Consider these scenarios (for simplicity, assume that task-class
performance score is such that

score(Cl0) < score(Cl1) < ... < score(Cl(n-1)) < score(Cln)). (Eq I)

Boxes depict scheduling groups being considered during load balance.
Labels inside the box depict the class of rq->curr, or the CPU being
idle.

asym
packing
priorities 50 50 30 30
_____________
| i . i |
| d . d |
| l . l | _______ _______
| e . e | | Cl0 | | Cl1 |
|___________| |_____| |_____|

^
dst_cpu sgA sgB
^
busiest

Figure 2. Scenario A

In Figure 2, dst_cpu is a group of SMT siblings, has become idle, has
higher priority, and is identifying the busiest group. sgA and sgB are of
group_asym_packing type, have the same priority, have a single CPU, and
have the same number of running tasks. By checking the class of the task
currently running on both scheduling groups, it selects sgB as the busiest
because it has a class of task higher performance score if placed on
dst_cpu.

asym
packing
priorities 50 50 50 50 30
_____________ _____________
| . | | . |
| . | | . | idle
| cl0 . cl1 | | cl0 . cl2 | _______
| . | | . | | |
|___________| |___________| |_____|

^
sgA sgB dst_cpu
^
busiest group

^
busiest queue

Figure 3. Scenario B

In Figure 3, dst_cpu has become idle, has lower priority and is identifying
a busiest group. sgA and sgB are groups of SMT siblings. Both siblings are
busy and, therefore, classified as group_asym_packing [2], have the same
priority and the same number of running tasks. The load balancer computes
the class-specific performance score (scaled by the number of busy
siblings) by observing the currently running task on each runqueue.

As per Eq. I, cl0+cl2 has a higher throughput than cl0+cl1. So, it selects
sgB as the busiest group. If cl2 is left to run with the whole big core to
itself, it would deliver higher throughput than cl0. Hence, the runqueue of
cl0 is selected as the busiest.


+++ Dependencies
These patches assume that both SMT siblings of a core have the same
priority, as proposed in [3]. Also, they rely on the existing support for
the Hardware Feedback Interface [4].


I look forward to your review and thank you in advance!

These patches have been Acked-by: Len Brown <[email protected]>

Thanks and BR,
Ricardo

[1]. Intel Software Developer Manual, Volume 3, Section 14.6
https://intel.com/sdm
[2]. https://lkml.kernel.org/r/[email protected]
[3]. https://lore.kernel.org/lkml/[email protected]/
[4]. https://lore.kernel.org/linux-pm/[email protected]/

Ricardo Neri (23):
sched/task_struct: Introduce classes of tasks
sched: Add interfaces for classes of tasks
sched/core: Initialize the class of a new task
sched/core: Add user_tick as argument to scheduler_tick()
sched/core: Move is_core_idle() out of fair.c
sched/core: Update the classification of the current task
sched/fair: Collect load-balancing stats for task classes
sched/fair: Compute task-class performance scores for load balancing
sched/fair: Use task-class performance score to pick the busiest group
sched/fair: Use classes of tasks when selecting a busiest runqueue
thermal: intel: hfi: Introduce Hardware Feedback Interface classes
thermal: intel: hfi: Convert table_lock to use flags-handling variants
x86/cpufeatures: Add the Intel Thread Director feature definitions
thermal: intel: hfi: Update the class of the current task
thermal: intel: hfi: Report per-cpu class-specific performance scores
thermal: intel: hfi: Define a default classification for unclassified
tasks
thermal: intel: hfi: Enable the Intel Thread Director
sched/task_struct: Add helpers for task classification
sched/core: Initialize helpers of task classification
thermal: intel: hfi: Implement model-specific checks for task
classification
x86/cpufeatures: Add feature bit for HRESET
x86/hreset: Configure history reset
x86/process: Reset hardware history in context switch

arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/hreset.h | 30 ++++
arch/x86/include/asm/msr-index.h | 4 +
arch/x86/include/asm/topology.h | 10 ++
arch/x86/kernel/cpu/common.c | 35 +++-
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
arch/x86/kernel/cpu/scattered.c | 1 +
arch/x86/kernel/process_32.c | 3 +
arch/x86/kernel/process_64.c | 3 +
drivers/thermal/intel/Kconfig | 12 ++
drivers/thermal/intel/intel_hfi.c | 215 +++++++++++++++++++++--
include/linux/sched.h | 19 +-
init/Kconfig | 9 +
kernel/sched/core.c | 10 +-
kernel/sched/fair.c | 214 ++++++++++++++++++++--
kernel/sched/sched.h | 81 +++++++++
kernel/sched/topology.c | 8 +
kernel/time/timer.c | 2 +-
19 files changed, 635 insertions(+), 32 deletions(-)
create mode 100644 arch/x86/include/asm/hreset.h

--
2.25.1


2022-09-09 23:31:06

by Ricardo Neri

[permalink] [raw]
Subject: [RFC PATCH 01/23] sched/task_struct: Introduce classes of tasks

On hybrid processors, the architecture differences between the types of
CPUs lead to different instructions-per-cycle (IPC) on each type of CPU.
IPCs may differ further by the type of instructions. Instructions can be
grouped into classes of similar IPCs.

Hence, tasks may be classified into groups depending on the type of
instructions they execute.

Add a new member task_struct::classid to associate a particular task to
a classification that depends on the instructions it executes.

The scheduler may use the classification of a task and information about
the relative performance among CPUs of a particular class of task to
improve throughput. It may, for instance, place certain tasks on CPUs that
have higher performance

The methods to determine the classification of a task and its relative
IPC are specific to each CPU architecture.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
include/linux/sched.h | 7 +++++++
init/Kconfig | 9 +++++++++
2 files changed, 16 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e7b2f8a5c711..acc33dbaa47c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -117,6 +117,8 @@ struct task_group;
__TASK_TRACED | EXIT_DEAD | EXIT_ZOMBIE | \
TASK_PARKED)

+#define TASK_CLASS_UNCLASSIFIED -1
+
#define task_is_running(task) (READ_ONCE((task)->__state) == TASK_RUNNING)

#define task_is_traced(task) ((READ_ONCE(task->jobctl) & JOBCTL_TRACED) != 0)
@@ -1512,6 +1514,11 @@ struct task_struct {
union rv_task_monitor rv[RV_PER_TASK_MONITORS];
#endif

+#ifdef CONFIG_SCHED_TASK_CLASSES
+ /* Class of task that the scheduler uses for task placement decisions */
+ short class;
+#endif
+
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
diff --git a/init/Kconfig b/init/Kconfig
index 532362fcfe31..edfa27f8717a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -854,6 +854,15 @@ config UCLAMP_BUCKETS_COUNT

If in doubt, use the default value.

+config SCHED_TASK_CLASSES
+ bool "Classes of tasks"
+ depends on SMP
+ help
+ If selected, each task is assigned a classification value that
+ reflects the types of instructions that the task executes. The
+ scheduler uses the classification value to improve the placement of
+ tasks.
+
endmenu

#
--
2.25.1

2022-09-09 23:31:53

by Ricardo Neri

[permalink] [raw]
Subject: [RFC PATCH 12/23] thermal: intel: hfi: Convert table_lock to use flags-handling variants

Currently, the table of an HFI instance is only accessed from the HFI
thermal interrupt handler and from the delayed work that sends the thermal
netlink event to user space.

When using Intel Thread Director to support classes of tasks in the
scheduler, the HFI table will also be accessed from the timer interrupt
handler.

As two interrupt handlers will concurrently access the table, update locks
to use raw_spin_[un]lock_irq[save|restore].

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
drivers/thermal/intel/intel_hfi.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index 405495dad0b2..4bafe6848d5d 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -175,9 +175,10 @@ static struct workqueue_struct *hfi_updates_wq;
static void get_hfi_caps(struct hfi_instance *hfi_instance,
struct thermal_genl_cpu_caps *cpu_caps)
{
+ unsigned long flags;
int cpu, i = 0;

- raw_spin_lock_irq(&hfi_instance->table_lock);
+ raw_spin_lock_irqsave(&hfi_instance->table_lock, flags);
for_each_cpu(cpu, hfi_instance->cpus) {
struct hfi_cpu_data *caps;
s16 index;
@@ -199,7 +200,7 @@ static void get_hfi_caps(struct hfi_instance *hfi_instance,

++i;
}
- raw_spin_unlock_irq(&hfi_instance->table_lock);
+ raw_spin_unlock_irqrestore(&hfi_instance->table_lock, flags);
}

/*
@@ -262,6 +263,7 @@ void intel_hfi_process_event(__u64 pkg_therm_status_msr_val)
struct hfi_instance *hfi_instance;
int cpu = smp_processor_id();
struct hfi_cpu_info *info;
+ unsigned long flags;
u64 new_timestamp;

if (!pkg_therm_status_msr_val)
@@ -298,7 +300,7 @@ void intel_hfi_process_event(__u64 pkg_therm_status_msr_val)
return;
}

- raw_spin_lock(&hfi_instance->table_lock);
+ raw_spin_lock_irqsave(&hfi_instance->table_lock, flags);

/*
* Copy the updated table into our local copy. This includes the new
@@ -307,7 +309,7 @@ void intel_hfi_process_event(__u64 pkg_therm_status_msr_val)
memcpy(hfi_instance->local_table, hfi_instance->hw_table,
hfi_features.nr_table_pages << PAGE_SHIFT);

- raw_spin_unlock(&hfi_instance->table_lock);
+ raw_spin_unlock_irqrestore(&hfi_instance->table_lock, flags);
raw_spin_unlock(&hfi_instance->event_lock);

/*
--
2.25.1

2022-09-09 23:38:00

by Ricardo Neri

[permalink] [raw]
Subject: [RFC PATCH 20/23] thermal: intel: hfi: Implement model-specific checks for task classification

In Alderlake and Raptorlake, the result of thread classification is more
accurate when only one SMT sibling is busy. Classification results for
class 2 and 3 that are always reliable.

To avoid unnecessary migrations, only update the class of a task if it has
been the same during 4 consecutive ticks.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
drivers/thermal/intel/intel_hfi.c | 58 ++++++++++++++++++++++++++++++-
1 file changed, 57 insertions(+), 1 deletion(-)

diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index 9ddd3047eb39..972ccfd392cf 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -39,6 +39,7 @@
#include <linux/workqueue.h>

#include <asm/msr.h>
+#include <asm/intel-family.h>

#include "../thermal_core.h"
#include "intel_hfi.h"
@@ -214,6 +215,60 @@ int intel_hfi_has_task_classes(void)
return cpu_feature_enabled(X86_FEATURE_ITD);
}

+#define CLASS_DEBOUNCER_SKIPS 4
+
+/**
+ * debounce_and_update_class() - Process and update a task's classification
+ *
+ * @p: The task of which the classification will be updated
+ * @new_class: The new class that hardware provides
+ *
+ * Update the classification of @p with the new value that hardware provides.
+ * Only update the classification of @p it has been the same during
+ * CLASS_DEBOUNCER_SKIPS consecutive ticks.
+ */
+static void debounce_and_update_class(struct task_struct *p, u8 new_class)
+{
+ char debounce_skip;
+
+ /* The class of @p changed, only restart the debounce counter. */
+ if (p->class_candidate != new_class) {
+ p->class_debounce_counter = 1;
+ goto out;
+ }
+
+ /*
+ * The class of @p did not change. Update it if it has been the same
+ * for CLASS_DEBOUNCER_SKIPS user ticks.
+ */
+ debounce_skip = p->class_debounce_counter + 1;
+ if (debounce_skip < CLASS_DEBOUNCER_SKIPS)
+ p->class_debounce_counter++;
+ else
+ p->class = new_class;
+
+out:
+ p->class_candidate = new_class;
+}
+
+static bool classification_is_accurate(u8 class, bool smt_siblings_idle)
+{
+ switch (boot_cpu_data.x86_model) {
+ case INTEL_FAM6_ALDERLAKE:
+ case INTEL_FAM6_ALDERLAKE_L:
+ case INTEL_FAM6_RAPTORLAKE:
+ case INTEL_FAM6_RAPTORLAKE_P:
+ case INTEL_FAM6_RAPTORLAKE_S:
+ if (class == 3 || class == 2 || smt_siblings_idle)
+ return true;
+
+ return false;
+
+ default:
+ return true;
+ }
+}
+
void intel_hfi_update_task_class(struct task_struct *curr, bool smt_siblings_idle)
{
union hfi_thread_feedback_char_msr msr;
@@ -228,7 +283,8 @@ void intel_hfi_update_task_class(struct task_struct *curr, bool smt_siblings_idl
if (!msr.split.valid)
return;

- curr->class = msr.split.classid;
+ if (classification_is_accurate(msr.split.classid, smt_siblings_idle))
+ debounce_and_update_class(curr, msr.split.classid);
}

static void get_one_hfi_cap(struct hfi_instance *hfi_instance, s16 index,
--
2.25.1

2022-09-09 23:38:35

by Ricardo Neri

[permalink] [raw]
Subject: [RFC PATCH 17/23] thermal: intel: hfi: Enable the Intel Thread Director

Enable Intel Thread Director from the CPU hotplug callback: globally from
CPU0 and then enable the thread-classification hardware in each logical
processor individually.

Also, initialize the number of classes supported.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
arch/x86/include/asm/msr-index.h | 2 ++
drivers/thermal/intel/Kconfig | 12 ++++++++++++
drivers/thermal/intel/intel_hfi.c | 30 ++++++++++++++++++++++++++++--
3 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 6674bdb096f3..810b950dc2e0 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1053,5 +1053,7 @@
/* Hardware Feedback Interface */
#define MSR_IA32_HW_FEEDBACK_PTR 0x17d0
#define MSR_IA32_HW_FEEDBACK_CONFIG 0x17d1
+#define MSR_IA32_HW_FEEDBACK_THREAD_CONFIG 0x17d4
+#define MSR_IA32_HW_FEEDBACK_CHAR 0x17d2

#endif /* _ASM_X86_MSR_INDEX_H */
diff --git a/drivers/thermal/intel/Kconfig b/drivers/thermal/intel/Kconfig
index f0c845679250..8848bf45ffbc 100644
--- a/drivers/thermal/intel/Kconfig
+++ b/drivers/thermal/intel/Kconfig
@@ -113,3 +113,15 @@ config INTEL_HFI_THERMAL
These capabilities may change as a result of changes in the operating
conditions of the system such power and thermal limits. If selected,
the kernel relays updates in CPUs' capabilities to userspace.
+
+config INTEL_THREAD_DIRECTOR
+ bool "Intel Thread Director"
+ depends on INTEL_HFI_THERMAL
+ depends on SMP
+ select SCHED_TASK_CLASSES
+ help
+ Select this option to enable the Intel Thread Director. If selected,
+ hardware classifies tasks based on the type of instructions they
+ execute. It also provides performance capabilities for each class of
+ task. On hybrid processors, the scheduler uses this data to place
+ tasks of classes of higher performance on higher-performnance CPUs.
diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index 967899d2c01f..9ddd3047eb39 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -49,6 +49,8 @@
/* Hardware Feedback Interface MSR configuration bits */
#define HW_FEEDBACK_PTR_VALID_BIT BIT(0)
#define HW_FEEDBACK_CONFIG_HFI_ENABLE_BIT BIT(0)
+#define HW_FEEDBACK_CONFIG_ITD_ENABLE_BIT BIT(1)
+#define HW_FEEDBACK_THREAD_CONFIG_ENABLE_BIT BIT(0)

/* CPUID detection and enumeration definitions for HFI */

@@ -73,6 +75,15 @@ union cpuid6_edx {
u32 full;
};

+union cpuid6_ecx {
+ struct {
+ u32 dont_care0:8;
+ u32 nr_classes:8;
+ u32 dont_care1:16;
+ } split;
+ u32 full;
+};
+
#ifdef CONFIG_INTEL_THREAD_DIRECTOR
union hfi_thread_feedback_char_msr {
struct {
@@ -481,6 +492,11 @@ void intel_hfi_online(unsigned int cpu)

init_hfi_cpu_index(info);

+ if (cpu_feature_enabled(X86_FEATURE_ITD)) {
+ msr_val = HW_FEEDBACK_THREAD_CONFIG_ENABLE_BIT;
+ wrmsrl(MSR_IA32_HW_FEEDBACK_THREAD_CONFIG, msr_val);
+ }
+
/*
* Now check if the HFI instance of the package/die of @cpu has been
* initialized (by checking its header). In such case, all we have to
@@ -536,6 +552,10 @@ void intel_hfi_online(unsigned int cpu)
*/
rdmsrl(MSR_IA32_HW_FEEDBACK_CONFIG, msr_val);
msr_val |= HW_FEEDBACK_CONFIG_HFI_ENABLE_BIT;
+
+ if (cpu_feature_enabled(X86_FEATURE_ITD))
+ msr_val |= HW_FEEDBACK_CONFIG_ITD_ENABLE_BIT;
+
wrmsrl(MSR_IA32_HW_FEEDBACK_CONFIG, msr_val);

unlock:
@@ -615,8 +635,14 @@ static __init int hfi_parse_features(void)
*/
hfi_features.class_stride = nr_capabilities;

- /* For now, use only one class of the HFI table */
- hfi_features.nr_classes = 1;
+ if (cpu_feature_enabled(X86_FEATURE_ITD)) {
+ union cpuid6_ecx ecx;
+
+ ecx.full = cpuid_ecx(CPUID_HFI_LEAF);
+ hfi_features.nr_classes = ecx.split.nr_classes;
+ } else {
+ hfi_features.nr_classes = 1;
+ }

/*
* The header contains change indications for each supported feature.
--
2.25.1

2022-09-09 23:40:13

by Ricardo Neri

[permalink] [raw]
Subject: [RFC PATCH 03/23] sched/core: Initialize the class of a new task

New tasks shall start life as unclassified. They will be classified by
hardware when they run.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
kernel/sched/core.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee28253c9ac0..db548c1a25ef 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4336,6 +4336,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.prev_sum_exec_runtime = 0;
p->se.nr_migrations = 0;
p->se.vruntime = 0;
+#ifdef CONFIG_SCHED_TASK_CLASSES
+ p->class = TASK_CLASS_UNCLASSIFIED;
+#endif
INIT_LIST_HEAD(&p->se.group_node);

#ifdef CONFIG_FAIR_GROUP_SCHED
--
2.25.1

2022-09-09 23:42:33

by Ricardo Neri

[permalink] [raw]
Subject: [RFC PATCH 21/23] x86/cpufeatures: Add feature bit for HRESET

The HRESET instruction prevents the classification of the
current task from influencing the classification of the next task
when running serially on the same logical processor.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/msr-index.h | 2 ++
arch/x86/kernel/cpu/scattered.c | 1 +
3 files changed, 4 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index d3202d665ac0..b5e64203c6b2 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -304,6 +304,7 @@
#define X86_FEATURE_UNRET (11*32+15) /* "" AMD BTB untrain return */
#define X86_FEATURE_USE_IBPB_FW (11*32+16) /* "" Use IBPB during runtime firmware calls */
#define X86_FEATURE_RSB_VMEXIT_LITE (11*32+17) /* "" Fill RSB on VM exit when EIBRS is enabled */
+#define X86_FEATURE_HRESET (11*32+18) /* Hardware history reset instruction */

/* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
#define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 810b950dc2e0..18f741499465 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1056,4 +1056,6 @@
#define MSR_IA32_HW_FEEDBACK_THREAD_CONFIG 0x17d4
#define MSR_IA32_HW_FEEDBACK_CHAR 0x17d2

+/* Hardware History Reset */
+#define MSR_IA32_HW_HRESET_ENABLE 0x17da
#endif /* _ASM_X86_MSR_INDEX_H */
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index fd44b54c90d5..7d268377b03a 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -28,6 +28,7 @@ static const struct cpuid_bit cpuid_bits[] = {
{ X86_FEATURE_EPB, CPUID_ECX, 3, 0x00000006, 0 },
{ X86_FEATURE_INTEL_PPIN, CPUID_EBX, 0, 0x00000007, 1 },
{ X86_FEATURE_RRSBA_CTRL, CPUID_EDX, 2, 0x00000007, 2 },
+ { X86_FEATURE_HRESET, CPUID_EAX, 22, 0x00000007, 1 },
{ X86_FEATURE_CQM_LLC, CPUID_EDX, 1, 0x0000000f, 0 },
{ X86_FEATURE_CQM_OCCUP_LLC, CPUID_EDX, 0, 0x0000000f, 1 },
{ X86_FEATURE_CQM_MBM_TOTAL, CPUID_EDX, 1, 0x0000000f, 1 },
--
2.25.1

2022-09-09 23:42:48

by Ricardo Neri

[permalink] [raw]
Subject: [RFC PATCH 09/23] sched/fair: Use task-class performance score to pick the busiest group

update_sd_pick_busiest() keeps on selecting as the busiest group scheduling
groups of identical priority. Since both groups have the same priority,
either group is a good choice. The classes of tasks in the scheduling
groups can break this tie.

Pick as busiest the scheduling group that yields a higher task-class
performance score after load balancing.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
kernel/sched/fair.c | 68 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 68 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 97731f81b570..7368a0b453ee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8777,6 +8777,60 @@ static void compute_ilb_sg_task_class_scores(struct sg_lb_task_class_stats *clas
sgs->task_class_score_before = group_score;
}

+/**
+ * sched_asym_class_prefer - Select a sched group based on its classes of tasks
+ * @a: Load balancing statistics of @sg_a
+ * @b: Load balancing statistics of @sg_b
+ *
+ * Returns: true if preferring @a yields a higher overall throughput after
+ * balancing load. Returns false otherwise.
+ */
+static bool sched_asym_class_prefer(struct sg_lb_stats *a,
+ struct sg_lb_stats *b)
+{
+ if (!sched_task_classes_enabled())
+ return false;
+
+ /* @a increases overall throughput after load balance. */
+ if (a->task_class_score_after > b->task_class_score_after)
+ return true;
+
+ /*
+ * If @a and @b yield the same overall throughput, pick @a if
+ * its current throughput is lower than that of @b.
+ */
+ if (a->task_class_score_after == b->task_class_score_after)
+ return a->task_class_score_before < b->task_class_score_before;
+
+ return false;
+}
+
+/**
+ * sched_asym_class_pick - Select a sched group based on classes of tasks
+ * @a: A scheduling group
+ * @b: A second scheduling group
+ * @a_stats: Load balancing statistics of @a
+ * @b_stats: Load balancing statistics of @b
+ *
+ * Returns: true if @a has the same priority and @a has classes of tasks that
+ * yield higher overall throughput after load balance. Returns false otherwise.
+ */
+static bool sched_asym_class_pick(struct sched_group *a,
+ struct sched_group *b,
+ struct sg_lb_stats *a_stats,
+ struct sg_lb_stats *b_stats)
+{
+ /*
+ * Only use the class-specific preference selection if both sched
+ * groups have the same priority.
+ */
+ if (arch_asym_cpu_priority(a->asym_prefer_cpu) !=
+ arch_asym_cpu_priority(b->asym_prefer_cpu))
+ return false;
+
+ return sched_asym_class_prefer(a_stats, b_stats);
+}
+
#else /* CONFIG_SCHED_TASK_CLASSES */
static void update_rq_task_classes_stats(struct sg_lb_task_class_stats *class_sgs,
struct rq *rq)
@@ -8793,6 +8847,14 @@ static void compute_ilb_sg_task_class_scores(struct sg_lb_task_class_stats *clas
{
}

+static bool sched_asym_class_pick(struct sched_group *a,
+ struct sched_group *b,
+ struct sg_lb_stats *a_stats,
+ struct sg_lb_stats *b_stats)
+{
+ return false;
+}
+
#endif /* CONFIG_SCHED_TASK_CLASSES */

/**
@@ -9049,6 +9111,12 @@ static bool update_sd_pick_busiest(struct lb_env *env,
/* Prefer to move from lowest priority CPU's work */
if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))
return false;
+
+ /* @sg and @sds::busiest have the same priority. */
+ if (sched_asym_class_pick(sds->busiest, sg, &sds->busiest_stat, sgs))
+ return false;
+
+ /* @sg has lower priority than @sds::busiest. */
break;

case group_misfit_task:
--
2.25.1

2022-09-09 23:42:49

by Ricardo Neri

[permalink] [raw]
Subject: [RFC PATCH 06/23] sched/core: Update the classification of the current task

When supported, hardware monitors the instruction stream to classify the
current task. Hence, at userspace tick, we are ready to read the most
recent classification result for the current task.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
kernel/sched/core.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 76015dbc45c5..477a90bddcd5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5445,6 +5445,9 @@ void scheduler_tick(bool user_tick)
unsigned long thermal_pressure;
u64 resched_latency;

+ if (sched_task_classes_enabled() && user_tick)
+ arch_update_task_class(curr, is_core_idle(cpu));
+
arch_scale_freq_tick();
sched_clock_tick();

--
2.25.1

2022-09-09 23:43:01

by Ricardo Neri

[permalink] [raw]
Subject: [RFC PATCH 02/23] sched: Add interfaces for classes of tasks

Add the interfaces that architectures shall implement to convey needed data
to support classes of tasks.

arch_update_task_class() updates the classification of the current task as
given by dedicated hardware resources.

arch_get_task_class_score() provides a performance score for a given class
of task when placed on a specific CPU relative to other CPUs. Higher scores
indicate higher performance.

The number of classes and the score of each class of task is determined
by hardware.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
kernel/sched/sched.h | 64 +++++++++++++++++++++++++++++++++++++++++
kernel/sched/topology.c | 8 ++++++
2 files changed, 72 insertions(+)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e26688d387ae..719bdba660e2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2510,6 +2510,70 @@ void arch_scale_freq_tick(void)
}
#endif

+#ifdef CONFIG_SCHED_TASK_CLASSES
+DECLARE_STATIC_KEY_FALSE(sched_task_classes);
+
+static inline bool sched_task_classes_enabled(void)
+{
+ return static_branch_unlikely(&sched_task_classes);
+}
+
+#ifndef arch_has_task_classes
+/**
+ * arch_has_task_classes() - Check whether hardware supports classes of tasks
+ *
+ * Returns: true of classes of tasks are supported.
+ */
+static __always_inline
+bool arch_has_task_classes(void)
+{
+ return false;
+}
+#endif
+
+#ifndef arch_update_task_class
+/**
+ * arch_update_task_class() - Update the classification of the current task
+ * @curr: The current task
+ * @smt_siblings_idle: True if all of the SMT siblings of the CPU of @curr
+ * are idle.
+ *
+ * Request that the classification of @curr is updated. On certain CPUs, the
+ * classification is only reliable if all of the SMT siblings of the CPU of
+ * @curr are idle.
+ *
+ * Returns: none
+ */
+static __always_inline
+void arch_update_task_class(struct task_struct *curr, bool smt_siblings_idle)
+{
+}
+#endif
+
+#ifndef arch_get_task_class_score
+/**
+ * arch_get_task_class_score() - Get the priority of a class
+ * @class: A classification value
+ * @cpu: A CPU number
+ *
+ * Returns the performance score of a class of tasks when running on @cpu.
+ * Error when either @class or @cpu are invalid.
+ */
+static __always_inline
+int arch_get_task_class_score(int class, int cpu)
+{
+ return 1;
+}
+#endif
+#else /* CONFIG_SCHED_TASK_CLASSES */
+
+#define arch_get_task_class_score(class, cpu) (-EINVAL)
+#define arch_update_task_class(curr, smt_siblings_idle)
+
+static inline bool sched_task_classes_enabled(void) { return false; }
+
+#endif /* CONFIG_SCHED_TASK_CLASSES */
+
#ifndef arch_scale_freq_capacity
/**
* arch_scale_freq_capacity - get the frequency scale factor of a given CPU.
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..8d886dbe6318 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -669,6 +669,9 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
+#ifdef CONFIG_SCHED_TASK_CLASSES
+DEFINE_STATIC_KEY_FALSE(sched_task_classes);
+#endif

static void update_top_cache_domain(int cpu)
{
@@ -2384,6 +2387,11 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
if (has_asym)
static_branch_inc_cpuslocked(&sched_asym_cpucapacity);

+#ifdef CONFIG_SCHED_TASK_CLASSES
+ if (arch_has_task_classes())
+ static_branch_enable_cpuslocked(&sched_task_classes);
+#endif
+
if (rq && sched_debug_verbose) {
pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
--
2.25.1

2022-09-09 23:43:07

by Ricardo Neri

[permalink] [raw]
Subject: [RFC PATCH 13/23] x86/cpufeatures: Add the Intel Thread Director feature definitions

Intel Thread Director (ITD) provides hardware resources to categorize
the currently running task with a classification value. The classification
reflects the type of instructions that a task executes.

ITD extends the Hardware Feedback Interface table to provide performance
and energy efficiency capabilities for each of the supported classes of
tasks.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +++++++-
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index ef4775c6db01..d3202d665ac0 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -339,6 +339,7 @@
#define X86_FEATURE_HWP_EPP (14*32+10) /* HWP Energy Perf. Preference */
#define X86_FEATURE_HWP_PKG_REQ (14*32+11) /* HWP Package Level Request */
#define X86_FEATURE_HFI (14*32+19) /* Hardware Feedback Interface */
+#define X86_FEATURE_ITD (14*32+23) /* Intel Thread Director */

/* AMD SVM Feature Identification, CPUID level 0x8000000a (EDX), word 15 */
#define X86_FEATURE_NPT (15*32+ 0) /* Nested Page Table support */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 33d2cd04d254..225657aff476 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -87,6 +87,12 @@
# define DISABLE_TDX_GUEST (1 << (X86_FEATURE_TDX_GUEST & 31))
#endif

+#ifdef CONFIG_INTEL_THREAD_DIRECTOR
+# define DISABLE_ITD 0
+#else
+# define DISABLE_ITD (1 << (X86_FEATURE_ITD & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -103,7 +109,7 @@
#define DISABLED_MASK10 0
#define DISABLED_MASK11 (DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET)
#define DISABLED_MASK12 0
-#define DISABLED_MASK13 0
+#define DISABLED_MASK13 (DISABLE_ITD)
#define DISABLED_MASK14 0
#define DISABLED_MASK15 0
#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index c881bcafba7d..f6f8a3cd4f2c 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -78,6 +78,7 @@ static const struct cpuid_dep cpuid_deps[] = {
{ X86_FEATURE_XFD, X86_FEATURE_XSAVES },
{ X86_FEATURE_XFD, X86_FEATURE_XGETBV1 },
{ X86_FEATURE_AMX_TILE, X86_FEATURE_XFD },
+ { X86_FEATURE_ITD, X86_FEATURE_HFI },
{}
};

--
2.25.1

2022-09-09 23:43:29

by Ricardo Neri

[permalink] [raw]
Subject: [RFC PATCH 14/23] thermal: intel: hfi: Update the class of the current task

Support the interfaces of the scheduler for classes of tasks. If an Intel
processor supports Intel Thread Director, then it supports classes of
tasks.

Use the hardware resources that Intel Thread Director provides to assign
a class to the currently running task.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
arch/x86/include/asm/topology.h | 8 +++++++
drivers/thermal/intel/intel_hfi.c | 35 +++++++++++++++++++++++++++++++
2 files changed, 43 insertions(+)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 458c891a8273..9c6df4fd9414 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -227,4 +227,12 @@ void init_freq_invariance_cppc(void);
#define arch_init_invariance_cppc init_freq_invariance_cppc
#endif

+#ifdef CONFIG_INTEL_THREAD_DIRECTOR
+int intel_hfi_has_task_classes(void);
+void intel_hfi_update_task_class(struct task_struct *curr, bool smt_siblings_idle);
+
+#define arch_has_task_classes intel_hfi_has_task_classes
+#define arch_update_task_class intel_hfi_update_task_class
+#endif
+
#endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index 4bafe6848d5d..f46d9331f912 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -73,6 +73,17 @@ union cpuid6_edx {
u32 full;
};

+#ifdef CONFIG_INTEL_THREAD_DIRECTOR
+union hfi_thread_feedback_char_msr {
+ struct {
+ u8 classid;
+ u64 __reserved:55;
+ u8 valid:1;
+ } split;
+ u64 full;
+};
+#endif
+
/**
* struct hfi_cpu_data - HFI capabilities per CPU
* @perf_cap: Performance capability
@@ -172,6 +183,30 @@ static struct workqueue_struct *hfi_updates_wq;
#define HFI_UPDATE_INTERVAL HZ
#define HFI_MAX_THERM_NOTIFY_COUNT 16

+#ifdef CONFIG_INTEL_THREAD_DIRECTOR
+int intel_hfi_has_task_classes(void)
+{
+ return cpu_feature_enabled(X86_FEATURE_ITD);
+}
+
+void intel_hfi_update_task_class(struct task_struct *curr, bool smt_siblings_idle)
+{
+ union hfi_thread_feedback_char_msr msr;
+
+ /* We should not be here if ITD is not supported. */
+ if (!cpu_feature_enabled(X86_FEATURE_ITD)) {
+ pr_warn_once("task classification requested but not supported!");
+ return;
+ }
+
+ rdmsrl(MSR_IA32_HW_FEEDBACK_CHAR, msr.full);
+ if (!msr.split.valid)
+ return;
+
+ curr->class = msr.split.classid;
+}
+#endif /* CONFIG_INTEL_THREAD_DIRECTOR */
+
static void get_hfi_caps(struct hfi_instance *hfi_instance,
struct thermal_genl_cpu_caps *cpu_caps)
{
--
2.25.1

2022-09-09 23:43:44

by Ricardo Neri

[permalink] [raw]
Subject: [RFC PATCH 05/23] sched/core: Move is_core_idle() out of fair.c

Information about the idle state of the SMT siblings of a core can
be used to improve the accuracy of the classification of the current task.

Move is_core_idle() to sched.h to make it available to scheduler_tick().

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
kernel/sched/fair.c | 17 -----------------
kernel/sched/sched.h | 17 +++++++++++++++++
2 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0089b33ea7b8..2f2a6bb5990d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1536,23 +1536,6 @@ struct numa_stats {
int idle_cpu;
};

-static inline bool is_core_idle(int cpu)
-{
-#ifdef CONFIG_SCHED_SMT
- int sibling;
-
- for_each_cpu(sibling, cpu_smt_mask(cpu)) {
- if (cpu == sibling)
- continue;
-
- if (!idle_cpu(sibling))
- return false;
- }
-#endif
-
- return true;
-}
-
struct task_numa_env {
struct task_struct *p;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 719bdba660e2..2bddedc55ee9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3212,6 +3212,23 @@ static inline bool is_per_cpu_kthread(struct task_struct *p)
}
#endif

+static inline bool is_core_idle(int cpu)
+{
+#ifdef CONFIG_SCHED_SMT
+ int sibling;
+
+ for_each_cpu(sibling, cpu_smt_mask(cpu)) {
+ if (cpu == sibling)
+ continue;
+
+ if (!idle_cpu(sibling))
+ return false;
+ }
+#endif
+
+ return true;
+}
+
extern void swake_up_all_locked(struct swait_queue_head *q);
extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);

--
2.25.1

2022-09-09 23:43:52

by Ricardo Neri

[permalink] [raw]
Subject: [RFC PATCH 07/23] sched/fair: Collect load-balancing stats for task classes

When selecting a busiest scheduling group, the class of the current task
can be used to select between two scheduling groups of equal asym_packing
priority and number of running tasks.

Compute a new task-class performance score for a scheduling group. It
is the sum of the performance of the current tasks of all the runqueues.

Also, keep track of the task with the lowest performance score on the
scheduling group.

These two metrics will be used during idle load balancing to compute the
current and the prospective task-class performance of a scheduling
group.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
kernel/sched/fair.c | 61 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 61 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f2a6bb5990d..58a435a04c1c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8686,6 +8686,63 @@ group_type group_classify(unsigned int imbalance_pct,
return group_has_spare;
}

+struct sg_lb_task_class_stats {
+ /*
+ * Score of the task with lowest score among the current tasks (i.e.,
+ * runqueue::curr) of all runqueues in the scheduling group.
+ */
+ int min_score;
+ /*
+ * Sum of the scores of the current tasks of all runqueues in the
+ * scheduling group.
+ */
+ long sum_score;
+ /* The task with score equal to @min_score */
+ struct task_struct *p_min_score;
+};
+
+#ifdef CONFIG_SCHED_TASK_CLASSES
+static void init_rq_task_classes_stats(struct sg_lb_task_class_stats *class_sgs)
+{
+ class_sgs->min_score = INT_MAX;
+ class_sgs->sum_score = 0;
+ class_sgs->p_min_score = NULL;
+}
+
+/** Called only if cpu_of(@rq) is not idle and has tasks running. */
+static void update_rq_task_classes_stats(struct sg_lb_task_class_stats *class_sgs,
+ struct rq *rq)
+{
+ int score;
+
+ if (!sched_task_classes_enabled())
+ return;
+
+ /*
+ * TODO: if nr_running > 1 we may want go through all the tasks behind
+ * rq->curr.
+ */
+ score = arch_get_task_class_score(rq->curr->class, cpu_of(rq));
+
+ class_sgs->sum_score += score;
+
+ if (score >= class_sgs->min_score)
+ return;
+
+ class_sgs->min_score = score;
+ class_sgs->p_min_score = rq->curr;
+}
+#else /* CONFIG_SCHED_TASK_CLASSES */
+static void update_rq_task_classes_stats(struct sg_lb_task_class_stats *class_sgs,
+ struct rq *rq)
+{
+}
+
+static void init_rq_task_classes_stats(struct sg_lb_task_class_stats *class_sgs)
+{
+}
+#endif /* CONFIG_SCHED_TASK_CLASSES */
+
/**
* asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull tasks
* @dst_cpu: Destination CPU of the load balancing
@@ -8797,9 +8854,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
struct sg_lb_stats *sgs,
int *sg_status)
{
+ struct sg_lb_task_class_stats class_stats;
int i, nr_running, local_group;

memset(sgs, 0, sizeof(*sgs));
+ init_rq_task_classes_stats(&class_stats);

local_group = group == sds->local;

@@ -8849,6 +8908,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (sgs->group_misfit_task_load < load)
sgs->group_misfit_task_load = load;
}
+
+ update_rq_task_classes_stats(&class_stats, rq);
}

sgs->group_capacity = group->sgc->capacity;
--
2.25.1

2022-09-09 23:44:49

by Ricardo Neri

[permalink] [raw]
Subject: [RFC PATCH 08/23] sched/fair: Compute task-class performance scores for load balancing

Compute both the current and the prospective the task-class performance
of a scheduling group. As task-class statistics are used during asym_
packing load balancing, the scheduling group will become idle.

For a scheduling group with only one CPU, the prospective performance is
the performance of its current task if placed on the destination CPU.

In a scheduling group composed of SMT siblings the current tasks of all
CPUs share the resources of the core. Divide the task-class performance of
scheduling group by the number of busy CPUs.

After load balancing, the throughput of the siblings that remain busy
increases. Plus, the destination CPU now contributes to the overall
throughput.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
kernel/sched/fair.c | 53 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 53 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 58a435a04c1c..97731f81b570 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8405,6 +8405,8 @@ struct sg_lb_stats {
enum group_type group_type;
unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
+ long task_class_score_after; /* Prospective task-class score after load balancing */
+ long task_class_score_before; /* Task-class score before load balancing */
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
@@ -8732,6 +8734,49 @@ static void update_rq_task_classes_stats(struct sg_lb_task_class_stats *class_sg
class_sgs->min_score = score;
class_sgs->p_min_score = rq->curr;
}
+
+static void compute_ilb_sg_task_class_scores(struct sg_lb_task_class_stats *class_sgs,
+ struct sg_lb_stats *sgs,
+ int dst_cpu)
+{
+ int group_score, group_score_without, score_on_dst_cpu;
+ int busy_cpus = sgs->group_weight - sgs->idle_cpus;
+
+ if (!sched_task_classes_enabled())
+ return;
+
+ /* No busy CPUs in the group. No tasks to move. */
+ if (!busy_cpus)
+ return;
+
+ score_on_dst_cpu = arch_get_task_class_score(class_sgs->p_min_score->class,
+ dst_cpu);
+
+ /*
+ * The simpest case. The single busy CPU in the current group will
+ * become idle after pulling its current task. The destination CPU is
+ * idle.
+ */
+ if (busy_cpus == 1) {
+ sgs->task_class_score_before = class_sgs->sum_score;
+ sgs->task_class_score_after = score_on_dst_cpu;
+ return;
+ }
+
+ /*
+ * Now compute the group score with and without the task with the
+ * lowest score. We assume that the tasks that remain in the group share
+ * the CPU resources equally.
+ */
+ group_score = class_sgs->sum_score / busy_cpus;
+
+ group_score_without = (class_sgs->sum_score - class_sgs->min_score) /
+ (busy_cpus - 1);
+
+ sgs->task_class_score_after = group_score_without + score_on_dst_cpu;
+ sgs->task_class_score_before = group_score;
+}
+
#else /* CONFIG_SCHED_TASK_CLASSES */
static void update_rq_task_classes_stats(struct sg_lb_task_class_stats *class_sgs,
struct rq *rq)
@@ -8741,6 +8786,13 @@ static void update_rq_task_classes_stats(struct sg_lb_task_class_stats *class_sg
static void init_rq_task_classes_stats(struct sg_lb_task_class_stats *class_sgs)
{
}
+
+static void compute_ilb_sg_task_class_scores(struct sg_lb_task_class_stats *class_sgs,
+ struct sg_lb_stats *sgs,
+ int dst_cpu)
+{
+}
+
#endif /* CONFIG_SCHED_TASK_CLASSES */

/**
@@ -8920,6 +8972,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
env->idle != CPU_NOT_IDLE && sgs->sum_h_nr_running &&
sched_asym(env, sds, sgs, group)) {
+ compute_ilb_sg_task_class_scores(&class_stats, sgs, env->dst_cpu);
sgs->group_asym_packing = 1;
}

--
2.25.1

2022-09-09 23:44:50

by Ricardo Neri

[permalink] [raw]
Subject: [RFC PATCH 22/23] x86/hreset: Configure history reset

Configure the MSR that controls the behavior of HRESET on each logical
processor.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
arch/x86/kernel/cpu/common.c | 25 ++++++++++++++++++++++++-
1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 3e508f239098..108642fe6761 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -411,6 +411,28 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
cr4_clear_bits(X86_CR4_UMIP);
}

+static u32 hardware_history_features __read_mostly;
+
+static __always_inline void setup_hreset(struct cpuinfo_x86 *c)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_HRESET))
+ return;
+
+ /*
+ * Use on all CPUs the hardware history features that the boot
+ * CPU supports.
+ */
+ if (c == &boot_cpu_data)
+ hardware_history_features = cpuid_ebx(0x20);
+
+ if (!hardware_history_features)
+ return;
+
+ wrmsrl(MSR_IA32_HW_HRESET_ENABLE, hardware_history_features);
+
+ pr_info_once("x86/cpu: Intel History Reset (HRESET) activated\n");
+}
+
/* These bits should not change their value after CPU init is finished. */
static const unsigned long cr4_pinned_mask =
X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP |
@@ -1823,10 +1845,11 @@ static void identify_cpu(struct cpuinfo_x86 *c)
/* Disable the PN if appropriate */
squash_the_stupid_serial_number(c);

- /* Set up SMEP/SMAP/UMIP */
+ /* Set up SMEP/SMAP/UMIP/HRESET */
setup_smep(c);
setup_smap(c);
setup_umip(c);
+ setup_hreset(c);

/* Enable FSGSBASE instructions if available. */
if (cpu_has(c, X86_FEATURE_FSGSBASE)) {
--
2.25.1

2022-09-09 23:52:06

by Ricardo Neri

[permalink] [raw]
Subject: [RFC PATCH 23/23] x86/process: Reset hardware history in context switch

Reset the classification history of the current task when switching to
the next task. Hardware will start anew the classification of the next
running task.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
arch/x86/include/asm/hreset.h | 30 ++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/common.c | 10 ++++++++++
arch/x86/kernel/process_32.c | 3 +++
arch/x86/kernel/process_64.c | 3 +++
4 files changed, 46 insertions(+)
create mode 100644 arch/x86/include/asm/hreset.h

diff --git a/arch/x86/include/asm/hreset.h b/arch/x86/include/asm/hreset.h
new file mode 100644
index 000000000000..d68ca2fb8642
--- /dev/null
+++ b/arch/x86/include/asm/hreset.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_HRESET_H
+
+/**
+ * HRESET - History reset. Available since binutils v2.36.
+ *
+ * Request the processor to reset the history of task classification on the
+ * current logical processor. The history components to be
+ * reset are specified in %eax. Only bits specified in CPUID(0x20).EBX
+ * and enabled in the IA32_HRESET_ENABLE MSR can be selected.
+ *
+ * The assembly code looks like:
+ *
+ * hreset %eax
+ *
+ * The corresponding machine code looks like:
+ *
+ * F3 0F 3A F0 ModRM Imm
+ *
+ * The value of ModRM is 0xc0 to specify %eax register addressing.
+ * The ignored immediate operand is set to 0.
+ *
+ * The instruction is documented in the Intel SDM.
+ */
+
+#define __ASM_HRESET ".byte 0xf3, 0xf, 0x3a, 0xf0, 0xc0, 0x0"
+
+void reset_hardware_history(void);
+
+#endif /* _ASM_X86_HRESET_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 108642fe6761..4622a0ebf8a1 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -52,6 +52,7 @@
#include <asm/cpu.h>
#include <asm/mce.h>
#include <asm/msr.h>
+#include <asm/hreset.h>
#include <asm/memtype.h>
#include <asm/microcode.h>
#include <asm/microcode_intel.h>
@@ -413,6 +414,15 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)

static u32 hardware_history_features __read_mostly;

+void reset_hardware_history(void)
+{
+ if (!static_cpu_has(X86_FEATURE_HRESET))
+ return;
+
+ asm volatile("mov %0, %%eax;" __ASM_HRESET "\n" : :
+ "r" (hardware_history_features) : "%rax");
+}
+
static __always_inline void setup_hreset(struct cpuinfo_x86 *c)
{
if (!cpu_feature_enabled(X86_FEATURE_HRESET))
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 2f314b170c9f..74d8ad83e0b3 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -52,6 +52,7 @@
#include <asm/switch_to.h>
#include <asm/vm86.h>
#include <asm/resctrl.h>
+#include <asm/hreset.h>
#include <asm/proto.h>

#include "process.h"
@@ -214,6 +215,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
/* Load the Intel cache allocation PQR MSR. */
resctrl_sched_in();

+ reset_hardware_history();
+
return prev_p;
}

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 1962008fe743..0b175f30f359 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -53,6 +53,7 @@
#include <asm/xen/hypervisor.h>
#include <asm/vdso.h>
#include <asm/resctrl.h>
+#include <asm/hreset.h>
#include <asm/unistd.h>
#include <asm/fsgsbase.h>
#ifdef CONFIG_IA32_EMULATION
@@ -657,6 +658,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
/* Load the Intel cache allocation PQR MSR. */
resctrl_sched_in();

+ reset_hardware_history();
+
return prev_p;
}

--
2.25.1

2022-09-10 00:33:55

by Ricardo Neri

[permalink] [raw]
Subject: [RFC PATCH 15/23] thermal: intel: hfi: Report per-cpu class-specific performance scores

Support the arch_get_task_class_score() interface of the scheduler. Use the
data that Intel Thread Director provides to inform the scheduler the
performance of a class of tasks when placed on a given CPU.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
arch/x86/include/asm/topology.h | 2 ++
drivers/thermal/intel/intel_hfi.c | 40 +++++++++++++++++++++++++++++++
2 files changed, 42 insertions(+)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 9c6df4fd9414..2ed234104ef4 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -230,9 +230,11 @@ void init_freq_invariance_cppc(void);
#ifdef CONFIG_INTEL_THREAD_DIRECTOR
int intel_hfi_has_task_classes(void);
void intel_hfi_update_task_class(struct task_struct *curr, bool smt_siblings_idle);
+int intel_hfi_get_task_class_score(int class, int cpu);

#define arch_has_task_classes intel_hfi_has_task_classes
#define arch_update_task_class intel_hfi_update_task_class
+#define arch_get_task_class_score intel_hfi_get_task_class_score
#endif

#endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index f46d9331f912..1b6072c828ff 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -205,6 +205,46 @@ void intel_hfi_update_task_class(struct task_struct *curr, bool smt_siblings_idl

curr->class = msr.split.classid;
}
+
+static void get_one_hfi_cap(struct hfi_instance *hfi_instance, s16 index,
+ struct hfi_cpu_data *hfi_caps, int class)
+{
+ struct hfi_cpu_data *caps;
+
+ /* Find the capabilities of @cpu */
+ caps = hfi_instance->data + index * hfi_features.cpu_stride +
+ class * hfi_features.class_stride;
+ memcpy(hfi_caps, caps, sizeof(*hfi_caps));
+}
+
+int intel_hfi_get_task_class_score(int class, int cpu)
+{
+ struct hfi_cpu_info *info = &per_cpu(hfi_cpu_info, cpu);
+ struct hfi_instance *instance;
+ struct hfi_cpu_data caps;
+ unsigned long flags;
+ int cap;
+
+ if (cpu < 0 || cpu >= nr_cpu_ids)
+ return -EINVAL;
+
+ if (class == TASK_CLASS_UNCLASSIFIED)
+ return -EINVAL;
+
+ if (class >= (int)hfi_features.nr_classes)
+ return -EINVAL;
+
+ instance = info->hfi_instance;
+ if (!instance)
+ return -ENOENT;
+
+ raw_spin_lock_irqsave(&instance->table_lock, flags);
+ get_one_hfi_cap(instance, info->index, &caps, class);
+ cap = caps.perf_cap;
+ raw_spin_unlock_irqrestore(&instance->table_lock, flags);
+
+ return cap;
+}
#endif /* CONFIG_INTEL_THREAD_DIRECTOR */

static void get_hfi_caps(struct hfi_instance *hfi_instance,
--
2.25.1

2022-09-14 14:01:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 01/23] sched/task_struct: Introduce classes of tasks

On Fri, Sep 09, 2022 at 04:11:43PM -0700, Ricardo Neri wrote:

> include/linux/sched.h | 7 +++++++
> init/Kconfig | 9 +++++++++
> 2 files changed, 16 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e7b2f8a5c711..acc33dbaa47c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -117,6 +117,8 @@ struct task_group;
> __TASK_TRACED | EXIT_DEAD | EXIT_ZOMBIE | \
> TASK_PARKED)
>
> +#define TASK_CLASS_UNCLASSIFIED -1

> +#ifdef CONFIG_SCHED_TASK_CLASSES
> + /* Class of task that the scheduler uses for task placement decisions */
> + short class;
> +#endif

You're missing a hunk for init/init_task.c for this non-zero init value.

Does we really have to use a signed type and non-zero init value?

2022-09-14 14:05:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 06/23] sched/core: Update the classification of the current task

On Fri, Sep 09, 2022 at 04:11:48PM -0700, Ricardo Neri wrote:

> + if (sched_task_classes_enabled() && user_tick)
> + arch_update_task_class(curr, is_core_idle(cpu));

This evaluates is_core_idle() even if the hardware improves.

2022-09-16 15:19:23

by Ricardo Neri

[permalink] [raw]
Subject: Re: [RFC PATCH 06/23] sched/core: Update the classification of the current task

On Wed, Sep 14, 2022 at 03:44:29PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 09, 2022 at 04:11:48PM -0700, Ricardo Neri wrote:
>
> > + if (sched_task_classes_enabled() && user_tick)
> > + arch_update_task_class(curr, is_core_idle(cpu));
>
> This evaluates is_core_idle() even if the hardware improves.

Yes, this is true. Do you think it would make sense to expose is_core_idle()
outside the scheduler? In this manner, only hardware that needs it would
call it.

Thanks and BR,
Ricardo
>

2022-09-16 15:20:43

by Ricardo Neri

[permalink] [raw]
Subject: Re: [RFC PATCH 01/23] sched/task_struct: Introduce classes of tasks

On Wed, Sep 14, 2022 at 03:46:34PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 09, 2022 at 04:11:43PM -0700, Ricardo Neri wrote:
>
> > include/linux/sched.h | 7 +++++++
> > init/Kconfig | 9 +++++++++
> > 2 files changed, 16 insertions(+)
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index e7b2f8a5c711..acc33dbaa47c 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -117,6 +117,8 @@ struct task_group;
> > __TASK_TRACED | EXIT_DEAD | EXIT_ZOMBIE | \
> > TASK_PARKED)
> >
> > +#define TASK_CLASS_UNCLASSIFIED -1
>
> > +#ifdef CONFIG_SCHED_TASK_CLASSES
> > + /* Class of task that the scheduler uses for task placement decisions */
> > + short class;
> > +#endif

Thanks for your feedback Peter!
>
> You're missing a hunk for init/init_task.c for this non-zero init value.

Ah, yes. I'll add it.
>
> Does we really have to use a signed type and non-zero init value?

At least on Intel processors, class 0 is a valid class. The scheduler needs to
have a notion of unclassified tasks and decide how to handle them, IMO.

Intel processors currently support 8-bit, unsigned classes. I doubt other
architectures will ever support more than 256 classes. Short can handle all the
possible classification values and also the unclassified case.

On the other hand, class 0 could be the default classification unless hardware
classifies differently. 0 would be special and need to be documented clearly.
This would work for Intel processors.

Thanks and BR,
Ricardo

2022-09-16 15:37:25

by Ricardo Neri

[permalink] [raw]
Subject: Re: [RFC PATCH 06/23] sched/core: Update the classification of the current task

On Sat, Sep 10, 2022 at 03:21:20PM +0800, Hillf Danton wrote:
> On 9 Sep 2022 16:11:48 -0700 Ricardo Neri <[email protected]> wrote:
> > @@ -5445,6 +5445,9 @@ void scheduler_tick(bool user_tick)
> > unsigned long thermal_pressure;
> > u64 resched_latency;
> >
> > + if (sched_task_classes_enabled() && user_tick)
> > + arch_update_task_class(curr, is_core_idle(cpu));
> > +
> > arch_scale_freq_tick();
> > sched_clock_tick();
>
Thank you very much for your feedback Hillf!

> Given user_tick == true, core is not idle regardless of SMT.
> IOW I doubt is_core_idle() helps here.

Perhaps is_core_idle() is a bad name? The second argument of arch_update_
task_class() is smt_siblings_idle. is_core_idle() gives us the answer
we want.

is_core_idle() only checks the siblings of @cpu. It explicitly skips itself
from the checks of idle state. We are only interested in the idle state of
the siblings.

Am I missing anything?

Thanks and BR,
Ricardo

2022-09-26 16:41:53

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 03/23] sched/core: Initialize the class of a new task

Hi Ricardo,

On Fri, Sep 09, 2022 at 04:11:45PM -0700, Ricardo Neri wrote:
> New tasks shall start life as unclassified. They will be classified by
> hardware when they run.
>
> Cc: Ben Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Tim C. Chen <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Ricardo Neri <[email protected]>
> ---
> kernel/sched/core.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ee28253c9ac0..db548c1a25ef 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4336,6 +4336,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> p->se.prev_sum_exec_runtime = 0;
> p->se.nr_migrations = 0;
> p->se.vruntime = 0;
> +#ifdef CONFIG_SCHED_TASK_CLASSES
> + p->class = TASK_CLASS_UNCLASSIFIED;
> +#endif

I find the term 'class' very broad and unclear what kind of class (without
further reading). So I am worried about how this generic term usage plays
with Linux source code in the long-term (like what if someone else comes up
with a usage of term 'class' that is unrelated to IPC.)

To that end, I was wondering if it could be renamed to p->ipc_class, and
CONFIG_SCHED_TASK_IPC_CLASSES, or something.

thanks,

- Joel



> INIT_LIST_HEAD(&p->se.group_node);
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> --
> 2.25.1
>

2022-09-26 22:19:12

by Ricardo Neri

[permalink] [raw]
Subject: Re: [RFC PATCH 03/23] sched/core: Initialize the class of a new task

On Mon, Sep 26, 2022 at 02:57:29PM +0000, Joel Fernandes wrote:
> Hi Ricardo,
>
> On Fri, Sep 09, 2022 at 04:11:45PM -0700, Ricardo Neri wrote:
> > New tasks shall start life as unclassified. They will be classified by
> > hardware when they run.
> >
> > Cc: Ben Segall <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Srinivas Pandruvada <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Tim C. Chen <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: Ricardo Neri <[email protected]>
> > ---
> > kernel/sched/core.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index ee28253c9ac0..db548c1a25ef 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4336,6 +4336,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> > p->se.prev_sum_exec_runtime = 0;
> > p->se.nr_migrations = 0;
> > p->se.vruntime = 0;
> > +#ifdef CONFIG_SCHED_TASK_CLASSES
> > + p->class = TASK_CLASS_UNCLASSIFIED;
> > +#endif
>
> I find the term 'class' very broad and unclear what kind of class (without
> further reading). So I am worried about how this generic term usage plays
> with Linux source code in the long-term (like what if someone else comes up
> with a usage of term 'class' that is unrelated to IPC.)

Thank you very much for your review Joel! Yes, class seems too generic. It is
meant to read, for instance, task_struct::class or p->class, or rq->current->
class. This should imply that we are referring to the class of a task. But yes,
I agree that it is too generic.

>
> To that end, I was wondering if it could be renamed to p->ipc_class, and
> CONFIG_SCHED_TASK_IPC_CLASSES, or something.

This is a good suggestion. I will take it, unless others disagree.

BR,
Ricardo

2022-09-27 09:47:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 08/23] sched/fair: Compute task-class performance scores for load balancing

On Fri, Sep 09, 2022 at 04:11:50PM -0700, Ricardo Neri wrote:

> +static void compute_ilb_sg_task_class_scores(struct sg_lb_task_class_stats *class_sgs,
> + struct sg_lb_stats *sgs,
> + int dst_cpu)
> +{
> + int group_score, group_score_without, score_on_dst_cpu;
> + int busy_cpus = sgs->group_weight - sgs->idle_cpus;
> +
> + if (!sched_task_classes_enabled())
> + return;
> +
> + /* No busy CPUs in the group. No tasks to move. */
> + if (!busy_cpus)
> + return;
> +
> + score_on_dst_cpu = arch_get_task_class_score(class_sgs->p_min_score->class,
> + dst_cpu);
> +
> + /*
> + * The simpest case. The single busy CPU in the current group will
> + * become idle after pulling its current task. The destination CPU is
> + * idle.
> + */
> + if (busy_cpus == 1) {
> + sgs->task_class_score_before = class_sgs->sum_score;
> + sgs->task_class_score_after = score_on_dst_cpu;
> + return;
> + }
> +
> + /*
> + * Now compute the group score with and without the task with the
> + * lowest score. We assume that the tasks that remain in the group share
> + * the CPU resources equally.
> + */
> + group_score = class_sgs->sum_score / busy_cpus;
> +
> + group_score_without = (class_sgs->sum_score - class_sgs->min_score) /
> + (busy_cpus - 1);
> +
> + sgs->task_class_score_after = group_score_without + score_on_dst_cpu;
> + sgs->task_class_score_before = group_score;
> +}

That's just plain broken; also lots of cleanups done...

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8405,12 +8405,14 @@ struct sg_lb_stats {
enum group_type group_type;
unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
- long task_class_score_after; /* Prospective task-class score after load balancing */
- long task_class_score_before; /* Task-class score before load balancing */
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
#endif
+#ifdef CONFIG_SCHED_TASK_CLASSES
+ long task_class_score_after; /* Prospective task-class score after load balancing */
+ long task_class_score_before; /* Task-class score before load balancing */
+#endif
};

/*
@@ -8689,58 +8691,54 @@ group_type group_classify(unsigned int i
}

struct sg_lb_task_class_stats {
- /*
- * Score of the task with lowest score among the current tasks (i.e.,
- * runqueue::curr) of all runqueues in the scheduling group.
- */
- int min_score;
- /*
- * Sum of the scores of the current tasks of all runqueues in the
- * scheduling group.
- */
- long sum_score;
- /* The task with score equal to @min_score */
- struct task_struct *p_min_score;
+ int min_score; /* Min(rq->curr->score) */
+ int min_class;
+ long sum_score; /* Sum(rq->curr->score) */
};

#ifdef CONFIG_SCHED_TASK_CLASSES
-static void init_rq_task_classes_stats(struct sg_lb_task_class_stats *class_sgs)
+static void init_sg_lb_task_class_stats(struct sg_lb_task_class_stats *sgcs)
{
- class_sgs->min_score = INT_MAX;
- class_sgs->sum_score = 0;
- class_sgs->p_min_score = NULL;
+ *sgcs = (struct sg_lb_task_class_stats){
+ .min_score = INT_MAX,
+ };
}

/** Called only if cpu_of(@rq) is not idle and has tasks running. */
-static void update_rq_task_classes_stats(struct sg_lb_task_class_stats *class_sgs,
- struct rq *rq)
+static void update_sg_lb_task_class_stats(struct sg_lb_task_class_stats *sgcs,
+ struct rq *rq)
{
- int score;
+ struct task_struct *curr;
+ int class, score;

if (!sched_task_classes_enabled())
return;

+ curr = rcu_dereference(rq->curr);
+ if (!curr || (curr->flags & PF_EXITING) || is_idle_task(curr))
+ return;
+
/*
* TODO: if nr_running > 1 we may want go through all the tasks behind
* rq->curr.
*/
- score = arch_get_task_class_score(rq->curr->class, cpu_of(rq));
-
- class_sgs->sum_score += score;
+ class = curr->class;
+ score = arch_get_task_class_score(class, cpu_of(rq));

- if (score >= class_sgs->min_score)
- return;
+ sgcs->sum_score += score;

- class_sgs->min_score = score;
- class_sgs->p_min_score = rq->curr;
+ if (score < sgcs->min_score) {
+ sgcs->min_score = score;
+ sgcs->min_class = class;
+ }
}

-static void compute_ilb_sg_task_class_scores(struct sg_lb_task_class_stats *class_sgs,
- struct sg_lb_stats *sgs,
- int dst_cpu)
+static void update_sg_lb_stats_scores(struct sg_lb_task_class_stats *sgcs,
+ struct sg_lb_stats *sgs,
+ int dst_cpu)
{
- int group_score, group_score_without, score_on_dst_cpu;
int busy_cpus = sgs->group_weight - sgs->idle_cpus;
+ long before, after;

if (!sched_task_classes_enabled())
return;
@@ -8749,32 +8747,18 @@ static void compute_ilb_sg_task_class_sc
if (!busy_cpus)
return;

- score_on_dst_cpu = arch_get_task_class_score(class_sgs->p_min_score->class,
- dst_cpu);
+ score_on_dst_cpu = arch_get_task_class_score(sgcs->min_class, dst_cpu);

- /*
- * The simpest case. The single busy CPU in the current group will
- * become idle after pulling its current task. The destination CPU is
- * idle.
- */
- if (busy_cpus == 1) {
- sgs->task_class_score_before = class_sgs->sum_score;
- sgs->task_class_score_after = score_on_dst_cpu;
- return;
- }
+ before = sgcs->sum_score
+ after = before - sgcs->min_score + score_on_dst_cpu;

- /*
- * Now compute the group score with and without the task with the
- * lowest score. We assume that the tasks that remain in the group share
- * the CPU resources equally.
- */
- group_score = class_sgs->sum_score / busy_cpus;
-
- group_score_without = (class_sgs->sum_score - class_sgs->min_score) /
- (busy_cpus - 1);
+ if (busy_cpus > 1) {
+ before /= busy_cpus;
+ after /= busy_cpus;
+ }

- sgs->task_class_score_after = group_score_without + score_on_dst_cpu;
- sgs->task_class_score_before = group_score;
+ sgs->task_class_score_before = before;
+ sgs->task_class_score_after = after;
}

/**
@@ -8832,18 +8816,19 @@ static bool sched_asym_class_pick(struct
}

#else /* CONFIG_SCHED_TASK_CLASSES */
-static void update_rq_task_classes_stats(struct sg_lb_task_class_stats *class_sgs,
- struct rq *rq)
+
+static void init_sg_lb_task_class_stats(struct sg_lb_task_class_stats *sgcs)
{
}

-static void init_rq_task_classes_stats(struct sg_lb_task_class_stats *class_sgs)
+static void update_sg_lb_task_class_stats(struct sg_lb_task_class_stats *sgcs,
+ struct rq *rq)
{
}

-static void compute_ilb_sg_task_class_scores(struct sg_lb_task_class_stats *class_sgs,
- struct sg_lb_stats *sgs,
- int dst_cpu)
+static void update_sg_lb_stats_scores(struct sg_lb_task_class_stats *sgcs,
+ struct sg_lb_stats *sgs,
+ int dst_cpu)
{
}

@@ -8854,7 +8839,6 @@ static bool sched_asym_class_pick(struct
{
return false;
}
-
#endif /* CONFIG_SCHED_TASK_CLASSES */

/**
@@ -8979,11 +8963,11 @@ static inline void update_sg_lb_stats(st
struct sg_lb_stats *sgs,
int *sg_status)
{
- struct sg_lb_task_class_stats class_stats;
+ struct sg_lb_task_class_stats sgcs;
int i, nr_running, local_group;

memset(sgs, 0, sizeof(*sgs));
- init_rq_task_classes_stats(&class_stats);
+ init_sg_lb_task_class_stats(&sgcs);

local_group = group == sds->local;

@@ -9034,7 +9018,7 @@ static inline void update_sg_lb_stats(st
sgs->group_misfit_task_load = load;
}

- update_rq_task_classes_stats(&class_stats, rq);
+ update_sg_lb_task_class_stats(&sgcs, rq);
}

sgs->group_capacity = group->sgc->capacity;
@@ -9045,7 +9029,7 @@ static inline void update_sg_lb_stats(st
if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
env->idle != CPU_NOT_IDLE && sgs->sum_h_nr_running &&
sched_asym(env, sds, sgs, group)) {
- compute_ilb_sg_task_class_scores(&class_stats, sgs, env->dst_cpu);
+ update_sg_lb_stats_scores(&sgcs, sgs, env->dst_cpu);
sgs->group_asym_packing = 1;
}

2022-09-27 11:09:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 09/23] sched/fair: Use task-class performance score to pick the busiest group

On Fri, Sep 09, 2022 at 04:11:51PM -0700, Ricardo Neri wrote:
> update_sd_pick_busiest() keeps on selecting as the busiest group scheduling
> groups of identical priority. Since both groups have the same priority,
> either group is a good choice. The classes of tasks in the scheduling
> groups can break this tie.
>
> Pick as busiest the scheduling group that yields a higher task-class
> performance score after load balancing.

> +/**
> + * sched_asym_class_pick - Select a sched group based on classes of tasks
> + * @a: A scheduling group
> + * @b: A second scheduling group
> + * @a_stats: Load balancing statistics of @a
> + * @b_stats: Load balancing statistics of @b
> + *
> + * Returns: true if @a has the same priority and @a has classes of tasks that
> + * yield higher overall throughput after load balance. Returns false otherwise.
> + */
> +static bool sched_asym_class_pick(struct sched_group *a,
> + struct sched_group *b,
> + struct sg_lb_stats *a_stats,
> + struct sg_lb_stats *b_stats)
> +{
> + /*
> + * Only use the class-specific preference selection if both sched
> + * groups have the same priority.
> + */
> + if (arch_asym_cpu_priority(a->asym_prefer_cpu) !=
> + arch_asym_cpu_priority(b->asym_prefer_cpu))
> + return false;
> +
> + return sched_asym_class_prefer(a_stats, b_stats);
> +}
> +
> #else /* CONFIG_SCHED_TASK_CLASSES */
> static void update_rq_task_classes_stats(struct sg_lb_task_class_stats *class_sgs,
> struct rq *rq)

> @@ -9049,6 +9111,12 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> /* Prefer to move from lowest priority CPU's work */
> if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))
> return false;
> +
> + /* @sg and @sds::busiest have the same priority. */
> + if (sched_asym_class_pick(sds->busiest, sg, &sds->busiest_stat, sgs))
> + return false;
> +
> + /* @sg has lower priority than @sds::busiest. */
> break;
>
> case group_misfit_task:

So why does only this one instance of asym_prefer() require tie
breaking?

I must also re-iterate how much I hate having two different means of
dealing with big-little topologies.

And while looking through this, I must ask about the comment that goes
with sched_set_itmt_core_prio() vs the sg->asym_prefer_cpu assignment in
init_sched_groups_capacity(), what-up ?!


2022-09-27 12:17:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 12/23] thermal: intel: hfi: Convert table_lock to use flags-handling variants

On Tue, Sep 27, 2022 at 01:34:07PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 09, 2022 at 04:11:54PM -0700, Ricardo Neri wrote:
>
> > --- a/drivers/thermal/intel/intel_hfi.c
> > +++ b/drivers/thermal/intel/intel_hfi.c
> > @@ -175,9 +175,10 @@ static struct workqueue_struct *hfi_updates_wq;
> > static void get_hfi_caps(struct hfi_instance *hfi_instance,
> > struct thermal_genl_cpu_caps *cpu_caps)
> > {
> > + unsigned long flags;
> > int cpu, i = 0;
> >
> > - raw_spin_lock_irq(&hfi_instance->table_lock);
> > + raw_spin_lock_irqsave(&hfi_instance->table_lock, flags);
> > for_each_cpu(cpu, hfi_instance->cpus) {
> > struct hfi_cpu_data *caps;
> > s16 index;
>
> ^^^^ Anti-pattern alert!
>
> Now your IRQ latency depends on nr_cpus -- which is a fair fail. The
> existing code is already pretty crap in that it has the preemption
> latency depend on nr_cpus.
>
> While I'm here looking at the HFI stuff, did they fix that HFI interrupt
> broadcast mess already? Sending an interrupt to *all* CPUs is quite
> insane.

Anyway; given the existence of this here loop; why not have something
like:

DEFINE_PER_CPU(int, hfi_ipc_class);

class = // extract from HFI mess
WRITE_ONCE(per_cpu(hfi_ipc_class, cpu), class);

And then have the tick use this_cpu_read(hfi_ipc_class)? No extra
locking required.

2022-09-27 12:21:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 17/23] thermal: intel: hfi: Enable the Intel Thread Director

On Fri, Sep 09, 2022 at 04:11:59PM -0700, Ricardo Neri wrote:

> +config INTEL_THREAD_DIRECTOR
> + bool "Intel Thread Director"
> + depends on INTEL_HFI_THERMAL
> + depends on SMP
> + select SCHED_TASK_CLASSES
> + help
> + Select this option to enable the Intel Thread Director. If selected,
> + hardware classifies tasks based on the type of instructions they
> + execute. It also provides performance capabilities for each class of
> + task. On hybrid processors, the scheduler uses this data to place
> + tasks of classes of higher performance on higher-performnance CPUs.

Do we really need yet another CONFIG symbol for all this!? AFAICT this
Thread Director crud simply extends the HFI table and doesn't actually
carry that much code with it.

Best to always have it on when HFI is on, no?

2022-09-27 12:22:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 15/23] thermal: intel: hfi: Report per-cpu class-specific performance scores

On Fri, Sep 09, 2022 at 04:11:57PM -0700, Ricardo Neri wrote:
> Support the arch_get_task_class_score() interface of the scheduler. Use the
> data that Intel Thread Director provides to inform the scheduler the
> performance of a class of tasks when placed on a given CPU.
>

> +static void get_one_hfi_cap(struct hfi_instance *hfi_instance, s16 index,
> + struct hfi_cpu_data *hfi_caps, int class)
> +{
> + struct hfi_cpu_data *caps;
> +
> + /* Find the capabilities of @cpu */
> + caps = hfi_instance->data + index * hfi_features.cpu_stride +
> + class * hfi_features.class_stride;
> + memcpy(hfi_caps, caps, sizeof(*hfi_caps));
> +}
> +
> +int intel_hfi_get_task_class_score(int class, int cpu)
> +{
> + struct hfi_cpu_info *info = &per_cpu(hfi_cpu_info, cpu);
> + struct hfi_instance *instance;
> + struct hfi_cpu_data caps;
> + unsigned long flags;
> + int cap;
> +
> + if (cpu < 0 || cpu >= nr_cpu_ids)
> + return -EINVAL;
> +
> + if (class == TASK_CLASS_UNCLASSIFIED)
> + return -EINVAL;
> +
> + if (class >= (int)hfi_features.nr_classes)
> + return -EINVAL;
> +
> + instance = info->hfi_instance;
> + if (!instance)
> + return -ENOENT;
> +
> + raw_spin_lock_irqsave(&instance->table_lock, flags);
> + get_one_hfi_cap(instance, info->index, &caps, class);
> + cap = caps.perf_cap;
> + raw_spin_unlock_irqrestore(&instance->table_lock, flags);
> +
> + return cap;
> +}

Does any of that data actually ever change? Isn't the class score fixed
per CPU type?

2022-09-27 12:37:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 14/23] thermal: intel: hfi: Update the class of the current task

On Fri, Sep 09, 2022 at 04:11:56PM -0700, Ricardo Neri wrote:
> +union hfi_thread_feedback_char_msr {
> + struct {
> + u8 classid;
> + u64 __reserved:55;
> + u8 valid:1;
> + } split;
> + u64 full;
> +};

Urgh, did you perhaps mean:

struct {
u64 classid :8;
u64 __reserved :55;
u64 valid :1
};

?

Because yes, GCC does fold that into a single u64, but that's
implementation defined behaviour; the C spec doesn't require one to pack
adjacent bitfields of different types together.

I layout of:

u8 class; // offset 0
u64 __reserver : 55; // offset 8
u8 valid : 1; // offset 16

with a total size of 24 bytes is, AFAIU, a valid result of what you
wrote.

2022-09-27 12:41:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 12/23] thermal: intel: hfi: Convert table_lock to use flags-handling variants

On Fri, Sep 09, 2022 at 04:11:54PM -0700, Ricardo Neri wrote:

> --- a/drivers/thermal/intel/intel_hfi.c
> +++ b/drivers/thermal/intel/intel_hfi.c
> @@ -175,9 +175,10 @@ static struct workqueue_struct *hfi_updates_wq;
> static void get_hfi_caps(struct hfi_instance *hfi_instance,
> struct thermal_genl_cpu_caps *cpu_caps)
> {
> + unsigned long flags;
> int cpu, i = 0;
>
> - raw_spin_lock_irq(&hfi_instance->table_lock);
> + raw_spin_lock_irqsave(&hfi_instance->table_lock, flags);
> for_each_cpu(cpu, hfi_instance->cpus) {
> struct hfi_cpu_data *caps;
> s16 index;

^^^^ Anti-pattern alert!

Now your IRQ latency depends on nr_cpus -- which is a fair fail. The
existing code is already pretty crap in that it has the preemption
latency depend on nr_cpus.

While I'm here looking at the HFI stuff, did they fix that HFI interrupt
broadcast mess already? Sending an interrupt to *all* CPUs is quite
insane.

2022-09-27 13:08:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 03/23] sched/core: Initialize the class of a new task

On Mon, Sep 26, 2022 at 02:57:29PM +0000, Joel Fernandes wrote:

> > +#ifdef CONFIG_SCHED_TASK_CLASSES
> > + p->class = TASK_CLASS_UNCLASSIFIED;
> > +#endif
>
> I find the term 'class' very broad and unclear what kind of class (without
> further reading). So I am worried about how this generic term usage plays
> with Linux source code in the long-term (like what if someone else comes up
> with a usage of term 'class' that is unrelated to IPC.)

However much I like making a pain for people using C++ to compile the
kernel, I do think ipcc might be better here
(instructions_per_cycle_class for those of the novel per identifier
school of thought).

> To that end, I was wondering if it could be renamed to p->ipc_class, and
> CONFIG_SCHED_TASK_IPC_CLASSES, or something.

Can we *please* shorten those thing instead of writing a novel?
CONFIG_SCHED_IPC_CLASS works just as well, no? Or TASK_IPC, whatever.

2022-09-27 13:42:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 23/23] x86/process: Reset hardware history in context switch

On Fri, Sep 09, 2022 at 04:12:05PM -0700, Ricardo Neri wrote:

> +void reset_hardware_history(void)
> +{
> + if (!static_cpu_has(X86_FEATURE_HRESET))
> + return;
> +
> + asm volatile("mov %0, %%eax;" __ASM_HRESET "\n" : :
> + "r" (hardware_history_features) : "%rax");
> +}

asm_inline volatile (ALTERNATIVE("", __ASM_HRESET, X86_FEATURE_HRESET)
: : "a" (hardware_history_features) : "memory");

2022-09-27 13:46:47

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH 23/23] x86/process: Reset hardware history in context switch

On Fri, Sep 09, 2022 at 04:12:05PM -0700, Ricardo Neri wrote:
> +void reset_hardware_history(void)
> +{
> + if (!static_cpu_has(X86_FEATURE_HRESET))

In your whole patchset:

s/static_cpu_has/cpu_feature_enabled/g

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-09-27 13:51:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 22/23] x86/hreset: Configure history reset

On Fri, Sep 09, 2022 at 04:12:04PM -0700, Ricardo Neri wrote:

> +static u32 hardware_history_features __read_mostly;

__ro_after_init ?

> +static __always_inline void setup_hreset(struct cpuinfo_x86 *c)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_HRESET))
> + return;
> +
> + /*
> + * Use on all CPUs the hardware history features that the boot
> + * CPU supports.
> + */
> + if (c == &boot_cpu_data)
> + hardware_history_features = cpuid_ebx(0x20);
> +
> + if (!hardware_history_features)
> + return;
> +
> + wrmsrl(MSR_IA32_HW_HRESET_ENABLE, hardware_history_features);
> +
> + pr_info_once("x86/cpu: Intel History Reset (HRESET) activated\n");
> +}
> +
> /* These bits should not change their value after CPU init is finished. */
> static const unsigned long cr4_pinned_mask =
> X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP |

2022-09-27 13:51:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 23/23] x86/process: Reset hardware history in context switch

On Fri, Sep 09, 2022 at 04:12:05PM -0700, Ricardo Neri wrote:
> Reset the classification history of the current task when switching to
> the next task. Hardware will start anew the classification of the next
> running task.

Please quantify the cost of this HRESET instruction.

> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index 2f314b170c9f..74d8ad83e0b3 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -52,6 +52,7 @@
> #include <asm/switch_to.h>
> #include <asm/vm86.h>
> #include <asm/resctrl.h>
> +#include <asm/hreset.h>
> #include <asm/proto.h>
>
> #include "process.h"
> @@ -214,6 +215,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> /* Load the Intel cache allocation PQR MSR. */
> resctrl_sched_in();
>
> + reset_hardware_history();
> +
> return prev_p;
> }
>
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index 1962008fe743..0b175f30f359 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -53,6 +53,7 @@
> #include <asm/xen/hypervisor.h>
> #include <asm/vdso.h>
> #include <asm/resctrl.h>
> +#include <asm/hreset.h>
> #include <asm/unistd.h>
> #include <asm/fsgsbase.h>
> #ifdef CONFIG_IA32_EMULATION
> @@ -657,6 +658,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> /* Load the Intel cache allocation PQR MSR. */
> resctrl_sched_in();
>
> + reset_hardware_history();
> +
> return prev_p;
> }
>
> --
> 2.25.1
>

2022-09-27 14:06:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 01/23] sched/task_struct: Introduce classes of tasks

On Fri, Sep 16, 2022 at 07:41:13AM -0700, Ricardo Neri wrote:

> At least on Intel processors, class 0 is a valid class. The scheduler needs to
> have a notion of unclassified tasks and decide how to handle them, IMO.
>
> Intel processors currently support 8-bit, unsigned classes. I doubt other
> architectures will ever support more than 256 classes. Short can handle all the
> possible classification values and also the unclassified case.
>
> On the other hand, class 0 could be the default classification unless hardware
> classifies differently. 0 would be special and need to be documented clearly.
> This would work for Intel processors.

You can always do: class = hw_class + 1; that makes 0 'special' and the
hardware class can be trivially reconstructed by subtracting 1.

2022-09-27 16:01:23

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 03/23] sched/core: Initialize the class of a new task

On Tue, Sep 27, 2022 at 03:04:39PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 26, 2022 at 02:57:29PM +0000, Joel Fernandes wrote:
>
> > > +#ifdef CONFIG_SCHED_TASK_CLASSES
> > > + p->class = TASK_CLASS_UNCLASSIFIED;
> > > +#endif
> >
> > I find the term 'class' very broad and unclear what kind of class (without
> > further reading). So I am worried about how this generic term usage plays
> > with Linux source code in the long-term (like what if someone else comes up
> > with a usage of term 'class' that is unrelated to IPC.)
>
> However much I like making a pain for people using C++ to compile the
> kernel, I do think ipcc might be better here
> (instructions_per_cycle_class for those of the novel per identifier
> school of thought).

Yes, ipcc sounds fine to me.

> > To that end, I was wondering if it could be renamed to p->ipc_class, and
> > CONFIG_SCHED_TASK_IPC_CLASSES, or something.
>
> Can we *please* shorten those thing instead of writing a novel?
> CONFIG_SCHED_IPC_CLASS works just as well, no? Or TASK_IPC, whatever.

CONFIG_SCHED_IPC_CLASS also sounds fine, or:

CONFIG_SCHED_IPC_CLASS_SHORTENED_VERSION_TO_NOT_ANNOY_PETER.


thanks,

- Joel

2022-10-01 21:20:03

by Ricardo Neri

[permalink] [raw]
Subject: Re: [RFC PATCH 03/23] sched/core: Initialize the class of a new task

On Tue, Sep 27, 2022 at 03:04:39PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 26, 2022 at 02:57:29PM +0000, Joel Fernandes wrote:
>
> > > +#ifdef CONFIG_SCHED_TASK_CLASSES
> > > + p->class = TASK_CLASS_UNCLASSIFIED;
> > > +#endif
> >
> > I find the term 'class' very broad and unclear what kind of class (without
> > further reading). So I am worried about how this generic term usage plays
> > with Linux source code in the long-term (like what if someone else comes up
> > with a usage of term 'class' that is unrelated to IPC.)
>
> However much I like making a pain for people using C++ to compile the
> kernel, I do think ipcc might be better here
> (instructions_per_cycle_class for those of the novel per identifier
> school of thought).

Sure, I will use ippc
>
> > To that end, I was wondering if it could be renamed to p->ipc_class, and
> > CONFIG_SCHED_TASK_IPC_CLASSES, or something.
>
> Can we *please* shorten those thing instead of writing a novel?
> CONFIG_SCHED_IPC_CLASS works just as well, no? Or TASK_IPC, whatever.

... and avoid the novel-style identifiers :) .

Thanks and BR,
Ricardo

2022-10-02 22:19:36

by Ricardo Neri

[permalink] [raw]
Subject: Re: [RFC PATCH 23/23] x86/process: Reset hardware history in context switch

On Tue, Sep 27, 2022 at 02:53:29PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 09, 2022 at 04:12:05PM -0700, Ricardo Neri wrote:
> > Reset the classification history of the current task when switching to
> > the next task. Hardware will start anew the classification of the next
> > running task.
>
> Please quantify the cost of this HRESET instruction.

Sure Peter. I will.

Thanks and BR,
Ricardo

2022-10-02 22:22:05

by Ricardo Neri

[permalink] [raw]
Subject: Re: [RFC PATCH 23/23] x86/process: Reset hardware history in context switch

On Tue, Sep 27, 2022 at 03:15:47PM +0200, Borislav Petkov wrote:
> On Fri, Sep 09, 2022 at 04:12:05PM -0700, Ricardo Neri wrote:
> > +void reset_hardware_history(void)
> > +{
> > + if (!static_cpu_has(X86_FEATURE_HRESET))
>
> In your whole patchset:
>
> s/static_cpu_has/cpu_feature_enabled/g

Sure I can do this, Boris. I guess this implies that I also need to add the
DISABLE_MASK bits. Othewise, IIUC, cpu_feature_enabled() falls back to
static_cpu_has().

Thanks and BR,
Ricardo

2022-10-02 22:42:52

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH 23/23] x86/process: Reset hardware history in context switch

On Sun, Oct 02, 2022 at 03:12:38PM -0700, Ricardo Neri wrote:
> Sure I can do this, Boris. I guess this implies that I also need to
> add the DISABLE_MASK bits.

Are you adding a CONFIG_ item which can control the DISABLE_MASK bit
too?

> Othewise, IIUC, cpu_feature_enabled() falls back to static_cpu_has().

And?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-10-02 22:50:30

by Ricardo Neri

[permalink] [raw]
Subject: Re: [RFC PATCH 22/23] x86/hreset: Configure history reset

On Tue, Sep 27, 2022 at 02:03:59PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 09, 2022 at 04:12:04PM -0700, Ricardo Neri wrote:
>
> > +static u32 hardware_history_features __read_mostly;
>
> __ro_after_init ?

Yes! You are correct. the HRESET features are only set once.

Thanks and BR,
Ricardo

2022-10-02 23:05:03

by Ricardo Neri

[permalink] [raw]
Subject: Re: [RFC PATCH 01/23] sched/task_struct: Introduce classes of tasks

On Tue, Sep 27, 2022 at 03:01:07PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 16, 2022 at 07:41:13AM -0700, Ricardo Neri wrote:
>
> > At least on Intel processors, class 0 is a valid class. The scheduler needs to
> > have a notion of unclassified tasks and decide how to handle them, IMO.
> >
> > Intel processors currently support 8-bit, unsigned classes. I doubt other
> > architectures will ever support more than 256 classes. Short can handle all the
> > possible classification values and also the unclassified case.
> >
> > On the other hand, class 0 could be the default classification unless hardware
> > classifies differently. 0 would be special and need to be documented clearly.
> > This would work for Intel processors.
>
> You can always do: class = hw_class + 1; that makes 0 'special' and the
> hardware class can be trivially reconstructed by subtracting 1.

This makes sense to me. I will implement as you suggest.

Thanks and BR,
Ricardo

2022-10-03 20:11:59

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC PATCH 23/23] x86/process: Reset hardware history in context switch

On Mon, Oct 03, 2022 at 12:49:35PM -0700, Ricardo Neri wrote:
> Since I did not implement a DISABLE_MASK bit nor a CONFIG_ option for
> HRESET, I thought that static_cpu_has() was sufficient.

The only function that should be used to query X86_FEATURE_ flags is
cpu_feature_enabled().

> I am not against using cpu_feature_enabled(), I just want to confirm that
> I also need to implement the DISABLE_MASK bit and the CONFIG_ option.

You don't have to.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-10-03 20:28:10

by Ricardo Neri

[permalink] [raw]
Subject: Re: [RFC PATCH 23/23] x86/process: Reset hardware history in context switch

On Mon, Oct 03, 2022 at 12:15:32AM +0200, Borislav Petkov wrote:
> On Sun, Oct 02, 2022 at 03:12:38PM -0700, Ricardo Neri wrote:
> > Sure I can do this, Boris. I guess this implies that I also need to
> > add the DISABLE_MASK bits.
>
> Are you adding a CONFIG_ item which can control the DISABLE_MASK bit
> too?

I am not. I could use CONFIG_INTEL_HFI_THERMAL, as using HRESET only makes
sense when Intel Thread Director (Peter argued in separate email against
having a config option specific for Intel Thread Director).

>
> > Othewise, IIUC, cpu_feature_enabled() falls back to static_cpu_has().
>
> And?

Since I did not implement a DISABLE_MASK bit nor a CONFIG_ option for
HRESET, I thought that static_cpu_has() was sufficient.

I am not against using cpu_feature_enabled(), I just want to confirm that
I also need to implement the DISABLE_MASK bit and the CONFIG_ option.

Thanks and BR,
Ricardo

2022-10-03 23:40:02

by Ricardo Neri

[permalink] [raw]
Subject: Re: [RFC PATCH 23/23] x86/process: Reset hardware history in context switch

On Tue, Sep 27, 2022 at 02:52:24PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 09, 2022 at 04:12:05PM -0700, Ricardo Neri wrote:
>
> > +void reset_hardware_history(void)
> > +{
> > + if (!static_cpu_has(X86_FEATURE_HRESET))
> > + return;

If I used cpu_feature_enabled(X86_FEATURE_ITD) along with the CONFIG_HFI_
THERMAL and its corresponding DISABLE_MASK bit the code below would be
compiled out.

> > +
> > + asm volatile("mov %0, %%eax;" __ASM_HRESET "\n" : :
> > + "r" (hardware_history_features) : "%rax");
> > +}
>
> asm_inline volatile (ALTERNATIVE("", __ASM_HRESET, X86_FEATURE_HRESET)
> : : "a" (hardware_history_features) : "memory");

Do you still prefer have implemented as an ALTERNATIVE?

Thanks and BR,
Ricardo
>

2022-10-05 23:47:13

by Ricardo Neri

[permalink] [raw]
Subject: Re: [RFC PATCH 09/23] sched/fair: Use task-class performance score to pick the busiest group

On Tue, Sep 27, 2022 at 01:01:32PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 09, 2022 at 04:11:51PM -0700, Ricardo Neri wrote:
> > update_sd_pick_busiest() keeps on selecting as the busiest group scheduling
> > groups of identical priority. Since both groups have the same priority,
> > either group is a good choice. The classes of tasks in the scheduling
> > groups can break this tie.
> >
> > Pick as busiest the scheduling group that yields a higher task-class
> > performance score after load balancing.
>
> > +/**
> > + * sched_asym_class_pick - Select a sched group based on classes of tasks
> > + * @a: A scheduling group
> > + * @b: A second scheduling group
> > + * @a_stats: Load balancing statistics of @a
> > + * @b_stats: Load balancing statistics of @b
> > + *
> > + * Returns: true if @a has the same priority and @a has classes of tasks that
> > + * yield higher overall throughput after load balance. Returns false otherwise.
> > + */
> > +static bool sched_asym_class_pick(struct sched_group *a,
> > + struct sched_group *b,
> > + struct sg_lb_stats *a_stats,
> > + struct sg_lb_stats *b_stats)
> > +{
> > + /*
> > + * Only use the class-specific preference selection if both sched
> > + * groups have the same priority.
> > + */
> > + if (arch_asym_cpu_priority(a->asym_prefer_cpu) !=
> > + arch_asym_cpu_priority(b->asym_prefer_cpu))
> > + return false;
> > +
> > + return sched_asym_class_prefer(a_stats, b_stats);
> > +}
> > +
> > #else /* CONFIG_SCHED_TASK_CLASSES */
> > static void update_rq_task_classes_stats(struct sg_lb_task_class_stats *class_sgs,
> > struct rq *rq)
>
> > @@ -9049,6 +9111,12 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> > /* Prefer to move from lowest priority CPU's work */
> > if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))
> > return false;
> > +
> > + /* @sg and @sds::busiest have the same priority. */
> > + if (sched_asym_class_pick(sds->busiest, sg, &sds->busiest_stat, sgs))
> > + return false;
> > +
> > + /* @sg has lower priority than @sds::busiest. */
> > break;
> >
> > case group_misfit_task:
>
> So why does only this one instance of asym_prefer() require tie
> breaking?

This is the only place in which two sched groups with running tasks and of
equal priority are compared.

In all other places sched_asym_prefer() is used to compare the destination
CPU with others. Since asym_packing is done only when the destination CPU is
idle, there is no need to break this tie.

>
> I must also re-iterate how much I hate having two different means of
> dealing with big-little topologies.

Yes, that is true. We discussed the challenges of using EAS on x86 during
this year's Linux Plumbers Conference. For the discussion at hand, I think
that the most relevant part is the difficulty to assess CPU capacity
on Intel processors given the presence of SMT, a large range of turbo
frequencies and hardware-controlled frequency scaling.

We are looking into these challenges.

This patchset focuses mainly on the infrastructure to support IPC classes of
tasks in the scheduler. We use it on this tie break when using asym_packing,
but it could be used in capacity computations.

>
> And while looking through this, I must ask about the comment that goes
> with sched_set_itmt_core_prio() vs the sg->asym_prefer_cpu assignment in
> init_sched_groups_capacity(), what-up ?!

Are you referring to this comment?

"No need to rebuild sched domain after updating
the CPU priorities. The sched domains have no
dependency on CPU priorities"

If yes, then it looks wrong to me. Sched domains are rebuilt after updating
priorities.

Thanks and BR,
Ricardo

2022-10-05 23:57:58

by Ricardo Neri

[permalink] [raw]
Subject: Re: [RFC PATCH 15/23] thermal: intel: hfi: Report per-cpu class-specific performance scores

On Tue, Sep 27, 2022 at 01:59:15PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 09, 2022 at 04:11:57PM -0700, Ricardo Neri wrote:
> > Support the arch_get_task_class_score() interface of the scheduler. Use the
> > data that Intel Thread Director provides to inform the scheduler the
> > performance of a class of tasks when placed on a given CPU.
> >
>
> > +static void get_one_hfi_cap(struct hfi_instance *hfi_instance, s16 index,
> > + struct hfi_cpu_data *hfi_caps, int class)
> > +{
> > + struct hfi_cpu_data *caps;
> > +
> > + /* Find the capabilities of @cpu */
> > + caps = hfi_instance->data + index * hfi_features.cpu_stride +
> > + class * hfi_features.class_stride;
> > + memcpy(hfi_caps, caps, sizeof(*hfi_caps));
> > +}
> > +
> > +int intel_hfi_get_task_class_score(int class, int cpu)
> > +{
> > + struct hfi_cpu_info *info = &per_cpu(hfi_cpu_info, cpu);
> > + struct hfi_instance *instance;
> > + struct hfi_cpu_data caps;
> > + unsigned long flags;
> > + int cap;
> > +
> > + if (cpu < 0 || cpu >= nr_cpu_ids)
> > + return -EINVAL;
> > +
> > + if (class == TASK_CLASS_UNCLASSIFIED)
> > + return -EINVAL;
> > +
> > + if (class >= (int)hfi_features.nr_classes)
> > + return -EINVAL;
> > +
> > + instance = info->hfi_instance;
> > + if (!instance)
> > + return -ENOENT;
> > +
> > + raw_spin_lock_irqsave(&instance->table_lock, flags);
> > + get_one_hfi_cap(instance, info->index, &caps, class);
> > + cap = caps.perf_cap;
> > + raw_spin_unlock_irqrestore(&instance->table_lock, flags);
> > +
> > + return cap;
> > +}
>
> Does any of that data actually ever change? Isn't the class score fixed
> per CPU type?

Yes, data can change. The Intel SDM Vol 3 Section 14.6.7 states that the
table can be updated during runtime.

Thanks and BR,
Ricardo

2022-10-06 02:08:16

by Ricardo Neri

[permalink] [raw]
Subject: Re: [RFC PATCH 17/23] thermal: intel: hfi: Enable the Intel Thread Director

On Tue, Sep 27, 2022 at 02:00:39PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 09, 2022 at 04:11:59PM -0700, Ricardo Neri wrote:
>
> > +config INTEL_THREAD_DIRECTOR
> > + bool "Intel Thread Director"
> > + depends on INTEL_HFI_THERMAL
> > + depends on SMP
> > + select SCHED_TASK_CLASSES
> > + help
> > + Select this option to enable the Intel Thread Director. If selected,
> > + hardware classifies tasks based on the type of instructions they
> > + execute. It also provides performance capabilities for each class of
> > + task. On hybrid processors, the scheduler uses this data to place
> > + tasks of classes of higher performance on higher-performnance CPUs.
>
> Do we really need yet another CONFIG symbol for all this!? AFAICT this
> Thread Director crud simply extends the HFI table and doesn't actually
> carry that much code with it.
>
> Best to always have it on when HFI is on, no?

I decided to add CONFIG_INTEL_THREAD_DIRECTOR mainly to select CONFIG_IPC_
CLASS and have the needed members of task_struct only when needed. Legacy,
classless, HFI can work with the Thread Director part on some processors
(e.g., Sapphire Rapids or, FWIW, Lakefield).

I could get rid of CONFIG_INTEL_THREAD_DIRECTOR and instead wrap the Thread
Director code in an #ifdef(CONFIG_IPC_CLASS) block instead.

Thanks and BR,
Ricardo

2022-10-06 08:54:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 23/23] x86/process: Reset hardware history in context switch

On Mon, Oct 03, 2022 at 04:07:58PM -0700, Ricardo Neri wrote:
> On Tue, Sep 27, 2022 at 02:52:24PM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 09, 2022 at 04:12:05PM -0700, Ricardo Neri wrote:
> >
> > > +void reset_hardware_history(void)
> > > +{
> > > + if (!static_cpu_has(X86_FEATURE_HRESET))
> > > + return;
>
> If I used cpu_feature_enabled(X86_FEATURE_ITD) along with the CONFIG_HFI_
> THERMAL and its corresponding DISABLE_MASK bit the code below would be
> compiled out.

Nobody cares about compiled out -- distro's must enable all this. So
what counts is the code size, and the alternative is smaller.

> > > +
> > > + asm volatile("mov %0, %%eax;" __ASM_HRESET "\n" : :
> > > + "r" (hardware_history_features) : "%rax");
> > > +}
> >
> > asm_inline volatile (ALTERNATIVE("", __ASM_HRESET, X86_FEATURE_HRESET)
> > : : "a" (hardware_history_features) : "memory");
>
> Do you still prefer have implemented as an ALTERNATIVE?

Yes, look at the generated code.

2022-10-06 08:57:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 09/23] sched/fair: Use task-class performance score to pick the busiest group

On Wed, Oct 05, 2022 at 04:38:41PM -0700, Ricardo Neri wrote:
> On Tue, Sep 27, 2022 at 01:01:32PM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 09, 2022 at 04:11:51PM -0700, Ricardo Neri wrote:

> > > @@ -9049,6 +9111,12 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> > > /* Prefer to move from lowest priority CPU's work */
> > > if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))
> > > return false;
> > > +
> > > + /* @sg and @sds::busiest have the same priority. */
> > > + if (sched_asym_class_pick(sds->busiest, sg, &sds->busiest_stat, sgs))
> > > + return false;
> > > +
> > > + /* @sg has lower priority than @sds::busiest. */
> > > break;
> > >
> > > case group_misfit_task:
> >
> > So why does only this one instance of asym_prefer() require tie
> > breaking?
>
> This is the only place in which two sched groups with running tasks and of
> equal priority are compared.
>
> In all other places sched_asym_prefer() is used to compare the destination
> CPU with others. Since asym_packing is done only when the destination CPU is
> idle, there is no need to break this tie.

That would make for a fine comment, no? Because as presented one is left
wondering, why if asym_prefer() needs tie breaking, only this one site
needs it.

> > And while looking through this, I must ask about the comment that goes
> > with sched_set_itmt_core_prio() vs the sg->asym_prefer_cpu assignment in
> > init_sched_groups_capacity(), what-up ?!
>
> Are you referring to this comment?
>
> "No need to rebuild sched domain after updating
> the CPU priorities. The sched domains have no
> dependency on CPU priorities"
>
> If yes, then it looks wrong to me. Sched domains are rebuilt after updating
> priorities.

Right.

2022-10-06 09:27:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 15/23] thermal: intel: hfi: Report per-cpu class-specific performance scores

On Thu, Oct 06, 2022 at 10:52:16AM +0200, Peter Zijlstra wrote:
> On Wed, Oct 05, 2022 at 04:59:59PM -0700, Ricardo Neri wrote:
> > On Tue, Sep 27, 2022 at 01:59:15PM +0200, Peter Zijlstra wrote:
> > > On Fri, Sep 09, 2022 at 04:11:57PM -0700, Ricardo Neri wrote:
> > > > Support the arch_get_task_class_score() interface of the scheduler. Use the
> > > > data that Intel Thread Director provides to inform the scheduler the
> > > > performance of a class of tasks when placed on a given CPU.
> > > >
> > >
> > > > +static void get_one_hfi_cap(struct hfi_instance *hfi_instance, s16 index,
> > > > + struct hfi_cpu_data *hfi_caps, int class)
> > > > +{
> > > > + struct hfi_cpu_data *caps;
> > > > +
> > > > + /* Find the capabilities of @cpu */
> > > > + caps = hfi_instance->data + index * hfi_features.cpu_stride +
> > > > + class * hfi_features.class_stride;
> > > > + memcpy(hfi_caps, caps, sizeof(*hfi_caps));
> > > > +}
> > > > +
> > > > +int intel_hfi_get_task_class_score(int class, int cpu)
> > > > +{
> > > > + struct hfi_cpu_info *info = &per_cpu(hfi_cpu_info, cpu);
> > > > + struct hfi_instance *instance;
> > > > + struct hfi_cpu_data caps;
> > > > + unsigned long flags;
> > > > + int cap;
> > > > +
> > > > + if (cpu < 0 || cpu >= nr_cpu_ids)
> > > > + return -EINVAL;
> > > > +
> > > > + if (class == TASK_CLASS_UNCLASSIFIED)
> > > > + return -EINVAL;
> > > > +
> > > > + if (class >= (int)hfi_features.nr_classes)
> > > > + return -EINVAL;
> > > > +
> > > > + instance = info->hfi_instance;
> > > > + if (!instance)
> > > > + return -ENOENT;
> > > > +
> > > > + raw_spin_lock_irqsave(&instance->table_lock, flags);
> > > > + get_one_hfi_cap(instance, info->index, &caps, class);
> > > > + cap = caps.perf_cap;
> > > > + raw_spin_unlock_irqrestore(&instance->table_lock, flags);
> > > > +
> > > > + return cap;
> > > > +}
> > >
> > > Does any of that data actually ever change? Isn't the class score fixed
> > > per CPU type?
> >
> > Yes, data can change. The Intel SDM Vol 3 Section 14.6.7 states that the
> > table can be updated during runtime.
>
> I find the SDM is often unreadable gibberish, this part doesn't dissapoint.
>
> There's a ton of might and maybe there; what does it actually do and how
> often does it do it? Given the thermal interrupt is such a shitshow, we
> really, as in *REALLY* don't want this to happen at any frequency at
> all.
>
> And if it barely happens, why do we care if it happens at all?

I enabled this HFI crud on my ADL (INTEL_HFI_THERMAL -- because
apparently Debian doesn't default enable this) and now I get all of _1_
interrupt during boot.

Building a kernel on that machine doesn't manage to trip another one.

TRM: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Thermal event interrupts

So yeah.. *can* change, but doesn't.

2022-10-06 10:01:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 15/23] thermal: intel: hfi: Report per-cpu class-specific performance scores

On Wed, Oct 05, 2022 at 04:59:59PM -0700, Ricardo Neri wrote:
> On Tue, Sep 27, 2022 at 01:59:15PM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 09, 2022 at 04:11:57PM -0700, Ricardo Neri wrote:
> > > Support the arch_get_task_class_score() interface of the scheduler. Use the
> > > data that Intel Thread Director provides to inform the scheduler the
> > > performance of a class of tasks when placed on a given CPU.
> > >
> >
> > > +static void get_one_hfi_cap(struct hfi_instance *hfi_instance, s16 index,
> > > + struct hfi_cpu_data *hfi_caps, int class)
> > > +{
> > > + struct hfi_cpu_data *caps;
> > > +
> > > + /* Find the capabilities of @cpu */
> > > + caps = hfi_instance->data + index * hfi_features.cpu_stride +
> > > + class * hfi_features.class_stride;
> > > + memcpy(hfi_caps, caps, sizeof(*hfi_caps));
> > > +}
> > > +
> > > +int intel_hfi_get_task_class_score(int class, int cpu)
> > > +{
> > > + struct hfi_cpu_info *info = &per_cpu(hfi_cpu_info, cpu);
> > > + struct hfi_instance *instance;
> > > + struct hfi_cpu_data caps;
> > > + unsigned long flags;
> > > + int cap;
> > > +
> > > + if (cpu < 0 || cpu >= nr_cpu_ids)
> > > + return -EINVAL;
> > > +
> > > + if (class == TASK_CLASS_UNCLASSIFIED)
> > > + return -EINVAL;
> > > +
> > > + if (class >= (int)hfi_features.nr_classes)
> > > + return -EINVAL;
> > > +
> > > + instance = info->hfi_instance;
> > > + if (!instance)
> > > + return -ENOENT;
> > > +
> > > + raw_spin_lock_irqsave(&instance->table_lock, flags);
> > > + get_one_hfi_cap(instance, info->index, &caps, class);
> > > + cap = caps.perf_cap;
> > > + raw_spin_unlock_irqrestore(&instance->table_lock, flags);
> > > +
> > > + return cap;
> > > +}
> >
> > Does any of that data actually ever change? Isn't the class score fixed
> > per CPU type?
>
> Yes, data can change. The Intel SDM Vol 3 Section 14.6.7 states that the
> table can be updated during runtime.

I find the SDM is often unreadable gibberish, this part doesn't dissapoint.

There's a ton of might and maybe there; what does it actually do and how
often does it do it? Given the thermal interrupt is such a shitshow, we
really, as in *REALLY* don't want this to happen at any frequency at
all.

And if it barely happens, why do we care if it happens at all?

2022-10-06 16:07:45

by Brown, Len

[permalink] [raw]
Subject: RE: [RFC PATCH 15/23] thermal: intel: hfi: Report per-cpu class-specific performance scores

> > > Does any of that data actually ever change? Isn't the class score
> > > fixed per CPU type?

Depends on the chip.

As we described at LPC, the ADL chips shipping today update their tables in response
to RAPL working to keep the system below PL1. Linux or user-space can scribble on PL1
at any time, so technically, this table update can happen at any time.

That said, it is true that, say, an ADL desktop part that operates with plenty of power and cooling
will send the initial table and never have a need to update the table after that.

Upcoming chips are smarter and will give us more dynamic information.
We expect the P-unit to send only "meaningful" changes, and that they
Shall not occur more often than every 10ms.

Re: class core fixed per CPU type
Usually (but not always) the class score for every CPU of a type will be the same,
They can change, but likely will all change in concert.
However the scores for the different types can change relative to each other.

The not-always bit is because CPUs of the same type may not actually be identical.
Recall that ITMT was originally implemented to favor some Pcores that are
Faster than other Pcores of the same type. Similarly, I would expect that
Ecores within a module will be uniform, but some Ecore modules may
differ from modules -- and how they differ can change at run-time.

Cheers,
-Len



2022-10-06 16:42:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 15/23] thermal: intel: hfi: Report per-cpu class-specific performance scores

On Thu, Oct 06, 2022 at 03:05:18PM +0000, Brown, Len wrote:
> > > > Does any of that data actually ever change? Isn't the class score
> > > > fixed per CPU type?
>
> Depends on the chip.
>
> As we described at LPC, the ADL chips shipping today update their
> tables in response to RAPL working to keep the system below PL1.
> Linux or user-space can scribble on PL1 at any time, so technically,
> this table update can happen at any time.
>
> That said, it is true that, say, an ADL desktop part that operates
> with plenty of power and cooling will send the initial table and never
> have a need to update the table after that.

I have a NUC, so laptop part with limited thermals (the Lenovo P360
Ultra was announed just after I ordered the P360 Tiny). Still I wasn't
able to trigger this during normal operation.

> Upcoming chips are smarter and will give us more dynamic information.
> We expect the P-unit to send only "meaningful" changes, and that they
> Shall not occur more often than every 10ms.

Make *very* sure those upcoming chips don't broadcast that interrupt.
Broadcast interrupts are unconditional crap.

Broadcast interrupts every 10ms is terrifying crap and a good reason for
people to force disable this stuff.


2022-10-06 20:04:43

by Ricardo Neri

[permalink] [raw]
Subject: Re: [RFC PATCH 09/23] sched/fair: Use task-class performance score to pick the busiest group

On Thu, Oct 06, 2022 at 10:37:52AM +0200, Peter Zijlstra wrote:
> On Wed, Oct 05, 2022 at 04:38:41PM -0700, Ricardo Neri wrote:
> > On Tue, Sep 27, 2022 at 01:01:32PM +0200, Peter Zijlstra wrote:
> > > On Fri, Sep 09, 2022 at 04:11:51PM -0700, Ricardo Neri wrote:
>
> > > > @@ -9049,6 +9111,12 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> > > > /* Prefer to move from lowest priority CPU's work */
> > > > if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))
> > > > return false;
> > > > +
> > > > + /* @sg and @sds::busiest have the same priority. */
> > > > + if (sched_asym_class_pick(sds->busiest, sg, &sds->busiest_stat, sgs))
> > > > + return false;
> > > > +
> > > > + /* @sg has lower priority than @sds::busiest. */
> > > > break;
> > > >
> > > > case group_misfit_task:
> > >
> > > So why does only this one instance of asym_prefer() require tie
> > > breaking?
> >
> > This is the only place in which two sched groups with running tasks and of
> > equal priority are compared.
> >
> > In all other places sched_asym_prefer() is used to compare the destination
> > CPU with others. Since asym_packing is done only when the destination CPU is
> > idle, there is no need to break this tie.

>
> That would make for a fine comment, no? Because as presented one is left
> wondering, why if asym_prefer() needs tie breaking, only this one site
> needs it.

Sure. I will add this comment.

>
> > > And while looking through this, I must ask about the comment that goes
> > > with sched_set_itmt_core_prio() vs the sg->asym_prefer_cpu assignment in
> > > init_sched_groups_capacity(), what-up ?!
> >
> > Are you referring to this comment?
> >
> > "No need to rebuild sched domain after updating
> > the CPU priorities. The sched domains have no
> > dependency on CPU priorities"
> >
> > If yes, then it looks wrong to me. Sched domains are rebuilt after updating
> > priorities.

I can included in the series a patch removing this comment.

Thanks and BR,
Ricardo

2022-10-06 23:10:18

by Ricardo Neri

[permalink] [raw]
Subject: Re: [RFC PATCH 23/23] x86/process: Reset hardware history in context switch

On Thu, Oct 06, 2022 at 10:35:36AM +0200, Peter Zijlstra wrote:
> On Mon, Oct 03, 2022 at 04:07:58PM -0700, Ricardo Neri wrote:
> > On Tue, Sep 27, 2022 at 02:52:24PM +0200, Peter Zijlstra wrote:
> > > On Fri, Sep 09, 2022 at 04:12:05PM -0700, Ricardo Neri wrote:
> > >
> > > > +void reset_hardware_history(void)
> > > > +{
> > > > + if (!static_cpu_has(X86_FEATURE_HRESET))
> > > > + return;
> >
> > If I used cpu_feature_enabled(X86_FEATURE_ITD) along with the CONFIG_HFI_
> > THERMAL and its corresponding DISABLE_MASK bit the code below would be
> > compiled out.
>
> Nobody cares about compiled out -- distro's must enable all this.

I see.

> So
> what counts is the code size, and the alternative is smaller.
>
> > > > +
> > > > + asm volatile("mov %0, %%eax;" __ASM_HRESET "\n" : :
> > > > + "r" (hardware_history_features) : "%rax");
> > > > +}
> > >
> > > asm_inline volatile (ALTERNATIVE("", __ASM_HRESET, X86_FEATURE_HRESET)
> > > : : "a" (hardware_history_features) : "memory");
> >
> > Do you still prefer have implemented as an ALTERNATIVE?
>
> Yes, look at the generated code.

I did compare the sizes of the two options as several NOPs are added at
the end. I will take your code.

Thanks and BR,
Ricardo

2022-10-07 11:50:52

by Len Brown

[permalink] [raw]
Subject: Re: [RFC PATCH 15/23] thermal: intel: hfi: Report per-cpu class-specific performance scores

On Thu, Oct 6, 2022 at 6:15 PM Peter Zijlstra <[email protected]> wrote:

> > That said, it is true that, say, an ADL desktop part that operates
> > with plenty of power and cooling will send the initial table and never
> > have a need to update the table after that.
>
> I have a NUC, so laptop part with limited thermals (the Lenovo P360
> Ultra was announed just after I ordered the P360 Tiny). Still I wasn't
> able to trigger this during normal operation.

To force an update, scribble on the software power limit:

# echo 0 > /sys/class/powercap/intel-rapl:0/constraint_0_power_limit_uw
# echo 0 > /sys/class/powercap/intel-rapl:0/constraint_1_power_limit_uw

Note that your P-unit may not instantaneously respond to this stimulus.

> > Upcoming chips are smarter and will give us more dynamic information.
> > We expect the P-unit to send only "meaningful" changes, and that they
> > Shall not occur more often than every 10ms.
>
> Make *very* sure those upcoming chips don't broadcast that interrupt.
> Broadcast interrupts are unconditional crap.
>
> Broadcast interrupts every 10ms is terrifying crap and a good reason for
> people to force disable this stuff.

The package-wide broadcast interrupt is indeed, ugly, and we've
been screaming bloody murder about it from day 1. Unfortunately
the chip pipeline is long and deep, and so more chips with this issue will
materialize before the broadcast interrupt issue completely goes away.
--
Len Brown, Intel

2022-10-07 21:18:15

by Ricardo Neri

[permalink] [raw]
Subject: Re: [RFC PATCH 14/23] thermal: intel: hfi: Update the class of the current task

On Tue, Sep 27, 2022 at 01:46:59PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 09, 2022 at 04:11:56PM -0700, Ricardo Neri wrote:
> > +union hfi_thread_feedback_char_msr {
> > + struct {
> > + u8 classid;
> > + u64 __reserved:55;
> > + u8 valid:1;
> > + } split;
> > + u64 full;
> > +};
>
> Urgh, did you perhaps mean:
>
> struct {
> u64 classid :8;
> u64 __reserved :55;
> u64 valid :1
> };
>
> ?
>
> Because yes, GCC does fold that into a single u64, but that's
> implementation defined behaviour; the C spec doesn't require one to pack
> adjacent bitfields of different types together.
>
> I layout of:
>
> u8 class; // offset 0
> u64 __reserver : 55; // offset 8
> u8 valid : 1; // offset 16
>
> with a total size of 24 bytes is, AFAIU, a valid result of what you
> wrote.

I checked the C99 and C11 specs and, IIUC, it does prescribe how to handle
adjacent bit-fields:

"An implementation may allocate any addressable storage unit large
enough to hold a bitfield. If enough space remains, a bit-field
that immediately follows another bit-field in a structure shall be
packed into adjacent bits of the same unit."

Hence, __reserved and valid should be packed. classid, however, it is not
guaranteed to be adjacent to __reserved.

I will implement the struct are you have described.

Thanks and BR,
Ricardo

2022-10-11 19:32:28

by Bilbao, Carlos

[permalink] [raw]
Subject: Trying to apply patch set

Could you please tell me on top of what tree does this apply? It would be
great if you could provide a link to a patched repository, given the
depedencies on other patch sets.

Thanks in advance,
Carlos

2022-10-18 03:07:21

by Ricardo Neri

[permalink] [raw]
Subject: Re: Trying to apply patch set

On Tue, Oct 11, 2022 at 02:12:21PM -0500, Carlos Bilbao wrote:
> Could you please tell me on top of what tree does this apply? It would be
> great if you could provide a link to a patched repository, given the
> depedencies on other patch sets.

Hi Carlos,

These patches apply on top of v6.0. Also, you need to apply these patches
first:

https://lore.kernel.org/lkml/[email protected]/

I will push these patches to a repo to make it convenient.

Thanks and BR,
Ricardo

2022-10-26 04:24:34

by Ricardo Neri

[permalink] [raw]
Subject: Re: [RFC PATCH 12/23] thermal: intel: hfi: Convert table_lock to use flags-handling variants

On Tue, Sep 27, 2022 at 01:34:07PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 09, 2022 at 04:11:54PM -0700, Ricardo Neri wrote:
>
> > --- a/drivers/thermal/intel/intel_hfi.c
> > +++ b/drivers/thermal/intel/intel_hfi.c
> > @@ -175,9 +175,10 @@ static struct workqueue_struct *hfi_updates_wq;
> > static void get_hfi_caps(struct hfi_instance *hfi_instance,
> > struct thermal_genl_cpu_caps *cpu_caps)
> > {
> > + unsigned long flags;
> > int cpu, i = 0;
> >
> > - raw_spin_lock_irq(&hfi_instance->table_lock);
> > + raw_spin_lock_irqsave(&hfi_instance->table_lock, flags);
> > for_each_cpu(cpu, hfi_instance->cpus) {
> > struct hfi_cpu_data *caps;
> > s16 index;

(Another email I thought I had sent but did not. Sorry!)

>
> ^^^^ Anti-pattern alert!
>
> Now your IRQ latency depends on nr_cpus -- which is a fair fail. The
> existing code is already pretty crap in that it has the preemption
> latency depend on nr_cpus.

I see.

>
> While I'm here looking at the HFI stuff, did they fix that HFI interrupt
> broadcast mess already? Sending an interrupt to *all* CPUs is quite
> insane.

This issue has been raised with hardware teams and they are looking into
fixes. The issue, however, may persist on several models as while a fix
propagates.

Thanks and BR,
Ricardo

2022-10-26 04:28:08

by Ricardo Neri

[permalink] [raw]
Subject: Re: [RFC PATCH 08/23] sched/fair: Compute task-class performance scores for load balancing

On Tue, Sep 27, 2022 at 11:15:46AM +0200, Peter Zijlstra wrote:
> On Fri, Sep 09, 2022 at 04:11:50PM -0700, Ricardo Neri wrote:
>
> > +static void compute_ilb_sg_task_class_scores(struct sg_lb_task_class_stats *class_sgs,
> > + struct sg_lb_stats *sgs,
> > + int dst_cpu)
> > +{
> > + int group_score, group_score_without, score_on_dst_cpu;
> > + int busy_cpus = sgs->group_weight - sgs->idle_cpus;
> > +
> > + if (!sched_task_classes_enabled())
> > + return;
> > +
> > + /* No busy CPUs in the group. No tasks to move. */
> > + if (!busy_cpus)
> > + return;
> > +
> > + score_on_dst_cpu = arch_get_task_class_score(class_sgs->p_min_score->class,
> > + dst_cpu);
> > +
> > + /*
> > + * The simpest case. The single busy CPU in the current group will
> > + * become idle after pulling its current task. The destination CPU is
> > + * idle.
> > + */
> > + if (busy_cpus == 1) {
> > + sgs->task_class_score_before = class_sgs->sum_score;
> > + sgs->task_class_score_after = score_on_dst_cpu;
> > + return;
> > + }
> > +
> > + /*
> > + * Now compute the group score with and without the task with the
> > + * lowest score. We assume that the tasks that remain in the group share
> > + * the CPU resources equally.
> > + */
> > + group_score = class_sgs->sum_score / busy_cpus;
> > +
> > + group_score_without = (class_sgs->sum_score - class_sgs->min_score) /
> > + (busy_cpus - 1);
> > +
> > + sgs->task_class_score_after = group_score_without + score_on_dst_cpu;
> > + sgs->task_class_score_before = group_score;
> > +}

(I am sorry Peter, I just found that several emails were sitting on my drafts
directory).

>
> That's just plain broken; also lots of cleanups done...

Thank you very much for your suggestions. They make sense to me. I only
have a comment...

Do you want me to add your Signed-off-by and Co-developed-by tags?

>
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8405,12 +8405,14 @@ struct sg_lb_stats {
> enum group_type group_type;
> unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
> unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
> - long task_class_score_after; /* Prospective task-class score after load balancing */
> - long task_class_score_before; /* Task-class score before load balancing */
> #ifdef CONFIG_NUMA_BALANCING
> unsigned int nr_numa_running;
> unsigned int nr_preferred_running;
> #endif
> +#ifdef CONFIG_SCHED_TASK_CLASSES
> + long task_class_score_after; /* Prospective task-class score after load balancing */
> + long task_class_score_before; /* Task-class score before load balancing */
> +#endif
> };
>
> /*
> @@ -8689,58 +8691,54 @@ group_type group_classify(unsigned int i
> }
>
> struct sg_lb_task_class_stats {
> - /*
> - * Score of the task with lowest score among the current tasks (i.e.,
> - * runqueue::curr) of all runqueues in the scheduling group.
> - */
> - int min_score;
> - /*
> - * Sum of the scores of the current tasks of all runqueues in the
> - * scheduling group.
> - */
> - long sum_score;
> - /* The task with score equal to @min_score */
> - struct task_struct *p_min_score;
> + int min_score; /* Min(rq->curr->score) */
> + int min_class;
> + long sum_score; /* Sum(rq->curr->score) */
> };
>
> #ifdef CONFIG_SCHED_TASK_CLASSES
> -static void init_rq_task_classes_stats(struct sg_lb_task_class_stats *class_sgs)
> +static void init_sg_lb_task_class_stats(struct sg_lb_task_class_stats *sgcs)
> {
> - class_sgs->min_score = INT_MAX;
> - class_sgs->sum_score = 0;
> - class_sgs->p_min_score = NULL;
> + *sgcs = (struct sg_lb_task_class_stats){
> + .min_score = INT_MAX,
> + };
> }
>
> /** Called only if cpu_of(@rq) is not idle and has tasks running. */
> -static void update_rq_task_classes_stats(struct sg_lb_task_class_stats *class_sgs,
> - struct rq *rq)
> +static void update_sg_lb_task_class_stats(struct sg_lb_task_class_stats *sgcs,
> + struct rq *rq)
> {
> - int score;
> + struct task_struct *curr;
> + int class, score;
>
> if (!sched_task_classes_enabled())
> return;
>
> + curr = rcu_dereference(rq->curr);
> + if (!curr || (curr->flags & PF_EXITING) || is_idle_task(curr))
> + return;
> +
> /*
> * TODO: if nr_running > 1 we may want go through all the tasks behind
> * rq->curr.
> */
> - score = arch_get_task_class_score(rq->curr->class, cpu_of(rq));
> -
> - class_sgs->sum_score += score;
> + class = curr->class;
> + score = arch_get_task_class_score(class, cpu_of(rq));
>
> - if (score >= class_sgs->min_score)
> - return;
> + sgcs->sum_score += score;
>
> - class_sgs->min_score = score;
> - class_sgs->p_min_score = rq->curr;
> + if (score < sgcs->min_score) {
> + sgcs->min_score = score;
> + sgcs->min_class = class;
> + }
> }
>
> -static void compute_ilb_sg_task_class_scores(struct sg_lb_task_class_stats *class_sgs,
> - struct sg_lb_stats *sgs,
> - int dst_cpu)
> +static void update_sg_lb_stats_scores(struct sg_lb_task_class_stats *sgcs,
> + struct sg_lb_stats *sgs,
> + int dst_cpu)
> {
> - int group_score, group_score_without, score_on_dst_cpu;
> int busy_cpus = sgs->group_weight - sgs->idle_cpus;
> + long before, after;
>
> if (!sched_task_classes_enabled())
> return;
> @@ -8749,32 +8747,18 @@ static void compute_ilb_sg_task_class_sc
> if (!busy_cpus)
> return;
>
> - score_on_dst_cpu = arch_get_task_class_score(class_sgs->p_min_score->class,
> - dst_cpu);
> + score_on_dst_cpu = arch_get_task_class_score(sgcs->min_class, dst_cpu);
>
> - /*
> - * The simpest case. The single busy CPU in the current group will
> - * become idle after pulling its current task. The destination CPU is
> - * idle.
> - */
> - if (busy_cpus == 1) {
> - sgs->task_class_score_before = class_sgs->sum_score;
> - sgs->task_class_score_after = score_on_dst_cpu;
> - return;
> - }
> + before = sgcs->sum_score
> + after = before - sgcs->min_score + score_on_dst_cpu;

This works when the sched group being evaluated has only one busy CPU
because it will become idle if the destination CPU (which was idle) pulls
the current task.

>
> - /*
> - * Now compute the group score with and without the task with the
> - * lowest score. We assume that the tasks that remain in the group share
> - * the CPU resources equally.
> - */
> - group_score = class_sgs->sum_score / busy_cpus;
> -
> - group_score_without = (class_sgs->sum_score - class_sgs->min_score) /
> - (busy_cpus - 1);
> + if (busy_cpus > 1) {
> + before /= busy_cpus;
> + after /= busy_cpus;


However, I don't think this works when the sched group has more than one
busy CPU. 'before' and 'after' reflect the total throughput score of both
the sched group *and* the destination CPU.

One of the CPUs in the sched group will become idle after the balance.

Also, at this point we have already added score_on_dst_cpu. We are incorrectly
scaling it by the number of busy CPUs in the sched group.

We instead must scale 'after' by busy_cpus - 1 and then add score_on_dst_cpu.

> + }
>
> - sgs->task_class_score_after = group_score_without + score_on_dst_cpu;
> - sgs->task_class_score_before = group_score;
> + sgs->task_class_score_before = before;
> + sgs->task_class_score_after = after;
>

Thanks and BR,
Ricardo
>

2022-10-26 04:28:18

by Ricardo Neri

[permalink] [raw]
Subject: Re: [RFC PATCH 12/23] thermal: intel: hfi: Convert table_lock to use flags-handling variants

On Tue, Sep 27, 2022 at 01:36:53PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 27, 2022 at 01:34:07PM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 09, 2022 at 04:11:54PM -0700, Ricardo Neri wrote:
> >
> > > --- a/drivers/thermal/intel/intel_hfi.c
> > > +++ b/drivers/thermal/intel/intel_hfi.c
> > > @@ -175,9 +175,10 @@ static struct workqueue_struct *hfi_updates_wq;
> > > static void get_hfi_caps(struct hfi_instance *hfi_instance,
> > > struct thermal_genl_cpu_caps *cpu_caps)
> > > {
> > > + unsigned long flags;
> > > int cpu, i = 0;
> > >
> > > - raw_spin_lock_irq(&hfi_instance->table_lock);
> > > + raw_spin_lock_irqsave(&hfi_instance->table_lock, flags);
> > > for_each_cpu(cpu, hfi_instance->cpus) {
> > > struct hfi_cpu_data *caps;
> > > s16 index;
> >
> > ^^^^ Anti-pattern alert!
> >
> > Now your IRQ latency depends on nr_cpus -- which is a fair fail. The
> > existing code is already pretty crap in that it has the preemption
> > latency depend on nr_cpus.
> >
> > While I'm here looking at the HFI stuff, did they fix that HFI interrupt
> > broadcast mess already? Sending an interrupt to *all* CPUs is quite
> > insane.
>
> Anyway; given the existence of this here loop; why not have something
> like:
>
> DEFINE_PER_CPU(int, hfi_ipc_class);
>
> class = // extract from HFI mess
> WRITE_ONCE(per_cpu(hfi_ipc_class, cpu), class);
>
> And then have the tick use this_cpu_read(hfi_ipc_class)? No extra
> locking required.

Thanks Peter. I think this is a good solution. I will implement it.

BR,
Ricardo

2022-10-26 09:12:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 08/23] sched/fair: Compute task-class performance scores for load balancing

On Tue, Oct 25, 2022 at 08:57:24PM -0700, Ricardo Neri wrote:

> Do you want me to add your Signed-off-by and Co-developed-by tags?

Nah; who cares ;-)


> > @@ -8749,32 +8747,18 @@ static void compute_ilb_sg_task_class_sc
> > if (!busy_cpus)
> > return;
> >
> > - score_on_dst_cpu = arch_get_task_class_score(class_sgs->p_min_score->class,
> > - dst_cpu);
> > + score_on_dst_cpu = arch_get_task_class_score(sgcs->min_class, dst_cpu);
> >
> > - /*
> > - * The simpest case. The single busy CPU in the current group will
> > - * become idle after pulling its current task. The destination CPU is
> > - * idle.
> > - */
> > - if (busy_cpus == 1) {
> > - sgs->task_class_score_before = class_sgs->sum_score;
> > - sgs->task_class_score_after = score_on_dst_cpu;
> > - return;
> > - }
> > + before = sgcs->sum_score
> > + after = before - sgcs->min_score + score_on_dst_cpu;
>
> This works when the sched group being evaluated has only one busy CPU
> because it will become idle if the destination CPU (which was idle) pulls
> the current task.
>
> >
> > - /*
> > - * Now compute the group score with and without the task with the
> > - * lowest score. We assume that the tasks that remain in the group share
> > - * the CPU resources equally.
> > - */
> > - group_score = class_sgs->sum_score / busy_cpus;
> > -
> > - group_score_without = (class_sgs->sum_score - class_sgs->min_score) /
> > - (busy_cpus - 1);
> > + if (busy_cpus > 1) {
> > + before /= busy_cpus;
> > + after /= busy_cpus;
>
>
> However, I don't think this works when the sched group has more than one
> busy CPU. 'before' and 'after' reflect the total throughput score of both
> the sched group *and* the destination CPU.
>
> One of the CPUs in the sched group will become idle after the balance.
>
> Also, at this point we have already added score_on_dst_cpu. We are incorrectly
> scaling it by the number of busy CPUs in the sched group.
>
> We instead must scale 'after' by busy_cpus - 1 and then add score_on_dst_cpu.

So none of that makes sense.

'x/n + y' != '(x+y)/(n+1)'

IOW:

> > + }
> >
> > - sgs->task_class_score_after = group_score_without + score_on_dst_cpu;
> > - sgs->task_class_score_before = group_score;
> > + sgs->task_class_score_before = before;
> > + sgs->task_class_score_after = after;

your task_class_score_after is a sum value for 2 cpus worth, not a value
for a single cpu, while your task_class_score_before is a single cpu
average. You can't compare these numbers and have a sensible outcome.

If you have a number of values: x_1....x_n, their average is
Sum(x_1...x_n) / n, which is a single value again.

If you want to update one of the x's, say x_i->x'_i, then the average
changes like:

Sum(x_1...x_n-x_i+x'_i) / n

If you want to remove one of the x's, then you get:

Sum(x_1...x_n-x_i) / (n-1) ; 0<i<n

if you want to add an x:

Sum(x_1...x_n+i_i) / (n+1) ; i>n

Nowhere would you ever get:

Sum(x_1...x_n) / n + x_i

That's just straight up nonsense.

So I might buy an argument for:

if (busy_cpus > 1) {
before /= (busy_cpus-1);
after /= (busy_cpus+1);
}

Or something along those lines (where you remove an entry from before
and add it to after), but not this.


2022-10-27 03:57:41

by Ricardo Neri

[permalink] [raw]
Subject: Re: [RFC PATCH 08/23] sched/fair: Compute task-class performance scores for load balancing

On Wed, Oct 26, 2022 at 10:55:11AM +0200, Peter Zijlstra wrote:
> On Tue, Oct 25, 2022 at 08:57:24PM -0700, Ricardo Neri wrote:
>
> > Do you want me to add your Signed-off-by and Co-developed-by tags?
>
> Nah; who cares ;-)
>
>
> > > @@ -8749,32 +8747,18 @@ static void compute_ilb_sg_task_class_sc
> > > if (!busy_cpus)
> > > return;
> > >
> > > - score_on_dst_cpu = arch_get_task_class_score(class_sgs->p_min_score->class,
> > > - dst_cpu);
> > > + score_on_dst_cpu = arch_get_task_class_score(sgcs->min_class, dst_cpu);
> > >
> > > - /*
> > > - * The simpest case. The single busy CPU in the current group will
> > > - * become idle after pulling its current task. The destination CPU is
> > > - * idle.
> > > - */
> > > - if (busy_cpus == 1) {
> > > - sgs->task_class_score_before = class_sgs->sum_score;
> > > - sgs->task_class_score_after = score_on_dst_cpu;
> > > - return;
> > > - }
> > > + before = sgcs->sum_score
> > > + after = before - sgcs->min_score + score_on_dst_cpu;
> >
> > This works when the sched group being evaluated has only one busy CPU
> > because it will become idle if the destination CPU (which was idle) pulls
> > the current task.
> >
> > >
> > > - /*
> > > - * Now compute the group score with and without the task with the
> > > - * lowest score. We assume that the tasks that remain in the group share
> > > - * the CPU resources equally.
> > > - */
> > > - group_score = class_sgs->sum_score / busy_cpus;
> > > -
> > > - group_score_without = (class_sgs->sum_score - class_sgs->min_score) /
> > > - (busy_cpus - 1);
> > > + if (busy_cpus > 1) {
> > > + before /= busy_cpus;
> > > + after /= busy_cpus;
> >
> >
> > However, I don't think this works when the sched group has more than one
> > busy CPU. 'before' and 'after' reflect the total throughput score of both
> > the sched group *and* the destination CPU.
> >
> > One of the CPUs in the sched group will become idle after the balance.
> >
> > Also, at this point we have already added score_on_dst_cpu. We are incorrectly
> > scaling it by the number of busy CPUs in the sched group.
> >
> > We instead must scale 'after' by busy_cpus - 1 and then add score_on_dst_cpu.
>
> So none of that makes sense.
>
> 'x/n + y' != '(x+y)/(n+1)'
>
> IOW:
>
> > > + }
> > >
> > > - sgs->task_class_score_after = group_score_without + score_on_dst_cpu;
> > > - sgs->task_class_score_before = group_score;
> > > + sgs->task_class_score_before = before;
> > > + sgs->task_class_score_after = after;
>
> your task_class_score_after is a sum value for 2 cpus worth not a value
> for a single cpu,

Agreed.

> while your task_class_score_before is a single cpu
> average.

Agreed. You can also regard task_class_score_before as a value for 2 CPUs
worth, only that the contribution to throughput of dst_cpu is 0.

> You can't compare these numbers and have a sensible outcome.
>
> If you have a number of values: x_1....x_n, their average is
> Sum(x_1...x_n) / n, which is a single value again.
>
> If you want to update one of the x's, say x_i->x'_i, then the average
> changes like:
>
> Sum(x_1...x_n-x_i+x'_i) / n
>
> If you want to remove one of the x's, then you get:
>
> Sum(x_1...x_n-x_i) / (n-1) ; 0<i<n
>
> if you want to add an x:
>
> Sum(x_1...x_n+i_i) / (n+1) ; i>n
>
> Nowhere would you ever get:
>
> Sum(x_1...x_n) / n + x_i
>
> That's just straight up nonsense.

But we are not computing the average throughput. We are computing the
*total* throughput of two CPUs. Hence, what we need is the sum of the
throughput score of both CPUs.

We may be here because the sched group of sgcs is composed of SMT
siblings. Hence, we divide by busy_cpus assuming that all busy siblings
share the core resources evenly. (For non-SMT sched groups, busy_cpus is 1
at most).

>
> So I might buy an argument for:
>
> if (busy_cpus > 1) {
> before /= (busy_cpus-1);
> after /= (busy_cpus+1);
> }
>
> Or something along those lines (where you remove an entry from before
> and add it to after), but not this.

The entry that we remove from before will go to after. The entry will be
placed in the local group. This group has a different busy_cpus: all of its
SMT siblings, if any, idle (otherwise, we would not be here). It will
become 1 after the balance.