Hi,
This is third version of this patchset. Previous versions can be found
here [1] and here [2]. For brevity, I did not include the cover letter
from the original posting. You can read it here [1].
This patchset depends on a separate series to handle better asym_packing
between SMT cores [3].
For convenience, this patchset and [3] can be retrieved from [4] and are
based on the tip tree as on Feb 6th, 2023.
Changes since v2:
Ionela pointed out that the IPCC performance score was vague. I provided
a clearer definition and guidance on how architectures should implement
support for it.
Ionela mentioned that other architectures or scheduling schemes may want
to use IPC classes differently. I restricted its current use to
asym_packing.
Lukasz raised the issue that hardware may not be ready to support IPC
classes early after boot. I added a new interface that drivers or
enablement code can call to enable the use of IPC classes when ready.
Vincent provided multiple suggestions on how to balance non-SMT and SMT
sched groups. His feedback was incorporated in [3]. As a result, now
IPCC statistics are also used to break ties between fully_busy groups.
Dietmar indicated that real-time nor deadline tasks should influence the
CFS load balancing. I implemented such change. Also, as per his suggestion,
I folded the IPCC statistics into the existing struct sg_lb_stats.
Updated patches: 2, 6, 7, 10, 13, 14, 17
New patches: 9, 20
Unchanged patches: 1, 3, 4, 5, 8, 11, 12, 15, 16, 18, 19, 21, 22, 23, 24
Hopefully, this series is one step closer to be merged.
Thanks in advance for your kind feedback!
BR,
Ricardo
[1]. https://lore.kernel.org/lkml/[email protected]/
[2]. https://lore.kernel.org/lkml/[email protected]/
[3]. https://lore.kernel.org/lkml/[email protected]/
[4]. https://github.com/ricardon/tip/tree/rneri/ipc_classes_v3
Ricardo Neri (24):
sched/task_struct: Introduce IPC classes of tasks
sched: Add interfaces for IPC classes
sched/core: Initialize the IPC class of a new task
sched/core: Add user_tick as argument to scheduler_tick()
sched/core: Update the IPC class of the current task
sched/fair: Collect load-balancing stats for IPC classes
sched/fair: Compute IPC class scores for load balancing
sched/fair: Use IPCC stats to break ties between asym_packing sched
groups
sched/fair: Use IPCC stats to break ties between fully_busy SMT groups
sched/fair: Use IPCC scores to select a busiest runqueue
thermal: intel: hfi: Introduce Intel Thread Director classes
x86/cpufeatures: Add the Intel Thread Director feature definitions
thermal: intel: hfi: Store per-CPU IPCC scores
thermal: intel: hfi: Update the IPC class of the current task
thermal: intel: hfi: Report the IPC class score of a CPU
thermal: intel: hfi: Define a default class for unclassified tasks
thermal: intel: hfi: Enable the Intel Thread Director
sched/task_struct: Add helpers for IPC classification
sched/core: Initialize helpers of task classification
sched/fair: Introduce sched_smt_siblings_idle()
thermal: intel: hfi: Implement model-specific checks for task
classification
x86/cpufeatures: Add feature bit for HRESET
x86/hreset: Configure history reset
x86/process: Reset hardware history in context switch
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/hreset.h | 30 +++
arch/x86/include/asm/msr-index.h | 6 +-
arch/x86/include/asm/topology.h | 8 +
arch/x86/kernel/cpu/common.c | 30 ++-
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
arch/x86/kernel/cpu/scattered.c | 1 +
arch/x86/kernel/process_32.c | 3 +
arch/x86/kernel/process_64.c | 3 +
drivers/thermal/intel/intel_hfi.c | 242 +++++++++++++++++-
include/linux/sched.h | 24 +-
include/linux/sched/topology.h | 6 +
init/Kconfig | 12 +
kernel/sched/core.c | 10 +-
kernel/sched/fair.c | 309 ++++++++++++++++++++++-
kernel/sched/sched.h | 66 +++++
kernel/sched/topology.c | 9 +
kernel/time/timer.c | 2 +-
19 files changed, 751 insertions(+), 21 deletions(-)
create mode 100644 arch/x86/include/asm/hreset.h
--
2.25.1
On hybrid processors, the architecture differences between the types of
CPUs lead to different instructions-per-cycle (IPC) on each type of CPU.
IPCs may differ further by the type of instructions. Instructions can be
grouped into classes of similar IPCs.
Hence, tasks can be classified into groups based on the type of
instructions they execute.
Add a new member task_struct::ipcc to associate a particular task to
an IPC class that depends on the instructions it executes.
The scheduler may use the IPC class of a task and data about the
performance among CPUs of a given IPC class to improve throughput. It
may, for instance, place certain classes of tasks on CPUs of higher
performance.
The methods to determine the classification of a task and its relative
IPC score are specific to each CPU architecture.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* Changed the type of task_struct::ipcc to unsigned short. A subsequent
patch uses bit fields to use 9 bits, along with other auxiliary
members.
Changes since v1:
* Renamed task_struct::class as task_struct::ipcc. (Joel)
* Use task_struct::ipcc = 0 for unclassified tasks. (PeterZ)
* Renamed CONFIG_SCHED_TASK_CLASSES as CONFIG_IPC_CLASSES. (PeterZ, Joel)
---
include/linux/sched.h | 10 ++++++++++
init/Kconfig | 12 ++++++++++++
2 files changed, 22 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4df2b3e76b30..98f84f90a01d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -127,6 +127,8 @@ struct task_group;
__TASK_TRACED | EXIT_DEAD | EXIT_ZOMBIE | \
TASK_PARKED)
+#define IPC_CLASS_UNCLASSIFIED 0
+
#define task_is_running(task) (READ_ONCE((task)->__state) == TASK_RUNNING)
#define task_is_traced(task) ((READ_ONCE(task->jobctl) & JOBCTL_TRACED) != 0)
@@ -1528,6 +1530,14 @@ struct task_struct {
union rv_task_monitor rv[RV_PER_TASK_MONITORS];
#endif
+#ifdef CONFIG_IPC_CLASSES
+ /*
+ * A hardware-defined classification of task that reflects but is
+ * not identical to the number of instructions per cycle.
+ */
+ unsigned short ipcc;
+#endif
+
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
diff --git a/init/Kconfig b/init/Kconfig
index e76dc579cfa2..731a4b652030 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -867,6 +867,18 @@ config UCLAMP_BUCKETS_COUNT
If in doubt, use the default value.
+config IPC_CLASSES
+ bool "IPC classes of tasks"
+ depends on SMP
+ help
+ If selected, each task is assigned a classification value that
+ reflects the type of instructions that the task executes. This
+ classification reflects but is not equal to the number of
+ instructions retired per cycle.
+
+ The scheduler uses the classification value to improve the placement
+ of tasks.
+
endmenu
#
--
2.25.1
Add the interfaces that architectures shall implement to convey the data
to support IPC classes.
arch_update_ipcc() updates the IPC classification of the current task as
given by hardware.
arch_get_ipcc_score() provides a performance score for a given IPC class
when placed on a specific CPU. Higher scores indicate higher performance.
When a driver or equivalent enablement code has configured the necessary
hardware to support IPC classes, it should call sched_enable_ipc_classes()
to notify the scheduler that it can start using IPC classes data.
The number of classes and the score of each class of task are determined
by hardware.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* Clarified the properties of the IPC score: abstract, and linear. It can
normalized when needed. (Ionela)
* Selected a better default IPC score. (Ionela)
* Removed arch_has_ipc_classes(). It is not suitable for hardware that is
not ready to support IPC classes after boot. (Lukasz)
* Added a new sched_enable_ipc_classes() interface that drivers or
enablement code can call when ready to support IPC classes. (Lukasz)
Changes since v1:
* Shortened the names of the IPCC interfaces (PeterZ):
sched_task_classes_enabled >> sched_ipcc_enabled
arch_has_task_classes >> arch_has_ipc_classes
arch_update_task_class >> arch_update_ipcc
arch_get_task_class_score >> arch_get_ipcc_score
* Removed smt_siblings_idle argument from arch_update_ipcc(). (PeterZ)
---
include/linux/sched/topology.h | 6 ++++
kernel/sched/sched.h | 66 ++++++++++++++++++++++++++++++++++
kernel/sched/topology.c | 9 +++++
3 files changed, 81 insertions(+)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 816df6cc444e..5b084d3c9ad1 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -280,4 +280,10 @@ static inline int task_node(const struct task_struct *p)
return cpu_to_node(task_cpu(p));
}
+#ifdef CONFIG_IPC_CLASSES
+extern void sched_enable_ipc_classes(void);
+#else
+static inline void sched_enable_ipc_classes(void) { }
+#endif
+
#endif /* _LINUX_SCHED_TOPOLOGY_H */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1072502976df..0a9c3024326d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2532,6 +2532,72 @@ void arch_scale_freq_tick(void)
}
#endif
+#ifdef CONFIG_IPC_CLASSES
+DECLARE_STATIC_KEY_FALSE(sched_ipcc);
+
+static inline bool sched_ipcc_enabled(void)
+{
+ return static_branch_unlikely(&sched_ipcc);
+}
+
+#ifndef arch_update_ipcc
+/**
+ * arch_update_ipcc() - Update the IPC class of the current task
+ * @curr: The current task
+ *
+ * Request that the IPC classification of @curr is updated.
+ *
+ * Returns: none
+ */
+static __always_inline
+void arch_update_ipcc(struct task_struct *curr)
+{
+}
+#endif
+
+#ifndef arch_get_ipcc_score
+
+#define SCHED_IPCC_SCORE_SCALE (1L << SCHED_FIXEDPOINT_SHIFT)
+/**
+ * arch_get_ipcc_score() - Get the IPC score of a class of task
+ * @ipcc: The IPC class
+ * @cpu: A CPU number
+ *
+ * The IPC performance scores reflects (but it is not identical to) the number
+ * of instructions retired per cycle for a given IPC class. It is a linear and
+ * abstract metric. Higher scores reflect better performance.
+ *
+ * The IPC score can be normalized with respect to the class, i, with the
+ * highest IPC score on the CPU, c, with highest performance:
+ *
+ * IPC(i, c)
+ * ------------------------------------ * SCHED_IPCC_SCORE_SCALE
+ * max(IPC(i, c) : (i, c))
+ *
+ * Scheduling schemes that want to use the IPC score along with other
+ * normalized metrics for scheduling (e.g., CPU capacity) may need to normalize
+ * it.
+ *
+ * Other scheduling schemes (e.g., asym_packing) do not need normalization.
+ *
+ * Returns the performance score of an IPC class, @ipcc, when running on @cpu.
+ * Error when either @ipcc or @cpu are invalid.
+ */
+static __always_inline
+unsigned long arch_get_ipcc_score(unsigned short ipcc, int cpu)
+{
+ return SCHED_IPCC_SCORE_SCALE;
+}
+#endif
+#else /* CONFIG_IPC_CLASSES */
+
+#define arch_get_ipcc_score(ipcc, cpu) (-EINVAL)
+#define arch_update_ipcc(curr)
+
+static inline bool sched_ipcc_enabled(void) { return false; }
+
+#endif /* CONFIG_IPC_CLASSES */
+
#ifndef arch_scale_freq_capacity
/**
* arch_scale_freq_capacity - get the frequency scale factor of a given CPU.
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d93c3379e901..8380bb7f0cd9 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -670,6 +670,15 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
+#ifdef CONFIG_IPC_CLASSES
+DEFINE_STATIC_KEY_FALSE(sched_ipcc);
+
+void sched_enable_ipc_classes(void)
+{
+ static_branch_enable_cpuslocked(&sched_ipcc);
+}
+#endif
+
static void update_top_cache_domain(int cpu)
{
struct sched_domain_shared *sds = NULL;
--
2.25.1
New tasks shall start life as unclassified. They will be classified by
hardware when they run.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* None
Changes since v1:
* None
---
kernel/sched/core.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4580fe3e1d0c..ed1549d28090 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4428,6 +4428,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.prev_sum_exec_runtime = 0;
p->se.nr_migrations = 0;
p->se.vruntime = 0;
+#ifdef CONFIG_IPC_CLASSES
+ p->ipcc = IPC_CLASS_UNCLASSIFIED;
+#endif
INIT_LIST_HEAD(&p->se.group_node);
#ifdef CONFIG_FAIR_GROUP_SCHED
--
2.25.1
Differentiate between user and kernel ticks so that the scheduler updates
the IPC class of the current task during the former.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* Corrected error in the changeset description: the IPC class of the
current task is updated at user tick. (Dietmar)
Changes since v1:
* None
---
include/linux/sched.h | 2 +-
kernel/sched/core.c | 2 +-
kernel/time/timer.c | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 98f84f90a01d..10c6abdc3465 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -293,7 +293,7 @@ enum {
TASK_COMM_LEN = 16,
};
-extern void scheduler_tick(void);
+extern void scheduler_tick(bool user_tick);
#define MAX_SCHEDULE_TIMEOUT LONG_MAX
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ed1549d28090..39d218a2f243 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5555,7 +5555,7 @@ static inline u64 cpu_resched_latency(struct rq *rq) { return 0; }
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
*/
-void scheduler_tick(void)
+void scheduler_tick(bool user_tick)
{
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 63a8ce7177dd..e15e24105891 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -2073,7 +2073,7 @@ void update_process_times(int user_tick)
if (in_irq())
irq_work_tick();
#endif
- scheduler_tick();
+ scheduler_tick(user_tick);
if (IS_ENABLED(CONFIG_POSIX_TIMERS))
run_posix_cpu_timers();
}
--
2.25.1
When supported, hardware monitors the instruction stream to classify the
current task. Hence, at userspace tick, we are ready to read the most
recent classification result for the current task.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* None
Changes since v1:
* Removed argument smt_siblings_idle from call to arch_ipcc_update().
* Used the new IPCC interfaces names.
---
kernel/sched/core.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 39d218a2f243..9f4e9cc16df8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5567,6 +5567,9 @@ void scheduler_tick(bool user_tick)
if (housekeeping_cpu(cpu, HK_TYPE_TICK))
arch_scale_freq_tick();
+ if (sched_ipcc_enabled() && user_tick)
+ arch_update_ipcc(curr);
+
sched_clock_tick();
rq_lock(rq, &rf);
--
2.25.1
When selecting a busiest scheduling group, the IPC class of the current
task can be used to select between two scheduling groups of types asym_
packing or fully_busy that are otherwise identical.
Compute the IPC class performance score for a scheduling group. It
is the sum of the scores of the current tasks of all the runqueues.
Also, keep track of the class of the task with the lowest IPC class score
in the scheduling group.
These two metrics will be used during idle load balancing to compute the
current and the prospective IPC class score of a scheduling group.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* Also excluded deadline and realtime tasks from IPCC stats. (Dietmar)
* Also excluded tasks that cannot run on the destination CPU from the
IPCC stats.
* Folded struct sg_lb_ipcc_stats into struct sg_lb_stats. (Dietmar)
* Reworded description sg_lb_stats::min_ipcc. (Ionela)
* Handle errors of arch_get_ipcc_score(). (Ionela)
Changes since v1:
* Implemented cleanups and reworks from PeterZ. Thanks!
* Used the new interface names.
---
kernel/sched/fair.c | 61 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 61 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0ada2d18b934..d773380a95b3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8897,6 +8897,11 @@ struct sg_lb_stats {
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
#endif
+#ifdef CONFIG_IPC_CLASSES
+ unsigned long min_score; /* Min(score(rq->curr->ipcc)) */
+ unsigned short min_ipcc; /* Class of the task with the minimum IPCC score in the rq */
+ unsigned long sum_score; /* Sum(score(rq->curr->ipcc)) */
+#endif
};
/*
@@ -9240,6 +9245,59 @@ group_type group_classify(unsigned int imbalance_pct,
return group_has_spare;
}
+#ifdef CONFIG_IPC_CLASSES
+static void init_rq_ipcc_stats(struct sg_lb_stats *sgs)
+{
+ /* All IPCC stats have been set to zero in update_sg_lb_stats(). */
+ sgs->min_score = ULONG_MAX;
+}
+
+/* Called only if cpu_of(@rq) is not idle and has tasks running. */
+static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
+ struct rq *rq)
+{
+ struct task_struct *curr;
+ unsigned short ipcc;
+ unsigned long score;
+
+ if (!sched_ipcc_enabled())
+ return;
+
+ curr = rcu_dereference(rq->curr);
+ if (!curr || (curr->flags & PF_EXITING) || is_idle_task(curr) ||
+ task_is_realtime(curr) ||
+ !cpumask_test_cpu(dst_cpu, curr->cpus_ptr))
+ return;
+
+ ipcc = curr->ipcc;
+ score = arch_get_ipcc_score(ipcc, cpu_of(rq));
+
+ /*
+ * Ignore tasks with invalid scores. When finding the busiest group, we
+ * prefer those with higher sum_score. This group will not be selected.
+ */
+ if (IS_ERR_VALUE(score))
+ return;
+
+ sgs->sum_score += score;
+
+ if (score < sgs->min_score) {
+ sgs->min_score = score;
+ sgs->min_ipcc = ipcc;
+ }
+}
+
+#else /* CONFIG_IPC_CLASSES */
+static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
+ struct rq *rq)
+{
+}
+
+static void init_rq_ipcc_stats(struct sg_lb_stats *sgs)
+{
+}
+#endif /* CONFIG_IPC_CLASSES */
+
/**
* asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull tasks
* @dst_cpu: Destination CPU of the load balancing
@@ -9332,6 +9390,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
int i, nr_running, local_group;
memset(sgs, 0, sizeof(*sgs));
+ init_rq_ipcc_stats(sgs);
local_group = group == sds->local;
@@ -9381,6 +9440,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (sgs->group_misfit_task_load < load)
sgs->group_misfit_task_load = load;
}
+
+ update_sg_lb_ipcc_stats(env->dst_cpu, sgs, rq);
}
sgs->group_capacity = group->sgc->capacity;
--
2.25.1
Compute the joint total (both current and prospective) IPC class score of
a scheduling group and the local scheduling group.
These IPCC statistics are used during idle load balancing. The candidate
scheduling group will have one fewer busy CPU after load balancing. This
observation is important for cores with SMT support.
The IPCC score of scheduling groups composed of SMT siblings needs to
consider that the siblings share CPU resources. When computing the total
IPCC score of the scheduling group, divide score of each sibling by the
number of busy siblings.
Collect IPCC statistics for asym_packing and fully_busy scheduling groups.
When picking a busiest group, they are used to break ties between otherwise
identical groups.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* Also collect IPCC stats for fully_busy sched groups.
* Restrict use of IPCC stats to SD_ASYM_PACKING. (Ionela)
* Handle errors of arch_get_ipcc_score(). (Ionela)
Changes since v1:
* Implemented cleanups and reworks from PeterZ. I took all his
suggestions, except the computation of the IPC score before and after
load balancing. We are computing not the average score, but the *total*.
* Check for the SD_SHARE_CPUCAPACITY to compute the throughput of the SMT
siblings of a physical core.
* Used the new interface names.
* Reworded commit message for clarity.
---
kernel/sched/fair.c | 68 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 68 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d773380a95b3..b6165aa8a376 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8901,6 +8901,8 @@ struct sg_lb_stats {
unsigned long min_score; /* Min(score(rq->curr->ipcc)) */
unsigned short min_ipcc; /* Class of the task with the minimum IPCC score in the rq */
unsigned long sum_score; /* Sum(score(rq->curr->ipcc)) */
+ long ipcc_score_after; /* Prospective IPCC score after load balancing */
+ unsigned long ipcc_score_before; /* IPCC score before load balancing */
#endif
};
@@ -9287,6 +9289,62 @@ static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
}
}
+static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
+ struct sched_group *sg,
+ struct lb_env *env)
+{
+ unsigned long score_on_dst_cpu, before;
+ int busy_cpus;
+ long after;
+
+ if (!sched_ipcc_enabled())
+ return;
+
+ /*
+ * IPCC scores are only useful during idle load balancing. For now,
+ * only asym_packing uses IPCC scores.
+ */
+ if (!(env->sd->flags & SD_ASYM_PACKING) ||
+ env->idle == CPU_NOT_IDLE)
+ return;
+
+ /*
+ * IPCC scores are used to break ties only between these types of
+ * groups.
+ */
+ if (sgs->group_type != group_fully_busy &&
+ sgs->group_type != group_asym_packing)
+ return;
+
+ busy_cpus = sgs->group_weight - sgs->idle_cpus;
+
+ /* No busy CPUs in the group. No tasks to move. */
+ if (!busy_cpus)
+ return;
+
+ score_on_dst_cpu = arch_get_ipcc_score(sgs->min_ipcc, env->dst_cpu);
+
+ /*
+ * Do not use IPC scores. sgs::ipcc_score_{after, before} will be zero
+ * and not used.
+ */
+ if (IS_ERR_VALUE(score_on_dst_cpu))
+ return;
+
+ before = sgs->sum_score;
+ after = before - sgs->min_score;
+
+ /* SMT siblings share throughput. */
+ if (busy_cpus > 1 && sg->flags & SD_SHARE_CPUCAPACITY) {
+ before /= busy_cpus;
+ /* One sibling will become idle after load balance. */
+ after /= busy_cpus - 1;
+ }
+
+ sgs->ipcc_score_after = after + score_on_dst_cpu;
+ sgs->ipcc_score_before = before;
+}
+
#else /* CONFIG_IPC_CLASSES */
static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
struct rq *rq)
@@ -9296,6 +9354,13 @@ static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
static void init_rq_ipcc_stats(struct sg_lb_stats *sgs)
{
}
+
+static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
+ struct sched_group *sg,
+ struct lb_env *env)
+{
+}
+
#endif /* CONFIG_IPC_CLASSES */
/**
@@ -9457,6 +9522,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
+ if (!local_group)
+ update_sg_lb_stats_scores(sgs, group, env);
+
/* Computing avg_load makes sense only when group is overloaded */
if (sgs->group_type == group_overloaded)
sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
--
2.25.1
IPCC statistics are used during idle load balancing. After balancing one
of the siblings of an SMT core will become idle. The rest of the busy
siblings will enjoy increased throughput. The IPCC statistics provide
a measure of the increased throughput. Use them to pick a busiest group
from otherwise identical fully_busy scheduling groups (of which the
avg_load is equal - and zero).
Using IPCC scores to break ties with non-SMT fully_busy sched groups
is not necessary. SMT sched groups always need more help.
Add a stub sched_asym_ipcc_prefer() for !CONFIG_IPC_CLASSES.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* Introduced this patch.
Changes since v1:
* N/A
---
kernel/sched/fair.c | 23 ++++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 841927b9b192..72d88270b320 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9415,6 +9415,12 @@ static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
{
}
+static bool sched_asym_ipcc_prefer(struct sg_lb_stats *a,
+ struct sg_lb_stats *b)
+{
+ return false;
+}
+
static bool sched_asym_ipcc_pick(struct sched_group *a,
struct sched_group *b,
struct sg_lb_stats *a_stats,
@@ -9698,10 +9704,21 @@ static bool update_sd_pick_busiest(struct lb_env *env,
if (sgs->avg_load == busiest->avg_load) {
/*
* SMT sched groups need more help than non-SMT groups.
- * If @sg happens to also be SMT, either choice is good.
*/
- if (sds->busiest->flags & SD_SHARE_CPUCAPACITY)
- return false;
+ if (sds->busiest->flags & SD_SHARE_CPUCAPACITY) {
+ if (!(sg->flags & SD_SHARE_CPUCAPACITY))
+ return false;
+
+ /*
+ * Between two SMT groups, use IPCC scores to pick the
+ * one that would improve throughput the most (only
+ * asym_packing uses IPCC scores for now).
+ */
+ if (sched_ipcc_enabled() &&
+ env->sd->flags & SD_ASYM_PACKING &&
+ sched_asym_ipcc_prefer(busiest, sgs))
+ return false;
+ }
}
break;
--
2.25.1
As it iterates, update_sd_pick_busiest() keeps on selecting as busiest
sched groups of identical priority. Since both groups have the same
priority, either group is a good choice. The IPCC statistics provide a
measure of the throughput before and after load balance. Use them to
pick a busiest scheduling group from otherwise identical asym_packing
scheduling groups.
Pick as busiest the scheduling group that yields a higher IPCC score
after load balancing.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* None
Changes since v1:
* Added a comment to clarify why sched_asym_prefer() needs a tie breaker
only in update_sd_pick_busiest(). (PeterZ)
* Renamed functions for accuracy:
sched_asym_class_prefer() >> sched_asym_ipcc_prefer()
sched_asym_class_pick() >> sched_asym_ipcc_pick()
* Reworded commit message for clarity.
---
kernel/sched/fair.c | 72 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 72 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b6165aa8a376..841927b9b192 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9345,6 +9345,60 @@ static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
sgs->ipcc_score_before = before;
}
+/**
+ * sched_asym_ipcc_prefer - Select a sched group based on its IPCC score
+ * @a: Load balancing statistics of a sched group
+ * @b: Load balancing statistics of a second sched group
+ *
+ * Returns: true if @a has a higher IPCC score than @b after load balance.
+ * False otherwise.
+ */
+static bool sched_asym_ipcc_prefer(struct sg_lb_stats *a,
+ struct sg_lb_stats *b)
+{
+ if (!sched_ipcc_enabled())
+ return false;
+
+ /* @a increases overall throughput after load balance. */
+ if (a->ipcc_score_after > b->ipcc_score_after)
+ return true;
+
+ /*
+ * If @a and @b yield the same overall throughput, pick @a if
+ * its current throughput is lower than that of @b.
+ */
+ if (a->ipcc_score_after == b->ipcc_score_after)
+ return a->ipcc_score_before < b->ipcc_score_before;
+
+ return false;
+}
+
+/**
+ * sched_asym_ipcc_pick - Select a sched group based on its IPCC score
+ * @a: A scheduling group
+ * @b: A second scheduling group
+ * @a_stats: Load balancing statistics of @a
+ * @b_stats: Load balancing statistics of @b
+ *
+ * Returns: true if @a has the same priority and @a has tasks with IPC classes
+ * that yield higher overall throughput after load balance. False otherwise.
+ */
+static bool sched_asym_ipcc_pick(struct sched_group *a,
+ struct sched_group *b,
+ struct sg_lb_stats *a_stats,
+ struct sg_lb_stats *b_stats)
+{
+ /*
+ * Only use the class-specific preference selection if both sched
+ * groups have the same priority.
+ */
+ if (arch_asym_cpu_priority(a->asym_prefer_cpu) !=
+ arch_asym_cpu_priority(b->asym_prefer_cpu))
+ return false;
+
+ return sched_asym_ipcc_prefer(a_stats, b_stats);
+}
+
#else /* CONFIG_IPC_CLASSES */
static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
struct rq *rq)
@@ -9361,6 +9415,14 @@ static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
{
}
+static bool sched_asym_ipcc_pick(struct sched_group *a,
+ struct sched_group *b,
+ struct sg_lb_stats *a_stats,
+ struct sg_lb_stats *b_stats)
+{
+ return false;
+}
+
#endif /* CONFIG_IPC_CLASSES */
/**
@@ -9596,6 +9658,16 @@ static bool update_sd_pick_busiest(struct lb_env *env,
/* Prefer to move from lowest priority CPU's work */
if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))
return false;
+
+ /*
+ * Unlike other callers of sched_asym_prefer(), here both @sg
+ * and @sds::busiest have tasks running. When they have equal
+ * priority, their IPC class scores can be used to select a
+ * better busiest.
+ */
+ if (sched_asym_ipcc_pick(sds->busiest, sg, &sds->busiest_stat, sgs))
+ return false;
+
break;
case group_misfit_task:
--
2.25.1
For two runqueues of equal priority and equal number of running of tasks,
select the one whose current task would have the highest IPC class score
if placed on the destination CPU.
For now, use IPCC scores only for scheduling domains with the
SD_ASYM_PACKING flag.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* Only use IPCC scores to break ties if the sched domain uses
asym_packing. (Ionela)
* Handle errors of arch_get_ipcc_score(). (Ionela)
Changes since v1:
* Fixed a bug when selecting a busiest runqueue: when comparing two
runqueues with equal nr_running, we must compute the IPCC score delta
of both.
* Renamed local variables to improve the layout of the code block.
(PeterZ)
* Used the new interface names.
---
kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 64 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 72d88270b320..d3c22dc145f7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9399,6 +9399,37 @@ static bool sched_asym_ipcc_pick(struct sched_group *a,
return sched_asym_ipcc_prefer(a_stats, b_stats);
}
+/**
+ * ipcc_score_delta - Get the IPCC score delta wrt the load balance's dst_cpu
+ * @p: A task
+ * @env: Load balancing environment
+ *
+ * Returns: The IPCC score delta that @p would get if placed in the destination
+ * CPU of @env. LONG_MIN to indicate that the delta should not be used.
+ */
+static long ipcc_score_delta(struct task_struct *p, struct lb_env *env)
+{
+ unsigned long score_src, score_dst;
+ unsigned short ipcc = p->ipcc;
+
+ if (!sched_ipcc_enabled())
+ return LONG_MIN;
+
+ /* Only asym_packing uses IPCC scores at the moment. */
+ if (!(env->sd->flags & SD_ASYM_PACKING))
+ return LONG_MIN;
+
+ score_dst = arch_get_ipcc_score(ipcc, env->dst_cpu);
+ if (IS_ERR_VALUE(score_dst))
+ return LONG_MIN;
+
+ score_src = arch_get_ipcc_score(ipcc, task_cpu(p));
+ if (IS_ERR_VALUE(score_src))
+ return LONG_MIN;
+
+ return score_dst - score_src;
+}
+
#else /* CONFIG_IPC_CLASSES */
static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
struct rq *rq)
@@ -9429,6 +9460,11 @@ static bool sched_asym_ipcc_pick(struct sched_group *a,
return false;
}
+static long ipcc_score_delta(struct task_struct *p, struct lb_env *env)
+{
+ return LONG_MIN;
+}
+
#endif /* CONFIG_IPC_CLASSES */
/**
@@ -10589,6 +10625,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
{
struct rq *busiest = NULL, *rq;
unsigned long busiest_util = 0, busiest_load = 0, busiest_capacity = 1;
+ long busiest_ipcc_delta = LONG_MIN;
unsigned int busiest_nr = 0;
int i;
@@ -10705,8 +10742,35 @@ static struct rq *find_busiest_queue(struct lb_env *env,
case migrate_task:
if (busiest_nr < nr_running) {
+ struct task_struct *curr;
+
busiest_nr = nr_running;
busiest = rq;
+
+ /*
+ * Remember the IPCC score delta of busiest::curr.
+ * We may need it to break a tie with other queues
+ * with equal nr_running.
+ */
+ curr = rcu_dereference(busiest->curr);
+ busiest_ipcc_delta = ipcc_score_delta(curr, env);
+ /*
+ * If rq and busiest have the same number of running
+ * tasks and IPC classes are supported, pick rq if doing
+ * so would give rq::curr a bigger IPC boost on dst_cpu.
+ */
+ } else if (busiest_nr == nr_running) {
+ struct task_struct *curr;
+ long delta;
+
+ curr = rcu_dereference(rq->curr);
+ delta = ipcc_score_delta(curr, env);
+
+ if (busiest_ipcc_delta < delta) {
+ busiest_ipcc_delta = delta;
+ busiest_nr = nr_running;
+ busiest = rq;
+ }
}
break;
--
2.25.1
Intel Thread Director (ITD) provides hardware resources to classify
the current task. The classification reflects the type of instructions that
a task currently executes.
ITD extends the Hardware Feedback Interface table to provide performance
and energy efficiency capabilities for each of the supported classes of
tasks.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* None
Changes since v1:
* Removed dependency on CONFIG_INTEL_THREAD_DIRECTOR. Instead, depend on
CONFIG_IPC_CLASSES.
* Added DISABLE_ITD to the correct DISABLE_MASK: 14 instead of 13.
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +++++++-
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
3 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index fdb8e09234ba..8a6261a5dbbf 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -349,6 +349,7 @@
#define X86_FEATURE_HWP_EPP (14*32+10) /* HWP Energy Perf. Preference */
#define X86_FEATURE_HWP_PKG_REQ (14*32+11) /* HWP Package Level Request */
#define X86_FEATURE_HFI (14*32+19) /* Hardware Feedback Interface */
+#define X86_FEATURE_ITD (14*32+23) /* Intel Thread Director */
/* AMD SVM Feature Identification, CPUID level 0x8000000a (EDX), word 15 */
#define X86_FEATURE_NPT (15*32+ 0) /* Nested Page Table support */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 5dfa4fb76f4b..f8e145a8c5dd 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -99,6 +99,12 @@
# define DISABLE_TDX_GUEST (1 << (X86_FEATURE_TDX_GUEST & 31))
#endif
+#ifdef CONFIG_IPC_CLASSES
+# define DISABLE_ITD 0
+#else
+# define DISABLE_ITD (1 << (X86_FEATURE_ITD & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -117,7 +123,7 @@
DISABLE_CALL_DEPTH_TRACKING)
#define DISABLED_MASK12 0
#define DISABLED_MASK13 0
-#define DISABLED_MASK14 0
+#define DISABLED_MASK14 (DISABLE_ITD)
#define DISABLED_MASK15 0
#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
DISABLE_ENQCMD)
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index f6748c8bd647..7a87b823eef3 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -81,6 +81,7 @@ static const struct cpuid_dep cpuid_deps[] = {
{ X86_FEATURE_XFD, X86_FEATURE_XSAVES },
{ X86_FEATURE_XFD, X86_FEATURE_XGETBV1 },
{ X86_FEATURE_AMX_TILE, X86_FEATURE_XFD },
+ { X86_FEATURE_ITD, X86_FEATURE_HFI },
{}
};
--
2.25.1
On Intel hybrid parts, each type of CPU has specific performance and
energy efficiency capabilities. The Intel Thread Director technology
extends the Hardware Feedback Interface (HFI) to provide performance and
energy efficiency data for advanced classes of instructions.
Add support to parse per-class capabilities.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* None
Changes since v1:
* Removed a now obsolete comment.
---
drivers/thermal/intel/intel_hfi.c | 30 ++++++++++++++++++++++++------
1 file changed, 24 insertions(+), 6 deletions(-)
diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index 6e604bda2b93..2527ae3836c7 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -77,7 +77,7 @@ union cpuid6_edx {
* @ee_cap: Energy efficiency capability
*
* Capabilities of a logical processor in the HFI table. These capabilities are
- * unitless.
+ * unitless and specific to each HFI class.
*/
struct hfi_cpu_data {
u8 perf_cap;
@@ -89,7 +89,8 @@ struct hfi_cpu_data {
* @perf_updated: Hardware updated performance capabilities
* @ee_updated: Hardware updated energy efficiency capabilities
*
- * Properties of the data in an HFI table.
+ * Properties of the data in an HFI table. There exists one header per each
+ * HFI class.
*/
struct hfi_hdr {
u8 perf_updated;
@@ -127,16 +128,21 @@ struct hfi_instance {
/**
* struct hfi_features - Supported HFI features
+ * @nr_classes: Number of classes supported
* @nr_table_pages: Size of the HFI table in 4KB pages
* @cpu_stride: Stride size to locate the capability data of a logical
* processor within the table (i.e., row stride)
+ * @class_stride: Stride size to locate a class within the capability
+ * data of a logical processor or the HFI table header
* @hdr_size: Size of the table header
*
* Parameters and supported features that are common to all HFI instances
*/
struct hfi_features {
+ unsigned int nr_classes;
size_t nr_table_pages;
unsigned int cpu_stride;
+ unsigned int class_stride;
unsigned int hdr_size;
};
@@ -333,8 +339,8 @@ static void init_hfi_cpu_index(struct hfi_cpu_info *info)
}
/*
- * The format of the HFI table depends on the number of capabilities that the
- * hardware supports. Keep a data structure to navigate the table.
+ * The format of the HFI table depends on the number of capabilities and classes
+ * that the hardware supports. Keep a data structure to navigate the table.
*/
static void init_hfi_instance(struct hfi_instance *hfi_instance)
{
@@ -515,18 +521,30 @@ static __init int hfi_parse_features(void)
/* The number of 4KB pages required by the table */
hfi_features.nr_table_pages = edx.split.table_pages + 1;
+ /*
+ * Capability fields of an HFI class are grouped together. Classes are
+ * contiguous in memory. Hence, use the number of supported features to
+ * locate a specific class.
+ */
+ hfi_features.class_stride = nr_capabilities;
+
+ /* For now, use only one class of the HFI table */
+ hfi_features.nr_classes = 1;
+
/*
* The header contains change indications for each supported feature.
* The size of the table header is rounded up to be a multiple of 8
* bytes.
*/
- hfi_features.hdr_size = DIV_ROUND_UP(nr_capabilities, 8) * 8;
+ hfi_features.hdr_size = DIV_ROUND_UP(nr_capabilities *
+ hfi_features.nr_classes, 8) * 8;
/*
* Data of each logical processor is also rounded up to be a multiple
* of 8 bytes.
*/
- hfi_features.cpu_stride = DIV_ROUND_UP(nr_capabilities, 8) * 8;
+ hfi_features.cpu_stride = DIV_ROUND_UP(nr_capabilities *
+ hfi_features.nr_classes, 8) * 8;
return 0;
}
--
2.25.1
Use Intel Thread Director classification to update the IPC class of a
task. Implement the arch_update_ipcc() interface of the scheduler.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* Removed the implementation of arch_has_ipc_classes().
Changes since v1:
* Adjusted the result the classification of Intel Thread Director to start
at class 1. Class 0 for the scheduler means that the task is
unclassified.
* Redefined union hfi_thread_feedback_char_msr to ensure all
bit-fields are packed. (PeterZ)
* Removed CONFIG_INTEL_THREAD_DIRECTOR. (PeterZ)
* Shortened the names of the functions that implement IPC classes.
* Removed argument smt_siblings_idle from intel_hfi_update_ipcc().
(PeterZ)
---
arch/x86/include/asm/topology.h | 6 ++++++
drivers/thermal/intel/intel_hfi.c | 32 +++++++++++++++++++++++++++++++
2 files changed, 38 insertions(+)
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 458c891a8273..ffcdac3f398f 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -227,4 +227,10 @@ void init_freq_invariance_cppc(void);
#define arch_init_invariance_cppc init_freq_invariance_cppc
#endif
+#if defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL)
+void intel_hfi_update_ipcc(struct task_struct *curr);
+
+#define arch_update_ipcc intel_hfi_update_ipcc
+#endif /* defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL) */
+
#endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index b06021828892..530dcf57e06e 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -72,6 +72,17 @@ union cpuid6_edx {
u32 full;
};
+#ifdef CONFIG_IPC_CLASSES
+union hfi_thread_feedback_char_msr {
+ struct {
+ u64 classid : 8;
+ u64 __reserved : 55;
+ u64 valid : 1;
+ } split;
+ u64 full;
+};
+#endif
+
/**
* struct hfi_cpu_data - HFI capabilities per CPU
* @perf_cap: Performance capability
@@ -174,6 +185,27 @@ static struct workqueue_struct *hfi_updates_wq;
#ifdef CONFIG_IPC_CLASSES
static int __percpu *hfi_ipcc_scores;
+void intel_hfi_update_ipcc(struct task_struct *curr)
+{
+ union hfi_thread_feedback_char_msr msr;
+
+ /* We should not be here if ITD is not supported. */
+ if (!cpu_feature_enabled(X86_FEATURE_ITD)) {
+ pr_warn_once("task classification requested but not supported!");
+ return;
+ }
+
+ rdmsrl(MSR_IA32_HW_FEEDBACK_CHAR, msr.full);
+ if (!msr.split.valid)
+ return;
+
+ /*
+ * 0 is a valid classification for Intel Thread Director. A scheduler
+ * IPCC class of 0 means that the task is unclassified. Adjust.
+ */
+ curr->ipcc = msr.split.classid + 1;
+}
+
static int alloc_hfi_ipcc_scores(void)
{
if (!cpu_feature_enabled(X86_FEATURE_ITD))
--
2.25.1
The scheduler reads the IPCC scores when balancing load. These reads can
be quite frequent. Hardware can also update the HFI table frequently.
Concurrent access may cause a lot of lock contention. It gets worse as the
number of CPUs increases.
Instead, create separate per-CPU IPCC scores that the scheduler can read
without the HFI table lock.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Suggested-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* Only create these per-CPU variables when Intel Thread Director is
supported.
Changes since v1:
* Added this patch.
---
drivers/thermal/intel/intel_hfi.c | 46 +++++++++++++++++++++++++++++++
1 file changed, 46 insertions(+)
diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index 2527ae3836c7..b06021828892 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -29,6 +29,7 @@
#include <linux/kernel.h>
#include <linux/math.h>
#include <linux/mutex.h>
+#include <linux/percpu.h>
#include <linux/percpu-defs.h>
#include <linux/printk.h>
#include <linux/processor.h>
@@ -170,6 +171,43 @@ static struct workqueue_struct *hfi_updates_wq;
#define HFI_UPDATE_INTERVAL HZ
#define HFI_MAX_THERM_NOTIFY_COUNT 16
+#ifdef CONFIG_IPC_CLASSES
+static int __percpu *hfi_ipcc_scores;
+
+static int alloc_hfi_ipcc_scores(void)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_ITD))
+ return 0;
+
+ hfi_ipcc_scores = __alloc_percpu(sizeof(*hfi_ipcc_scores) *
+ hfi_features.nr_classes,
+ sizeof(*hfi_ipcc_scores));
+
+ return !hfi_ipcc_scores;
+}
+
+static void set_hfi_ipcc_score(void *caps, int cpu)
+{
+ int i, *hfi_class;
+
+ if (!cpu_feature_enabled(X86_FEATURE_ITD))
+ return;
+
+ hfi_class = per_cpu_ptr(hfi_ipcc_scores, cpu);
+
+ for (i = 0; i < hfi_features.nr_classes; i++) {
+ struct hfi_cpu_data *class_caps;
+
+ class_caps = caps + i * hfi_features.class_stride;
+ WRITE_ONCE(hfi_class[i], class_caps->perf_cap);
+ }
+}
+
+#else
+static int alloc_hfi_ipcc_scores(void) { return 0; }
+static void set_hfi_ipcc_score(void *caps, int cpu) { }
+#endif /* CONFIG_IPC_CLASSES */
+
static void get_hfi_caps(struct hfi_instance *hfi_instance,
struct thermal_genl_cpu_caps *cpu_caps)
{
@@ -192,6 +230,8 @@ static void get_hfi_caps(struct hfi_instance *hfi_instance,
cpu_caps[i].efficiency = caps->ee_cap << 2;
++i;
+
+ set_hfi_ipcc_score(caps, cpu);
}
raw_spin_unlock_irq(&hfi_instance->table_lock);
}
@@ -580,8 +620,14 @@ void __init intel_hfi_init(void)
if (!hfi_updates_wq)
goto err_nomem;
+ if (alloc_hfi_ipcc_scores())
+ goto err_ipcc;
+
return;
+err_ipcc:
+ destroy_workqueue(hfi_updates_wq);
+
err_nomem:
for (j = 0; j < i; ++j) {
hfi_instance = &hfi_instances[j];
--
2.25.1
A task may be unclassified if it has been recently created, spend most of
its lifetime sleeping, or hardware has not provided a classification.
Most tasks will be eventually classified as scheduler's IPC class 1
(HFI class 0). This class corresponds to the capabilities in the legacy,
classless, HFI table.
IPC class 1 is a reasonable choice until hardware provides an actual
classification. Meanwhile, the scheduler will place classes of tasks with
higher IPC scores on higher-performance CPUs.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* None
Changes since v1:
* Now the default class is 1.
---
drivers/thermal/intel/intel_hfi.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index fa9b4a678d92..7ea6acce7107 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -185,6 +185,19 @@ static struct workqueue_struct *hfi_updates_wq;
#ifdef CONFIG_IPC_CLASSES
static int __percpu *hfi_ipcc_scores;
+/*
+ * A task may be unclassified if it has been recently created, spend most of
+ * its lifetime sleeping, or hardware has not provided a classification.
+ *
+ * Most tasks will be classified as scheduler's IPC class 1 (HFI class 0)
+ * eventually. Meanwhile, the scheduler will place classes of tasks with higher
+ * IPC scores on higher-performance CPUs.
+ *
+ * IPC class 1 is a reasonable choice. It matches the performance capability
+ * of the legacy, classless, HFI table.
+ */
+#define HFI_UNCLASSIFIED_DEFAULT 1
+
void intel_hfi_update_ipcc(struct task_struct *curr)
{
union hfi_thread_feedback_char_msr msr;
@@ -215,7 +228,7 @@ unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu)
return -EINVAL;
if (ipcc == IPC_CLASS_UNCLASSIFIED)
- return -EINVAL;
+ ipcc = HFI_UNCLASSIFIED_DEFAULT;
/*
* Scheduler IPC classes start at 1. HFI classes start at 0.
--
2.25.1
Implement the arch_get_ipcc_score() interface of the scheduler. Use the
performance capabilities of the extended Hardware Feedback Interface table
as the IPC score.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* None
Changes since v1:
* Adjusted the returned HFI class (which starts at 0) to match the
scheduler IPCC class (which starts at 1). (PeterZ)
* Used the new interface names.
---
arch/x86/include/asm/topology.h | 2 ++
drivers/thermal/intel/intel_hfi.c | 27 +++++++++++++++++++++++++++
2 files changed, 29 insertions(+)
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index ffcdac3f398f..c4fcd9c3c634 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -229,8 +229,10 @@ void init_freq_invariance_cppc(void);
#if defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL)
void intel_hfi_update_ipcc(struct task_struct *curr);
+unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu);
#define arch_update_ipcc intel_hfi_update_ipcc
+#define arch_get_ipcc_score intel_hfi_get_ipcc_score
#endif /* defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL) */
#endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index 530dcf57e06e..fa9b4a678d92 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -206,6 +206,33 @@ void intel_hfi_update_ipcc(struct task_struct *curr)
curr->ipcc = msr.split.classid + 1;
}
+unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu)
+{
+ unsigned short hfi_class;
+ int *scores;
+
+ if (cpu < 0 || cpu >= nr_cpu_ids)
+ return -EINVAL;
+
+ if (ipcc == IPC_CLASS_UNCLASSIFIED)
+ return -EINVAL;
+
+ /*
+ * Scheduler IPC classes start at 1. HFI classes start at 0.
+ * See note intel_hfi_update_ipcc().
+ */
+ hfi_class = ipcc - 1;
+
+ if (hfi_class >= hfi_features.nr_classes)
+ return -EINVAL;
+
+ scores = per_cpu_ptr(hfi_ipcc_scores, cpu);
+ if (!scores)
+ return -ENODEV;
+
+ return READ_ONCE(scores[hfi_class]);
+}
+
static int alloc_hfi_ipcc_scores(void)
{
if (!cpu_feature_enabled(X86_FEATURE_ITD))
--
2.25.1
Just as tasks start life unclassified, initialize the classification
auxiliar variables.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* None
Changes since v1:
* None
---
kernel/sched/core.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9f4e9cc16df8..71b4af7ae496 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4430,6 +4430,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.vruntime = 0;
#ifdef CONFIG_IPC_CLASSES
p->ipcc = IPC_CLASS_UNCLASSIFIED;
+ p->ipcc_tmp = IPC_CLASS_UNCLASSIFIED;
+ p->ipcc_cntr = 0;
#endif
INIT_LIST_HEAD(&p->se.group_node);
--
2.25.1
Enable Intel Thread Director from the CPU hotplug callback: globally from
CPU0 and then enable the thread-classification hardware in each logical
processor individually.
Also, initialize the number of classes supported.
Let the scheduler know that it can start using IPC classes.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* Use the new sched_enable_ipc_classes() interface to enable the use of
IPC classes in the scheduler.
Changes since v1:
* None
---
arch/x86/include/asm/msr-index.h | 2 ++
drivers/thermal/intel/intel_hfi.c | 40 +++++++++++++++++++++++++++++--
2 files changed, 40 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index ad35355ee43e..0ea25cc9c621 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1106,6 +1106,8 @@
/* Hardware Feedback Interface */
#define MSR_IA32_HW_FEEDBACK_PTR 0x17d0
#define MSR_IA32_HW_FEEDBACK_CONFIG 0x17d1
+#define MSR_IA32_HW_FEEDBACK_THREAD_CONFIG 0x17d4
+#define MSR_IA32_HW_FEEDBACK_CHAR 0x17d2
/* x2APIC locked status */
#define MSR_IA32_XAPIC_DISABLE_STATUS 0xBD
diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index 7ea6acce7107..35d947f47550 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -48,6 +48,8 @@
/* Hardware Feedback Interface MSR configuration bits */
#define HW_FEEDBACK_PTR_VALID_BIT BIT(0)
#define HW_FEEDBACK_CONFIG_HFI_ENABLE_BIT BIT(0)
+#define HW_FEEDBACK_CONFIG_ITD_ENABLE_BIT BIT(1)
+#define HW_FEEDBACK_THREAD_CONFIG_ENABLE_BIT BIT(0)
/* CPUID detection and enumeration definitions for HFI */
@@ -72,6 +74,15 @@ union cpuid6_edx {
u32 full;
};
+union cpuid6_ecx {
+ struct {
+ u32 dont_care0:8;
+ u32 nr_classes:8;
+ u32 dont_care1:16;
+ } split;
+ u32 full;
+};
+
#ifdef CONFIG_IPC_CLASSES
union hfi_thread_feedback_char_msr {
struct {
@@ -506,6 +517,11 @@ void intel_hfi_online(unsigned int cpu)
init_hfi_cpu_index(info);
+ if (cpu_feature_enabled(X86_FEATURE_ITD)) {
+ msr_val = HW_FEEDBACK_THREAD_CONFIG_ENABLE_BIT;
+ wrmsrl(MSR_IA32_HW_FEEDBACK_THREAD_CONFIG, msr_val);
+ }
+
/*
* Now check if the HFI instance of the package/die of @cpu has been
* initialized (by checking its header). In such case, all we have to
@@ -561,8 +577,22 @@ void intel_hfi_online(unsigned int cpu)
*/
rdmsrl(MSR_IA32_HW_FEEDBACK_CONFIG, msr_val);
msr_val |= HW_FEEDBACK_CONFIG_HFI_ENABLE_BIT;
+
+ if (cpu_feature_enabled(X86_FEATURE_ITD))
+ msr_val |= HW_FEEDBACK_CONFIG_ITD_ENABLE_BIT;
+
wrmsrl(MSR_IA32_HW_FEEDBACK_CONFIG, msr_val);
+ /*
+ * We have all we need to support IPC classes. Task classification is
+ * now working.
+ *
+ * All class scores are zero until after the first HFI update. That is
+ * OK. The scheduler queries these scores at every load balance.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_ITD))
+ sched_enable_ipc_classes();
+
unlock:
mutex_unlock(&hfi_instance_lock);
return;
@@ -640,8 +670,14 @@ static __init int hfi_parse_features(void)
*/
hfi_features.class_stride = nr_capabilities;
- /* For now, use only one class of the HFI table */
- hfi_features.nr_classes = 1;
+ if (cpu_feature_enabled(X86_FEATURE_ITD)) {
+ union cpuid6_ecx ecx;
+
+ ecx.full = cpuid_ecx(CPUID_HFI_LEAF);
+ hfi_features.nr_classes = ecx.split.nr_classes;
+ } else {
+ hfi_features.nr_classes = 1;
+ }
/*
* The header contains change indications for each supported feature.
--
2.25.1
The unprocessed classification that hardware provides for a task may not
be usable by the scheduler: the classification may change too frequently or
architectures may want to consider extra factors. For instance, some
processors with Intel Thread Director need to consider the state of the SMT
siblings of a core.
Provide per-task helper variables that architectures can use to post-
process the classification that hardware provides.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* None
Changes since v1:
* Used bit-fields to fit all the IPC class data in 4 bytes. (PeterZ)
* Shortened names of the helpers.
* Renamed helpers with the ipcc_ prefix.
* Reworded commit message for clarity
---
include/linux/sched.h | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 10c6abdc3465..45f28a601b3d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1535,7 +1535,17 @@ struct task_struct {
* A hardware-defined classification of task that reflects but is
* not identical to the number of instructions per cycle.
*/
- unsigned short ipcc;
+ unsigned int ipcc : 9;
+ /*
+ * A candidate classification that arch-specific implementations
+ * qualify for correctness.
+ */
+ unsigned int ipcc_tmp : 9;
+ /*
+ * Counter to filter out transient candidate classifications
+ * of a task.
+ */
+ unsigned int ipcc_cntr : 14;
#endif
/*
--
2.25.1
In Alder Lake and Raptor Lake, the result of thread classification is more
accurate when only one SMT sibling is busy. Classification results for
class 2 and 3 are always reliable.
To avoid unnecessary migrations, only update the class of a task if it has
been the same during 4 consecutive user ticks.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* None
Changes since v1:
* Adjusted the result the classification of Intel Thread Director to start
at class 1. Class 0 for the scheduler means that the task is
unclassified.
* Used the new names of the IPC classes members in task_struct.
* Reworked helper functions to use sched_smt_siblings_idle() to query
the idle state of the SMT siblings of a CPU.
---
drivers/thermal/intel/intel_hfi.c | 60 ++++++++++++++++++++++++++++++-
1 file changed, 59 insertions(+), 1 deletion(-)
diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index 35d947f47550..fdb53e4cabc1 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -40,6 +40,7 @@
#include <linux/workqueue.h>
#include <asm/msr.h>
+#include <asm/intel-family.h>
#include "../thermal_core.h"
#include "intel_hfi.h"
@@ -209,9 +210,64 @@ static int __percpu *hfi_ipcc_scores;
*/
#define HFI_UNCLASSIFIED_DEFAULT 1
+#define CLASS_DEBOUNCER_SKIPS 4
+
+/**
+ * debounce_and_update_class() - Process and update a task's classification
+ *
+ * @p: The task of which the classification will be updated
+ * @new_ipcc: The new IPC classification
+ *
+ * Update the classification of @p with the new value that hardware provides.
+ * Only update the classification of @p if it has been the same during
+ * CLASS_DEBOUNCER_SKIPS consecutive ticks.
+ */
+static void debounce_and_update_class(struct task_struct *p, u8 new_ipcc)
+{
+ u16 debounce_skip;
+
+ /* The class of @p changed. Only restart the debounce counter. */
+ if (p->ipcc_tmp != new_ipcc) {
+ p->ipcc_cntr = 1;
+ goto out;
+ }
+
+ /*
+ * The class of @p did not change. Update it if it has been the same
+ * for CLASS_DEBOUNCER_SKIPS user ticks.
+ */
+ debounce_skip = p->ipcc_cntr + 1;
+ if (debounce_skip < CLASS_DEBOUNCER_SKIPS)
+ p->ipcc_cntr++;
+ else
+ p->ipcc = new_ipcc;
+
+out:
+ p->ipcc_tmp = new_ipcc;
+}
+
+static bool classification_is_accurate(u8 hfi_class, bool smt_siblings_idle)
+{
+ switch (boot_cpu_data.x86_model) {
+ case INTEL_FAM6_ALDERLAKE:
+ case INTEL_FAM6_ALDERLAKE_L:
+ case INTEL_FAM6_RAPTORLAKE:
+ case INTEL_FAM6_RAPTORLAKE_P:
+ case INTEL_FAM6_RAPTORLAKE_S:
+ if (hfi_class == 3 || hfi_class == 2 || smt_siblings_idle)
+ return true;
+
+ return false;
+
+ default:
+ return true;
+ }
+}
+
void intel_hfi_update_ipcc(struct task_struct *curr)
{
union hfi_thread_feedback_char_msr msr;
+ bool idle;
/* We should not be here if ITD is not supported. */
if (!cpu_feature_enabled(X86_FEATURE_ITD)) {
@@ -227,7 +283,9 @@ void intel_hfi_update_ipcc(struct task_struct *curr)
* 0 is a valid classification for Intel Thread Director. A scheduler
* IPCC class of 0 means that the task is unclassified. Adjust.
*/
- curr->ipcc = msr.split.classid + 1;
+ idle = sched_smt_siblings_idle(task_cpu(curr));
+ if (classification_is_accurate(msr.split.classid, idle))
+ debounce_and_update_class(curr, msr.split.classid + 1);
}
unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu)
--
2.25.1
Configure the MSR that controls the behavior of HRESET on each logical
processor.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* None
Changes since v1:
* Marked hardware_history_features as __ro_after_init instead of
__read_mostly. (PeterZ)
---
arch/x86/kernel/cpu/common.c | 23 ++++++++++++++++++++++-
1 file changed, 22 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 831a1a07d357..f3f936f7de5f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -412,6 +412,26 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
cr4_clear_bits(X86_CR4_UMIP);
}
+static u32 hardware_history_features __ro_after_init;
+
+static __always_inline void setup_hreset(struct cpuinfo_x86 *c)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_HRESET))
+ return;
+
+ /*
+ * Use on all CPUs the hardware history features that the boot
+ * CPU supports.
+ */
+ if (c == &boot_cpu_data)
+ hardware_history_features = cpuid_ebx(0x20);
+
+ if (!hardware_history_features)
+ return;
+
+ wrmsrl(MSR_IA32_HW_HRESET_ENABLE, hardware_history_features);
+}
+
/* These bits should not change their value after CPU init is finished. */
static const unsigned long cr4_pinned_mask =
X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP |
@@ -1848,10 +1868,11 @@ static void identify_cpu(struct cpuinfo_x86 *c)
/* Disable the PN if appropriate */
squash_the_stupid_serial_number(c);
- /* Set up SMEP/SMAP/UMIP */
+ /* Set up SMEP/SMAP/UMIP/HRESET */
setup_smep(c);
setup_smap(c);
setup_umip(c);
+ setup_hreset(c);
/* Enable FSGSBASE instructions if available. */
if (cpu_has(c, X86_FEATURE_FSGSBASE)) {
--
2.25.1
X86 needs to know the idle state of the SMT siblings of a CPU to improve
the accuracy of IPCC classification. X86 implements support for IPC classes
in the thermal HFI driver.
Rename is_core_idle() as sched_smt_siblings_idle() and make it available
outside the scheduler code.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
is_core_idle() is no longer an inline function after this patch. To rule
out performance degradation, I compared the execution time of the inline
and non-inline versions on a 4-socket Cascade Lake system using the NUMA
stressor of stress-ng:
$ stress-ng --numa 1500 -t 10m
is_core_idle() was called ~200,000 times. I measured the value of the TSC
counter before and after calling is_core_idle() and computed the delta
value.
I arbitrarily removed outliers (defined as any delta larger than 5000
counts). This required removing ~40 samples.
The table below summarizes the difference in execution time. All quantities
are expressed in TSC counts, except the standard deviation, expressed as a
percentage of the average.
Average Median Std(%) Mode
TSCdelta inline 668.76 626 67.24 42
TSCdelta non-inline 677.64 624 67.67 46
All metrics are similar for the inline and non-inline cases.
---
Changes since v2:
* Brought back this previously dropped patch.
* Profiled inline vs non-inline is_core_idle(). I found not major penalty.
* Merged is_core_idle() and sched_smt_siblings_idle() into a single
function. (Dietmar)
Changes since v1:
* Dropped this patch.
---
include/linux/sched.h | 2 ++
kernel/sched/fair.c | 21 +++++++++++++++------
2 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 45f28a601b3d..7ef9fd84e7ad 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2449,4 +2449,6 @@ static inline void sched_core_fork(struct task_struct *p) { }
extern void sched_set_stop_task(int cpu, struct task_struct *stop);
+extern bool sched_smt_siblings_idle(int cpu);
+
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d3c22dc145f7..a66d86c5cb5c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1064,7 +1064,14 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
* Scheduling class queueing methods:
*/
-static inline bool is_core_idle(int cpu)
+/**
+ * sched_smt_siblings_idle - Check whether SMT siblings of a CPU are idle
+ * @cpu: The CPU to check
+ *
+ * Returns true if all the SMT siblings of @cpu are idle or @cpu does not have
+ * SMT siblings. The idle state of @cpu is not considered.
+ */
+bool sched_smt_siblings_idle(int cpu)
{
#ifdef CONFIG_SCHED_SMT
int sibling;
@@ -1767,7 +1774,7 @@ static inline int numa_idle_core(int idle_core, int cpu)
* Prefer cores instead of packing HT siblings
* and triggering future load balancing.
*/
- if (is_core_idle(cpu))
+ if (sched_smt_siblings_idle(cpu))
idle_core = cpu;
return idle_core;
@@ -9518,7 +9525,8 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs
* If the destination CPU has SMT siblings, env->idle != CPU_NOT_IDLE
* is not sufficient. We need to make sure the whole core is idle.
*/
- if (sds->local->flags & SD_SHARE_CPUCAPACITY && !is_core_idle(env->dst_cpu))
+ if (sds->local->flags & SD_SHARE_CPUCAPACITY &&
+ !sched_smt_siblings_idle(env->dst_cpu))
return false;
/* Only do SMT checks if either local or candidate have SMT siblings. */
@@ -10687,7 +10695,8 @@ static struct rq *find_busiest_queue(struct lb_env *env,
sched_asym_prefer(i, env->dst_cpu) &&
nr_running == 1) {
if (env->sd->flags & SD_SHARE_CPUCAPACITY ||
- (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && is_core_idle(i)))
+ (!(env->sd->flags & SD_SHARE_CPUCAPACITY) &&
+ sched_smt_siblings_idle(i)))
continue;
}
@@ -10816,7 +10825,7 @@ asym_active_balance(struct lb_env *env)
* busy sibling.
*/
return sched_asym_prefer(env->dst_cpu, env->src_cpu) ||
- !is_core_idle(env->src_cpu);
+ !sched_smt_siblings_idle(env->src_cpu);
}
return false;
@@ -11563,7 +11572,7 @@ static void nohz_balancer_kick(struct rq *rq)
*/
if (sd->flags & SD_SHARE_CPUCAPACITY ||
(!(sd->flags & SD_SHARE_CPUCAPACITY) &&
- is_core_idle(i))) {
+ sched_smt_siblings_idle(i))) {
flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
goto unlock;
}
--
2.25.1
The HRESET instruction prevents the classification of the current task
from influencing the classification of the next task when running serially
on the same logical processor.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* None
Changes since v1:
* None
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/msr-index.h | 4 +++-
arch/x86/kernel/cpu/scattered.c | 1 +
3 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 8a6261a5dbbf..eb859a82b22a 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -309,6 +309,7 @@
#define X86_FEATURE_MSR_TSX_CTRL (11*32+20) /* "" MSR IA32_TSX_CTRL (Intel) implemented */
#define X86_FEATURE_SMBA (11*32+21) /* "" Slow Memory Bandwidth Allocation */
#define X86_FEATURE_BMEC (11*32+22) /* "" Bandwidth Monitoring Event Configuration */
+#define X86_FEATURE_HRESET (11*32+23) /* Hardware history reset instruction */
/* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
#define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 0ea25cc9c621..dc96944d61a6 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1109,6 +1109,9 @@
#define MSR_IA32_HW_FEEDBACK_THREAD_CONFIG 0x17d4
#define MSR_IA32_HW_FEEDBACK_CHAR 0x17d2
+/* Hardware History Reset */
+#define MSR_IA32_HW_HRESET_ENABLE 0x17da
+
/* x2APIC locked status */
#define MSR_IA32_XAPIC_DISABLE_STATUS 0xBD
#define LEGACY_XAPIC_DISABLED BIT(0) /*
@@ -1116,5 +1119,4 @@
* disabling x2APIC will cause
* a #GP
*/
-
#endif /* _ASM_X86_MSR_INDEX_H */
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 0dad49a09b7a..cb8a0e7a4fdb 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -28,6 +28,7 @@ static const struct cpuid_bit cpuid_bits[] = {
{ X86_FEATURE_EPB, CPUID_ECX, 3, 0x00000006, 0 },
{ X86_FEATURE_INTEL_PPIN, CPUID_EBX, 0, 0x00000007, 1 },
{ X86_FEATURE_RRSBA_CTRL, CPUID_EDX, 2, 0x00000007, 2 },
+ { X86_FEATURE_HRESET, CPUID_EAX, 22, 0x00000007, 1 },
{ X86_FEATURE_CQM_LLC, CPUID_EDX, 1, 0x0000000f, 0 },
{ X86_FEATURE_CQM_OCCUP_LLC, CPUID_EDX, 0, 0x0000000f, 1 },
{ X86_FEATURE_CQM_MBM_TOTAL, CPUID_EDX, 1, 0x0000000f, 1 },
--
2.25.1
Reset the classification history of the current task when switching to the
next task. Hardware will start the classification of the next task from
scratch.
Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ionela Voinescu <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v2:
* None
Changes since v1:
* Measurements of the cost of the HRESET instruction
Methodology:
I created a tight loop with interrupts and preemption disabled. I
recorded the value of the TSC counter before and after executing
HRESET or RDTSC. I repeated the measurement 100,000 times.
I performed the experiment using an Alder Lake S system. I set the
frequency of the CPUs at a fixed value.
The table below compares the cost of HRESET with RDTSC (expressed in
the elapsed TSC count). The cost of the two instructions is
comparable.
PCore ECore
Frequency (GHz) 5.0 3.8
HRESET (avg) 28.5 44.7
HRESET (stdev %) 3.6 2.3
RDTSC (avg) 25.2 35.7
RDTSC (stdev %) 3.9 2.6
* Used an ALTERNATIVE macro instead of static_cpu_has() to execute HRESET
when supported. (PeterZ)
---
arch/x86/include/asm/hreset.h | 30 ++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/common.c | 7 +++++++
arch/x86/kernel/process_32.c | 3 +++
arch/x86/kernel/process_64.c | 3 +++
4 files changed, 43 insertions(+)
create mode 100644 arch/x86/include/asm/hreset.h
diff --git a/arch/x86/include/asm/hreset.h b/arch/x86/include/asm/hreset.h
new file mode 100644
index 000000000000..d68ca2fb8642
--- /dev/null
+++ b/arch/x86/include/asm/hreset.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_HRESET_H
+
+/**
+ * HRESET - History reset. Available since binutils v2.36.
+ *
+ * Request the processor to reset the history of task classification on the
+ * current logical processor. The history components to be
+ * reset are specified in %eax. Only bits specified in CPUID(0x20).EBX
+ * and enabled in the IA32_HRESET_ENABLE MSR can be selected.
+ *
+ * The assembly code looks like:
+ *
+ * hreset %eax
+ *
+ * The corresponding machine code looks like:
+ *
+ * F3 0F 3A F0 ModRM Imm
+ *
+ * The value of ModRM is 0xc0 to specify %eax register addressing.
+ * The ignored immediate operand is set to 0.
+ *
+ * The instruction is documented in the Intel SDM.
+ */
+
+#define __ASM_HRESET ".byte 0xf3, 0xf, 0x3a, 0xf0, 0xc0, 0x0"
+
+void reset_hardware_history(void);
+
+#endif /* _ASM_X86_HRESET_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f3f936f7de5f..17e2068530b0 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -53,6 +53,7 @@
#include <asm/mce.h>
#include <asm/msr.h>
#include <asm/cacheinfo.h>
+#include <asm/hreset.h>
#include <asm/memtype.h>
#include <asm/microcode.h>
#include <asm/microcode_intel.h>
@@ -414,6 +415,12 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
static u32 hardware_history_features __ro_after_init;
+void reset_hardware_history(void)
+{
+ asm_inline volatile (ALTERNATIVE("", __ASM_HRESET, X86_FEATURE_HRESET)
+ : : "a" (hardware_history_features) : "memory");
+}
+
static __always_inline void setup_hreset(struct cpuinfo_x86 *c)
{
if (!cpu_feature_enabled(X86_FEATURE_HRESET))
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 470c128759ea..397a6e6f4e61 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -52,6 +52,7 @@
#include <asm/switch_to.h>
#include <asm/vm86.h>
#include <asm/resctrl.h>
+#include <asm/hreset.h>
#include <asm/proto.h>
#include "process.h"
@@ -214,6 +215,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
/* Load the Intel cache allocation PQR MSR. */
resctrl_sched_in();
+ reset_hardware_history();
+
return prev_p;
}
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 4e34b3b68ebd..6176044ecc16 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -53,6 +53,7 @@
#include <asm/xen/hypervisor.h>
#include <asm/vdso.h>
#include <asm/resctrl.h>
+#include <asm/hreset.h>
#include <asm/unistd.h>
#include <asm/fsgsbase.h>
#ifdef CONFIG_IA32_EMULATION
@@ -658,6 +659,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
/* Load the Intel cache allocation PQR MSR. */
resctrl_sched_in();
+ reset_hardware_history();
+
return prev_p;
}
--
2.25.1
On Mon, Feb 06, 2023 at 09:10:41PM -0800, Ricardo Neri wrote:
> Hi,
>
> This is third version of this patchset. Previous versions can be found
> here [1] and here [2]. For brevity, I did not include the cover letter
> from the original posting. You can read it here [1].
Hello! Is there any feedback for this patchset?
Thanks and BR,
Ricardo
On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
<[email protected]> wrote:
>
> On Intel hybrid parts, each type of CPU has specific performance and
> energy efficiency capabilities. The Intel Thread Director technology
> extends the Hardware Feedback Interface (HFI) to provide performance and
> energy efficiency data for advanced classes of instructions.
>
> Add support to parse per-class capabilities.
>
> Cc: Ben Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Ionela Voinescu <[email protected]>
> Cc: Joel Fernandes (Google) <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Lukasz Luba <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Tim C. Chen <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Ricardo Neri <[email protected]>
No objections to this patch, so
Acked-by: Rafael J. Wysocki <[email protected]>
> ---
> Changes since v2:
> * None
>
> Changes since v1:
> * Removed a now obsolete comment.
> ---
> drivers/thermal/intel/intel_hfi.c | 30 ++++++++++++++++++++++++------
> 1 file changed, 24 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> index 6e604bda2b93..2527ae3836c7 100644
> --- a/drivers/thermal/intel/intel_hfi.c
> +++ b/drivers/thermal/intel/intel_hfi.c
> @@ -77,7 +77,7 @@ union cpuid6_edx {
> * @ee_cap: Energy efficiency capability
> *
> * Capabilities of a logical processor in the HFI table. These capabilities are
> - * unitless.
> + * unitless and specific to each HFI class.
> */
> struct hfi_cpu_data {
> u8 perf_cap;
> @@ -89,7 +89,8 @@ struct hfi_cpu_data {
> * @perf_updated: Hardware updated performance capabilities
> * @ee_updated: Hardware updated energy efficiency capabilities
> *
> - * Properties of the data in an HFI table.
> + * Properties of the data in an HFI table. There exists one header per each
> + * HFI class.
> */
> struct hfi_hdr {
> u8 perf_updated;
> @@ -127,16 +128,21 @@ struct hfi_instance {
>
> /**
> * struct hfi_features - Supported HFI features
> + * @nr_classes: Number of classes supported
> * @nr_table_pages: Size of the HFI table in 4KB pages
> * @cpu_stride: Stride size to locate the capability data of a logical
> * processor within the table (i.e., row stride)
> + * @class_stride: Stride size to locate a class within the capability
> + * data of a logical processor or the HFI table header
> * @hdr_size: Size of the table header
> *
> * Parameters and supported features that are common to all HFI instances
> */
> struct hfi_features {
> + unsigned int nr_classes;
> size_t nr_table_pages;
> unsigned int cpu_stride;
> + unsigned int class_stride;
> unsigned int hdr_size;
> };
>
> @@ -333,8 +339,8 @@ static void init_hfi_cpu_index(struct hfi_cpu_info *info)
> }
>
> /*
> - * The format of the HFI table depends on the number of capabilities that the
> - * hardware supports. Keep a data structure to navigate the table.
> + * The format of the HFI table depends on the number of capabilities and classes
> + * that the hardware supports. Keep a data structure to navigate the table.
> */
> static void init_hfi_instance(struct hfi_instance *hfi_instance)
> {
> @@ -515,18 +521,30 @@ static __init int hfi_parse_features(void)
> /* The number of 4KB pages required by the table */
> hfi_features.nr_table_pages = edx.split.table_pages + 1;
>
> + /*
> + * Capability fields of an HFI class are grouped together. Classes are
> + * contiguous in memory. Hence, use the number of supported features to
> + * locate a specific class.
> + */
> + hfi_features.class_stride = nr_capabilities;
> +
> + /* For now, use only one class of the HFI table */
> + hfi_features.nr_classes = 1;
> +
> /*
> * The header contains change indications for each supported feature.
> * The size of the table header is rounded up to be a multiple of 8
> * bytes.
> */
> - hfi_features.hdr_size = DIV_ROUND_UP(nr_capabilities, 8) * 8;
> + hfi_features.hdr_size = DIV_ROUND_UP(nr_capabilities *
> + hfi_features.nr_classes, 8) * 8;
>
> /*
> * Data of each logical processor is also rounded up to be a multiple
> * of 8 bytes.
> */
> - hfi_features.cpu_stride = DIV_ROUND_UP(nr_capabilities, 8) * 8;
> + hfi_features.cpu_stride = DIV_ROUND_UP(nr_capabilities *
> + hfi_features.nr_classes, 8) * 8;
>
> return 0;
> }
> --
> 2.25.1
>
On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
<[email protected]> wrote:
>
> The scheduler reads the IPCC scores when balancing load. These reads can
> be quite frequent. Hardware can also update the HFI table frequently.
> Concurrent access may cause a lot of lock contention. It gets worse as the
> number of CPUs increases.
>
> Instead, create separate per-CPU IPCC scores that the scheduler can read
> without the HFI table lock.
>
> Cc: Ben Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Ionela Voinescu <[email protected]>
> Cc: Joel Fernandes (Google) <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Lukasz Luba <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Tim C. Chen <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Suggested-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Ricardo Neri <[email protected]>
> ---
> Changes since v2:
> * Only create these per-CPU variables when Intel Thread Director is
> supported.
>
> Changes since v1:
> * Added this patch.
> ---
> drivers/thermal/intel/intel_hfi.c | 46 +++++++++++++++++++++++++++++++
> 1 file changed, 46 insertions(+)
>
> diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> index 2527ae3836c7..b06021828892 100644
> --- a/drivers/thermal/intel/intel_hfi.c
> +++ b/drivers/thermal/intel/intel_hfi.c
> @@ -29,6 +29,7 @@
> #include <linux/kernel.h>
> #include <linux/math.h>
> #include <linux/mutex.h>
> +#include <linux/percpu.h>
> #include <linux/percpu-defs.h>
> #include <linux/printk.h>
> #include <linux/processor.h>
> @@ -170,6 +171,43 @@ static struct workqueue_struct *hfi_updates_wq;
> #define HFI_UPDATE_INTERVAL HZ
> #define HFI_MAX_THERM_NOTIFY_COUNT 16
>
> +#ifdef CONFIG_IPC_CLASSES
It would be good to provide a (concise) description of this variable.
> +static int __percpu *hfi_ipcc_scores;
> +
> +static int alloc_hfi_ipcc_scores(void)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_ITD))
> + return 0;
> +
> + hfi_ipcc_scores = __alloc_percpu(sizeof(*hfi_ipcc_scores) *
> + hfi_features.nr_classes,
> + sizeof(*hfi_ipcc_scores));
> +
> + return !hfi_ipcc_scores;
I would do
if (!hfi_ipcc_scores)
return -ENOMEM;
return 0;
Or make the function return bool.
> +}
> +
> +static void set_hfi_ipcc_score(void *caps, int cpu)
> +{
> + int i, *hfi_class;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_ITD))
> + return;
> +
> + hfi_class = per_cpu_ptr(hfi_ipcc_scores, cpu);
> +
> + for (i = 0; i < hfi_features.nr_classes; i++) {
> + struct hfi_cpu_data *class_caps;
> +
> + class_caps = caps + i * hfi_features.class_stride;
> + WRITE_ONCE(hfi_class[i], class_caps->perf_cap);
As it stands, it is unclear why WRITE_ONCE() is needed here.
> + }
> +}
> +
> +#else
> +static int alloc_hfi_ipcc_scores(void) { return 0; }
> +static void set_hfi_ipcc_score(void *caps, int cpu) { }
> +#endif /* CONFIG_IPC_CLASSES */
> +
> static void get_hfi_caps(struct hfi_instance *hfi_instance,
> struct thermal_genl_cpu_caps *cpu_caps)
> {
> @@ -192,6 +230,8 @@ static void get_hfi_caps(struct hfi_instance *hfi_instance,
> cpu_caps[i].efficiency = caps->ee_cap << 2;
>
> ++i;
> +
> + set_hfi_ipcc_score(caps, cpu);
> }
> raw_spin_unlock_irq(&hfi_instance->table_lock);
> }
> @@ -580,8 +620,14 @@ void __init intel_hfi_init(void)
> if (!hfi_updates_wq)
> goto err_nomem;
>
> + if (alloc_hfi_ipcc_scores())
> + goto err_ipcc;
> +
> return;
>
> +err_ipcc:
> + destroy_workqueue(hfi_updates_wq);
> +
> err_nomem:
> for (j = 0; j < i; ++j) {
> hfi_instance = &hfi_instances[j];
> --
On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
<[email protected]> wrote:
>
> Use Intel Thread Director classification to update the IPC class of a
> task. Implement the arch_update_ipcc() interface of the scheduler.
>
> Cc: Ben Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Ionela Voinescu <[email protected]>
> Cc: Joel Fernandes (Google) <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Lukasz Luba <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Tim C. Chen <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Ricardo Neri <[email protected]>
> ---
> Changes since v2:
> * Removed the implementation of arch_has_ipc_classes().
>
> Changes since v1:
> * Adjusted the result the classification of Intel Thread Director to start
> at class 1. Class 0 for the scheduler means that the task is
> unclassified.
> * Redefined union hfi_thread_feedback_char_msr to ensure all
> bit-fields are packed. (PeterZ)
> * Removed CONFIG_INTEL_THREAD_DIRECTOR. (PeterZ)
> * Shortened the names of the functions that implement IPC classes.
> * Removed argument smt_siblings_idle from intel_hfi_update_ipcc().
> (PeterZ)
> ---
> arch/x86/include/asm/topology.h | 6 ++++++
> drivers/thermal/intel/intel_hfi.c | 32 +++++++++++++++++++++++++++++++
> 2 files changed, 38 insertions(+)
>
> diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> index 458c891a8273..ffcdac3f398f 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -227,4 +227,10 @@ void init_freq_invariance_cppc(void);
> #define arch_init_invariance_cppc init_freq_invariance_cppc
> #endif
>
> +#if defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL)
> +void intel_hfi_update_ipcc(struct task_struct *curr);
> +
> +#define arch_update_ipcc intel_hfi_update_ipcc
> +#endif /* defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL) */
> +
> #endif /* _ASM_X86_TOPOLOGY_H */
> diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> index b06021828892..530dcf57e06e 100644
> --- a/drivers/thermal/intel/intel_hfi.c
> +++ b/drivers/thermal/intel/intel_hfi.c
> @@ -72,6 +72,17 @@ union cpuid6_edx {
> u32 full;
> };
>
> +#ifdef CONFIG_IPC_CLASSES
> +union hfi_thread_feedback_char_msr {
> + struct {
> + u64 classid : 8;
> + u64 __reserved : 55;
> + u64 valid : 1;
> + } split;
> + u64 full;
> +};
> +#endif
> +
> /**
> * struct hfi_cpu_data - HFI capabilities per CPU
> * @perf_cap: Performance capability
> @@ -174,6 +185,27 @@ static struct workqueue_struct *hfi_updates_wq;
> #ifdef CONFIG_IPC_CLASSES
> static int __percpu *hfi_ipcc_scores;
>
> +void intel_hfi_update_ipcc(struct task_struct *curr)
> +{
> + union hfi_thread_feedback_char_msr msr;
> +
> + /* We should not be here if ITD is not supported. */
> + if (!cpu_feature_enabled(X86_FEATURE_ITD)) {
> + pr_warn_once("task classification requested but not supported!");
> + return;
> + }
> +
> + rdmsrl(MSR_IA32_HW_FEEDBACK_CHAR, msr.full);
> + if (!msr.split.valid)
> + return;
> +
> + /*
> + * 0 is a valid classification for Intel Thread Director. A scheduler
> + * IPCC class of 0 means that the task is unclassified. Adjust.
> + */
> + curr->ipcc = msr.split.classid + 1;
> +}
Wouldn't it be better to return the adjusted value from this function
and let the caller store it where appropriate?
It doesn't look like it is necessary to pass the task_struct pointer to it.
> +
> static int alloc_hfi_ipcc_scores(void)
> {
> if (!cpu_feature_enabled(X86_FEATURE_ITD))
> --
On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
<[email protected]> wrote:
>
> Implement the arch_get_ipcc_score() interface of the scheduler. Use the
> performance capabilities of the extended Hardware Feedback Interface table
> as the IPC score.
>
> Cc: Ben Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Ionela Voinescu <[email protected]>
> Cc: Joel Fernandes (Google) <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Lukasz Luba <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Tim C. Chen <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Ricardo Neri <[email protected]>
> ---
> Changes since v2:
> * None
>
> Changes since v1:
> * Adjusted the returned HFI class (which starts at 0) to match the
> scheduler IPCC class (which starts at 1). (PeterZ)
> * Used the new interface names.
> ---
> arch/x86/include/asm/topology.h | 2 ++
> drivers/thermal/intel/intel_hfi.c | 27 +++++++++++++++++++++++++++
> 2 files changed, 29 insertions(+)
>
> diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> index ffcdac3f398f..c4fcd9c3c634 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -229,8 +229,10 @@ void init_freq_invariance_cppc(void);
>
> #if defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL)
> void intel_hfi_update_ipcc(struct task_struct *curr);
> +unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu);
>
> #define arch_update_ipcc intel_hfi_update_ipcc
> +#define arch_get_ipcc_score intel_hfi_get_ipcc_score
> #endif /* defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL) */
>
> #endif /* _ASM_X86_TOPOLOGY_H */
> diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> index 530dcf57e06e..fa9b4a678d92 100644
> --- a/drivers/thermal/intel/intel_hfi.c
> +++ b/drivers/thermal/intel/intel_hfi.c
> @@ -206,6 +206,33 @@ void intel_hfi_update_ipcc(struct task_struct *curr)
> curr->ipcc = msr.split.classid + 1;
> }
>
> +unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu)
> +{
> + unsigned short hfi_class;
It looks like the variable above is only used to save a subtraction or
addition of 1 to something going forward.
> + int *scores;
> +
> + if (cpu < 0 || cpu >= nr_cpu_ids)
> + return -EINVAL;
> +
> + if (ipcc == IPC_CLASS_UNCLASSIFIED)
> + return -EINVAL;
> +
> + /*
> + * Scheduler IPC classes start at 1. HFI classes start at 0.
> + * See note intel_hfi_update_ipcc().
> + */
> + hfi_class = ipcc - 1;
> +
> + if (hfi_class >= hfi_features.nr_classes)
Personally, I would do
if (ipcc >= hfi_features.nr_classes + 1)
here and ->
> + return -EINVAL;
> +
> + scores = per_cpu_ptr(hfi_ipcc_scores, cpu);
> + if (!scores)
> + return -ENODEV;
> +
-> scores[ipcc - 1]
below.
> + return READ_ONCE(scores[hfi_class]);
And why does this need to use READ_ONCE()?
> +}
> +
> static int alloc_hfi_ipcc_scores(void)
> {
> if (!cpu_feature_enabled(X86_FEATURE_ITD))
> --
On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
<[email protected]> wrote:
>
> A task may be unclassified if it has been recently created, spend most of
> its lifetime sleeping, or hardware has not provided a classification.
>
> Most tasks will be eventually classified as scheduler's IPC class 1
> (HFI class 0). This class corresponds to the capabilities in the legacy,
> classless, HFI table.
>
> IPC class 1 is a reasonable choice until hardware provides an actual
> classification. Meanwhile, the scheduler will place classes of tasks with
> higher IPC scores on higher-performance CPUs.
>
> Cc: Ben Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Ionela Voinescu <[email protected]>
> Cc: Joel Fernandes (Google) <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Lukasz Luba <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Tim C. Chen <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Ricardo Neri <[email protected]>
Fine with me, so
Acked-by: Rafael J. Wysocki <[email protected]>
> ---
> Changes since v2:
> * None
>
> Changes since v1:
> * Now the default class is 1.
> ---
> drivers/thermal/intel/intel_hfi.c | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> index fa9b4a678d92..7ea6acce7107 100644
> --- a/drivers/thermal/intel/intel_hfi.c
> +++ b/drivers/thermal/intel/intel_hfi.c
> @@ -185,6 +185,19 @@ static struct workqueue_struct *hfi_updates_wq;
> #ifdef CONFIG_IPC_CLASSES
> static int __percpu *hfi_ipcc_scores;
>
> +/*
> + * A task may be unclassified if it has been recently created, spend most of
> + * its lifetime sleeping, or hardware has not provided a classification.
> + *
> + * Most tasks will be classified as scheduler's IPC class 1 (HFI class 0)
> + * eventually. Meanwhile, the scheduler will place classes of tasks with higher
> + * IPC scores on higher-performance CPUs.
> + *
> + * IPC class 1 is a reasonable choice. It matches the performance capability
> + * of the legacy, classless, HFI table.
> + */
> +#define HFI_UNCLASSIFIED_DEFAULT 1
> +
> void intel_hfi_update_ipcc(struct task_struct *curr)
> {
> union hfi_thread_feedback_char_msr msr;
> @@ -215,7 +228,7 @@ unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu)
> return -EINVAL;
>
> if (ipcc == IPC_CLASS_UNCLASSIFIED)
> - return -EINVAL;
> + ipcc = HFI_UNCLASSIFIED_DEFAULT;
>
> /*
> * Scheduler IPC classes start at 1. HFI classes start at 0.
> --
> 2.25.1
>
On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
<[email protected]> wrote:
>
> Enable Intel Thread Director from the CPU hotplug callback: globally from
> CPU0 and then enable the thread-classification hardware in each logical
> processor individually.
>
> Also, initialize the number of classes supported.
>
> Let the scheduler know that it can start using IPC classes.
>
> Cc: Ben Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Ionela Voinescu <[email protected]>
> Cc: Joel Fernandes (Google) <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Lukasz Luba <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Tim C. Chen <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Ricardo Neri <[email protected]>
For the changes in intel_hfi.c
Acked-by: Rafael J. Wysocki <[email protected]>
> ---
> Changes since v2:
> * Use the new sched_enable_ipc_classes() interface to enable the use of
> IPC classes in the scheduler.
>
> Changes since v1:
> * None
> ---
> arch/x86/include/asm/msr-index.h | 2 ++
> drivers/thermal/intel/intel_hfi.c | 40 +++++++++++++++++++++++++++++--
> 2 files changed, 40 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index ad35355ee43e..0ea25cc9c621 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1106,6 +1106,8 @@
> /* Hardware Feedback Interface */
> #define MSR_IA32_HW_FEEDBACK_PTR 0x17d0
> #define MSR_IA32_HW_FEEDBACK_CONFIG 0x17d1
> +#define MSR_IA32_HW_FEEDBACK_THREAD_CONFIG 0x17d4
> +#define MSR_IA32_HW_FEEDBACK_CHAR 0x17d2
>
> /* x2APIC locked status */
> #define MSR_IA32_XAPIC_DISABLE_STATUS 0xBD
> diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> index 7ea6acce7107..35d947f47550 100644
> --- a/drivers/thermal/intel/intel_hfi.c
> +++ b/drivers/thermal/intel/intel_hfi.c
> @@ -48,6 +48,8 @@
> /* Hardware Feedback Interface MSR configuration bits */
> #define HW_FEEDBACK_PTR_VALID_BIT BIT(0)
> #define HW_FEEDBACK_CONFIG_HFI_ENABLE_BIT BIT(0)
> +#define HW_FEEDBACK_CONFIG_ITD_ENABLE_BIT BIT(1)
> +#define HW_FEEDBACK_THREAD_CONFIG_ENABLE_BIT BIT(0)
>
> /* CPUID detection and enumeration definitions for HFI */
>
> @@ -72,6 +74,15 @@ union cpuid6_edx {
> u32 full;
> };
>
> +union cpuid6_ecx {
> + struct {
> + u32 dont_care0:8;
> + u32 nr_classes:8;
> + u32 dont_care1:16;
> + } split;
> + u32 full;
> +};
> +
> #ifdef CONFIG_IPC_CLASSES
> union hfi_thread_feedback_char_msr {
> struct {
> @@ -506,6 +517,11 @@ void intel_hfi_online(unsigned int cpu)
>
> init_hfi_cpu_index(info);
>
> + if (cpu_feature_enabled(X86_FEATURE_ITD)) {
> + msr_val = HW_FEEDBACK_THREAD_CONFIG_ENABLE_BIT;
> + wrmsrl(MSR_IA32_HW_FEEDBACK_THREAD_CONFIG, msr_val);
> + }
> +
> /*
> * Now check if the HFI instance of the package/die of @cpu has been
> * initialized (by checking its header). In such case, all we have to
> @@ -561,8 +577,22 @@ void intel_hfi_online(unsigned int cpu)
> */
> rdmsrl(MSR_IA32_HW_FEEDBACK_CONFIG, msr_val);
> msr_val |= HW_FEEDBACK_CONFIG_HFI_ENABLE_BIT;
> +
> + if (cpu_feature_enabled(X86_FEATURE_ITD))
> + msr_val |= HW_FEEDBACK_CONFIG_ITD_ENABLE_BIT;
> +
> wrmsrl(MSR_IA32_HW_FEEDBACK_CONFIG, msr_val);
>
> + /*
> + * We have all we need to support IPC classes. Task classification is
> + * now working.
> + *
> + * All class scores are zero until after the first HFI update. That is
> + * OK. The scheduler queries these scores at every load balance.
> + */
> + if (cpu_feature_enabled(X86_FEATURE_ITD))
> + sched_enable_ipc_classes();
> +
> unlock:
> mutex_unlock(&hfi_instance_lock);
> return;
> @@ -640,8 +670,14 @@ static __init int hfi_parse_features(void)
> */
> hfi_features.class_stride = nr_capabilities;
>
> - /* For now, use only one class of the HFI table */
> - hfi_features.nr_classes = 1;
> + if (cpu_feature_enabled(X86_FEATURE_ITD)) {
> + union cpuid6_ecx ecx;
> +
> + ecx.full = cpuid_ecx(CPUID_HFI_LEAF);
> + hfi_features.nr_classes = ecx.split.nr_classes;
> + } else {
> + hfi_features.nr_classes = 1;
> + }
>
> /*
> * The header contains change indications for each supported feature.
> --
> 2.25.1
>
On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
<[email protected]> wrote:
>
> In Alder Lake and Raptor Lake, the result of thread classification is more
> accurate when only one SMT sibling is busy. Classification results for
> class 2 and 3 are always reliable.
>
> To avoid unnecessary migrations, only update the class of a task if it has
> been the same during 4 consecutive user ticks.
>
> Cc: Ben Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Ionela Voinescu <[email protected]>
> Cc: Joel Fernandes (Google) <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Lukasz Luba <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Tim C. Chen <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Ricardo Neri <[email protected]>
> ---
> Changes since v2:
> * None
>
> Changes since v1:
> * Adjusted the result the classification of Intel Thread Director to start
> at class 1. Class 0 for the scheduler means that the task is
> unclassified.
> * Used the new names of the IPC classes members in task_struct.
> * Reworked helper functions to use sched_smt_siblings_idle() to query
> the idle state of the SMT siblings of a CPU.
> ---
> drivers/thermal/intel/intel_hfi.c | 60 ++++++++++++++++++++++++++++++-
> 1 file changed, 59 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> index 35d947f47550..fdb53e4cabc1 100644
> --- a/drivers/thermal/intel/intel_hfi.c
> +++ b/drivers/thermal/intel/intel_hfi.c
> @@ -40,6 +40,7 @@
> #include <linux/workqueue.h>
>
> #include <asm/msr.h>
> +#include <asm/intel-family.h>
>
> #include "../thermal_core.h"
> #include "intel_hfi.h"
> @@ -209,9 +210,64 @@ static int __percpu *hfi_ipcc_scores;
> */
> #define HFI_UNCLASSIFIED_DEFAULT 1
>
> +#define CLASS_DEBOUNCER_SKIPS 4
> +
> +/**
> + * debounce_and_update_class() - Process and update a task's classification
> + *
> + * @p: The task of which the classification will be updated
> + * @new_ipcc: The new IPC classification
> + *
> + * Update the classification of @p with the new value that hardware provides.
> + * Only update the classification of @p if it has been the same during
> + * CLASS_DEBOUNCER_SKIPS consecutive ticks.
> + */
> +static void debounce_and_update_class(struct task_struct *p, u8 new_ipcc)
> +{
> + u16 debounce_skip;
> +
> + /* The class of @p changed. Only restart the debounce counter. */
> + if (p->ipcc_tmp != new_ipcc) {
> + p->ipcc_cntr = 1;
> + goto out;
> + }
> +
> + /*
> + * The class of @p did not change. Update it if it has been the same
> + * for CLASS_DEBOUNCER_SKIPS user ticks.
> + */
> + debounce_skip = p->ipcc_cntr + 1;
> + if (debounce_skip < CLASS_DEBOUNCER_SKIPS)
> + p->ipcc_cntr++;
> + else
> + p->ipcc = new_ipcc;
> +
> +out:
> + p->ipcc_tmp = new_ipcc;
> +}
Why does the code above belong to the Intel HFI driver? It doesn't
look like there is anything driver-specific in it.
> +
> +static bool classification_is_accurate(u8 hfi_class, bool smt_siblings_idle)
> +{
> + switch (boot_cpu_data.x86_model) {
> + case INTEL_FAM6_ALDERLAKE:
> + case INTEL_FAM6_ALDERLAKE_L:
> + case INTEL_FAM6_RAPTORLAKE:
> + case INTEL_FAM6_RAPTORLAKE_P:
> + case INTEL_FAM6_RAPTORLAKE_S:
> + if (hfi_class == 3 || hfi_class == 2 || smt_siblings_idle)
> + return true;
> +
> + return false;
> +
> + default:
> + return true;
> + }
> +}
> +
> void intel_hfi_update_ipcc(struct task_struct *curr)
> {
> union hfi_thread_feedback_char_msr msr;
> + bool idle;
>
> /* We should not be here if ITD is not supported. */
> if (!cpu_feature_enabled(X86_FEATURE_ITD)) {
> @@ -227,7 +283,9 @@ void intel_hfi_update_ipcc(struct task_struct *curr)
> * 0 is a valid classification for Intel Thread Director. A scheduler
> * IPCC class of 0 means that the task is unclassified. Adjust.
> */
> - curr->ipcc = msr.split.classid + 1;
> + idle = sched_smt_siblings_idle(task_cpu(curr));
> + if (classification_is_accurate(msr.split.classid, idle))
> + debounce_and_update_class(curr, msr.split.classid + 1);
> }
I still think that this function should just return a number, possibly
including a special "no IPCC" value. It may be passed a bool argument
indicating whether or not the SMT siblings are idle.
>
> unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu)
> --
On Tue, 7 Feb 2023 at 06:01, Ricardo Neri
<[email protected]> wrote:
>
> Compute the joint total (both current and prospective) IPC class score of
> a scheduling group and the local scheduling group.
>
> These IPCC statistics are used during idle load balancing. The candidate
> scheduling group will have one fewer busy CPU after load balancing. This
> observation is important for cores with SMT support.
>
> The IPCC score of scheduling groups composed of SMT siblings needs to
> consider that the siblings share CPU resources. When computing the total
> IPCC score of the scheduling group, divide score of each sibling by the
> number of busy siblings.
>
> Collect IPCC statistics for asym_packing and fully_busy scheduling groups.
IPCC statistics collect scores of current tasks, so they are
meaningful only when trying to migrate one of those running tasks.
Using such score when pulling other tasks is just meaningless. And I
don't see how you ensure such correct use of ipcc score
> When picking a busiest group, they are used to break ties between otherwise
> identical groups.
>
> Cc: Ben Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Ionela Voinescu <[email protected]>
> Cc: Joel Fernandes (Google) <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Lukasz Luba <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Tim C. Chen <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Ricardo Neri <[email protected]>
> ---
> Changes since v2:
> * Also collect IPCC stats for fully_busy sched groups.
Why have you added fully_busy case ? it's worth explaining the
rational because there is a lot of change to use the ipcc score of the
wrong task to take the decision
> * Restrict use of IPCC stats to SD_ASYM_PACKING. (Ionela)
> * Handle errors of arch_get_ipcc_score(). (Ionela)
>
> Changes since v1:
> * Implemented cleanups and reworks from PeterZ. I took all his
> suggestions, except the computation of the IPC score before and after
> load balancing. We are computing not the average score, but the *total*.
> * Check for the SD_SHARE_CPUCAPACITY to compute the throughput of the SMT
> siblings of a physical core.
> * Used the new interface names.
> * Reworded commit message for clarity.
> ---
> kernel/sched/fair.c | 68 +++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 68 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d773380a95b3..b6165aa8a376 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8901,6 +8901,8 @@ struct sg_lb_stats {
> unsigned long min_score; /* Min(score(rq->curr->ipcc)) */
> unsigned short min_ipcc; /* Class of the task with the minimum IPCC score in the rq */
> unsigned long sum_score; /* Sum(score(rq->curr->ipcc)) */
> + long ipcc_score_after; /* Prospective IPCC score after load balancing */
> + unsigned long ipcc_score_before; /* IPCC score before load balancing */
> #endif
> };
>
> @@ -9287,6 +9289,62 @@ static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> }
> }
>
> +static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
> + struct sched_group *sg,
> + struct lb_env *env)
> +{
> + unsigned long score_on_dst_cpu, before;
> + int busy_cpus;
> + long after;
> +
> + if (!sched_ipcc_enabled())
> + return;
> +
> + /*
> + * IPCC scores are only useful during idle load balancing. For now,
> + * only asym_packing uses IPCC scores.
> + */
> + if (!(env->sd->flags & SD_ASYM_PACKING) ||
> + env->idle == CPU_NOT_IDLE)
> + return;
> +
> + /*
> + * IPCC scores are used to break ties only between these types of
> + * groups.
> + */
> + if (sgs->group_type != group_fully_busy &&
> + sgs->group_type != group_asym_packing)
> + return;
> +
> + busy_cpus = sgs->group_weight - sgs->idle_cpus;
> +
> + /* No busy CPUs in the group. No tasks to move. */
> + if (!busy_cpus)
> + return;
> +
> + score_on_dst_cpu = arch_get_ipcc_score(sgs->min_ipcc, env->dst_cpu);
> +
> + /*
> + * Do not use IPC scores. sgs::ipcc_score_{after, before} will be zero
> + * and not used.
> + */
> + if (IS_ERR_VALUE(score_on_dst_cpu))
> + return;
> +
> + before = sgs->sum_score;
> + after = before - sgs->min_score;
IIUC, you assume that you will select the cpu with the min score.
How do you ensure this ? otherwise all your comparisong are useless
> +
> + /* SMT siblings share throughput. */
> + if (busy_cpus > 1 && sg->flags & SD_SHARE_CPUCAPACITY) {
> + before /= busy_cpus;
> + /* One sibling will become idle after load balance. */
> + after /= busy_cpus - 1;
> + }
> +
> + sgs->ipcc_score_after = after + score_on_dst_cpu;
> + sgs->ipcc_score_before = before;
I'm not sure to understand why you are computing the sum_score,
ipcc_score_before and ipcc_score_after ?
Why is it not sufficient to check if score_on_dst_cpu will be higher
than current min_score ?
> +}
> +
> #else /* CONFIG_IPC_CLASSES */
> static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> struct rq *rq)
> @@ -9296,6 +9354,13 @@ static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> static void init_rq_ipcc_stats(struct sg_lb_stats *sgs)
> {
> }
> +
> +static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
> + struct sched_group *sg,
> + struct lb_env *env)
> +{
> +}
> +
> #endif /* CONFIG_IPC_CLASSES */
>
> /**
> @@ -9457,6 +9522,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>
> sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
>
> + if (!local_group)
> + update_sg_lb_stats_scores(sgs, group, env);
> +
> /* Computing avg_load makes sense only when group is overloaded */
> if (sgs->group_type == group_overloaded)
> sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
> --
> 2.25.1
>
On Tue, 7 Feb 2023 at 06:01, Ricardo Neri
<[email protected]> wrote:
>
> For two runqueues of equal priority and equal number of running of tasks,
> select the one whose current task would have the highest IPC class score
> if placed on the destination CPU.
You failed to explain why it make sense to compare current task score
whereas we will most probably not pull this task at the end
>
> For now, use IPCC scores only for scheduling domains with the
> SD_ASYM_PACKING flag.
>
> Cc: Ben Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Ionela Voinescu <[email protected]>
> Cc: Joel Fernandes (Google) <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Lukasz Luba <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Tim C. Chen <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Ricardo Neri <[email protected]>
> ---
> Changes since v2:
> * Only use IPCC scores to break ties if the sched domain uses
> asym_packing. (Ionela)
> * Handle errors of arch_get_ipcc_score(). (Ionela)
>
> Changes since v1:
> * Fixed a bug when selecting a busiest runqueue: when comparing two
> runqueues with equal nr_running, we must compute the IPCC score delta
> of both.
> * Renamed local variables to improve the layout of the code block.
> (PeterZ)
> * Used the new interface names.
> ---
> kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 64 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 72d88270b320..d3c22dc145f7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9399,6 +9399,37 @@ static bool sched_asym_ipcc_pick(struct sched_group *a,
> return sched_asym_ipcc_prefer(a_stats, b_stats);
> }
>
> +/**
> + * ipcc_score_delta - Get the IPCC score delta wrt the load balance's dst_cpu
> + * @p: A task
> + * @env: Load balancing environment
> + *
> + * Returns: The IPCC score delta that @p would get if placed in the destination
> + * CPU of @env. LONG_MIN to indicate that the delta should not be used.
> + */
> +static long ipcc_score_delta(struct task_struct *p, struct lb_env *env)
> +{
> + unsigned long score_src, score_dst;
> + unsigned short ipcc = p->ipcc;
> +
> + if (!sched_ipcc_enabled())
> + return LONG_MIN;
> +
> + /* Only asym_packing uses IPCC scores at the moment. */
> + if (!(env->sd->flags & SD_ASYM_PACKING))
> + return LONG_MIN;
> +
> + score_dst = arch_get_ipcc_score(ipcc, env->dst_cpu);
> + if (IS_ERR_VALUE(score_dst))
> + return LONG_MIN;
> +
> + score_src = arch_get_ipcc_score(ipcc, task_cpu(p));
> + if (IS_ERR_VALUE(score_src))
> + return LONG_MIN;
> +
> + return score_dst - score_src;
> +}
> +
> #else /* CONFIG_IPC_CLASSES */
> static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> struct rq *rq)
> @@ -9429,6 +9460,11 @@ static bool sched_asym_ipcc_pick(struct sched_group *a,
> return false;
> }
>
> +static long ipcc_score_delta(struct task_struct *p, struct lb_env *env)
> +{
> + return LONG_MIN;
> +}
> +
> #endif /* CONFIG_IPC_CLASSES */
>
> /**
> @@ -10589,6 +10625,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> {
> struct rq *busiest = NULL, *rq;
> unsigned long busiest_util = 0, busiest_load = 0, busiest_capacity = 1;
> + long busiest_ipcc_delta = LONG_MIN;
> unsigned int busiest_nr = 0;
> int i;
>
> @@ -10705,8 +10742,35 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>
> case migrate_task:
> if (busiest_nr < nr_running) {
> + struct task_struct *curr;
> +
> busiest_nr = nr_running;
> busiest = rq;
> +
> + /*
> + * Remember the IPCC score delta of busiest::curr.
> + * We may need it to break a tie with other queues
> + * with equal nr_running.
> + */
> + curr = rcu_dereference(busiest->curr);
> + busiest_ipcc_delta = ipcc_score_delta(curr, env);
Hmm, i don't like this at all
Also, curr is the least probable task to be pulled which means that
all this his useless
> + /*
> + * If rq and busiest have the same number of running
> + * tasks and IPC classes are supported, pick rq if doing
> + * so would give rq::curr a bigger IPC boost on dst_cpu.
> + */
> + } else if (busiest_nr == nr_running) {
> + struct task_struct *curr;
> + long delta;
> +
> + curr = rcu_dereference(rq->curr);
> + delta = ipcc_score_delta(curr, env);
> +
> + if (busiest_ipcc_delta < delta) {
> + busiest_ipcc_delta = delta;
> + busiest_nr = nr_running;
> + busiest = rq;
> + }
> }
> break;
>
> --
> 2.25.1
>
On Mon, Mar 27, 2023 at 06:50:13PM +0200, Rafael J. Wysocki wrote:
> On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
> <[email protected]> wrote:
> >
> > Implement the arch_get_ipcc_score() interface of the scheduler. Use the
> > performance capabilities of the extended Hardware Feedback Interface table
> > as the IPC score.
> >
> > Cc: Ben Segall <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Ionela Voinescu <[email protected]>
> > Cc: Joel Fernandes (Google) <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: Lukasz Luba <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Srinivas Pandruvada <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Tim C. Chen <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: Ricardo Neri <[email protected]>
> > ---
> > Changes since v2:
> > * None
> >
> > Changes since v1:
> > * Adjusted the returned HFI class (which starts at 0) to match the
> > scheduler IPCC class (which starts at 1). (PeterZ)
> > * Used the new interface names.
> > ---
> > arch/x86/include/asm/topology.h | 2 ++
> > drivers/thermal/intel/intel_hfi.c | 27 +++++++++++++++++++++++++++
> > 2 files changed, 29 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> > index ffcdac3f398f..c4fcd9c3c634 100644
> > --- a/arch/x86/include/asm/topology.h
> > +++ b/arch/x86/include/asm/topology.h
> > @@ -229,8 +229,10 @@ void init_freq_invariance_cppc(void);
> >
> > #if defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL)
> > void intel_hfi_update_ipcc(struct task_struct *curr);
> > +unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu);
> >
> > #define arch_update_ipcc intel_hfi_update_ipcc
> > +#define arch_get_ipcc_score intel_hfi_get_ipcc_score
> > #endif /* defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL) */
> >
> > #endif /* _ASM_X86_TOPOLOGY_H */
> > diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> > index 530dcf57e06e..fa9b4a678d92 100644
> > --- a/drivers/thermal/intel/intel_hfi.c
> > +++ b/drivers/thermal/intel/intel_hfi.c
> > @@ -206,6 +206,33 @@ void intel_hfi_update_ipcc(struct task_struct *curr)
> > curr->ipcc = msr.split.classid + 1;
> > }
> >
> > +unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu)
> > +{
> > + unsigned short hfi_class;
>
> It looks like the variable above is only used to save a subtraction or
> addition of 1 to something going forward.
>
> > + int *scores;
> > +
> > + if (cpu < 0 || cpu >= nr_cpu_ids)
> > + return -EINVAL;
> > +
> > + if (ipcc == IPC_CLASS_UNCLASSIFIED)
> > + return -EINVAL;
> > +
> > + /*
> > + * Scheduler IPC classes start at 1. HFI classes start at 0.
> > + * See note intel_hfi_update_ipcc().
> > + */
> > + hfi_class = ipcc - 1;
> > +
> > + if (hfi_class >= hfi_features.nr_classes)
>
> Personally, I would do
>
> if (ipcc >= hfi_features.nr_classes + 1)
>
> here and ->
>
> > + return -EINVAL;
> > +
> > + scores = per_cpu_ptr(hfi_ipcc_scores, cpu);
> > + if (!scores)
> > + return -ENODEV;
> > +
>
> -> scores[ipcc - 1]
Sure, I can get rid of hfi_class.
>
> below.
>
> > + return READ_ONCE(scores[hfi_class]);
>
> And why does this need to use READ_ONCE()?
This is the corresponding read of the WRITE_ONCE in patch 13. The CPU
handling the HFI interrupt, very likely a different CPU, updates
scores[hfi_class]. My intention is to let that write to complete before
reading the score here.
>
> > +}
> > +
> > static int alloc_hfi_ipcc_scores(void)
> > {
> > if (!cpu_feature_enabled(X86_FEATURE_ITD))
> > --
On Mon, Mar 27, 2023 at 06:42:28PM +0200, Rafael J. Wysocki wrote:
> On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
> <[email protected]> wrote:
> >
> > Use Intel Thread Director classification to update the IPC class of a
> > task. Implement the arch_update_ipcc() interface of the scheduler.
> >
> > Cc: Ben Segall <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Ionela Voinescu <[email protected]>
> > Cc: Joel Fernandes (Google) <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: Lukasz Luba <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Srinivas Pandruvada <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Tim C. Chen <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: Ricardo Neri <[email protected]>
> > ---
> > Changes since v2:
> > * Removed the implementation of arch_has_ipc_classes().
> >
> > Changes since v1:
> > * Adjusted the result the classification of Intel Thread Director to start
> > at class 1. Class 0 for the scheduler means that the task is
> > unclassified.
> > * Redefined union hfi_thread_feedback_char_msr to ensure all
> > bit-fields are packed. (PeterZ)
> > * Removed CONFIG_INTEL_THREAD_DIRECTOR. (PeterZ)
> > * Shortened the names of the functions that implement IPC classes.
> > * Removed argument smt_siblings_idle from intel_hfi_update_ipcc().
> > (PeterZ)
> > ---
> > arch/x86/include/asm/topology.h | 6 ++++++
> > drivers/thermal/intel/intel_hfi.c | 32 +++++++++++++++++++++++++++++++
> > 2 files changed, 38 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> > index 458c891a8273..ffcdac3f398f 100644
> > --- a/arch/x86/include/asm/topology.h
> > +++ b/arch/x86/include/asm/topology.h
> > @@ -227,4 +227,10 @@ void init_freq_invariance_cppc(void);
> > #define arch_init_invariance_cppc init_freq_invariance_cppc
> > #endif
> >
> > +#if defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL)
> > +void intel_hfi_update_ipcc(struct task_struct *curr);
> > +
> > +#define arch_update_ipcc intel_hfi_update_ipcc
> > +#endif /* defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL) */
> > +
> > #endif /* _ASM_X86_TOPOLOGY_H */
> > diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> > index b06021828892..530dcf57e06e 100644
> > --- a/drivers/thermal/intel/intel_hfi.c
> > +++ b/drivers/thermal/intel/intel_hfi.c
> > @@ -72,6 +72,17 @@ union cpuid6_edx {
> > u32 full;
> > };
> >
> > +#ifdef CONFIG_IPC_CLASSES
> > +union hfi_thread_feedback_char_msr {
> > + struct {
> > + u64 classid : 8;
> > + u64 __reserved : 55;
> > + u64 valid : 1;
> > + } split;
> > + u64 full;
> > +};
> > +#endif
> > +
> > /**
> > * struct hfi_cpu_data - HFI capabilities per CPU
> > * @perf_cap: Performance capability
> > @@ -174,6 +185,27 @@ static struct workqueue_struct *hfi_updates_wq;
> > #ifdef CONFIG_IPC_CLASSES
> > static int __percpu *hfi_ipcc_scores;
> >
> > +void intel_hfi_update_ipcc(struct task_struct *curr)
> > +{
> > + union hfi_thread_feedback_char_msr msr;
> > +
> > + /* We should not be here if ITD is not supported. */
> > + if (!cpu_feature_enabled(X86_FEATURE_ITD)) {
> > + pr_warn_once("task classification requested but not supported!");
> > + return;
> > + }
> > +
> > + rdmsrl(MSR_IA32_HW_FEEDBACK_CHAR, msr.full);
> > + if (!msr.split.valid)
> > + return;
> > +
> > + /*
> > + * 0 is a valid classification for Intel Thread Director. A scheduler
> > + * IPCC class of 0 means that the task is unclassified. Adjust.
> > + */
> > + curr->ipcc = msr.split.classid + 1;
> > +}
>
> Wouldn't it be better to return the adjusted value from this function
> and let the caller store it where appropriate?
>
> It doesn't look like it is necessary to pass the task_struct pointer to it.
Judging from this patch alone, yes, it does not make much sense to pass a
task_struct as argument. In patch 21, however, this function uses various
members of task_struct and makes it more convenient to have it as argument,
no?
>
> > +
> > static int alloc_hfi_ipcc_scores(void)
> > {
> > if (!cpu_feature_enabled(X86_FEATURE_ITD))
> > --
On Mon, Mar 27, 2023 at 06:31:03PM +0200, Rafael J. Wysocki wrote:
> On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
> <[email protected]> wrote:
> >
> > On Intel hybrid parts, each type of CPU has specific performance and
> > energy efficiency capabilities. The Intel Thread Director technology
> > extends the Hardware Feedback Interface (HFI) to provide performance and
> > energy efficiency data for advanced classes of instructions.
> >
> > Add support to parse per-class capabilities.
> >
> > Cc: Ben Segall <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Ionela Voinescu <[email protected]>
> > Cc: Joel Fernandes (Google) <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: Lukasz Luba <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Srinivas Pandruvada <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Tim C. Chen <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: Ricardo Neri <[email protected]>
>
> No objections to this patch, so
>
> Acked-by: Rafael J. Wysocki <[email protected]>
Thank you Rafael!
On Mon, Mar 27, 2023 at 06:37:32PM +0200, Rafael J. Wysocki wrote:
> On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
> <[email protected]> wrote:
> >
> > The scheduler reads the IPCC scores when balancing load. These reads can
> > be quite frequent. Hardware can also update the HFI table frequently.
> > Concurrent access may cause a lot of lock contention. It gets worse as the
> > number of CPUs increases.
> >
> > Instead, create separate per-CPU IPCC scores that the scheduler can read
> > without the HFI table lock.
> >
> > Cc: Ben Segall <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Ionela Voinescu <[email protected]>
> > Cc: Joel Fernandes (Google) <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: Lukasz Luba <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Srinivas Pandruvada <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Tim C. Chen <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Suggested-by: Peter Zijlstra (Intel) <[email protected]>
> > Signed-off-by: Ricardo Neri <[email protected]>
> > ---
> > Changes since v2:
> > * Only create these per-CPU variables when Intel Thread Director is
> > supported.
> >
> > Changes since v1:
> > * Added this patch.
> > ---
> > drivers/thermal/intel/intel_hfi.c | 46 +++++++++++++++++++++++++++++++
> > 1 file changed, 46 insertions(+)
> >
> > diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> > index 2527ae3836c7..b06021828892 100644
> > --- a/drivers/thermal/intel/intel_hfi.c
> > +++ b/drivers/thermal/intel/intel_hfi.c
> > @@ -29,6 +29,7 @@
> > #include <linux/kernel.h>
> > #include <linux/math.h>
> > #include <linux/mutex.h>
> > +#include <linux/percpu.h>
> > #include <linux/percpu-defs.h>
> > #include <linux/printk.h>
> > #include <linux/processor.h>
> > @@ -170,6 +171,43 @@ static struct workqueue_struct *hfi_updates_wq;
> > #define HFI_UPDATE_INTERVAL HZ
> > #define HFI_MAX_THERM_NOTIFY_COUNT 16
> >
> > +#ifdef CONFIG_IPC_CLASSES
>
> It would be good to provide a (concise) description of this variable.
>
> > +static int __percpu *hfi_ipcc_scores;
Do you mean hfi_ipcc_scores or CONFIG_IPC_CLASSES?
> > +
> > +static int alloc_hfi_ipcc_scores(void)
> > +{
> > + if (!cpu_feature_enabled(X86_FEATURE_ITD))
> > + return 0;
> > +
> > + hfi_ipcc_scores = __alloc_percpu(sizeof(*hfi_ipcc_scores) *
> > + hfi_features.nr_classes,
> > + sizeof(*hfi_ipcc_scores));
> > +
> > + return !hfi_ipcc_scores;
>
> I would do
>
> if (!hfi_ipcc_scores)
> return -ENOMEM;
>
> return 0;
>
> Or make the function return bool.
Sure, I can make this function return -ENOMEM.
>
> > +}
> > +
> > +static void set_hfi_ipcc_score(void *caps, int cpu)
> > +{
> > + int i, *hfi_class;
> > +
> > + if (!cpu_feature_enabled(X86_FEATURE_ITD))
> > + return;
> > +
> > + hfi_class = per_cpu_ptr(hfi_ipcc_scores, cpu);
> > +
> > + for (i = 0; i < hfi_features.nr_classes; i++) {
> > + struct hfi_cpu_data *class_caps;
> > +
> > + class_caps = caps + i * hfi_features.class_stride;
> > + WRITE_ONCE(hfi_class[i], class_caps->perf_cap);
>
> As it stands, it is unclear why WRITE_ONCE() is needed here.
The CPU handling the HFI interrupt will update all the per-CPU IPCC
scores. My intention is to ensure that a WRITE of a given IPCC score
is completed before another CPU READs an IPCC score. The corresponding
READ_ONCE happens in patch 15.
On Mon, Mar 27, 2023 at 06:51:33PM +0200, Rafael J. Wysocki wrote:
> On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
> <[email protected]> wrote:
> >
> > A task may be unclassified if it has been recently created, spend most of
> > its lifetime sleeping, or hardware has not provided a classification.
> >
> > Most tasks will be eventually classified as scheduler's IPC class 1
> > (HFI class 0). This class corresponds to the capabilities in the legacy,
> > classless, HFI table.
> >
> > IPC class 1 is a reasonable choice until hardware provides an actual
> > classification. Meanwhile, the scheduler will place classes of tasks with
> > higher IPC scores on higher-performance CPUs.
> >
> > Cc: Ben Segall <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Ionela Voinescu <[email protected]>
> > Cc: Joel Fernandes (Google) <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: Lukasz Luba <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Srinivas Pandruvada <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Tim C. Chen <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: Ricardo Neri <[email protected]>
>
> Fine with me, so
>
> Acked-by: Rafael J. Wysocki <[email protected]>
Thank you Rafael!
On Mon, Mar 27, 2023 at 06:53:40PM +0200, Rafael J. Wysocki wrote:
> On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
> <[email protected]> wrote:
> >
> > Enable Intel Thread Director from the CPU hotplug callback: globally from
> > CPU0 and then enable the thread-classification hardware in each logical
> > processor individually.
> >
> > Also, initialize the number of classes supported.
> >
> > Let the scheduler know that it can start using IPC classes.
> >
> > Cc: Ben Segall <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Ionela Voinescu <[email protected]>
> > Cc: Joel Fernandes (Google) <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: Lukasz Luba <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Srinivas Pandruvada <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Tim C. Chen <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: Ricardo Neri <[email protected]>
>
> For the changes in intel_hfi.c
>
> Acked-by: Rafael J. Wysocki <[email protected]>
Thank you Rafael!
On Mon, Mar 27, 2023 at 07:03:08PM +0200, Rafael J. Wysocki wrote:
> On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
> <[email protected]> wrote:
> >
> > In Alder Lake and Raptor Lake, the result of thread classification is more
> > accurate when only one SMT sibling is busy. Classification results for
> > class 2 and 3 are always reliable.
> >
> > To avoid unnecessary migrations, only update the class of a task if it has
> > been the same during 4 consecutive user ticks.
> >
> > Cc: Ben Segall <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Ionela Voinescu <[email protected]>
> > Cc: Joel Fernandes (Google) <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: Lukasz Luba <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Srinivas Pandruvada <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Tim C. Chen <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: Ricardo Neri <[email protected]>
> > ---
> > Changes since v2:
> > * None
> >
> > Changes since v1:
> > * Adjusted the result the classification of Intel Thread Director to start
> > at class 1. Class 0 for the scheduler means that the task is
> > unclassified.
> > * Used the new names of the IPC classes members in task_struct.
> > * Reworked helper functions to use sched_smt_siblings_idle() to query
> > the idle state of the SMT siblings of a CPU.
> > ---
> > drivers/thermal/intel/intel_hfi.c | 60 ++++++++++++++++++++++++++++++-
> > 1 file changed, 59 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> > index 35d947f47550..fdb53e4cabc1 100644
> > --- a/drivers/thermal/intel/intel_hfi.c
> > +++ b/drivers/thermal/intel/intel_hfi.c
> > @@ -40,6 +40,7 @@
> > #include <linux/workqueue.h>
> >
> > #include <asm/msr.h>
> > +#include <asm/intel-family.h>
> >
> > #include "../thermal_core.h"
> > #include "intel_hfi.h"
> > @@ -209,9 +210,64 @@ static int __percpu *hfi_ipcc_scores;
> > */
> > #define HFI_UNCLASSIFIED_DEFAULT 1
> >
> > +#define CLASS_DEBOUNCER_SKIPS 4
> > +
> > +/**
> > + * debounce_and_update_class() - Process and update a task's classification
> > + *
> > + * @p: The task of which the classification will be updated
> > + * @new_ipcc: The new IPC classification
> > + *
> > + * Update the classification of @p with the new value that hardware provides.
> > + * Only update the classification of @p if it has been the same during
> > + * CLASS_DEBOUNCER_SKIPS consecutive ticks.
> > + */
> > +static void debounce_and_update_class(struct task_struct *p, u8 new_ipcc)
> > +{
> > + u16 debounce_skip;
> > +
> > + /* The class of @p changed. Only restart the debounce counter. */
> > + if (p->ipcc_tmp != new_ipcc) {
> > + p->ipcc_cntr = 1;
> > + goto out;
> > + }
> > +
> > + /*
> > + * The class of @p did not change. Update it if it has been the same
> > + * for CLASS_DEBOUNCER_SKIPS user ticks.
> > + */
> > + debounce_skip = p->ipcc_cntr + 1;
> > + if (debounce_skip < CLASS_DEBOUNCER_SKIPS)
> > + p->ipcc_cntr++;
> > + else
> > + p->ipcc = new_ipcc;
> > +
> > +out:
> > + p->ipcc_tmp = new_ipcc;
> > +}
>
> Why does the code above belong to the Intel HFI driver? It doesn't
> look like there is anything driver-specific in it.
That is a good point. This post-processing is specific to the
implementation of IPCC classes using Intel Thread Director.
Maybe a new file called drivers/thermal/intel/intel_itd.c would be better?
On Wed, Mar 29, 2023 at 1:32 AM Ricardo Neri
<[email protected]> wrote:
>
> On Mon, Mar 27, 2023 at 06:37:32PM +0200, Rafael J. Wysocki wrote:
> > On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
> > <[email protected]> wrote:
> > >
> > > The scheduler reads the IPCC scores when balancing load. These reads can
> > > be quite frequent. Hardware can also update the HFI table frequently.
> > > Concurrent access may cause a lot of lock contention. It gets worse as the
> > > number of CPUs increases.
> > >
> > > Instead, create separate per-CPU IPCC scores that the scheduler can read
> > > without the HFI table lock.
> > >
> > > Cc: Ben Segall <[email protected]>
> > > Cc: Daniel Bristot de Oliveira <[email protected]>
> > > Cc: Dietmar Eggemann <[email protected]>
> > > Cc: Ionela Voinescu <[email protected]>
> > > Cc: Joel Fernandes (Google) <[email protected]>
> > > Cc: Len Brown <[email protected]>
> > > Cc: Lukasz Luba <[email protected]>
> > > Cc: Mel Gorman <[email protected]>
> > > Cc: Rafael J. Wysocki <[email protected]>
> > > Cc: Srinivas Pandruvada <[email protected]>
> > > Cc: Steven Rostedt <[email protected]>
> > > Cc: Tim C. Chen <[email protected]>
> > > Cc: Valentin Schneider <[email protected]>
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Suggested-by: Peter Zijlstra (Intel) <[email protected]>
> > > Signed-off-by: Ricardo Neri <[email protected]>
> > > ---
> > > Changes since v2:
> > > * Only create these per-CPU variables when Intel Thread Director is
> > > supported.
> > >
> > > Changes since v1:
> > > * Added this patch.
> > > ---
> > > drivers/thermal/intel/intel_hfi.c | 46 +++++++++++++++++++++++++++++++
> > > 1 file changed, 46 insertions(+)
> > >
> > > diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> > > index 2527ae3836c7..b06021828892 100644
> > > --- a/drivers/thermal/intel/intel_hfi.c
> > > +++ b/drivers/thermal/intel/intel_hfi.c
> > > @@ -29,6 +29,7 @@
> > > #include <linux/kernel.h>
> > > #include <linux/math.h>
> > > #include <linux/mutex.h>
> > > +#include <linux/percpu.h>
> > > #include <linux/percpu-defs.h>
> > > #include <linux/printk.h>
> > > #include <linux/processor.h>
> > > @@ -170,6 +171,43 @@ static struct workqueue_struct *hfi_updates_wq;
> > > #define HFI_UPDATE_INTERVAL HZ
> > > #define HFI_MAX_THERM_NOTIFY_COUNT 16
> > >
> > > +#ifdef CONFIG_IPC_CLASSES
> >
> > It would be good to provide a (concise) description of this variable.
> >
> > > +static int __percpu *hfi_ipcc_scores;
>
> Do you mean hfi_ipcc_scores or CONFIG_IPC_CLASSES?
hfi_ipcc_scores (as the latter is not a variable).
On Wed, Mar 29, 2023 at 1:31 AM Ricardo Neri
<[email protected]> wrote:
>
> On Mon, Mar 27, 2023 at 06:42:28PM +0200, Rafael J. Wysocki wrote:
> > On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
> > <[email protected]> wrote:
> > >
> > > Use Intel Thread Director classification to update the IPC class of a
> > > task. Implement the arch_update_ipcc() interface of the scheduler.
> > >
> > > Cc: Ben Segall <[email protected]>
> > > Cc: Daniel Bristot de Oliveira <[email protected]>
> > > Cc: Dietmar Eggemann <[email protected]>
> > > Cc: Ionela Voinescu <[email protected]>
> > > Cc: Joel Fernandes (Google) <[email protected]>
> > > Cc: Len Brown <[email protected]>
> > > Cc: Lukasz Luba <[email protected]>
> > > Cc: Mel Gorman <[email protected]>
> > > Cc: Rafael J. Wysocki <[email protected]>
> > > Cc: Srinivas Pandruvada <[email protected]>
> > > Cc: Steven Rostedt <[email protected]>
> > > Cc: Tim C. Chen <[email protected]>
> > > Cc: Valentin Schneider <[email protected]>
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Signed-off-by: Ricardo Neri <[email protected]>
> > > ---
> > > Changes since v2:
> > > * Removed the implementation of arch_has_ipc_classes().
> > >
> > > Changes since v1:
> > > * Adjusted the result the classification of Intel Thread Director to start
> > > at class 1. Class 0 for the scheduler means that the task is
> > > unclassified.
> > > * Redefined union hfi_thread_feedback_char_msr to ensure all
> > > bit-fields are packed. (PeterZ)
> > > * Removed CONFIG_INTEL_THREAD_DIRECTOR. (PeterZ)
> > > * Shortened the names of the functions that implement IPC classes.
> > > * Removed argument smt_siblings_idle from intel_hfi_update_ipcc().
> > > (PeterZ)
> > > ---
> > > arch/x86/include/asm/topology.h | 6 ++++++
> > > drivers/thermal/intel/intel_hfi.c | 32 +++++++++++++++++++++++++++++++
> > > 2 files changed, 38 insertions(+)
> > >
> > > diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> > > index 458c891a8273..ffcdac3f398f 100644
> > > --- a/arch/x86/include/asm/topology.h
> > > +++ b/arch/x86/include/asm/topology.h
> > > @@ -227,4 +227,10 @@ void init_freq_invariance_cppc(void);
> > > #define arch_init_invariance_cppc init_freq_invariance_cppc
> > > #endif
> > >
> > > +#if defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL)
> > > +void intel_hfi_update_ipcc(struct task_struct *curr);
> > > +
> > > +#define arch_update_ipcc intel_hfi_update_ipcc
> > > +#endif /* defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL) */
> > > +
> > > #endif /* _ASM_X86_TOPOLOGY_H */
> > > diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> > > index b06021828892..530dcf57e06e 100644
> > > --- a/drivers/thermal/intel/intel_hfi.c
> > > +++ b/drivers/thermal/intel/intel_hfi.c
> > > @@ -72,6 +72,17 @@ union cpuid6_edx {
> > > u32 full;
> > > };
> > >
> > > +#ifdef CONFIG_IPC_CLASSES
> > > +union hfi_thread_feedback_char_msr {
> > > + struct {
> > > + u64 classid : 8;
> > > + u64 __reserved : 55;
> > > + u64 valid : 1;
> > > + } split;
> > > + u64 full;
> > > +};
> > > +#endif
> > > +
> > > /**
> > > * struct hfi_cpu_data - HFI capabilities per CPU
> > > * @perf_cap: Performance capability
> > > @@ -174,6 +185,27 @@ static struct workqueue_struct *hfi_updates_wq;
> > > #ifdef CONFIG_IPC_CLASSES
> > > static int __percpu *hfi_ipcc_scores;
> > >
> > > +void intel_hfi_update_ipcc(struct task_struct *curr)
> > > +{
> > > + union hfi_thread_feedback_char_msr msr;
> > > +
> > > + /* We should not be here if ITD is not supported. */
> > > + if (!cpu_feature_enabled(X86_FEATURE_ITD)) {
> > > + pr_warn_once("task classification requested but not supported!");
> > > + return;
> > > + }
> > > +
> > > + rdmsrl(MSR_IA32_HW_FEEDBACK_CHAR, msr.full);
> > > + if (!msr.split.valid)
> > > + return;
> > > +
> > > + /*
> > > + * 0 is a valid classification for Intel Thread Director. A scheduler
> > > + * IPCC class of 0 means that the task is unclassified. Adjust.
> > > + */
> > > + curr->ipcc = msr.split.classid + 1;
> > > +}
> >
> > Wouldn't it be better to return the adjusted value from this function
> > and let the caller store it where appropriate?
> >
> > It doesn't look like it is necessary to pass the task_struct pointer to it.
>
> Judging from this patch alone, yes, it does not make much sense to pass a
> task_struct as argument. In patch 21, however, this function uses various
> members of task_struct and makes it more convenient to have it as argument,
> no?
I'm not convinced about this, but anyway it is better to combine the
two patches in such cases IMO.
The way it is done now confuses things from my perspective.
> >
> > > +
> > > static int alloc_hfi_ipcc_scores(void)
> > > {
> > > if (!cpu_feature_enabled(X86_FEATURE_ITD))
> > > --
On Wed, Mar 29, 2023 at 2:04 AM Ricardo Neri
<[email protected]> wrote:
>
> On Mon, Mar 27, 2023 at 07:03:08PM +0200, Rafael J. Wysocki wrote:
> > On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
> > <[email protected]> wrote:
> > >
> > > In Alder Lake and Raptor Lake, the result of thread classification is more
> > > accurate when only one SMT sibling is busy. Classification results for
> > > class 2 and 3 are always reliable.
> > >
> > > To avoid unnecessary migrations, only update the class of a task if it has
> > > been the same during 4 consecutive user ticks.
> > >
> > > Cc: Ben Segall <[email protected]>
> > > Cc: Daniel Bristot de Oliveira <[email protected]>
> > > Cc: Dietmar Eggemann <[email protected]>
> > > Cc: Ionela Voinescu <[email protected]>
> > > Cc: Joel Fernandes (Google) <[email protected]>
> > > Cc: Len Brown <[email protected]>
> > > Cc: Lukasz Luba <[email protected]>
> > > Cc: Mel Gorman <[email protected]>
> > > Cc: Rafael J. Wysocki <[email protected]>
> > > Cc: Srinivas Pandruvada <[email protected]>
> > > Cc: Steven Rostedt <[email protected]>
> > > Cc: Tim C. Chen <[email protected]>
> > > Cc: Valentin Schneider <[email protected]>
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Signed-off-by: Ricardo Neri <[email protected]>
> > > ---
> > > Changes since v2:
> > > * None
> > >
> > > Changes since v1:
> > > * Adjusted the result the classification of Intel Thread Director to start
> > > at class 1. Class 0 for the scheduler means that the task is
> > > unclassified.
> > > * Used the new names of the IPC classes members in task_struct.
> > > * Reworked helper functions to use sched_smt_siblings_idle() to query
> > > the idle state of the SMT siblings of a CPU.
> > > ---
> > > drivers/thermal/intel/intel_hfi.c | 60 ++++++++++++++++++++++++++++++-
> > > 1 file changed, 59 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> > > index 35d947f47550..fdb53e4cabc1 100644
> > > --- a/drivers/thermal/intel/intel_hfi.c
> > > +++ b/drivers/thermal/intel/intel_hfi.c
> > > @@ -40,6 +40,7 @@
> > > #include <linux/workqueue.h>
> > >
> > > #include <asm/msr.h>
> > > +#include <asm/intel-family.h>
> > >
> > > #include "../thermal_core.h"
> > > #include "intel_hfi.h"
> > > @@ -209,9 +210,64 @@ static int __percpu *hfi_ipcc_scores;
> > > */
> > > #define HFI_UNCLASSIFIED_DEFAULT 1
> > >
> > > +#define CLASS_DEBOUNCER_SKIPS 4
> > > +
> > > +/**
> > > + * debounce_and_update_class() - Process and update a task's classification
> > > + *
> > > + * @p: The task of which the classification will be updated
> > > + * @new_ipcc: The new IPC classification
> > > + *
> > > + * Update the classification of @p with the new value that hardware provides.
> > > + * Only update the classification of @p if it has been the same during
> > > + * CLASS_DEBOUNCER_SKIPS consecutive ticks.
> > > + */
> > > +static void debounce_and_update_class(struct task_struct *p, u8 new_ipcc)
> > > +{
> > > + u16 debounce_skip;
> > > +
> > > + /* The class of @p changed. Only restart the debounce counter. */
> > > + if (p->ipcc_tmp != new_ipcc) {
> > > + p->ipcc_cntr = 1;
> > > + goto out;
> > > + }
> > > +
> > > + /*
> > > + * The class of @p did not change. Update it if it has been the same
> > > + * for CLASS_DEBOUNCER_SKIPS user ticks.
> > > + */
> > > + debounce_skip = p->ipcc_cntr + 1;
> > > + if (debounce_skip < CLASS_DEBOUNCER_SKIPS)
> > > + p->ipcc_cntr++;
> > > + else
> > > + p->ipcc = new_ipcc;
> > > +
> > > +out:
> > > + p->ipcc_tmp = new_ipcc;
> > > +}
> >
> > Why does the code above belong to the Intel HFI driver? It doesn't
> > look like there is anything driver-specific in it.
>
> That is a good point. This post-processing is specific to the
> implementation of IPCC classes using Intel Thread Director.
Well, the implementation-specific part is the processor model check
whose only contribution is to say whether or not the classification is
valid. The rest appears to be fairly generic to me.
> Maybe a new file called drivers/thermal/intel/intel_itd.c would be better?
So which part of this code other than the processor model check
mentioned above is Intel-specific?
On Wed, Mar 29, 2023 at 1:30 AM Ricardo Neri
<[email protected]> wrote:
>
> On Mon, Mar 27, 2023 at 06:50:13PM +0200, Rafael J. Wysocki wrote:
> > On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
> > <[email protected]> wrote:
> > >
> > > Implement the arch_get_ipcc_score() interface of the scheduler. Use the
> > > performance capabilities of the extended Hardware Feedback Interface table
> > > as the IPC score.
> > >
> > > Cc: Ben Segall <[email protected]>
> > > Cc: Daniel Bristot de Oliveira <[email protected]>
> > > Cc: Dietmar Eggemann <[email protected]>
> > > Cc: Ionela Voinescu <[email protected]>
> > > Cc: Joel Fernandes (Google) <[email protected]>
> > > Cc: Len Brown <[email protected]>
> > > Cc: Lukasz Luba <[email protected]>
> > > Cc: Mel Gorman <[email protected]>
> > > Cc: Rafael J. Wysocki <[email protected]>
> > > Cc: Srinivas Pandruvada <[email protected]>
> > > Cc: Steven Rostedt <[email protected]>
> > > Cc: Tim C. Chen <[email protected]>
> > > Cc: Valentin Schneider <[email protected]>
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Signed-off-by: Ricardo Neri <[email protected]>
> > > ---
> > > Changes since v2:
> > > * None
> > >
> > > Changes since v1:
> > > * Adjusted the returned HFI class (which starts at 0) to match the
> > > scheduler IPCC class (which starts at 1). (PeterZ)
> > > * Used the new interface names.
> > > ---
> > > arch/x86/include/asm/topology.h | 2 ++
> > > drivers/thermal/intel/intel_hfi.c | 27 +++++++++++++++++++++++++++
> > > 2 files changed, 29 insertions(+)
> > >
> > > diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> > > index ffcdac3f398f..c4fcd9c3c634 100644
> > > --- a/arch/x86/include/asm/topology.h
> > > +++ b/arch/x86/include/asm/topology.h
> > > @@ -229,8 +229,10 @@ void init_freq_invariance_cppc(void);
> > >
> > > #if defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL)
> > > void intel_hfi_update_ipcc(struct task_struct *curr);
> > > +unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu);
> > >
> > > #define arch_update_ipcc intel_hfi_update_ipcc
> > > +#define arch_get_ipcc_score intel_hfi_get_ipcc_score
> > > #endif /* defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL) */
> > >
> > > #endif /* _ASM_X86_TOPOLOGY_H */
> > > diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> > > index 530dcf57e06e..fa9b4a678d92 100644
> > > --- a/drivers/thermal/intel/intel_hfi.c
> > > +++ b/drivers/thermal/intel/intel_hfi.c
> > > @@ -206,6 +206,33 @@ void intel_hfi_update_ipcc(struct task_struct *curr)
> > > curr->ipcc = msr.split.classid + 1;
> > > }
> > >
> > > +unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu)
> > > +{
> > > + unsigned short hfi_class;
> >
> > It looks like the variable above is only used to save a subtraction or
> > addition of 1 to something going forward.
> >
> > > + int *scores;
> > > +
> > > + if (cpu < 0 || cpu >= nr_cpu_ids)
> > > + return -EINVAL;
> > > +
> > > + if (ipcc == IPC_CLASS_UNCLASSIFIED)
> > > + return -EINVAL;
> > > +
> > > + /*
> > > + * Scheduler IPC classes start at 1. HFI classes start at 0.
> > > + * See note intel_hfi_update_ipcc().
> > > + */
> > > + hfi_class = ipcc - 1;
> > > +
> > > + if (hfi_class >= hfi_features.nr_classes)
> >
> > Personally, I would do
> >
> > if (ipcc >= hfi_features.nr_classes + 1)
> >
> > here and ->
> >
> > > + return -EINVAL;
> > > +
> > > + scores = per_cpu_ptr(hfi_ipcc_scores, cpu);
> > > + if (!scores)
> > > + return -ENODEV;
> > > +
> >
> > -> scores[ipcc - 1]
>
> Sure, I can get rid of hfi_class.
>
> >
> > below.
> >
> > > + return READ_ONCE(scores[hfi_class]);
> >
> > And why does this need to use READ_ONCE()?
>
> This is the corresponding read of the WRITE_ONCE in patch 13. The CPU
> handling the HFI interrupt, very likely a different CPU, updates
> scores[hfi_class]. My intention is to let that write to complete before
> reading the score here.
However, READ_ONCE()/WRITE_ONCE() only affect compiler optimizations
AFAICS. What if the CPUs running the code reorder the instructions?
In any case, IMV the reason why READ_ONCE() is used needs to be clear
to the reviewers from the patch itself (and to a casual reader of the
code from the code itself).
> >
> > > +}
> > > +
> > > static int alloc_hfi_ipcc_scores(void)
> > > {
> > > if (!cpu_feature_enabled(X86_FEATURE_ITD))
> > > --
On Tue, Mar 28, 2023 at 12:00:58PM +0200, Vincent Guittot wrote:
> On Tue, 7 Feb 2023 at 06:01, Ricardo Neri
> <[email protected]> wrote:
> >
> > Compute the joint total (both current and prospective) IPC class score of
> > a scheduling group and the local scheduling group.
> >
> > These IPCC statistics are used during idle load balancing. The candidate
> > scheduling group will have one fewer busy CPU after load balancing. This
> > observation is important for cores with SMT support.
> >
> > The IPCC score of scheduling groups composed of SMT siblings needs to
> > consider that the siblings share CPU resources. When computing the total
> > IPCC score of the scheduling group, divide score of each sibling by the
> > number of busy siblings.
> >
> > Collect IPCC statistics for asym_packing and fully_busy scheduling groups.
>
> IPCC statistics collect scores of current tasks, so they are
> meaningful only when trying to migrate one of those running tasks.
> Using such score when pulling other tasks is just meaningless. And I
> don't see how you ensure such correct use of ipcc score
Thank you very much for your feedback Vincent!
It is true that the task that is current when collecting statistics may be
different from the task that is current when we are ready to pluck tasks.
Using IPCC scores for load balancing benefits large, long-running tasks
the most. For these tasks, the current task is likely to remain the same
at the two mentioned points in time.
My patchset proposes to use IPCC clases to break ties between otherwise
identical sched groups in update_sd_pick_busiest(). Its use is limited to
asym_packing and fully_busy types. For these types, it is likely that there
will not be tasks wanting to run other than current. need_active_balance()
will return true and we will migrate the current task.
You are correct, by only looking at the current tasks we risk overlooking
other tasks in the queue and the statistics becoming meaningless. A fully
correct solution would need to keep track of the the types of tasks in
all runqueues as they come and go. IMO, the increased complexity of such
approach does not justify the benefit. We give the load balancer extra
information to decide between otherwise identical sched groups using the
IPCC statistics of big tasks.
>
> > When picking a busiest group, they are used to break ties between otherwise
> > identical groups.
> >
> > Cc: Ben Segall <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Ionela Voinescu <[email protected]>
> > Cc: Joel Fernandes (Google) <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: Lukasz Luba <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Srinivas Pandruvada <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Tim C. Chen <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: Ricardo Neri <[email protected]>
> > ---
> > Changes since v2:
> > * Also collect IPCC stats for fully_busy sched groups.
>
> Why have you added fully_busy case ? it's worth explaining the
> rational because there is a lot of change to use the ipcc score of the
> wrong task to take the decision
When deciding between two fully_busy sched groups, update_sd_pick_busiest()
unconditionally selects the given candidate @sg as busiest. With IPCC
scores, what is running a scheduling group becomes important. We now have
extra information to select either of the fully_busy groups. This is done
in patch 9.
(Also note that in [1] I tweak the logic to select fully_busy SMT cores vs
fully_busy non-SMT cores).
>
> > * Restrict use of IPCC stats to SD_ASYM_PACKING. (Ionela)
> > * Handle errors of arch_get_ipcc_score(). (Ionela)
> >
> > Changes since v1:
> > * Implemented cleanups and reworks from PeterZ. I took all his
> > suggestions, except the computation of the IPC score before and after
> > load balancing. We are computing not the average score, but the *total*.
> > * Check for the SD_SHARE_CPUCAPACITY to compute the throughput of the SMT
> > siblings of a physical core.
> > * Used the new interface names.
> > * Reworded commit message for clarity.
> > ---
> > kernel/sched/fair.c | 68 +++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 68 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d773380a95b3..b6165aa8a376 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8901,6 +8901,8 @@ struct sg_lb_stats {
> > unsigned long min_score; /* Min(score(rq->curr->ipcc)) */
> > unsigned short min_ipcc; /* Class of the task with the minimum IPCC score in the rq */
> > unsigned long sum_score; /* Sum(score(rq->curr->ipcc)) */
> > + long ipcc_score_after; /* Prospective IPCC score after load balancing */
> > + unsigned long ipcc_score_before; /* IPCC score before load balancing */
> > #endif
> > };
> >
> > @@ -9287,6 +9289,62 @@ static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> > }
> > }
> >
> > +static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
> > + struct sched_group *sg,
> > + struct lb_env *env)
> > +{
> > + unsigned long score_on_dst_cpu, before;
> > + int busy_cpus;
> > + long after;
> > +
> > + if (!sched_ipcc_enabled())
> > + return;
> > +
> > + /*
> > + * IPCC scores are only useful during idle load balancing. For now,
> > + * only asym_packing uses IPCC scores.
> > + */
> > + if (!(env->sd->flags & SD_ASYM_PACKING) ||
> > + env->idle == CPU_NOT_IDLE)
> > + return;
> > +
> > + /*
> > + * IPCC scores are used to break ties only between these types of
> > + * groups.
> > + */
> > + if (sgs->group_type != group_fully_busy &&
> > + sgs->group_type != group_asym_packing)
> > + return;
> > +
> > + busy_cpus = sgs->group_weight - sgs->idle_cpus;
> > +
> > + /* No busy CPUs in the group. No tasks to move. */
> > + if (!busy_cpus)
> > + return;
> > +
> > + score_on_dst_cpu = arch_get_ipcc_score(sgs->min_ipcc, env->dst_cpu);
> > +
> > + /*
> > + * Do not use IPC scores. sgs::ipcc_score_{after, before} will be zero
> > + * and not used.
> > + */
> > + if (IS_ERR_VALUE(score_on_dst_cpu))
> > + return;
> > +
> > + before = sgs->sum_score;
> > + after = before - sgs->min_score;
>
> IIUC, you assume that you will select the cpu with the min score.
> How do you ensure this ? otherwise all your comparisong are useless
This is relevant for SMT cores. A smaller idle core will help a fully_busy
SMT core by pulling low-IPC work, leaving the full core for high-IPC
work.
We ensure (or anticipate if you will) this because find_busiest_queue()
will select the queue whose current task gets the biggest IPC boost. When
done from big to small cores the IPC boost is negative.
>
> > +
> > + /* SMT siblings share throughput. */
> > + if (busy_cpus > 1 && sg->flags & SD_SHARE_CPUCAPACITY) {
> > + before /= busy_cpus;
> > + /* One sibling will become idle after load balance. */
> > + after /= busy_cpus - 1;
> > + }
> > +
> > + sgs->ipcc_score_after = after + score_on_dst_cpu;
> > + sgs->ipcc_score_before = before;
>
> I'm not sure to understand why you are computing the sum_score,
> ipcc_score_before and ipcc_score_after ?
> Why is it not sufficient to check if score_on_dst_cpu will be higher
> than current min_score ?
You are right. As the source core becomes idle after load balancing, it is
sufficient to look for the highest score_on_dst_cpu. However, we must also
handle SMT cores.
If the source sched group is a fully_busy SMT core, one of the siblings
will become idle after load balance (my patchset uses IPCC scores for
asym_packing and when balancing the number of idle CPUs from fully_busy).
The remaining non-idle siblings will benefit from the extra throughput.
We calculate ipcc_score_after to find the sched group that would benefit
the most. We calculate ipcc_score_before to break ties of two groups of
identical ipcc_score_after.
Thanks and BR,
Ricardo
[1]. https://lore.kernel.org/lkml/[email protected]/
On Tue, Mar 28, 2023 at 12:03:58PM +0200, Vincent Guittot wrote:
> On Tue, 7 Feb 2023 at 06:01, Ricardo Neri
> <[email protected]> wrote:
> >
> > For two runqueues of equal priority and equal number of running of tasks,
> > select the one whose current task would have the highest IPC class score
> > if placed on the destination CPU.
>
> You failed to explain why it make sense to compare current task score
> whereas we will most probably not pull this task at the end
Thank you for your feedback Vincent! Please kindly refer to my reply to
your feedback in patch 7.
> >
> > For now, use IPCC scores only for scheduling domains with the
> > SD_ASYM_PACKING flag.
> >
> > Cc: Ben Segall <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Ionela Voinescu <[email protected]>
> > Cc: Joel Fernandes (Google) <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: Lukasz Luba <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Srinivas Pandruvada <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Tim C. Chen <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: Ricardo Neri <[email protected]>
> > ---
> > Changes since v2:
> > * Only use IPCC scores to break ties if the sched domain uses
> > asym_packing. (Ionela)
> > * Handle errors of arch_get_ipcc_score(). (Ionela)
> >
> > Changes since v1:
> > * Fixed a bug when selecting a busiest runqueue: when comparing two
> > runqueues with equal nr_running, we must compute the IPCC score delta
> > of both.
> > * Renamed local variables to improve the layout of the code block.
> > (PeterZ)
> > * Used the new interface names.
> > ---
> > kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 64 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 72d88270b320..d3c22dc145f7 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9399,6 +9399,37 @@ static bool sched_asym_ipcc_pick(struct sched_group *a,
> > return sched_asym_ipcc_prefer(a_stats, b_stats);
> > }
> >
> > +/**
> > + * ipcc_score_delta - Get the IPCC score delta wrt the load balance's dst_cpu
> > + * @p: A task
> > + * @env: Load balancing environment
> > + *
> > + * Returns: The IPCC score delta that @p would get if placed in the destination
> > + * CPU of @env. LONG_MIN to indicate that the delta should not be used.
> > + */
> > +static long ipcc_score_delta(struct task_struct *p, struct lb_env *env)
> > +{
> > + unsigned long score_src, score_dst;
> > + unsigned short ipcc = p->ipcc;
> > +
> > + if (!sched_ipcc_enabled())
> > + return LONG_MIN;
> > +
> > + /* Only asym_packing uses IPCC scores at the moment. */
> > + if (!(env->sd->flags & SD_ASYM_PACKING))
> > + return LONG_MIN;
> > +
> > + score_dst = arch_get_ipcc_score(ipcc, env->dst_cpu);
> > + if (IS_ERR_VALUE(score_dst))
> > + return LONG_MIN;
> > +
> > + score_src = arch_get_ipcc_score(ipcc, task_cpu(p));
> > + if (IS_ERR_VALUE(score_src))
> > + return LONG_MIN;
> > +
> > + return score_dst - score_src;
> > +}
> > +
> > #else /* CONFIG_IPC_CLASSES */
> > static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> > struct rq *rq)
> > @@ -9429,6 +9460,11 @@ static bool sched_asym_ipcc_pick(struct sched_group *a,
> > return false;
> > }
> >
> > +static long ipcc_score_delta(struct task_struct *p, struct lb_env *env)
> > +{
> > + return LONG_MIN;
> > +}
> > +
> > #endif /* CONFIG_IPC_CLASSES */
> >
> > /**
> > @@ -10589,6 +10625,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> > {
> > struct rq *busiest = NULL, *rq;
> > unsigned long busiest_util = 0, busiest_load = 0, busiest_capacity = 1;
> > + long busiest_ipcc_delta = LONG_MIN;
> > unsigned int busiest_nr = 0;
> > int i;
> >
> > @@ -10705,8 +10742,35 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> >
> > case migrate_task:
> > if (busiest_nr < nr_running) {
> > + struct task_struct *curr;
> > +
> > busiest_nr = nr_running;
> > busiest = rq;
> > +
> > + /*
> > + * Remember the IPCC score delta of busiest::curr.
> > + * We may need it to break a tie with other queues
> > + * with equal nr_running.
> > + */
> > + curr = rcu_dereference(busiest->curr);
> > + busiest_ipcc_delta = ipcc_score_delta(curr, env);
>
> Hmm, i don't like this at all
>
> Also, curr is the least probable task to be pulled which means that
> all this his useless
but when doing asym_packing balancing nr_running = 1, need_active_balance()
returns true and we will pull the current task, no? This is also true for
fully_busy groups with one task per CPU. These are the only two cases that
currently use IPCC scores.
If there are more than one tasks in the runqueue, the group will be
classified as overloaded and we will not use IPCC scores nor active
balance.
Thanks and BR,
Ricardo
On Wed, Mar 29, 2023 at 02:08:30PM +0200, Rafael J. Wysocki wrote:
> On Wed, Mar 29, 2023 at 1:32 AM Ricardo Neri
> <[email protected]> wrote:
> >
> > On Mon, Mar 27, 2023 at 06:37:32PM +0200, Rafael J. Wysocki wrote:
> > > On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
> > > <[email protected]> wrote:
> > > >
> > > > The scheduler reads the IPCC scores when balancing load. These reads can
> > > > be quite frequent. Hardware can also update the HFI table frequently.
> > > > Concurrent access may cause a lot of lock contention. It gets worse as the
> > > > number of CPUs increases.
> > > >
> > > > Instead, create separate per-CPU IPCC scores that the scheduler can read
> > > > without the HFI table lock.
> > > >
> > > > Cc: Ben Segall <[email protected]>
> > > > Cc: Daniel Bristot de Oliveira <[email protected]>
> > > > Cc: Dietmar Eggemann <[email protected]>
> > > > Cc: Ionela Voinescu <[email protected]>
> > > > Cc: Joel Fernandes (Google) <[email protected]>
> > > > Cc: Len Brown <[email protected]>
> > > > Cc: Lukasz Luba <[email protected]>
> > > > Cc: Mel Gorman <[email protected]>
> > > > Cc: Rafael J. Wysocki <[email protected]>
> > > > Cc: Srinivas Pandruvada <[email protected]>
> > > > Cc: Steven Rostedt <[email protected]>
> > > > Cc: Tim C. Chen <[email protected]>
> > > > Cc: Valentin Schneider <[email protected]>
> > > > Cc: [email protected]
> > > > Cc: [email protected]
> > > > Cc: [email protected]
> > > > Suggested-by: Peter Zijlstra (Intel) <[email protected]>
> > > > Signed-off-by: Ricardo Neri <[email protected]>
> > > > ---
> > > > Changes since v2:
> > > > * Only create these per-CPU variables when Intel Thread Director is
> > > > supported.
> > > >
> > > > Changes since v1:
> > > > * Added this patch.
> > > > ---
> > > > drivers/thermal/intel/intel_hfi.c | 46 +++++++++++++++++++++++++++++++
> > > > 1 file changed, 46 insertions(+)
> > > >
> > > > diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> > > > index 2527ae3836c7..b06021828892 100644
> > > > --- a/drivers/thermal/intel/intel_hfi.c
> > > > +++ b/drivers/thermal/intel/intel_hfi.c
> > > > @@ -29,6 +29,7 @@
> > > > #include <linux/kernel.h>
> > > > #include <linux/math.h>
> > > > #include <linux/mutex.h>
> > > > +#include <linux/percpu.h>
> > > > #include <linux/percpu-defs.h>
> > > > #include <linux/printk.h>
> > > > #include <linux/processor.h>
> > > > @@ -170,6 +171,43 @@ static struct workqueue_struct *hfi_updates_wq;
> > > > #define HFI_UPDATE_INTERVAL HZ
> > > > #define HFI_MAX_THERM_NOTIFY_COUNT 16
> > > >
> > > > +#ifdef CONFIG_IPC_CLASSES
> > >
> > > It would be good to provide a (concise) description of this variable.
> > >
> > > > +static int __percpu *hfi_ipcc_scores;
> >
> > Do you mean hfi_ipcc_scores or CONFIG_IPC_CLASSES?
>
> hfi_ipcc_scores (as the latter is not a variable).
I thought so. Thank you for clarifying.
On Wed, Mar 29, 2023 at 02:21:57PM +0200, Rafael J. Wysocki wrote:
> On Wed, Mar 29, 2023 at 2:04 AM Ricardo Neri
> <[email protected]> wrote:
> >
> > On Mon, Mar 27, 2023 at 07:03:08PM +0200, Rafael J. Wysocki wrote:
> > > On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
> > > <[email protected]> wrote:
> > > >
> > > > In Alder Lake and Raptor Lake, the result of thread classification is more
> > > > accurate when only one SMT sibling is busy. Classification results for
> > > > class 2 and 3 are always reliable.
> > > >
> > > > To avoid unnecessary migrations, only update the class of a task if it has
> > > > been the same during 4 consecutive user ticks.
> > > >
> > > > Cc: Ben Segall <[email protected]>
> > > > Cc: Daniel Bristot de Oliveira <[email protected]>
> > > > Cc: Dietmar Eggemann <[email protected]>
> > > > Cc: Ionela Voinescu <[email protected]>
> > > > Cc: Joel Fernandes (Google) <[email protected]>
> > > > Cc: Len Brown <[email protected]>
> > > > Cc: Lukasz Luba <[email protected]>
> > > > Cc: Mel Gorman <[email protected]>
> > > > Cc: Rafael J. Wysocki <[email protected]>
> > > > Cc: Srinivas Pandruvada <[email protected]>
> > > > Cc: Steven Rostedt <[email protected]>
> > > > Cc: Tim C. Chen <[email protected]>
> > > > Cc: Valentin Schneider <[email protected]>
> > > > Cc: [email protected]
> > > > Cc: [email protected]
> > > > Cc: [email protected]
> > > > Signed-off-by: Ricardo Neri <[email protected]>
> > > > ---
> > > > Changes since v2:
> > > > * None
> > > >
> > > > Changes since v1:
> > > > * Adjusted the result the classification of Intel Thread Director to start
> > > > at class 1. Class 0 for the scheduler means that the task is
> > > > unclassified.
> > > > * Used the new names of the IPC classes members in task_struct.
> > > > * Reworked helper functions to use sched_smt_siblings_idle() to query
> > > > the idle state of the SMT siblings of a CPU.
> > > > ---
> > > > drivers/thermal/intel/intel_hfi.c | 60 ++++++++++++++++++++++++++++++-
> > > > 1 file changed, 59 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> > > > index 35d947f47550..fdb53e4cabc1 100644
> > > > --- a/drivers/thermal/intel/intel_hfi.c
> > > > +++ b/drivers/thermal/intel/intel_hfi.c
> > > > @@ -40,6 +40,7 @@
> > > > #include <linux/workqueue.h>
> > > >
> > > > #include <asm/msr.h>
> > > > +#include <asm/intel-family.h>
> > > >
> > > > #include "../thermal_core.h"
> > > > #include "intel_hfi.h"
> > > > @@ -209,9 +210,64 @@ static int __percpu *hfi_ipcc_scores;
> > > > */
> > > > #define HFI_UNCLASSIFIED_DEFAULT 1
> > > >
> > > > +#define CLASS_DEBOUNCER_SKIPS 4
> > > > +
> > > > +/**
> > > > + * debounce_and_update_class() - Process and update a task's classification
> > > > + *
> > > > + * @p: The task of which the classification will be updated
> > > > + * @new_ipcc: The new IPC classification
> > > > + *
> > > > + * Update the classification of @p with the new value that hardware provides.
> > > > + * Only update the classification of @p if it has been the same during
> > > > + * CLASS_DEBOUNCER_SKIPS consecutive ticks.
> > > > + */
> > > > +static void debounce_and_update_class(struct task_struct *p, u8 new_ipcc)
> > > > +{
> > > > + u16 debounce_skip;
> > > > +
> > > > + /* The class of @p changed. Only restart the debounce counter. */
> > > > + if (p->ipcc_tmp != new_ipcc) {
> > > > + p->ipcc_cntr = 1;
> > > > + goto out;
> > > > + }
> > > > +
> > > > + /*
> > > > + * The class of @p did not change. Update it if it has been the same
> > > > + * for CLASS_DEBOUNCER_SKIPS user ticks.
> > > > + */
> > > > + debounce_skip = p->ipcc_cntr + 1;
> > > > + if (debounce_skip < CLASS_DEBOUNCER_SKIPS)
> > > > + p->ipcc_cntr++;
> > > > + else
> > > > + p->ipcc = new_ipcc;
> > > > +
> > > > +out:
> > > > + p->ipcc_tmp = new_ipcc;
> > > > +}
> > >
> > > Why does the code above belong to the Intel HFI driver? It doesn't
> > > look like there is anything driver-specific in it.
> >
> > That is a good point. This post-processing is specific to the
> > implementation of IPCC classes using Intel Thread Director.
>
> Well, the implementation-specific part is the processor model check
> whose only contribution is to say whether or not the classification is
> valid. The rest appears to be fairly generic to me.
I meant to say that we use Intel Thread Director and the HFI driver to
implement the interfaces defined in patch 2. Other architectures may
implement those interfaces differently.
For Intel, we may even need different filters and debouncers for different
models.
>
> > Maybe a new file called drivers/thermal/intel/intel_itd.c would be better?
>
> So which part of this code other than the processor model check
> mentioned above is Intel-specific?
debounce_and_update_class() is needed for Intel processors, other
architectures may not need it or have a different solution.
On Wed, Mar 29, 2023 at 02:13:29PM +0200, Rafael J. Wysocki wrote:
> On Wed, Mar 29, 2023 at 1:31 AM Ricardo Neri
> <[email protected]> wrote:
> >
> > On Mon, Mar 27, 2023 at 06:42:28PM +0200, Rafael J. Wysocki wrote:
> > > On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
> > > <[email protected]> wrote:
> > > >
> > > > Use Intel Thread Director classification to update the IPC class of a
> > > > task. Implement the arch_update_ipcc() interface of the scheduler.
> > > >
> > > > Cc: Ben Segall <[email protected]>
> > > > Cc: Daniel Bristot de Oliveira <[email protected]>
> > > > Cc: Dietmar Eggemann <[email protected]>
> > > > Cc: Ionela Voinescu <[email protected]>
> > > > Cc: Joel Fernandes (Google) <[email protected]>
> > > > Cc: Len Brown <[email protected]>
> > > > Cc: Lukasz Luba <[email protected]>
> > > > Cc: Mel Gorman <[email protected]>
> > > > Cc: Rafael J. Wysocki <[email protected]>
> > > > Cc: Srinivas Pandruvada <[email protected]>
> > > > Cc: Steven Rostedt <[email protected]>
> > > > Cc: Tim C. Chen <[email protected]>
> > > > Cc: Valentin Schneider <[email protected]>
> > > > Cc: [email protected]
> > > > Cc: [email protected]
> > > > Cc: [email protected]
> > > > Signed-off-by: Ricardo Neri <[email protected]>
> > > > ---
> > > > Changes since v2:
> > > > * Removed the implementation of arch_has_ipc_classes().
> > > >
> > > > Changes since v1:
> > > > * Adjusted the result the classification of Intel Thread Director to start
> > > > at class 1. Class 0 for the scheduler means that the task is
> > > > unclassified.
> > > > * Redefined union hfi_thread_feedback_char_msr to ensure all
> > > > bit-fields are packed. (PeterZ)
> > > > * Removed CONFIG_INTEL_THREAD_DIRECTOR. (PeterZ)
> > > > * Shortened the names of the functions that implement IPC classes.
> > > > * Removed argument smt_siblings_idle from intel_hfi_update_ipcc().
> > > > (PeterZ)
> > > > ---
> > > > arch/x86/include/asm/topology.h | 6 ++++++
> > > > drivers/thermal/intel/intel_hfi.c | 32 +++++++++++++++++++++++++++++++
> > > > 2 files changed, 38 insertions(+)
> > > >
> > > > diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> > > > index 458c891a8273..ffcdac3f398f 100644
> > > > --- a/arch/x86/include/asm/topology.h
> > > > +++ b/arch/x86/include/asm/topology.h
> > > > @@ -227,4 +227,10 @@ void init_freq_invariance_cppc(void);
> > > > #define arch_init_invariance_cppc init_freq_invariance_cppc
> > > > #endif
> > > >
> > > > +#if defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL)
> > > > +void intel_hfi_update_ipcc(struct task_struct *curr);
> > > > +
> > > > +#define arch_update_ipcc intel_hfi_update_ipcc
> > > > +#endif /* defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL) */
> > > > +
> > > > #endif /* _ASM_X86_TOPOLOGY_H */
> > > > diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> > > > index b06021828892..530dcf57e06e 100644
> > > > --- a/drivers/thermal/intel/intel_hfi.c
> > > > +++ b/drivers/thermal/intel/intel_hfi.c
> > > > @@ -72,6 +72,17 @@ union cpuid6_edx {
> > > > u32 full;
> > > > };
> > > >
> > > > +#ifdef CONFIG_IPC_CLASSES
> > > > +union hfi_thread_feedback_char_msr {
> > > > + struct {
> > > > + u64 classid : 8;
> > > > + u64 __reserved : 55;
> > > > + u64 valid : 1;
> > > > + } split;
> > > > + u64 full;
> > > > +};
> > > > +#endif
> > > > +
> > > > /**
> > > > * struct hfi_cpu_data - HFI capabilities per CPU
> > > > * @perf_cap: Performance capability
> > > > @@ -174,6 +185,27 @@ static struct workqueue_struct *hfi_updates_wq;
> > > > #ifdef CONFIG_IPC_CLASSES
> > > > static int __percpu *hfi_ipcc_scores;
> > > >
> > > > +void intel_hfi_update_ipcc(struct task_struct *curr)
> > > > +{
> > > > + union hfi_thread_feedback_char_msr msr;
> > > > +
> > > > + /* We should not be here if ITD is not supported. */
> > > > + if (!cpu_feature_enabled(X86_FEATURE_ITD)) {
> > > > + pr_warn_once("task classification requested but not supported!");
> > > > + return;
> > > > + }
> > > > +
> > > > + rdmsrl(MSR_IA32_HW_FEEDBACK_CHAR, msr.full);
> > > > + if (!msr.split.valid)
> > > > + return;
> > > > +
> > > > + /*
> > > > + * 0 is a valid classification for Intel Thread Director. A scheduler
> > > > + * IPCC class of 0 means that the task is unclassified. Adjust.
> > > > + */
> > > > + curr->ipcc = msr.split.classid + 1;
> > > > +}
> > >
> > > Wouldn't it be better to return the adjusted value from this function
> > > and let the caller store it where appropriate?
> > >
> > > It doesn't look like it is necessary to pass the task_struct pointer to it.
> >
> > Judging from this patch alone, yes, it does not make much sense to pass a
> > task_struct as argument. In patch 21, however, this function uses various
> > members of task_struct and makes it more convenient to have it as argument,
> > no?
>
> I'm not convinced about this, but anyway it is better to combine the
> two patches in such cases IMO.
>
> The way it is done now confuses things from my perspective.
Right, I structured the patchset to have the additions to task_struct in
separate patches. This made it less clear for intel_hfi.c
Would it be acceptable if I kept void intel_hfi_update_ipcc(struct
task_struct *curr) and added a static function u32 intel_hfi_get_ipcc(void)
to return the hardware classification?
Otherwise, I would need to add three different accessors for task_struct
so that the HFI driver can retrieve the auxiliary members from patch 21.
Thanks and BR,
Ricardo
On Wed, Mar 29, 2023 at 02:17:01PM +0200, Rafael J. Wysocki wrote:
> On Wed, Mar 29, 2023 at 1:30 AM Ricardo Neri
> <[email protected]> wrote:
> >
> > On Mon, Mar 27, 2023 at 06:50:13PM +0200, Rafael J. Wysocki wrote:
> > > On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
> > > <[email protected]> wrote:
> > > >
> > > > Implement the arch_get_ipcc_score() interface of the scheduler. Use the
> > > > performance capabilities of the extended Hardware Feedback Interface table
> > > > as the IPC score.
> > > >
> > > > Cc: Ben Segall <[email protected]>
> > > > Cc: Daniel Bristot de Oliveira <[email protected]>
> > > > Cc: Dietmar Eggemann <[email protected]>
> > > > Cc: Ionela Voinescu <[email protected]>
> > > > Cc: Joel Fernandes (Google) <[email protected]>
> > > > Cc: Len Brown <[email protected]>
> > > > Cc: Lukasz Luba <[email protected]>
> > > > Cc: Mel Gorman <[email protected]>
> > > > Cc: Rafael J. Wysocki <[email protected]>
> > > > Cc: Srinivas Pandruvada <[email protected]>
> > > > Cc: Steven Rostedt <[email protected]>
> > > > Cc: Tim C. Chen <[email protected]>
> > > > Cc: Valentin Schneider <[email protected]>
> > > > Cc: [email protected]
> > > > Cc: [email protected]
> > > > Cc: [email protected]
> > > > Signed-off-by: Ricardo Neri <[email protected]>
> > > > ---
> > > > Changes since v2:
> > > > * None
> > > >
> > > > Changes since v1:
> > > > * Adjusted the returned HFI class (which starts at 0) to match the
> > > > scheduler IPCC class (which starts at 1). (PeterZ)
> > > > * Used the new interface names.
> > > > ---
> > > > arch/x86/include/asm/topology.h | 2 ++
> > > > drivers/thermal/intel/intel_hfi.c | 27 +++++++++++++++++++++++++++
> > > > 2 files changed, 29 insertions(+)
> > > >
> > > > diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> > > > index ffcdac3f398f..c4fcd9c3c634 100644
> > > > --- a/arch/x86/include/asm/topology.h
> > > > +++ b/arch/x86/include/asm/topology.h
> > > > @@ -229,8 +229,10 @@ void init_freq_invariance_cppc(void);
> > > >
> > > > #if defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL)
> > > > void intel_hfi_update_ipcc(struct task_struct *curr);
> > > > +unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu);
> > > >
> > > > #define arch_update_ipcc intel_hfi_update_ipcc
> > > > +#define arch_get_ipcc_score intel_hfi_get_ipcc_score
> > > > #endif /* defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL) */
> > > >
> > > > #endif /* _ASM_X86_TOPOLOGY_H */
> > > > diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> > > > index 530dcf57e06e..fa9b4a678d92 100644
> > > > --- a/drivers/thermal/intel/intel_hfi.c
> > > > +++ b/drivers/thermal/intel/intel_hfi.c
> > > > @@ -206,6 +206,33 @@ void intel_hfi_update_ipcc(struct task_struct *curr)
> > > > curr->ipcc = msr.split.classid + 1;
> > > > }
> > > >
> > > > +unsigned long intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu)
> > > > +{
> > > > + unsigned short hfi_class;
> > >
> > > It looks like the variable above is only used to save a subtraction or
> > > addition of 1 to something going forward.
> > >
> > > > + int *scores;
> > > > +
> > > > + if (cpu < 0 || cpu >= nr_cpu_ids)
> > > > + return -EINVAL;
> > > > +
> > > > + if (ipcc == IPC_CLASS_UNCLASSIFIED)
> > > > + return -EINVAL;
> > > > +
> > > > + /*
> > > > + * Scheduler IPC classes start at 1. HFI classes start at 0.
> > > > + * See note intel_hfi_update_ipcc().
> > > > + */
> > > > + hfi_class = ipcc - 1;
> > > > +
> > > > + if (hfi_class >= hfi_features.nr_classes)
> > >
> > > Personally, I would do
> > >
> > > if (ipcc >= hfi_features.nr_classes + 1)
> > >
> > > here and ->
> > >
> > > > + return -EINVAL;
> > > > +
> > > > + scores = per_cpu_ptr(hfi_ipcc_scores, cpu);
> > > > + if (!scores)
> > > > + return -ENODEV;
> > > > +
> > >
> > > -> scores[ipcc - 1]
> >
> > Sure, I can get rid of hfi_class.
> >
> > >
> > > below.
> > >
> > > > + return READ_ONCE(scores[hfi_class]);
> > >
> > > And why does this need to use READ_ONCE()?
> >
> > This is the corresponding read of the WRITE_ONCE in patch 13. The CPU
> > handling the HFI interrupt, very likely a different CPU, updates
> > scores[hfi_class]. My intention is to let that write to complete before
> > reading the score here.
>
> However, READ_ONCE()/WRITE_ONCE() only affect compiler optimizations
> AFAICS. What if the CPUs running the code reorder the instructions?
True, this implementation may not be complete and may need a memory
barrier. I will look at it more closely.
>
> In any case, IMV the reason why READ_ONCE() is used needs to be clear
> to the reviewers from the patch itself (and to a casual reader of the
> code from the code itself).
Sure. I will make sure this is clear in the commit message and/or
inline comments.
Thanks and BR,
Ricardo
On Thu, 30 Mar 2023 at 03:56, Ricardo Neri
<[email protected]> wrote:
>
> On Tue, Mar 28, 2023 at 12:00:58PM +0200, Vincent Guittot wrote:
> > On Tue, 7 Feb 2023 at 06:01, Ricardo Neri
> > <[email protected]> wrote:
> > >
> > > Compute the joint total (both current and prospective) IPC class score of
> > > a scheduling group and the local scheduling group.
> > >
> > > These IPCC statistics are used during idle load balancing. The candidate
> > > scheduling group will have one fewer busy CPU after load balancing. This
> > > observation is important for cores with SMT support.
> > >
> > > The IPCC score of scheduling groups composed of SMT siblings needs to
> > > consider that the siblings share CPU resources. When computing the total
> > > IPCC score of the scheduling group, divide score of each sibling by the
> > > number of busy siblings.
> > >
> > > Collect IPCC statistics for asym_packing and fully_busy scheduling groups.
> >
> > IPCC statistics collect scores of current tasks, so they are
> > meaningful only when trying to migrate one of those running tasks.
> > Using such score when pulling other tasks is just meaningless. And I
> > don't see how you ensure such correct use of ipcc score
>
> Thank you very much for your feedback Vincent!
>
> It is true that the task that is current when collecting statistics may be
> different from the task that is current when we are ready to pluck tasks.
>
> Using IPCC scores for load balancing benefits large, long-running tasks
> the most. For these tasks, the current task is likely to remain the same
> at the two mentioned points in time.
My point was mainly about the fact that the current running task is
the last one to be pulled. And this happens only when no other task
was pulled otherwise.
>
> My patchset proposes to use IPCC clases to break ties between otherwise
> identical sched groups in update_sd_pick_busiest(). Its use is limited to
> asym_packing and fully_busy types. For these types, it is likely that there
> will not be tasks wanting to run other than current. need_active_balance()
> will return true and we will migrate the current task.
I disagree with your assumption above, asym_packing and fully_busy
types doesn't put any mean on the number of running tasks
>
> You are correct, by only looking at the current tasks we risk overlooking
> other tasks in the queue and the statistics becoming meaningless. A fully
> correct solution would need to keep track of the the types of tasks in
> all runqueues as they come and go. IMO, the increased complexity of such
> approach does not justify the benefit. We give the load balancer extra
> information to decide between otherwise identical sched groups using the
> IPCC statistics of big tasks.
because IPCC are meaningful only when there is only 1 running task and
during active migration, you should collect them only for such
situation
>
> >
> > > When picking a busiest group, they are used to break ties between otherwise
> > > identical groups.
> > >
> > > Cc: Ben Segall <[email protected]>
> > > Cc: Daniel Bristot de Oliveira <[email protected]>
> > > Cc: Dietmar Eggemann <[email protected]>
> > > Cc: Ionela Voinescu <[email protected]>
> > > Cc: Joel Fernandes (Google) <[email protected]>
> > > Cc: Len Brown <[email protected]>
> > > Cc: Lukasz Luba <[email protected]>
> > > Cc: Mel Gorman <[email protected]>
> > > Cc: Rafael J. Wysocki <[email protected]>
> > > Cc: Srinivas Pandruvada <[email protected]>
> > > Cc: Steven Rostedt <[email protected]>
> > > Cc: Tim C. Chen <[email protected]>
> > > Cc: Valentin Schneider <[email protected]>
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Signed-off-by: Ricardo Neri <[email protected]>
> > > ---
> > > Changes since v2:
> > > * Also collect IPCC stats for fully_busy sched groups.
> >
> > Why have you added fully_busy case ? it's worth explaining the
> > rational because there is a lot of change to use the ipcc score of the
> > wrong task to take the decision
>
> When deciding between two fully_busy sched groups, update_sd_pick_busiest()
> unconditionally selects the given candidate @sg as busiest. With IPCC
> scores, what is running a scheduling group becomes important. We now have
> extra information to select either of the fully_busy groups. This is done
> in patch 9.
>
> (Also note that in [1] I tweak the logic to select fully_busy SMT cores vs
> fully_busy non-SMT cores).
>
> >
> > > * Restrict use of IPCC stats to SD_ASYM_PACKING. (Ionela)
> > > * Handle errors of arch_get_ipcc_score(). (Ionela)
> > >
> > > Changes since v1:
> > > * Implemented cleanups and reworks from PeterZ. I took all his
> > > suggestions, except the computation of the IPC score before and after
> > > load balancing. We are computing not the average score, but the *total*.
> > > * Check for the SD_SHARE_CPUCAPACITY to compute the throughput of the SMT
> > > siblings of a physical core.
> > > * Used the new interface names.
> > > * Reworded commit message for clarity.
> > > ---
> > > kernel/sched/fair.c | 68 +++++++++++++++++++++++++++++++++++++++++++++
> > > 1 file changed, 68 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index d773380a95b3..b6165aa8a376 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8901,6 +8901,8 @@ struct sg_lb_stats {
> > > unsigned long min_score; /* Min(score(rq->curr->ipcc)) */
> > > unsigned short min_ipcc; /* Class of the task with the minimum IPCC score in the rq */
> > > unsigned long sum_score; /* Sum(score(rq->curr->ipcc)) */
> > > + long ipcc_score_after; /* Prospective IPCC score after load balancing */
> > > + unsigned long ipcc_score_before; /* IPCC score before load balancing */
> > > #endif
> > > };
> > >
> > > @@ -9287,6 +9289,62 @@ static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> > > }
> > > }
> > >
> > > +static void update_sg_lb_stats_scores(struct sg_lb_stats *sgs,
> > > + struct sched_group *sg,
> > > + struct lb_env *env)
> > > +{
> > > + unsigned long score_on_dst_cpu, before;
> > > + int busy_cpus;
> > > + long after;
> > > +
> > > + if (!sched_ipcc_enabled())
> > > + return;
> > > +
> > > + /*
> > > + * IPCC scores are only useful during idle load balancing. For now,
> > > + * only asym_packing uses IPCC scores.
> > > + */
> > > + if (!(env->sd->flags & SD_ASYM_PACKING) ||
> > > + env->idle == CPU_NOT_IDLE)
> > > + return;
> > > +
> > > + /*
> > > + * IPCC scores are used to break ties only between these types of
> > > + * groups.
> > > + */
> > > + if (sgs->group_type != group_fully_busy &&
> > > + sgs->group_type != group_asym_packing)
> > > + return;
> > > +
> > > + busy_cpus = sgs->group_weight - sgs->idle_cpus;
> > > +
> > > + /* No busy CPUs in the group. No tasks to move. */
> > > + if (!busy_cpus)
> > > + return;
> > > +
> > > + score_on_dst_cpu = arch_get_ipcc_score(sgs->min_ipcc, env->dst_cpu);
> > > +
> > > + /*
> > > + * Do not use IPC scores. sgs::ipcc_score_{after, before} will be zero
> > > + * and not used.
> > > + */
> > > + if (IS_ERR_VALUE(score_on_dst_cpu))
> > > + return;
> > > +
> > > + before = sgs->sum_score;
> > > + after = before - sgs->min_score;
> >
> > IIUC, you assume that you will select the cpu with the min score.
> > How do you ensure this ? otherwise all your comparisong are useless
>
> This is relevant for SMT cores. A smaller idle core will help a fully_busy
> SMT core by pulling low-IPC work, leaving the full core for high-IPC
> work.
>
> We ensure (or anticipate if you will) this because find_busiest_queue()
> will select the queue whose current task gets the biggest IPC boost. When
> done from big to small cores the IPC boost is negative.
>
> >
> > > +
> > > + /* SMT siblings share throughput. */
> > > + if (busy_cpus > 1 && sg->flags & SD_SHARE_CPUCAPACITY) {
> > > + before /= busy_cpus;
> > > + /* One sibling will become idle after load balance. */
> > > + after /= busy_cpus - 1;
> > > + }
> > > +
> > > + sgs->ipcc_score_after = after + score_on_dst_cpu;
> > > + sgs->ipcc_score_before = before;
> >
> > I'm not sure to understand why you are computing the sum_score,
> > ipcc_score_before and ipcc_score_after ?
> > Why is it not sufficient to check if score_on_dst_cpu will be higher
> > than current min_score ?
>
> You are right. As the source core becomes idle after load balancing, it is
> sufficient to look for the highest score_on_dst_cpu. However, we must also
> handle SMT cores.
>
> If the source sched group is a fully_busy SMT core, one of the siblings
> will become idle after load balance (my patchset uses IPCC scores for
> asym_packing and when balancing the number of idle CPUs from fully_busy).
> The remaining non-idle siblings will benefit from the extra throughput.
>
> We calculate ipcc_score_after to find the sched group that would benefit
> the most. We calculate ipcc_score_before to break ties of two groups of
> identical ipcc_score_after.
>
> Thanks and BR,
> Ricardo
>
> [1]. https://lore.kernel.org/lkml/[email protected]/
On Thu, 30 Mar 2023 at 04:03, Ricardo Neri
<[email protected]> wrote:
>
> On Tue, Mar 28, 2023 at 12:03:58PM +0200, Vincent Guittot wrote:
> > On Tue, 7 Feb 2023 at 06:01, Ricardo Neri
> > <[email protected]> wrote:
> > >
> > > For two runqueues of equal priority and equal number of running of tasks,
> > > select the one whose current task would have the highest IPC class score
> > > if placed on the destination CPU.
> >
> > You failed to explain why it make sense to compare current task score
> > whereas we will most probably not pull this task at the end
>
> Thank you for your feedback Vincent! Please kindly refer to my reply to
> your feedback in patch 7.
>
> > >
> > > For now, use IPCC scores only for scheduling domains with the
> > > SD_ASYM_PACKING flag.
> > >
> > > Cc: Ben Segall <[email protected]>
> > > Cc: Daniel Bristot de Oliveira <[email protected]>
> > > Cc: Dietmar Eggemann <[email protected]>
> > > Cc: Ionela Voinescu <[email protected]>
> > > Cc: Joel Fernandes (Google) <[email protected]>
> > > Cc: Len Brown <[email protected]>
> > > Cc: Lukasz Luba <[email protected]>
> > > Cc: Mel Gorman <[email protected]>
> > > Cc: Rafael J. Wysocki <[email protected]>
> > > Cc: Srinivas Pandruvada <[email protected]>
> > > Cc: Steven Rostedt <[email protected]>
> > > Cc: Tim C. Chen <[email protected]>
> > > Cc: Valentin Schneider <[email protected]>
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Signed-off-by: Ricardo Neri <[email protected]>
> > > ---
> > > Changes since v2:
> > > * Only use IPCC scores to break ties if the sched domain uses
> > > asym_packing. (Ionela)
> > > * Handle errors of arch_get_ipcc_score(). (Ionela)
> > >
> > > Changes since v1:
> > > * Fixed a bug when selecting a busiest runqueue: when comparing two
> > > runqueues with equal nr_running, we must compute the IPCC score delta
> > > of both.
> > > * Renamed local variables to improve the layout of the code block.
> > > (PeterZ)
> > > * Used the new interface names.
> > > ---
> > > kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++++++++++++++++++
> > > 1 file changed, 64 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 72d88270b320..d3c22dc145f7 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -9399,6 +9399,37 @@ static bool sched_asym_ipcc_pick(struct sched_group *a,
> > > return sched_asym_ipcc_prefer(a_stats, b_stats);
> > > }
> > >
> > > +/**
> > > + * ipcc_score_delta - Get the IPCC score delta wrt the load balance's dst_cpu
> > > + * @p: A task
> > > + * @env: Load balancing environment
> > > + *
> > > + * Returns: The IPCC score delta that @p would get if placed in the destination
> > > + * CPU of @env. LONG_MIN to indicate that the delta should not be used.
> > > + */
> > > +static long ipcc_score_delta(struct task_struct *p, struct lb_env *env)
> > > +{
> > > + unsigned long score_src, score_dst;
> > > + unsigned short ipcc = p->ipcc;
> > > +
> > > + if (!sched_ipcc_enabled())
> > > + return LONG_MIN;
> > > +
> > > + /* Only asym_packing uses IPCC scores at the moment. */
> > > + if (!(env->sd->flags & SD_ASYM_PACKING))
> > > + return LONG_MIN;
> > > +
> > > + score_dst = arch_get_ipcc_score(ipcc, env->dst_cpu);
> > > + if (IS_ERR_VALUE(score_dst))
> > > + return LONG_MIN;
> > > +
> > > + score_src = arch_get_ipcc_score(ipcc, task_cpu(p));
> > > + if (IS_ERR_VALUE(score_src))
> > > + return LONG_MIN;
> > > +
> > > + return score_dst - score_src;
> > > +}
> > > +
> > > #else /* CONFIG_IPC_CLASSES */
> > > static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> > > struct rq *rq)
> > > @@ -9429,6 +9460,11 @@ static bool sched_asym_ipcc_pick(struct sched_group *a,
> > > return false;
> > > }
> > >
> > > +static long ipcc_score_delta(struct task_struct *p, struct lb_env *env)
> > > +{
> > > + return LONG_MIN;
> > > +}
> > > +
> > > #endif /* CONFIG_IPC_CLASSES */
> > >
> > > /**
> > > @@ -10589,6 +10625,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> > > {
> > > struct rq *busiest = NULL, *rq;
> > > unsigned long busiest_util = 0, busiest_load = 0, busiest_capacity = 1;
> > > + long busiest_ipcc_delta = LONG_MIN;
> > > unsigned int busiest_nr = 0;
> > > int i;
> > >
> > > @@ -10705,8 +10742,35 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> > >
> > > case migrate_task:
> > > if (busiest_nr < nr_running) {
> > > + struct task_struct *curr;
> > > +
> > > busiest_nr = nr_running;
> > > busiest = rq;
> > > +
> > > + /*
> > > + * Remember the IPCC score delta of busiest::curr.
> > > + * We may need it to break a tie with other queues
> > > + * with equal nr_running.
> > > + */
> > > + curr = rcu_dereference(busiest->curr);
> > > + busiest_ipcc_delta = ipcc_score_delta(curr, env);
> >
> > Hmm, i don't like this at all
> >
> > Also, curr is the least probable task to be pulled which means that
> > all this his useless
>
> but when doing asym_packing balancing nr_running = 1, need_active_balance()
> returns true and we will pull the current task, no? This is also true for
> fully_busy groups with one task per CPU. These are the only two cases that
> currently use IPCC scores.
hmm, for sure it's not true for fully_busy and I don't see anything
about asym_packing mandating that nr_running = 1
You should have a look at misfit task which seems to better fit your
situation where you have one task that doesn't fit its cpu instead of
adding such condition
>
> If there are more than one tasks in the runqueue, the group will be
> classified as overloaded and we will not use IPCC scores nor active
> balance.
>
> Thanks and BR,
> Ricardo
On Thu, Mar 30, 2023 at 4:31 AM Ricardo Neri
<[email protected]> wrote:
>
> On Wed, Mar 29, 2023 at 02:21:57PM +0200, Rafael J. Wysocki wrote:
> > On Wed, Mar 29, 2023 at 2:04 AM Ricardo Neri
> > <[email protected]> wrote:
> > >
> > > On Mon, Mar 27, 2023 at 07:03:08PM +0200, Rafael J. Wysocki wrote:
> > > > On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
> > > > <[email protected]> wrote:
> > > > >
> > > > > In Alder Lake and Raptor Lake, the result of thread classification is more
> > > > > accurate when only one SMT sibling is busy. Classification results for
> > > > > class 2 and 3 are always reliable.
> > > > >
> > > > > To avoid unnecessary migrations, only update the class of a task if it has
> > > > > been the same during 4 consecutive user ticks.
> > > > >
> > > > > Cc: Ben Segall <[email protected]>
> > > > > Cc: Daniel Bristot de Oliveira <[email protected]>
> > > > > Cc: Dietmar Eggemann <[email protected]>
> > > > > Cc: Ionela Voinescu <[email protected]>
> > > > > Cc: Joel Fernandes (Google) <[email protected]>
> > > > > Cc: Len Brown <[email protected]>
> > > > > Cc: Lukasz Luba <[email protected]>
> > > > > Cc: Mel Gorman <[email protected]>
> > > > > Cc: Rafael J. Wysocki <[email protected]>
> > > > > Cc: Srinivas Pandruvada <[email protected]>
> > > > > Cc: Steven Rostedt <[email protected]>
> > > > > Cc: Tim C. Chen <[email protected]>
> > > > > Cc: Valentin Schneider <[email protected]>
> > > > > Cc: [email protected]
> > > > > Cc: [email protected]
> > > > > Cc: [email protected]
> > > > > Signed-off-by: Ricardo Neri <[email protected]>
> > > > > ---
> > > > > Changes since v2:
> > > > > * None
> > > > >
> > > > > Changes since v1:
> > > > > * Adjusted the result the classification of Intel Thread Director to start
> > > > > at class 1. Class 0 for the scheduler means that the task is
> > > > > unclassified.
> > > > > * Used the new names of the IPC classes members in task_struct.
> > > > > * Reworked helper functions to use sched_smt_siblings_idle() to query
> > > > > the idle state of the SMT siblings of a CPU.
> > > > > ---
> > > > > drivers/thermal/intel/intel_hfi.c | 60 ++++++++++++++++++++++++++++++-
> > > > > 1 file changed, 59 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> > > > > index 35d947f47550..fdb53e4cabc1 100644
> > > > > --- a/drivers/thermal/intel/intel_hfi.c
> > > > > +++ b/drivers/thermal/intel/intel_hfi.c
> > > > > @@ -40,6 +40,7 @@
> > > > > #include <linux/workqueue.h>
> > > > >
> > > > > #include <asm/msr.h>
> > > > > +#include <asm/intel-family.h>
> > > > >
> > > > > #include "../thermal_core.h"
> > > > > #include "intel_hfi.h"
> > > > > @@ -209,9 +210,64 @@ static int __percpu *hfi_ipcc_scores;
> > > > > */
> > > > > #define HFI_UNCLASSIFIED_DEFAULT 1
> > > > >
> > > > > +#define CLASS_DEBOUNCER_SKIPS 4
> > > > > +
> > > > > +/**
> > > > > + * debounce_and_update_class() - Process and update a task's classification
> > > > > + *
> > > > > + * @p: The task of which the classification will be updated
> > > > > + * @new_ipcc: The new IPC classification
> > > > > + *
> > > > > + * Update the classification of @p with the new value that hardware provides.
> > > > > + * Only update the classification of @p if it has been the same during
> > > > > + * CLASS_DEBOUNCER_SKIPS consecutive ticks.
> > > > > + */
> > > > > +static void debounce_and_update_class(struct task_struct *p, u8 new_ipcc)
> > > > > +{
> > > > > + u16 debounce_skip;
> > > > > +
> > > > > + /* The class of @p changed. Only restart the debounce counter. */
> > > > > + if (p->ipcc_tmp != new_ipcc) {
> > > > > + p->ipcc_cntr = 1;
> > > > > + goto out;
> > > > > + }
> > > > > +
> > > > > + /*
> > > > > + * The class of @p did not change. Update it if it has been the same
> > > > > + * for CLASS_DEBOUNCER_SKIPS user ticks.
> > > > > + */
> > > > > + debounce_skip = p->ipcc_cntr + 1;
> > > > > + if (debounce_skip < CLASS_DEBOUNCER_SKIPS)
> > > > > + p->ipcc_cntr++;
> > > > > + else
> > > > > + p->ipcc = new_ipcc;
> > > > > +
> > > > > +out:
> > > > > + p->ipcc_tmp = new_ipcc;
> > > > > +}
> > > >
> > > > Why does the code above belong to the Intel HFI driver? It doesn't
> > > > look like there is anything driver-specific in it.
> > >
> > > That is a good point. This post-processing is specific to the
> > > implementation of IPCC classes using Intel Thread Director.
> >
> > Well, the implementation-specific part is the processor model check
> > whose only contribution is to say whether or not the classification is
> > valid. The rest appears to be fairly generic to me.
>
> I meant to say that we use Intel Thread Director and the HFI driver to
> implement the interfaces defined in patch 2. Other architectures may
> implement those interfaces differently.
>
> For Intel, we may even need different filters and debouncers for different
> models.
>
> >
> > > Maybe a new file called drivers/thermal/intel/intel_itd.c would be better?
> >
> > So which part of this code other than the processor model check
> > mentioned above is Intel-specific?
>
> debounce_and_update_class() is needed for Intel processors, other
> architectures may not need it or have a different solution.
IMV, one general problem with this approach is that it is making a
more-or-less random thermal driver operate on task structure
internals, while drivers/thermal/ is not a usual place to look for CPU
scheduler code.
I'm not sure why it has to be done this way and none of the above
explains that IIUC.
Is it really the role of the thermal HFI driver to implement a task
classification algorithm? I'm not convinced about that. Personally,
I would probably introduce some proper arch code doing that and using
input from the HFI driver.
On Fri, Mar 31, 2023 at 07:08:55PM +0200, Rafael J. Wysocki wrote:
> On Thu, Mar 30, 2023 at 4:31 AM Ricardo Neri
> <[email protected]> wrote:
> >
> > On Wed, Mar 29, 2023 at 02:21:57PM +0200, Rafael J. Wysocki wrote:
> > > On Wed, Mar 29, 2023 at 2:04 AM Ricardo Neri
> > > <[email protected]> wrote:
> > > >
> > > > On Mon, Mar 27, 2023 at 07:03:08PM +0200, Rafael J. Wysocki wrote:
> > > > > On Tue, Feb 7, 2023 at 6:02 AM Ricardo Neri
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > In Alder Lake and Raptor Lake, the result of thread classification is more
> > > > > > accurate when only one SMT sibling is busy. Classification results for
> > > > > > class 2 and 3 are always reliable.
> > > > > >
> > > > > > To avoid unnecessary migrations, only update the class of a task if it has
> > > > > > been the same during 4 consecutive user ticks.
> > > > > >
> > > > > > Cc: Ben Segall <[email protected]>
> > > > > > Cc: Daniel Bristot de Oliveira <[email protected]>
> > > > > > Cc: Dietmar Eggemann <[email protected]>
> > > > > > Cc: Ionela Voinescu <[email protected]>
> > > > > > Cc: Joel Fernandes (Google) <[email protected]>
> > > > > > Cc: Len Brown <[email protected]>
> > > > > > Cc: Lukasz Luba <[email protected]>
> > > > > > Cc: Mel Gorman <[email protected]>
> > > > > > Cc: Rafael J. Wysocki <[email protected]>
> > > > > > Cc: Srinivas Pandruvada <[email protected]>
> > > > > > Cc: Steven Rostedt <[email protected]>
> > > > > > Cc: Tim C. Chen <[email protected]>
> > > > > > Cc: Valentin Schneider <[email protected]>
> > > > > > Cc: [email protected]
> > > > > > Cc: [email protected]
> > > > > > Cc: [email protected]
> > > > > > Signed-off-by: Ricardo Neri <[email protected]>
> > > > > > ---
> > > > > > Changes since v2:
> > > > > > * None
> > > > > >
> > > > > > Changes since v1:
> > > > > > * Adjusted the result the classification of Intel Thread Director to start
> > > > > > at class 1. Class 0 for the scheduler means that the task is
> > > > > > unclassified.
> > > > > > * Used the new names of the IPC classes members in task_struct.
> > > > > > * Reworked helper functions to use sched_smt_siblings_idle() to query
> > > > > > the idle state of the SMT siblings of a CPU.
> > > > > > ---
> > > > > > drivers/thermal/intel/intel_hfi.c | 60 ++++++++++++++++++++++++++++++-
> > > > > > 1 file changed, 59 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
> > > > > > index 35d947f47550..fdb53e4cabc1 100644
> > > > > > --- a/drivers/thermal/intel/intel_hfi.c
> > > > > > +++ b/drivers/thermal/intel/intel_hfi.c
> > > > > > @@ -40,6 +40,7 @@
> > > > > > #include <linux/workqueue.h>
> > > > > >
> > > > > > #include <asm/msr.h>
> > > > > > +#include <asm/intel-family.h>
> > > > > >
> > > > > > #include "../thermal_core.h"
> > > > > > #include "intel_hfi.h"
> > > > > > @@ -209,9 +210,64 @@ static int __percpu *hfi_ipcc_scores;
> > > > > > */
> > > > > > #define HFI_UNCLASSIFIED_DEFAULT 1
> > > > > >
> > > > > > +#define CLASS_DEBOUNCER_SKIPS 4
> > > > > > +
> > > > > > +/**
> > > > > > + * debounce_and_update_class() - Process and update a task's classification
> > > > > > + *
> > > > > > + * @p: The task of which the classification will be updated
> > > > > > + * @new_ipcc: The new IPC classification
> > > > > > + *
> > > > > > + * Update the classification of @p with the new value that hardware provides.
> > > > > > + * Only update the classification of @p if it has been the same during
> > > > > > + * CLASS_DEBOUNCER_SKIPS consecutive ticks.
> > > > > > + */
> > > > > > +static void debounce_and_update_class(struct task_struct *p, u8 new_ipcc)
> > > > > > +{
> > > > > > + u16 debounce_skip;
> > > > > > +
> > > > > > + /* The class of @p changed. Only restart the debounce counter. */
> > > > > > + if (p->ipcc_tmp != new_ipcc) {
> > > > > > + p->ipcc_cntr = 1;
> > > > > > + goto out;
> > > > > > + }
> > > > > > +
> > > > > > + /*
> > > > > > + * The class of @p did not change. Update it if it has been the same
> > > > > > + * for CLASS_DEBOUNCER_SKIPS user ticks.
> > > > > > + */
> > > > > > + debounce_skip = p->ipcc_cntr + 1;
> > > > > > + if (debounce_skip < CLASS_DEBOUNCER_SKIPS)
> > > > > > + p->ipcc_cntr++;
> > > > > > + else
> > > > > > + p->ipcc = new_ipcc;
> > > > > > +
> > > > > > +out:
> > > > > > + p->ipcc_tmp = new_ipcc;
> > > > > > +}
> > > > >
> > > > > Why does the code above belong to the Intel HFI driver? It doesn't
> > > > > look like there is anything driver-specific in it.
> > > >
> > > > That is a good point. This post-processing is specific to the
> > > > implementation of IPCC classes using Intel Thread Director.
> > >
> > > Well, the implementation-specific part is the processor model check
> > > whose only contribution is to say whether or not the classification is
> > > valid. The rest appears to be fairly generic to me.
> >
> > I meant to say that we use Intel Thread Director and the HFI driver to
> > implement the interfaces defined in patch 2. Other architectures may
> > implement those interfaces differently.
> >
> > For Intel, we may even need different filters and debouncers for different
> > models.
> >
> > >
> > > > Maybe a new file called drivers/thermal/intel/intel_itd.c would be better?
> > >
> > > So which part of this code other than the processor model check
> > > mentioned above is Intel-specific?
> >
> > debounce_and_update_class() is needed for Intel processors, other
> > architectures may not need it or have a different solution.
>
> IMV, one general problem with this approach is that it is making a
> more-or-less random thermal driver operate on task structure
> internals, while drivers/thermal/ is not a usual place to look for CPU
> scheduler code.
Fair point.
>
> I'm not sure why it has to be done this way and none of the above
> explains that IIUC.
>
> Is it really the role of the thermal HFI driver to implement a task
> classification algorithm? I'm not convinced about that.
Arguably, Intel Thread Director, an extension of the HFI driver provides
the classification.
> Personally,
> I would probably introduce some proper arch code doing that and using
> input from the HFI driver.
This makes sense to me. I will work on such updates.
Thanks and BR,
Ricardo
On Fri, Mar 31, 2023 at 02:20:11PM +0200, Vincent Guittot wrote:
> On Thu, 30 Mar 2023 at 03:56, Ricardo Neri
> <[email protected]> wrote:
> >
> > On Tue, Mar 28, 2023 at 12:00:58PM +0200, Vincent Guittot wrote:
> > > On Tue, 7 Feb 2023 at 06:01, Ricardo Neri
> > > <[email protected]> wrote:
> > > >
> > > > Compute the joint total (both current and prospective) IPC class score of
> > > > a scheduling group and the local scheduling group.
> > > >
> > > > These IPCC statistics are used during idle load balancing. The candidate
> > > > scheduling group will have one fewer busy CPU after load balancing. This
> > > > observation is important for cores with SMT support.
> > > >
> > > > The IPCC score of scheduling groups composed of SMT siblings needs to
> > > > consider that the siblings share CPU resources. When computing the total
> > > > IPCC score of the scheduling group, divide score of each sibling by the
> > > > number of busy siblings.
> > > >
> > > > Collect IPCC statistics for asym_packing and fully_busy scheduling groups.
> > >
> > > IPCC statistics collect scores of current tasks, so they are
> > > meaningful only when trying to migrate one of those running tasks.
> > > Using such score when pulling other tasks is just meaningless. And I
> > > don't see how you ensure such correct use of ipcc score
> >
> > Thank you very much for your feedback Vincent!
> >
> > It is true that the task that is current when collecting statistics may be
> > different from the task that is current when we are ready to pluck tasks.
> >
> > Using IPCC scores for load balancing benefits large, long-running tasks
> > the most. For these tasks, the current task is likely to remain the same
> > at the two mentioned points in time.
>
> My point was mainly about the fact that the current running task is
> the last one to be pulled. And this happens only when no other task
> was pulled otherwise.
(Thanks again for your feedback, Vincent. I am sorry for the late reply. I
needed some more time to think about it.)
Good point! It is smarter to compare and pull from the back of the queue,
rather than comparing curr and pulling from the back. We are more likely
to break the tie correctly without being too complex.
Here is an incremental patch with the update. I'll include this change in
my next version.
@@ -9281,24 +9281,42 @@ static void init_rq_ipcc_stats(struct sg_lb_stats *sgs)
sgs->min_score = ULONG_MAX;
}
+static int rq_last_task_ipcc(int dst_cpu, struct rq *rq, unsigned short *ipcc)
+{
+ struct list_head *tasks = &rq->cfs_tasks;
+ struct task_struct *p;
+ struct rq_flags rf;
+ int ret = -EINVAL;
+
+ rq_lock_irqsave(rq, &rf);
+ if (list_empty(tasks))
+ goto out;
+
+ p = list_last_entry(tasks, struct task_struct, se.group_node);
+ if (p->flags & PF_EXITING || is_idle_task(p) ||
+ !cpumask_test_cpu(dst_cpu, p->cpus_ptr))
+ goto out;
+
+ ret = 0;
+ *ipcc = p->ipcc;
+out:
+ rq_unlock(rq, &rf);
+ return ret;
+}
+
/* Called only if cpu_of(@rq) is not idle and has tasks running. */
static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
struct rq *rq)
{
- struct task_struct *curr;
unsigned short ipcc;
unsigned long score;
if (!sched_ipcc_enabled())
return;
- curr = rcu_dereference(rq->curr);
- if (!curr || (curr->flags & PF_EXITING) || is_idle_task(curr) ||
- task_is_realtime(curr) ||
- !cpumask_test_cpu(dst_cpu, curr->cpus_ptr))
+ if (rq_last_task_ipcc(dst_cpu, rq, &ipcc))
return;
- ipcc = curr->ipcc;
score = arch_get_ipcc_score(ipcc, cpu_of(rq));
>
> >
> > My patchset proposes to use IPCC clases to break ties between otherwise
> > identical sched groups in update_sd_pick_busiest(). Its use is limited to
> > asym_packing and fully_busy types. For these types, it is likely that there
> > will not be tasks wanting to run other than current. need_active_balance()
> > will return true and we will migrate the current task.
>
> I disagree with your assumption above, asym_packing and fully_busy
> types doesn't put any mean on the number of running tasks
Agreed. What I stated was not correct.
o>
> >
> > You are correct, by only looking at the current tasks we risk overlooking
> > other tasks in the queue and the statistics becoming meaningless. A fully
> > correct solution would need to keep track of the the types of tasks in
> > all runqueues as they come and go. IMO, the increased complexity of such
> > approach does not justify the benefit. We give the load balancer extra
> > information to decide between otherwise identical sched groups using the
> > IPCC statistics of big tasks.
>
> because IPCC are meaningful only when there is only 1 running task and
> during active migration, you should collect them only for such
> situation
I think that if we compute the IPCC statistics using the tasks at the back
of the runqueue, then IPCC statistics remain meaningful for nr_running >= 1.
On Fri, Mar 31, 2023 at 02:23:58PM +0200, Vincent Guittot wrote:
> On Thu, 30 Mar 2023 at 04:03, Ricardo Neri
> <[email protected]> wrote:
> >
> > On Tue, Mar 28, 2023 at 12:03:58PM +0200, Vincent Guittot wrote:
> > > On Tue, 7 Feb 2023 at 06:01, Ricardo Neri
> > > <[email protected]> wrote:
> > > >
> > > > For two runqueues of equal priority and equal number of running of tasks,
> > > > select the one whose current task would have the highest IPC class score
> > > > if placed on the destination CPU.
> > >
> > > You failed to explain why it make sense to compare current task score
> > > whereas we will most probably not pull this task at the end
> >
> > Thank you for your feedback Vincent! Please kindly refer to my reply to
> > your feedback in patch 7.
> >
> > > >
> > > > For now, use IPCC scores only for scheduling domains with the
> > > > SD_ASYM_PACKING flag.
> > > >
> > > > Cc: Ben Segall <[email protected]>
> > > > Cc: Daniel Bristot de Oliveira <[email protected]>
> > > > Cc: Dietmar Eggemann <[email protected]>
> > > > Cc: Ionela Voinescu <[email protected]>
> > > > Cc: Joel Fernandes (Google) <[email protected]>
> > > > Cc: Len Brown <[email protected]>
> > > > Cc: Lukasz Luba <[email protected]>
> > > > Cc: Mel Gorman <[email protected]>
> > > > Cc: Rafael J. Wysocki <[email protected]>
> > > > Cc: Srinivas Pandruvada <[email protected]>
> > > > Cc: Steven Rostedt <[email protected]>
> > > > Cc: Tim C. Chen <[email protected]>
> > > > Cc: Valentin Schneider <[email protected]>
> > > > Cc: [email protected]
> > > > Cc: [email protected]
> > > > Cc: [email protected]
> > > > Signed-off-by: Ricardo Neri <[email protected]>
> > > > ---
> > > > Changes since v2:
> > > > * Only use IPCC scores to break ties if the sched domain uses
> > > > asym_packing. (Ionela)
> > > > * Handle errors of arch_get_ipcc_score(). (Ionela)
> > > >
> > > > Changes since v1:
> > > > * Fixed a bug when selecting a busiest runqueue: when comparing two
> > > > runqueues with equal nr_running, we must compute the IPCC score delta
> > > > of both.
> > > > * Renamed local variables to improve the layout of the code block.
> > > > (PeterZ)
> > > > * Used the new interface names.
> > > > ---
> > > > kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++++++++++++++++++
> > > > 1 file changed, 64 insertions(+)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 72d88270b320..d3c22dc145f7 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -9399,6 +9399,37 @@ static bool sched_asym_ipcc_pick(struct sched_group *a,
> > > > return sched_asym_ipcc_prefer(a_stats, b_stats);
> > > > }
> > > >
> > > > +/**
> > > > + * ipcc_score_delta - Get the IPCC score delta wrt the load balance's dst_cpu
> > > > + * @p: A task
> > > > + * @env: Load balancing environment
> > > > + *
> > > > + * Returns: The IPCC score delta that @p would get if placed in the destination
> > > > + * CPU of @env. LONG_MIN to indicate that the delta should not be used.
> > > > + */
> > > > +static long ipcc_score_delta(struct task_struct *p, struct lb_env *env)
> > > > +{
> > > > + unsigned long score_src, score_dst;
> > > > + unsigned short ipcc = p->ipcc;
> > > > +
> > > > + if (!sched_ipcc_enabled())
> > > > + return LONG_MIN;
> > > > +
> > > > + /* Only asym_packing uses IPCC scores at the moment. */
> > > > + if (!(env->sd->flags & SD_ASYM_PACKING))
> > > > + return LONG_MIN;
> > > > +
> > > > + score_dst = arch_get_ipcc_score(ipcc, env->dst_cpu);
> > > > + if (IS_ERR_VALUE(score_dst))
> > > > + return LONG_MIN;
> > > > +
> > > > + score_src = arch_get_ipcc_score(ipcc, task_cpu(p));
> > > > + if (IS_ERR_VALUE(score_src))
> > > > + return LONG_MIN;
> > > > +
> > > > + return score_dst - score_src;
> > > > +}
> > > > +
> > > > #else /* CONFIG_IPC_CLASSES */
> > > > static void update_sg_lb_ipcc_stats(int dst_cpu, struct sg_lb_stats *sgs,
> > > > struct rq *rq)
> > > > @@ -9429,6 +9460,11 @@ static bool sched_asym_ipcc_pick(struct sched_group *a,
> > > > return false;
> > > > }
> > > >
> > > > +static long ipcc_score_delta(struct task_struct *p, struct lb_env *env)
> > > > +{
> > > > + return LONG_MIN;
> > > > +}
> > > > +
> > > > #endif /* CONFIG_IPC_CLASSES */
> > > >
> > > > /**
> > > > @@ -10589,6 +10625,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> > > > {
> > > > struct rq *busiest = NULL, *rq;
> > > > unsigned long busiest_util = 0, busiest_load = 0, busiest_capacity = 1;
> > > > + long busiest_ipcc_delta = LONG_MIN;
> > > > unsigned int busiest_nr = 0;
> > > > int i;
> > > >
> > > > @@ -10705,8 +10742,35 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> > > >
> > > > case migrate_task:
> > > > if (busiest_nr < nr_running) {
> > > > + struct task_struct *curr;
> > > > +
> > > > busiest_nr = nr_running;
> > > > busiest = rq;
> > > > +
> > > > + /*
> > > > + * Remember the IPCC score delta of busiest::curr.
> > > > + * We may need it to break a tie with other queues
> > > > + * with equal nr_running.
> > > > + */
> > > > + curr = rcu_dereference(busiest->curr);
> > > > + busiest_ipcc_delta = ipcc_score_delta(curr, env);
> > >
> > > Hmm, i don't like this at all
> > >
> > > Also, curr is the least probable task to be pulled which means that
> > > all this his useless
> >
> > but when doing asym_packing balancing nr_running = 1, need_active_balance()
> > returns true and we will pull the current task, no? This is also true for
> > fully_busy groups with one task per CPU. These are the only two cases that
> > currently use IPCC scores.
>
> hmm, for sure it's not true for fully_busy and I don't see anything
> about asym_packing mandating that nr_running = 1
I meant to say that if nr_running = 1, we would pull the running task
because there is nothing else to pull.
Just as when identifying the busiest group, we can instead break the tie
using the task at the back of the runqueue. We will start with this task
when migrating tasks.
>
> You should have a look at misfit task which seems to better fit your
> situation where you have one task that doesn't fit its cpu instead of
> adding such condition
Thank you for the suggestion! I did take a look. When dealing with misfit
tasks, we identify one task that is too big for a small CPU. The
destination CPU is not needed when updating the misfit status.
On the other hand, identifying the “an IPCC-misfit” task (similar to
update_misfit_status()) would require knowing the destination CPU, of which
there may be more than one type. We could compute the IPCC-misfit status
for all prospective destinations CPU types but that may introduce a non-
trivial overhead and would be too complex for a tie breaker, IMO.
Unlike migrate_misfit, we would not take immediate action but use IPCC
scores to improve the selection of the busiest runqueue.