2022-11-28 14:40:18

by Ricardo Neri

[permalink] [raw]
Subject: [PATCH v2 00/22] sched: Introduce IPC classes for load balance

Hi,

This is the v2 of the patchset. Since it did not receive strong objections
on the design, I took the liberty of promoting the series from RFC to
PATCH :)

The problem statement and design do not change in this version. Thus, I did
not repeat the cover letter. It can be retrieved here [1].

This series depends on my other patches to use identical asym_packing CPU
priorities on all the SMT siblings of a physical core on x86 [2].

These patches apply cleanly on top of [2]. For convenience, these patches
and [2] can be found here:

https://github.com/ricardon/tip.git rneri/ipc_classes_v2

Thanks and BR,
Ricardo

Changes since v1 (sorted by significance):
* Renamed task_struct::class as task::struct_ipcc. (Joel)
* Use task_struct::ipcc = 0 for unclassified tasks. (PeterZ)
* Renamed CONFIG_SCHED_TASK_CLASSES as CONFIG_IPC_CLASSES. (PeterZ, Joel)
* Dropped patch to take spin lock to read the HFI table from the
* scheduler and from the HFI enabling code.
* Implemented per-CPU variables to store the IPCC scores of each class.
These can be read without holding a lock. (PeterZ).
* Dropped patch to expose is_core_idle() outside the scheduler. It is
now exposed as part of [2].
* Implemented cleanups and reworks from PeterZ when collecting IPCC
statistics. I took all his suggestions, except the computation of the
total IPC score of two physical cores.
* Quantified the cost of HRESET.
* Use an ALTERNATIVE macro instead of static_cpu_has() to execute HRESET
when supported. (PeterZ)
* Fixed a bug when selecting a busiest runqueue: when comparing two
runqueues with equal nr_running, we must compute the IPCC score delta
of both runqueues.
* Fixed the bit number DISABLE_ITD to the correct DISABLE_MASK: 14 instead
of 13.
* Redefined union hfi_thread_feedback_char_msr to ensure all
bit-fields are packed. (PeterZ)
* Use bit-fields to fit all the ipcc members of task_struct in 4 bytes.
(PeterZ)
* Shortened the names of the IPCC interfaces (PeterZ):
sched_task_classes_enabled >> sched_ipcc_enabled
arch_has_task_classes >> arch_has_ipc_classes
arch_update_task_class >> arch_update_ipcc
arch_get_task_class_score >> arch_get_ipcc_score
* Removed smt_siblings_idle argument from arch_update_ipcc(). (PeterZ)
* Added a comment to clarify why sched_asym_prefer() needs a tie breaker
only in update_sd_pick_busiest(). (PeterZ)
* Renamed functions for accuracy:
sched_asym_class_prefer() >> sched_asym_ipcc_prefer()
sched_asym_class_pick() >> sched_asym_ipcc_pick()
* Renamed local variables to improve the layout of the code block I added
in find_busiest_queue(). (PeterZ)
* Removed proposed CONFIG_INTEL_THREAD_DIRECTOR Kconfig option.
* Mark hardware_history_features as __ro_after_init instead of
__read_mostly. (PeterZ)

[1]. https://lore.kernel.org/lkml/[email protected]/
[2]. https://lore.kernel.org/lkml/[email protected]/

Ricardo Neri (22):
sched/task_struct: Introduce IPC classes of tasks
sched: Add interfaces for IPC classes
sched/core: Initialize the IPC class of a new task
sched/core: Add user_tick as argument to scheduler_tick()
sched/core: Update the IPC class of the current task
sched/fair: Collect load-balancing stats for IPC classes
sched/fair: Compute IPC class scores for load balancing
sched/fair: Use IPC class to pick the busiest group
sched/fair: Use IPC class score to select a busiest runqueue
thermal: intel: hfi: Introduce Intel Thread Director classes
thermal: intel: hfi: Store per-CPU IPCC scores
x86/cpufeatures: Add the Intel Thread Director feature definitions
thermal: intel: hfi: Update the IPC class of the current task
thermal: intel: hfi: Report the IPC class score of a CPU
thermal: intel: hfi: Define a default class for unclassified tasks
thermal: intel: hfi: Enable the Intel Thread Director
sched/task_struct: Add helpers for IPC classification
sched/core: Initialize helpers of task classification
thermal: intel: hfi: Implement model-specific checks for task
classification
x86/cpufeatures: Add feature bit for HRESET
x86/hreset: Configure history reset
x86/process: Reset hardware history in context switch

arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/hreset.h | 30 +++
arch/x86/include/asm/msr-index.h | 6 +-
arch/x86/include/asm/topology.h | 10 +
arch/x86/kernel/cpu/common.c | 30 ++-
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
arch/x86/kernel/cpu/scattered.c | 1 +
arch/x86/kernel/process_32.c | 3 +
arch/x86/kernel/process_64.c | 3 +
drivers/thermal/intel/intel_hfi.c | 229 ++++++++++++++++++++++-
include/linux/sched.h | 22 ++-
init/Kconfig | 12 ++
kernel/sched/core.c | 10 +-
kernel/sched/fair.c | 229 ++++++++++++++++++++++-
kernel/sched/sched.h | 60 ++++++
kernel/sched/topology.c | 8 +
kernel/time/timer.c | 2 +-
18 files changed, 653 insertions(+), 13 deletions(-)
create mode 100644 arch/x86/include/asm/hreset.h

--
2.25.1


2022-11-28 14:40:33

by Ricardo Neri

[permalink] [raw]
Subject: [PATCH v2 13/22] thermal: intel: hfi: Update the IPC class of the current task

Use Intel Thread Director classification to update the IPC class of a
task. Implement the needed scheduler interfaces.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v1:
* Adjusted the result the classification of Intel Thread Director to start
at class 1. Class 0 for the scheduler means that the task is
unclassified.
* Redefined union hfi_thread_feedback_char_msr to ensure all
bit-fields are packed. (PeterZ)
* Removed CONFIG_INTEL_THREAD_DIRECTOR. (PeterZ)
* Shortened the names of the functions that implement IPC classes.
* Removed argument smt_siblings_idle from intel_hfi_update_ipcc().
(PeterZ)
---
arch/x86/include/asm/topology.h | 8 +++++++
drivers/thermal/intel/intel_hfi.c | 37 +++++++++++++++++++++++++++++++
2 files changed, 45 insertions(+)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 458c891a8273..cf46a3aea283 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -227,4 +227,12 @@ void init_freq_invariance_cppc(void);
#define arch_init_invariance_cppc init_freq_invariance_cppc
#endif

+#if defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL)
+int intel_hfi_has_ipc_classes(void);
+void intel_hfi_update_ipcc(struct task_struct *curr);
+
+#define arch_has_ipc_classes intel_hfi_has_ipc_classes
+#define arch_update_ipcc intel_hfi_update_ipcc
+#endif /* defined(CONFIG_IPC_CLASSES) && defined(CONFIG_INTEL_HFI_THERMAL) */
+
#endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index 56dba967849c..f85394b532a7 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -74,6 +74,17 @@ union cpuid6_edx {
u32 full;
};

+#ifdef CONFIG_IPC_CLASSES
+union hfi_thread_feedback_char_msr {
+ struct {
+ u64 classid : 8;
+ u64 __reserved : 55;
+ u64 valid : 1;
+ } split;
+ u64 full;
+};
+#endif
+
/**
* struct hfi_cpu_data - HFI capabilities per CPU
* @perf_cap: Performance capability
@@ -176,6 +187,32 @@ static struct workqueue_struct *hfi_updates_wq;
#ifdef CONFIG_IPC_CLASSES
static int __percpu *hfi_ipcc_scores;

+int intel_hfi_has_ipc_classes(void)
+{
+ return cpu_feature_enabled(X86_FEATURE_ITD);
+}
+
+void intel_hfi_update_ipcc(struct task_struct *curr)
+{
+ union hfi_thread_feedback_char_msr msr;
+
+ /* We should not be here if ITD is not supported. */
+ if (!cpu_feature_enabled(X86_FEATURE_ITD)) {
+ pr_warn_once("task classification requested but not supported!");
+ return;
+ }
+
+ rdmsrl(MSR_IA32_HW_FEEDBACK_CHAR, msr.full);
+ if (!msr.split.valid)
+ return;
+
+ /*
+ * 0 is a valid classification for Intel Thread Director. A scheduler
+ * IPCC class of 0 means that the task is unclassified. Adjust.
+ */
+ curr->ipcc = msr.split.classid + 1;
+}
+
static int alloc_hfi_ipcc_scores(void)
{
hfi_ipcc_scores = __alloc_percpu(sizeof(*hfi_ipcc_scores) *
--
2.25.1

2022-11-28 14:40:39

by Ricardo Neri

[permalink] [raw]
Subject: [PATCH v2 02/22] sched: Add interfaces for IPC classes

Add the interfaces that architectures shall implement to convey the data
to support IPC classes.

arch_update_ipcc() updates the IPC classification of the current task as
given by hardware.

arch_get_ipcc_score() provides a performance score for a given IPC class
when placed on a specific CPU. Higher scores indicate higher performance.

The number of classes and the score of each class of task are determined
by hardware.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v1:
* Shortened the names of the IPCC interfaces (PeterZ):
sched_task_classes_enabled >> sched_ipcc_enabled
arch_has_task_classes >> arch_has_ipc_classes
arch_update_task_class >> arch_update_ipcc
arch_get_task_class_score >> arch_get_ipcc_score
* Removed smt_siblings_idle argument from arch_update_ipcc(). (PeterZ)
---
kernel/sched/sched.h | 60 +++++++++++++++++++++++++++++++++++++++++
kernel/sched/topology.c | 8 ++++++
2 files changed, 68 insertions(+)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b1d338a740e5..75e22baa2622 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2531,6 +2531,66 @@ void arch_scale_freq_tick(void)
}
#endif

+#ifdef CONFIG_IPC_CLASSES
+DECLARE_STATIC_KEY_FALSE(sched_ipcc);
+
+static inline bool sched_ipcc_enabled(void)
+{
+ return static_branch_unlikely(&sched_ipcc);
+}
+
+#ifndef arch_has_ipc_classes
+/**
+ * arch_has_ipc_classes() - Check whether hardware supports IPC classes of tasks
+ *
+ * Returns: true of IPC classes of tasks are supported.
+ */
+static __always_inline
+bool arch_has_ipc_classes(void)
+{
+ return false;
+}
+#endif
+
+#ifndef arch_update_ipcc
+/**
+ * arch_update_ipcc() - Update the IPC class of the current task
+ * @curr: The current task
+ *
+ * Request that the IPC classification of @curr is updated.
+ *
+ * Returns: none
+ */
+static __always_inline
+void arch_update_ipcc(struct task_struct *curr)
+{
+}
+#endif
+
+#ifndef arch_get_ipcc_score
+/**
+ * arch_get_ipcc_score() - Get the IPC score of a class of task
+ * @ipcc: The IPC class
+ * @cpu: A CPU number
+ *
+ * Returns the performance score of an IPC class when running on @cpu.
+ * Error when either @class or @cpu are invalid.
+ */
+static __always_inline
+int arch_get_ipcc_score(unsigned short ipcc, int cpu)
+{
+ return 1;
+}
+#endif
+#else /* CONFIG_IPC_CLASSES */
+
+#define arch_get_ipcc_score(ipcc, cpu) (-EINVAL)
+#define arch_update_ipcc(curr)
+
+static inline bool sched_ipcc_enabled(void) { return false; }
+
+#endif /* CONFIG_IPC_CLASSES */
+
#ifndef arch_scale_freq_capacity
/**
* arch_scale_freq_capacity - get the frequency scale factor of a given CPU.
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8154ef590b9f..eb1654b64df7 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -669,6 +669,9 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
+#ifdef CONFIG_IPC_CLASSES
+DEFINE_STATIC_KEY_FALSE(sched_ipcc);
+#endif

static void update_top_cache_domain(int cpu)
{
@@ -2388,6 +2391,11 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
if (has_asym)
static_branch_inc_cpuslocked(&sched_asym_cpucapacity);

+#ifdef CONFIG_IPC_CLASSES
+ if (arch_has_ipc_classes())
+ static_branch_enable_cpuslocked(&sched_ipcc);
+#endif
+
if (rq && sched_debug_verbose) {
pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
--
2.25.1

2022-11-28 14:40:48

by Ricardo Neri

[permalink] [raw]
Subject: [PATCH v2 11/22] thermal: intel: hfi: Store per-CPU IPCC scores

The scheduler reads the IPCC scores when balancing load. These reads can
be quite frequent. Hardware can also update the HFI table frequently.
Concurrent access may cause a lot of contention. It gets worse as the
number of CPUs increases.

Instead, create separate per-CPU IPCC scores that the scheduler can read
without the HFI table lock.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Suggested-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v1:
* Added this patch.
---
drivers/thermal/intel/intel_hfi.c | 38 +++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)

diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index df4dc50e19fb..56dba967849c 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -29,6 +29,7 @@
#include <linux/kernel.h>
#include <linux/math.h>
#include <linux/mutex.h>
+#include <linux/percpu.h>
#include <linux/percpu-defs.h>
#include <linux/printk.h>
#include <linux/processor.h>
@@ -172,6 +173,35 @@ static struct workqueue_struct *hfi_updates_wq;
#define HFI_UPDATE_INTERVAL HZ
#define HFI_MAX_THERM_NOTIFY_COUNT 16

+#ifdef CONFIG_IPC_CLASSES
+static int __percpu *hfi_ipcc_scores;
+
+static int alloc_hfi_ipcc_scores(void)
+{
+ hfi_ipcc_scores = __alloc_percpu(sizeof(*hfi_ipcc_scores) *
+ hfi_features.nr_classes,
+ sizeof(*hfi_ipcc_scores));
+
+ return !hfi_ipcc_scores;
+}
+
+static void set_hfi_ipcc_score(void *caps, int cpu)
+{
+ int i, *hfi_class = per_cpu_ptr(hfi_ipcc_scores, cpu);
+
+ for (i = 0; i < hfi_features.nr_classes; i++) {
+ struct hfi_cpu_data *class_caps;
+
+ class_caps = caps + i * hfi_features.class_stride;
+ WRITE_ONCE(hfi_class[i], class_caps->perf_cap);
+ }
+}
+
+#else
+static int alloc_hfi_ipcc_scores(void) { return 0; }
+static void set_hfi_ipcc_score(void *caps, int cpu) { }
+#endif /* CONFIG_IPC_CLASSES */
+
static void get_hfi_caps(struct hfi_instance *hfi_instance,
struct thermal_genl_cpu_caps *cpu_caps)
{
@@ -194,6 +224,8 @@ static void get_hfi_caps(struct hfi_instance *hfi_instance,
cpu_caps[i].efficiency = caps->ee_cap << 2;

++i;
+
+ set_hfi_ipcc_score(caps, cpu);
}
raw_spin_unlock_irq(&hfi_instance->table_lock);
}
@@ -572,8 +604,14 @@ void __init intel_hfi_init(void)
if (!hfi_updates_wq)
goto err_nomem;

+ if (alloc_hfi_ipcc_scores())
+ goto err_ipcc;
+
return;

+err_ipcc:
+ destroy_workqueue(hfi_updates_wq);
+
err_nomem:
for (j = 0; j < i; ++j) {
hfi_instance = &hfi_instances[j];
--
2.25.1

2022-11-28 14:43:04

by Ricardo Neri

[permalink] [raw]
Subject: [PATCH v2 07/22] sched/fair: Compute IPC class scores for load balancing

Compute the joint total (both current and prospective) IPC class score of
a scheduling group and the local scheduling group.

These IPCC statistics are used during asym_packing load balancing. It
implies that the candidate sched group will have one fewer busy CPU after
load balancing. This observation is important for physical cores with
SMT support.

The IPCC score of scheduling groups composed of SMT siblings needs to
consider that the siblings share CPU resources. When computing the total
IPCC score of the scheduling group, divide score from each sibilng by
the number of busy siblings.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v1:
* Implemented cleanups and reworks from PeterZ. I took all his
suggestions, except the computation of the IPC score before and after
load balancing. We are computing not the average score, but the *total*.
* Check for the SD_SHARE_CPUCAPACITY to compute the throughput of the SMT
siblings of a physical core.
* Used the new interface names.
* Reworded commit message for clarity.
---
kernel/sched/fair.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 45 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3a1d6c50a19b..e333f9623b3a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8766,6 +8766,10 @@ struct sg_lb_stats {
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
#endif
+#ifdef CONFIG_IPC_CLASSES
+ long ipcc_score_after; /* Prospective IPCC score after load balancing */
+ long ipcc_score_before; /* IPCC score before load balancing */
+#endif
};

/*
@@ -9140,6 +9144,38 @@ static void update_sg_lb_ipcc_stats(struct sg_lb_ipcc_stats *sgcs,
}
}

+static void update_sg_lb_stats_scores(struct sg_lb_ipcc_stats *sgcs,
+ struct sg_lb_stats *sgs,
+ struct sched_group *sg,
+ int dst_cpu)
+{
+ int busy_cpus, score_on_dst_cpu;
+ long before, after;
+
+ if (!sched_ipcc_enabled())
+ return;
+
+ busy_cpus = sgs->group_weight - sgs->idle_cpus;
+ /* No busy CPUs in the group. No tasks to move. */
+ if (!busy_cpus)
+ return;
+
+ score_on_dst_cpu = arch_get_ipcc_score(sgcs->min_ipcc, dst_cpu);
+
+ before = sgcs->sum_score;
+ after = before - sgcs->min_score;
+
+ /* SMT siblings share throughput. */
+ if (busy_cpus > 1 && sg->flags & SD_SHARE_CPUCAPACITY) {
+ before /= busy_cpus;
+ /* One sibling will become idle after load balance. */
+ after /= busy_cpus - 1;
+ }
+
+ sgs->ipcc_score_after = after + score_on_dst_cpu;
+ sgs->ipcc_score_before = before;
+}
+
#else /* CONFIG_IPC_CLASSES */
static void update_sg_lb_ipcc_stats(struct sg_lb_ipcc_stats *sgcs,
struct rq *rq)
@@ -9149,6 +9185,14 @@ static void update_sg_lb_ipcc_stats(struct sg_lb_ipcc_stats *sgcs,
static void init_rq_ipcc_stats(struct sg_lb_ipcc_stats *class_sgs)
{
}
+
+static void update_sg_lb_stats_scores(struct sg_lb_ipcc_stats *sgcs,
+ struct sg_lb_stats *sgs,
+ struct sched_group *sg,
+ int dst_cpu)
+{
+}
+
#endif /* CONFIG_IPC_CLASSES */

/**
@@ -9329,6 +9373,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
env->idle != CPU_NOT_IDLE && sgs->sum_h_nr_running &&
sched_asym(env, sds, sgs, group)) {
+ update_sg_lb_stats_scores(&sgcs, sgs, group, env->dst_cpu);
sgs->group_asym_packing = 1;
}

--
2.25.1

2022-11-28 14:43:38

by Ricardo Neri

[permalink] [raw]
Subject: [PATCH v2 06/22] sched/fair: Collect load-balancing stats for IPC classes

When selecting a busiest scheduling group, the IPC class of the current
task can be used to select between two scheduling groups of equal
asym_packing priority and number of running tasks.

Compute a new IPC class performance score for a scheduling group. It
is the sum of the performance of the current tasks of all the runqueues.

Also, keep track of the task with the lowest IPC class score on the
scheduling group.

These two metrics will be used during idle load balancing to compute the
current and the prospective task-class performance of a scheduling
group.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v1:
* Implemented cleanups and reworks from PeterZ. Thanks!
* Used the new interface names.
---
kernel/sched/fair.c | 55 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 55 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 224107278471..3a1d6c50a19b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9100,6 +9100,57 @@ group_type group_classify(unsigned int imbalance_pct,
return group_has_spare;
}

+struct sg_lb_ipcc_stats {
+ int min_score; /* Min(score(rq->curr->ipcc)) */
+ int min_ipcc; /* Min(rq->curr->ipcc) */
+ long sum_score; /* Sum(score(rq->curr->ipcc)) */
+};
+
+#ifdef CONFIG_IPC_CLASSES
+static void init_rq_ipcc_stats(struct sg_lb_ipcc_stats *sgcs)
+{
+ *sgcs = (struct sg_lb_ipcc_stats) {
+ .min_score = INT_MAX,
+ };
+}
+
+/** Called only if cpu_of(@rq) is not idle and has tasks running. */
+static void update_sg_lb_ipcc_stats(struct sg_lb_ipcc_stats *sgcs,
+ struct rq *rq)
+{
+ struct task_struct *curr;
+ unsigned short ipcc;
+ int score;
+
+ if (!sched_ipcc_enabled())
+ return;
+
+ curr = rcu_dereference(rq->curr);
+ if (!curr || (curr->flags & PF_EXITING) || is_idle_task(curr))
+ return;
+
+ ipcc = curr->ipcc;
+ score = arch_get_ipcc_score(ipcc, cpu_of(rq));
+
+ sgcs->sum_score += score;
+
+ if (score < sgcs->min_score) {
+ sgcs->min_score = score;
+ sgcs->min_ipcc = ipcc;
+ }
+}
+
+#else /* CONFIG_IPC_CLASSES */
+static void update_sg_lb_ipcc_stats(struct sg_lb_ipcc_stats *sgcs,
+ struct rq *rq)
+{
+}
+
+static void init_rq_ipcc_stats(struct sg_lb_ipcc_stats *class_sgs)
+{
+}
+#endif /* CONFIG_IPC_CLASSES */
+
/**
* asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull tasks
* @dst_cpu: Destination CPU of the load balancing
@@ -9212,9 +9263,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
struct sg_lb_stats *sgs,
int *sg_status)
{
+ struct sg_lb_ipcc_stats sgcs;
int i, nr_running, local_group;

memset(sgs, 0, sizeof(*sgs));
+ init_rq_ipcc_stats(&sgcs);

local_group = group == sds->local;

@@ -9264,6 +9317,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (sgs->group_misfit_task_load < load)
sgs->group_misfit_task_load = load;
}
+
+ update_sg_lb_ipcc_stats(&sgcs, rq);
}

sgs->group_capacity = group->sgc->capacity;
--
2.25.1

2022-11-28 14:44:17

by Ricardo Neri

[permalink] [raw]
Subject: [PATCH v2 20/22] x86/cpufeatures: Add feature bit for HRESET

The HRESET instruction prevents the classification of the current task
from influencing the classification of the next task when running serially
on the same logical processor.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v1:
* None
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/msr-index.h | 4 +++-
arch/x86/kernel/cpu/scattered.c | 1 +
3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 80b2beafc81e..281a7c861b8d 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -308,6 +308,7 @@
#define X86_FEATURE_CALL_DEPTH (11*32+19) /* "" Call depth tracking for RSB stuffing */

#define X86_FEATURE_MSR_TSX_CTRL (11*32+20) /* "" MSR IA32_TSX_CTRL (Intel) implemented */
+#define X86_FEATURE_HRESET (11*32+21) /* Hardware history reset instruction */

/* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
#define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 96303330223b..7a3ff73164bd 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1078,6 +1078,9 @@
#define MSR_IA32_HW_FEEDBACK_THREAD_CONFIG 0x17d4
#define MSR_IA32_HW_FEEDBACK_CHAR 0x17d2

+/* Hardware History Reset */
+#define MSR_IA32_HW_HRESET_ENABLE 0x17da
+
/* x2APIC locked status */
#define MSR_IA32_XAPIC_DISABLE_STATUS 0xBD
#define LEGACY_XAPIC_DISABLED BIT(0) /*
@@ -1085,5 +1088,4 @@
* disabling x2APIC will cause
* a #GP
*/
-
#endif /* _ASM_X86_MSR_INDEX_H */
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index f53944fb8f7f..66bc5713644d 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -28,6 +28,7 @@ static const struct cpuid_bit cpuid_bits[] = {
{ X86_FEATURE_EPB, CPUID_ECX, 3, 0x00000006, 0 },
{ X86_FEATURE_INTEL_PPIN, CPUID_EBX, 0, 0x00000007, 1 },
{ X86_FEATURE_RRSBA_CTRL, CPUID_EDX, 2, 0x00000007, 2 },
+ { X86_FEATURE_HRESET, CPUID_EAX, 22, 0x00000007, 1 },
{ X86_FEATURE_CQM_LLC, CPUID_EDX, 1, 0x0000000f, 0 },
{ X86_FEATURE_CQM_OCCUP_LLC, CPUID_EDX, 0, 0x0000000f, 1 },
{ X86_FEATURE_CQM_MBM_TOTAL, CPUID_EDX, 1, 0x0000000f, 1 },
--
2.25.1

2022-11-28 14:44:30

by Ricardo Neri

[permalink] [raw]
Subject: [PATCH v2 10/22] thermal: intel: hfi: Introduce Intel Thread Director classes

On Intel hybrid parts, each type of CPU has specific performance and
energy efficiency capabilities. The Intel Thread Director technology
extends the Hardware Feedback Interface (HFI) to provide performance and
energy efficiency data for advanced classes of instructions.

Add support to parse and parse per-class capabilities.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v1:
* Removed a now obsolete comment.
---
drivers/thermal/intel/intel_hfi.c | 30 ++++++++++++++++++++++++------
1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index a0640f762dc5..df4dc50e19fb 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -79,7 +79,7 @@ union cpuid6_edx {
* @ee_cap: Energy efficiency capability
*
* Capabilities of a logical processor in the HFI table. These capabilities are
- * unitless.
+ * unitless and specific to each HFI class.
*/
struct hfi_cpu_data {
u8 perf_cap;
@@ -91,7 +91,8 @@ struct hfi_cpu_data {
* @perf_updated: Hardware updated performance capabilities
* @ee_updated: Hardware updated energy efficiency capabilities
*
- * Properties of the data in an HFI table.
+ * Properties of the data in an HFI table. There exists one header per each
+ * HFI class.
*/
struct hfi_hdr {
u8 perf_updated;
@@ -129,16 +130,21 @@ struct hfi_instance {

/**
* struct hfi_features - Supported HFI features
+ * @nr_classes: Number of classes supported
* @nr_table_pages: Size of the HFI table in 4KB pages
* @cpu_stride: Stride size to locate the capability data of a logical
* processor within the table (i.e., row stride)
+ * @class_stride: Stride size to locate a class within the capability
+ * data of a logical processor or the HFI table header
* @hdr_size: Size of the table header
*
* Parameters and supported features that are common to all HFI instances
*/
struct hfi_features {
+ unsigned int nr_classes;
unsigned int nr_table_pages;
unsigned int cpu_stride;
+ unsigned int class_stride;
unsigned int hdr_size;
};

@@ -325,8 +331,8 @@ static void init_hfi_cpu_index(struct hfi_cpu_info *info)
}

/*
- * The format of the HFI table depends on the number of capabilities that the
- * hardware supports. Keep a data structure to navigate the table.
+ * The format of the HFI table depends on the number of capabilities and classes
+ * that the hardware supports. Keep a data structure to navigate the table.
*/
static void init_hfi_instance(struct hfi_instance *hfi_instance)
{
@@ -507,18 +513,30 @@ static __init int hfi_parse_features(void)
/* The number of 4KB pages required by the table */
hfi_features.nr_table_pages = edx.split.table_pages + 1;

+ /*
+ * Capability fields of an HFI class are grouped together. Classes are
+ * contiguous in memory. Hence, use the number of supported features to
+ * locate a specific class.
+ */
+ hfi_features.class_stride = nr_capabilities;
+
+ /* For now, use only one class of the HFI table */
+ hfi_features.nr_classes = 1;
+
/*
* The header contains change indications for each supported feature.
* The size of the table header is rounded up to be a multiple of 8
* bytes.
*/
- hfi_features.hdr_size = DIV_ROUND_UP(nr_capabilities, 8) * 8;
+ hfi_features.hdr_size = DIV_ROUND_UP(nr_capabilities *
+ hfi_features.nr_classes, 8) * 8;

/*
* Data of each logical processor is also rounded up to be a multiple
* of 8 bytes.
*/
- hfi_features.cpu_stride = DIV_ROUND_UP(nr_capabilities, 8) * 8;
+ hfi_features.cpu_stride = DIV_ROUND_UP(nr_capabilities *
+ hfi_features.nr_classes, 8) * 8;

return 0;
}
--
2.25.1

2022-11-28 14:44:35

by Ricardo Neri

[permalink] [raw]
Subject: [PATCH v2 22/22] x86/process: Reset hardware history in context switch

Reset the classification history of the current task when switching to the
next task. Hardware will start the classification of the next task from
scratch.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v1:
* Measurements of the cost of the HRESET instruction

Methodology:
I created a tight loop with interrupts and preemption disabled. I
recorded the value of the TSC counter before and after executing
HRESET or RDTSC. I repeated the measurement 100,000 times.
I performed the experiment using an Alder Lake S system. I set the
frequency of the CPUs at a fixed value.

The table below compares the cost of HRESET with RDTSC (expressed in
the elapsed TSC count). The cost of the two instructions is
comparable.

PCore ECore
Frequency (GHz) 5.0 3.8
HRESET (avg) 28.5 44.7
HRESET (stdev %) 3.6 2.3
RDTSC (avg) 25.2 35.7
RDTSC (stdev %) 3.9 2.6

* Used an ALTERNATIVE macro instead of static_cpu_has() to execute HRESET
when supported. (PeterZ)
---
arch/x86/include/asm/hreset.h | 30 ++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/common.c | 7 +++++++
arch/x86/kernel/process_32.c | 3 +++
arch/x86/kernel/process_64.c | 3 +++
4 files changed, 43 insertions(+)
create mode 100644 arch/x86/include/asm/hreset.h

diff --git a/arch/x86/include/asm/hreset.h b/arch/x86/include/asm/hreset.h
new file mode 100644
index 000000000000..d68ca2fb8642
--- /dev/null
+++ b/arch/x86/include/asm/hreset.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_HRESET_H
+
+/**
+ * HRESET - History reset. Available since binutils v2.36.
+ *
+ * Request the processor to reset the history of task classification on the
+ * current logical processor. The history components to be
+ * reset are specified in %eax. Only bits specified in CPUID(0x20).EBX
+ * and enabled in the IA32_HRESET_ENABLE MSR can be selected.
+ *
+ * The assembly code looks like:
+ *
+ * hreset %eax
+ *
+ * The corresponding machine code looks like:
+ *
+ * F3 0F 3A F0 ModRM Imm
+ *
+ * The value of ModRM is 0xc0 to specify %eax register addressing.
+ * The ignored immediate operand is set to 0.
+ *
+ * The instruction is documented in the Intel SDM.
+ */
+
+#define __ASM_HRESET ".byte 0xf3, 0xf, 0x3a, 0xf0, 0xc0, 0x0"
+
+void reset_hardware_history(void);
+
+#endif /* _ASM_X86_HRESET_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f8630da2a6dd..6c2b9768698e 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -53,6 +53,7 @@
#include <asm/mce.h>
#include <asm/msr.h>
#include <asm/cacheinfo.h>
+#include <asm/hreset.h>
#include <asm/memtype.h>
#include <asm/microcode.h>
#include <asm/microcode_intel.h>
@@ -414,6 +415,12 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)

static u32 hardware_history_features __ro_after_init;

+void reset_hardware_history(void)
+{
+ asm_inline volatile (ALTERNATIVE("", __ASM_HRESET, X86_FEATURE_HRESET)
+ : : "a" (hardware_history_features) : "memory");
+}
+
static __always_inline void setup_hreset(struct cpuinfo_x86 *c)
{
if (!cpu_feature_enabled(X86_FEATURE_HRESET))
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 470c128759ea..397a6e6f4e61 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -52,6 +52,7 @@
#include <asm/switch_to.h>
#include <asm/vm86.h>
#include <asm/resctrl.h>
+#include <asm/hreset.h>
#include <asm/proto.h>

#include "process.h"
@@ -214,6 +215,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
/* Load the Intel cache allocation PQR MSR. */
resctrl_sched_in();

+ reset_hardware_history();
+
return prev_p;
}

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 084ec467dbb1..ac9b3d44c1bd 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -53,6 +53,7 @@
#include <asm/xen/hypervisor.h>
#include <asm/vdso.h>
#include <asm/resctrl.h>
+#include <asm/hreset.h>
#include <asm/unistd.h>
#include <asm/fsgsbase.h>
#ifdef CONFIG_IA32_EMULATION
@@ -658,6 +659,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
/* Load the Intel cache allocation PQR MSR. */
resctrl_sched_in();

+ reset_hardware_history();
+
return prev_p;
}

--
2.25.1

2022-11-28 14:45:10

by Ricardo Neri

[permalink] [raw]
Subject: [PATCH v2 19/22] thermal: intel: hfi: Implement model-specific checks for task classification

In Alderlake and Raptorlake, the result of thread classification is more
accurate when only one SMT sibling is busy. Classification results for
class 2 and 3 that are always reliable.

To avoid unnecessary migrations, only update the class of a task if it has
been the same during 4 consecutive ticks.

Cc: Ben Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Srinivas Pandruvada <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Tim C. Chen <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v1:
* Adjusted the result the classification of Intel Thread Director to start
at class 1. Class 0 for the scheduler means that the task is
unclassified.
* Used the new names of the IPC classes members in task_struct.
* Reworked helper functions to use sched_smt_siblings_idle() to query
the idle state of the SMT siblings of a CPU.
---
drivers/thermal/intel/intel_hfi.c | 60 ++++++++++++++++++++++++++++++-
1 file changed, 59 insertions(+), 1 deletion(-)

diff --git a/drivers/thermal/intel/intel_hfi.c b/drivers/thermal/intel/intel_hfi.c
index 8287bfd7d6b6..a9ae09036909 100644
--- a/drivers/thermal/intel/intel_hfi.c
+++ b/drivers/thermal/intel/intel_hfi.c
@@ -40,6 +40,7 @@
#include <linux/workqueue.h>

#include <asm/msr.h>
+#include <asm/intel-family.h>

#include "../thermal_core.h"
#include "intel_hfi.h"
@@ -216,9 +217,64 @@ int intel_hfi_has_ipc_classes(void)
return cpu_feature_enabled(X86_FEATURE_ITD);
}

+#define CLASS_DEBOUNCER_SKIPS 4
+
+/**
+ * debounce_and_update_class() - Process and update a task's classification
+ *
+ * @p: The task of which the classification will be updated
+ * @new_ipcc: The new IPC classification
+ *
+ * Update the classification of @p with the new value that hardware provides.
+ * Only update the classification of @p if it has been the same during
+ * CLASS_DEBOUNCER_SKIPS consecutive ticks.
+ */
+static void debounce_and_update_class(struct task_struct *p, u8 new_ipcc)
+{
+ u16 debounce_skip;
+
+ /* The class of @p changed, only restart the debounce counter. */
+ if (p->ipcc_tmp != new_ipcc) {
+ p->ipcc_cntr = 1;
+ goto out;
+ }
+
+ /*
+ * The class of @p did not change. Update it if it has been the same
+ * for CLASS_DEBOUNCER_SKIPS user ticks.
+ */
+ debounce_skip = p->ipcc_cntr + 1;
+ if (debounce_skip < CLASS_DEBOUNCER_SKIPS)
+ p->ipcc_cntr++;
+ else
+ p->ipcc = new_ipcc;
+
+out:
+ p->ipcc_tmp = new_ipcc;
+}
+
+static bool classification_is_accurate(u8 hfi_class, bool smt_siblings_idle)
+{
+ switch (boot_cpu_data.x86_model) {
+ case INTEL_FAM6_ALDERLAKE:
+ case INTEL_FAM6_ALDERLAKE_L:
+ case INTEL_FAM6_RAPTORLAKE:
+ case INTEL_FAM6_RAPTORLAKE_P:
+ case INTEL_FAM6_RAPTORLAKE_S:
+ if (hfi_class == 3 || hfi_class == 2 || smt_siblings_idle)
+ return true;
+
+ return false;
+
+ default:
+ return true;
+ }
+}
+
void intel_hfi_update_ipcc(struct task_struct *curr)
{
union hfi_thread_feedback_char_msr msr;
+ bool idle;

/* We should not be here if ITD is not supported. */
if (!cpu_feature_enabled(X86_FEATURE_ITD)) {
@@ -234,7 +290,9 @@ void intel_hfi_update_ipcc(struct task_struct *curr)
* 0 is a valid classification for Intel Thread Director. A scheduler
* IPCC class of 0 means that the task is unclassified. Adjust.
*/
- curr->ipcc = msr.split.classid + 1;
+ idle = sched_smt_siblings_idle(task_cpu(curr));
+ if (classification_is_accurate(msr.split.classid, idle))
+ debounce_and_update_class(curr, msr.split.classid + 1);
}

int intel_hfi_get_ipcc_score(unsigned short ipcc, int cpu)
--
2.25.1

2022-12-07 17:38:59

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 06/22] sched/fair: Collect load-balancing stats for IPC classes

On 28/11/2022 14:20, Ricardo Neri wrote:

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 224107278471..3a1d6c50a19b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9100,6 +9100,57 @@ group_type group_classify(unsigned int imbalance_pct,
> return group_has_spare;
> }
>
> +struct sg_lb_ipcc_stats {
> + int min_score; /* Min(score(rq->curr->ipcc)) */
> + int min_ipcc; /* Min(rq->curr->ipcc) */
> + long sum_score; /* Sum(score(rq->curr->ipcc)) */
> +};

Wouldn't it be cleaner to put `min_score`, `min_ipcc` and `sum_score`
into `struct sg_lb_stats` next to `ipcc_score_{after, before}` under the
same #ifdef CONFIG_IPC_CLASSES?

Looks like those IPCC stats would only be needed in the specific
condition under which update_sg_lb_stats_scores() is called?

> +#ifdef CONFIG_IPC_CLASSES
> +static void init_rq_ipcc_stats(struct sg_lb_ipcc_stats *sgcs)
> +{
> + *sgcs = (struct sg_lb_ipcc_stats) {
> + .min_score = INT_MAX,
> + };
> +}
> +
> +/** Called only if cpu_of(@rq) is not idle and has tasks running. */
> +static void update_sg_lb_ipcc_stats(struct sg_lb_ipcc_stats *sgcs,
> + struct rq *rq)
> +{
> + struct task_struct *curr;
> + unsigned short ipcc;
> + int score;
> +
> + if (!sched_ipcc_enabled())
> + return;
> +
> + curr = rcu_dereference(rq->curr);
> + if (!curr || (curr->flags & PF_EXITING) || is_idle_task(curr))

So the Idle task is excluded but RT, DL, (Stopper) tasks are not. Looks
weird if non-CFS tasks could influence CFS load-balancing.
AFAICS, RT and DL tasks could have p->ipcc != IPC_CLASS_UNCLASSIFIED?

[...]

2022-12-08 09:01:43

by Ionela Voinescu

[permalink] [raw]
Subject: Re: [PATCH v2 02/22] sched: Add interfaces for IPC classes

Hi,

On Monday 28 Nov 2022 at 05:20:40 (-0800), Ricardo Neri wrote:
[..]
> +#ifndef arch_has_ipc_classes
> +/**
> + * arch_has_ipc_classes() - Check whether hardware supports IPC classes of tasks
> + *
> + * Returns: true of IPC classes of tasks are supported.
> + */
> +static __always_inline
> +bool arch_has_ipc_classes(void)
> +{
> + return false;
> +}
> +#endif
> +
> +#ifndef arch_update_ipcc
> +/**
> + * arch_update_ipcc() - Update the IPC class of the current task
> + * @curr: The current task
> + *
> + * Request that the IPC classification of @curr is updated.
> + *
> + * Returns: none
> + */
> +static __always_inline
> +void arch_update_ipcc(struct task_struct *curr)
> +{
> +}
> +#endif
> +
> +#ifndef arch_get_ipcc_score
> +/**
> + * arch_get_ipcc_score() - Get the IPC score of a class of task
> + * @ipcc: The IPC class
> + * @cpu: A CPU number
> + *
> + * Returns the performance score of an IPC class when running on @cpu.
> + * Error when either @class or @cpu are invalid.
> + */
> +static __always_inline
> +int arch_get_ipcc_score(unsigned short ipcc, int cpu)
> +{
> + return 1;
> +}
> +#endif

The interface looks mostly alright but this arch_get_ipcc_score() leaves
unclear what are the characteristics of the returned value.

Does it have a meaning as an absolute value or is it a value on an
abstract scale? If it should be interpreted as instructions per cycle,
if I wanted to have a proper comparison between the ability of two CPUs
to handle this class of tasks then I would need to take into consideration
the maximum frequency of each CPU. If it's a performance value on an
abstract scale (more likely), similar cu capacity, then it might be good
to better define this abstract scale. That would help with the default
implementation where possibly the best choice for a return value would
be the maximum value on the scale, suggesting equal/maximum performance
for different CPUs handling the class of tasks.

I suppose you avoided returning 0 for the default implementation as you
intend that to mean the inability of the CPU to handle that class of
tasks? It would be good to document this.

> +#else /* CONFIG_IPC_CLASSES */
> +
> +#define arch_get_ipcc_score(ipcc, cpu) (-EINVAL)
> +#define arch_update_ipcc(curr)
> +
> +static inline bool sched_ipcc_enabled(void) { return false; }
> +
> +#endif /* CONFIG_IPC_CLASSES */
> +
> #ifndef arch_scale_freq_capacity
> /**
> * arch_scale_freq_capacity - get the frequency scale factor of a given CPU.
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 8154ef590b9f..eb1654b64df7 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -669,6 +669,9 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
> +#ifdef CONFIG_IPC_CLASSES
> +DEFINE_STATIC_KEY_FALSE(sched_ipcc);
> +#endif
>
> static void update_top_cache_domain(int cpu)
> {
> @@ -2388,6 +2391,11 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> if (has_asym)
> static_branch_inc_cpuslocked(&sched_asym_cpucapacity);
>
> +#ifdef CONFIG_IPC_CLASSES
> + if (arch_has_ipc_classes())
> + static_branch_enable_cpuslocked(&sched_ipcc);
> +#endif

Wouldn't this be better placed directly in sched_init_smp()?
It's not gated by and it does not need any sched domains information.

Hope it helps,
Ionela.

> +
> if (rq && sched_debug_verbose) {
> pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
> cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
> --
> 2.25.1
>
>

2022-12-08 09:32:55

by Ionela Voinescu

[permalink] [raw]
Subject: Re: [PATCH v2 06/22] sched/fair: Collect load-balancing stats for IPC classes

Hi,

On Monday 28 Nov 2022 at 05:20:44 (-0800), Ricardo Neri wrote:
[..]
>
> +struct sg_lb_ipcc_stats {
> + int min_score; /* Min(score(rq->curr->ipcc)) */
> + int min_ipcc; /* Min(rq->curr->ipcc) */

Nit: this is not the minimum IPCC between the current tasks of all
runqueues, but the IPCC specific to the task with the minimum score.

Possibly there's not much to be done about the variable name, but the
comment can be made more clear.

Thanks,
Ionela.

> + long sum_score; /* Sum(score(rq->curr->ipcc)) */
> +};
[..]

2022-12-12 21:39:19

by Ricardo Neri

[permalink] [raw]
Subject: Re: [PATCH v2 06/22] sched/fair: Collect load-balancing stats for IPC classes

On Wed, Dec 07, 2022 at 06:00:32PM +0100, Dietmar Eggemann wrote:
> On 28/11/2022 14:20, Ricardo Neri wrote:
>
> [...]
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 224107278471..3a1d6c50a19b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9100,6 +9100,57 @@ group_type group_classify(unsigned int imbalance_pct,
> > return group_has_spare;
> > }
> >
> > +struct sg_lb_ipcc_stats {
> > + int min_score; /* Min(score(rq->curr->ipcc)) */
> > + int min_ipcc; /* Min(rq->curr->ipcc) */
> > + long sum_score; /* Sum(score(rq->curr->ipcc)) */
> > +};
>
> Wouldn't it be cleaner to put `min_score`, `min_ipcc` and `sum_score`
> into `struct sg_lb_stats` next to `ipcc_score_{after, before}` under the
> same #ifdef CONFIG_IPC_CLASSES?

Yes, that is a good observation. I initially wanted to hide these
intermediate and only expose the end result ipcc_score_{after, before} to
struct sg_lb_stats. I agree, it would look cleaner as you suggest.
>
> Looks like those IPCC stats would only be needed in the specific
> condition under which update_sg_lb_stats_scores() is called?

True.

>
> > +#ifdef CONFIG_IPC_CLASSES
> > +static void init_rq_ipcc_stats(struct sg_lb_ipcc_stats *sgcs)
> > +{
> > + *sgcs = (struct sg_lb_ipcc_stats) {
> > + .min_score = INT_MAX,
> > + };
> > +}
> > +
> > +/** Called only if cpu_of(@rq) is not idle and has tasks running. */
> > +static void update_sg_lb_ipcc_stats(struct sg_lb_ipcc_stats *sgcs,
> > + struct rq *rq)
> > +{
> > + struct task_struct *curr;
> > + unsigned short ipcc;
> > + int score;
> > +
> > + if (!sched_ipcc_enabled())
> > + return;
> > +
> > + curr = rcu_dereference(rq->curr);
> > + if (!curr || (curr->flags & PF_EXITING) || is_idle_task(curr))
>
> So the Idle task is excluded but RT, DL, (Stopper) tasks are not. Looks
> weird if non-CFS tasks could influence CFS load-balancing.
> AFAICS, RT and DL tasks could have p->ipcc != IPC_CLASS_UNCLASSIFIED?

Agreed. Perhaps I can also check for !(task_is_realtime()), which see
seems to cover all these cases.
>
> [...]

2022-12-14 00:38:25

by Ricardo Neri

[permalink] [raw]
Subject: Re: [PATCH v2 06/22] sched/fair: Collect load-balancing stats for IPC classes

On Thu, Dec 08, 2022 at 08:50:23AM +0000, Ionela Voinescu wrote:
> Hi,
>
> On Monday 28 Nov 2022 at 05:20:44 (-0800), Ricardo Neri wrote:
> [..]
> >
> > +struct sg_lb_ipcc_stats {
> > + int min_score; /* Min(score(rq->curr->ipcc)) */
> > + int min_ipcc; /* Min(rq->curr->ipcc) */
>
> Nit: this is not the minimum IPCC between the current tasks of all
> runqueues, but the IPCC specific to the task with the minimum score.

> Possibly there's not much to be done about the variable name, but the
> comment can be made more clear.

Very true. I will make the change.

Thanks and BR,
Ricardo

2022-12-14 00:44:49

by Ricardo Neri

[permalink] [raw]
Subject: Re: [PATCH v2 02/22] sched: Add interfaces for IPC classes

On Thu, Dec 08, 2022 at 08:48:46AM +0000, Ionela Voinescu wrote:
> Hi,
>
> On Monday 28 Nov 2022 at 05:20:40 (-0800), Ricardo Neri wrote:
> [..]
> > +#ifndef arch_has_ipc_classes
> > +/**
> > + * arch_has_ipc_classes() - Check whether hardware supports IPC classes of tasks
> > + *
> > + * Returns: true of IPC classes of tasks are supported.
> > + */
> > +static __always_inline
> > +bool arch_has_ipc_classes(void)
> > +{
> > + return false;
> > +}
> > +#endif
> > +
> > +#ifndef arch_update_ipcc
> > +/**
> > + * arch_update_ipcc() - Update the IPC class of the current task
> > + * @curr: The current task
> > + *
> > + * Request that the IPC classification of @curr is updated.
> > + *
> > + * Returns: none
> > + */
> > +static __always_inline
> > +void arch_update_ipcc(struct task_struct *curr)
> > +{
> > +}
> > +#endif
> > +
> > +#ifndef arch_get_ipcc_score
> > +/**
> > + * arch_get_ipcc_score() - Get the IPC score of a class of task
> > + * @ipcc: The IPC class
> > + * @cpu: A CPU number
> > + *
> > + * Returns the performance score of an IPC class when running on @cpu.
> > + * Error when either @class or @cpu are invalid.
> > + */
> > +static __always_inline
> > +int arch_get_ipcc_score(unsigned short ipcc, int cpu)
> > +{
> > + return 1;
> > +}
> > +#endif

Thank you very much for your feedback Ionela!

>
> The interface looks mostly alright but this arch_get_ipcc_score() leaves
> unclear what are the characteristics of the returned value.

Fair point. I mean for the return value to be defined by architectures;
but yes, architectures need to know how to implement this function.

>
> Does it have a meaning as an absolute value or is it a value on an
> abstract scale? If it should be interpreted as instructions per cycle,
> if I wanted to have a proper comparison between the ability of two CPUs
> to handle this class of tasks then I would need to take into consideration
> the maximum frequency of each CPU.

Do you mean when calling arch_get_ipcc_score()? If yes, then I agree, IPC
class may not be the only factor, but the criteria to use the return value
is up to the caller.

In asym_packing it is assumed that higher-priority CPUs are preferred.
When balancing load, IPC class scores are used to select between otherwise
identical runqueues. This should also be the case for migrate_misfit: we
know already that the tasks being considered do not fit on their current
CPU.

We would need to think what to do with other type of balancing, if at all.

That said, arch_get_ipcc_score() should only return a metric of the
instructions-per-*cycle*, independent of frequency, no?

> If it's a performance value on an
> abstract scale (more likely), similar cu capacity, then it might be good
> to better define this abstract scale. That would help with the default
> implementation where possibly the best choice for a return value would
> be the maximum value on the scale, suggesting equal/maximum performance
> for different CPUs handling the class of tasks.

I guess something like:

#define SCHED_IPCC_DEFAULT_SCALE 1024

?

I think I am fine with this value being the default. I also think that it
is up to architectures to whether scale all IPC class scores from the
best-performing class on the best-performing CPU. Doing so would introduce
overhead, especially if hardware updates the IPC class scores multiple
times during runtime.

>
> I suppose you avoided returning 0 for the default implementation as you
> intend that to mean the inability of the CPU to handle that class of
> tasks? It would be good to document this.

I meant this to be minimum possible IPC class score for any CPU: any
CPU should be able handle any IPC class. If not implemented, all CPUs
handle all IPC classes equally.

>
> > +#else /* CONFIG_IPC_CLASSES */
> > +
> > +#define arch_get_ipcc_score(ipcc, cpu) (-EINVAL)
> > +#define arch_update_ipcc(curr)
> > +
> > +static inline bool sched_ipcc_enabled(void) { return false; }
> > +
> > +#endif /* CONFIG_IPC_CLASSES */
> > +
> > #ifndef arch_scale_freq_capacity
> > /**
> > * arch_scale_freq_capacity - get the frequency scale factor of a given CPU.
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 8154ef590b9f..eb1654b64df7 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -669,6 +669,9 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> > DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
> > +#ifdef CONFIG_IPC_CLASSES
> > +DEFINE_STATIC_KEY_FALSE(sched_ipcc);
> > +#endif
> >
> > static void update_top_cache_domain(int cpu)
> > {
> > @@ -2388,6 +2391,11 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > if (has_asym)
> > static_branch_inc_cpuslocked(&sched_asym_cpucapacity);
> >
> > +#ifdef CONFIG_IPC_CLASSES
> > + if (arch_has_ipc_classes())
> > + static_branch_enable_cpuslocked(&sched_ipcc);
> > +#endif
>
> Wouldn't this be better placed directly in sched_init_smp()?
> It's not gated by and it does not need any sched domains information.

Very true. I will take your suggestion.

>
> Hope it helps,

It does help significantly. Thanks again for your feedback.

Thanks and BR,
Ricardo

2022-12-14 07:40:57

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v2 02/22] sched: Add interfaces for IPC classes

Hi Richardo,

I have some generic comment for the design of those interfaces.

On 11/28/22 13:20, Ricardo Neri wrote:
> Add the interfaces that architectures shall implement to convey the data
> to support IPC classes.
>
> arch_update_ipcc() updates the IPC classification of the current task as
> given by hardware.
>
> arch_get_ipcc_score() provides a performance score for a given IPC class
> when placed on a specific CPU. Higher scores indicate higher performance.
>
> The number of classes and the score of each class of task are determined
> by hardware.
>
> Cc: Ben Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Joel Fernandes (Google) <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Tim C. Chen <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Ricardo Neri <[email protected]>
> ---
> Changes since v1:
> * Shortened the names of the IPCC interfaces (PeterZ):
> sched_task_classes_enabled >> sched_ipcc_enabled
> arch_has_task_classes >> arch_has_ipc_classes
> arch_update_task_class >> arch_update_ipcc
> arch_get_task_class_score >> arch_get_ipcc_score
> * Removed smt_siblings_idle argument from arch_update_ipcc(). (PeterZ)
> ---
> kernel/sched/sched.h | 60 +++++++++++++++++++++++++++++++++++++++++
> kernel/sched/topology.c | 8 ++++++
> 2 files changed, 68 insertions(+)
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b1d338a740e5..75e22baa2622 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2531,6 +2531,66 @@ void arch_scale_freq_tick(void)
> }
> #endif
>
> +#ifdef CONFIG_IPC_CLASSES
> +DECLARE_STATIC_KEY_FALSE(sched_ipcc);
> +
> +static inline bool sched_ipcc_enabled(void)
> +{
> + return static_branch_unlikely(&sched_ipcc);
> +}
> +
> +#ifndef arch_has_ipc_classes
> +/**
> + * arch_has_ipc_classes() - Check whether hardware supports IPC classes of tasks
> + *
> + * Returns: true of IPC classes of tasks are supported.
> + */
> +static __always_inline
> +bool arch_has_ipc_classes(void)
> +{
> + return false;
> +}
> +#endif
> +
> +#ifndef arch_update_ipcc
> +/**
> + * arch_update_ipcc() - Update the IPC class of the current task
> + * @curr: The current task
> + *
> + * Request that the IPC classification of @curr is updated.
> + *
> + * Returns: none
> + */
> +static __always_inline
> +void arch_update_ipcc(struct task_struct *curr)
> +{
> +}
> +#endif
> +
> +#ifndef arch_get_ipcc_score
> +/**
> + * arch_get_ipcc_score() - Get the IPC score of a class of task
> + * @ipcc: The IPC class
> + * @cpu: A CPU number
> + *
> + * Returns the performance score of an IPC class when running on @cpu.
> + * Error when either @class or @cpu are invalid.
> + */
> +static __always_inline
> +int arch_get_ipcc_score(unsigned short ipcc, int cpu)
> +{
> + return 1;
> +}
> +#endif

Those interfaces are quite simple, probably work really OK with
your HW/FW. If any other architecture is going to re-use them
in future, we might face some issue. Let me explain why.

These kernel functions are start to be used very early in boot.
Your HW/FW is probably instantly ready to work from the very
beginning during boot. What is some other HW needs some
preparation code, like setup communication channel to FW or enable
needed clocks/events/etc.

What I would like to see is a similar mechanism to the one in schedutil.
Schedutil governor has to wait till cpufreq initialize the cpu freq
driver and policy objects (which sometimes takes ~2-3 sec). After that
cpufreq fwk starts the governor which populates this hook [1].
It's based on RCU mechanism with function pointer that can be then
called from the task scheduler when everything is ready to work.

If we (Arm) is going to use your proposed interfaces, we might need
different mechanisms because the platform likely would be ready after
our SCMI FW channels and cpufreq are setup.

Would it be possible to address such need now or I would have to
change that interface code later?

Regards,
Lukasz

[1]
https://elixir.bootlin.com/linux/latest/source/kernel/sched/cpufreq.c#L29

2022-12-14 23:22:37

by Ionela Voinescu

[permalink] [raw]
Subject: Re: [PATCH v2 02/22] sched: Add interfaces for IPC classes

Hi,

On Tuesday 13 Dec 2022 at 16:31:28 (-0800), Ricardo Neri wrote:
> On Thu, Dec 08, 2022 at 08:48:46AM +0000, Ionela Voinescu wrote:
> > Hi,
> >
> > On Monday 28 Nov 2022 at 05:20:40 (-0800), Ricardo Neri wrote:
> > [..]
> > > +#ifndef arch_has_ipc_classes
> > > +/**
> > > + * arch_has_ipc_classes() - Check whether hardware supports IPC classes of tasks
> > > + *
> > > + * Returns: true of IPC classes of tasks are supported.
> > > + */
> > > +static __always_inline
> > > +bool arch_has_ipc_classes(void)
> > > +{
> > > + return false;
> > > +}
> > > +#endif
> > > +
> > > +#ifndef arch_update_ipcc
> > > +/**
> > > + * arch_update_ipcc() - Update the IPC class of the current task
> > > + * @curr: The current task
> > > + *
> > > + * Request that the IPC classification of @curr is updated.
> > > + *
> > > + * Returns: none
> > > + */
> > > +static __always_inline
> > > +void arch_update_ipcc(struct task_struct *curr)
> > > +{
> > > +}
> > > +#endif
> > > +
> > > +#ifndef arch_get_ipcc_score
> > > +/**
> > > + * arch_get_ipcc_score() - Get the IPC score of a class of task
> > > + * @ipcc: The IPC class
> > > + * @cpu: A CPU number
> > > + *
> > > + * Returns the performance score of an IPC class when running on @cpu.
> > > + * Error when either @class or @cpu are invalid.
> > > + */
> > > +static __always_inline
> > > +int arch_get_ipcc_score(unsigned short ipcc, int cpu)
> > > +{
> > > + return 1;
> > > +}
> > > +#endif
>
> Thank you very much for your feedback Ionela!
>
> >
> > The interface looks mostly alright but this arch_get_ipcc_score() leaves
> > unclear what are the characteristics of the returned value.
>
> Fair point. I mean for the return value to be defined by architectures;
> but yes, architectures need to know how to implement this function.
>
> >
> > Does it have a meaning as an absolute value or is it a value on an
> > abstract scale? If it should be interpreted as instructions per cycle,
> > if I wanted to have a proper comparison between the ability of two CPUs
> > to handle this class of tasks then I would need to take into consideration
> > the maximum frequency of each CPU.
>
> Do you mean when calling arch_get_ipcc_score()? If yes, then I agree, IPC
> class may not be the only factor, but the criteria to use the return value
> is up to the caller.
>

Yes, but if different architectures give different meanings to this score
(scale, relative difference between two values, etc) while the policies
are common (uses of arch_get_ipcc_score() in common scheduler paths)
then the outcome can be vastly different.

If the "criteria to use the returned value is up to the caller", then
the caller of arch_get_ipcc_score() should always be architecture
specific code, which currently is not (see 09/22).

> In asym_packing it is assumed that higher-priority CPUs are preferred.
> When balancing load, IPC class scores are used to select between otherwise
> identical runqueues. This should also be the case for migrate_misfit: we
> know already that the tasks being considered do not fit on their current
> CPU.
>
> We would need to think what to do with other type of balancing, if at all.
>
> That said, arch_get_ipcc_score() should only return a metric of the
> instructions-per-*cycle*, independent of frequency, no?
>

Yes, performance on an abstract scale is preferred here. We would not
want to have to scale the score by frequency :). It was just an example
showing that the description of arch_get_ipcc_score() should be clarified.
Another possible clarification: is it expected that the scores scale
linearly with performance (does double the score mean double the
performance?).

> > If it's a performance value on an
> > abstract scale (more likely), similar cu capacity, then it might be good
> > to better define this abstract scale. That would help with the default
> > implementation where possibly the best choice for a return value would
> > be the maximum value on the scale, suggesting equal/maximum performance
> > for different CPUs handling the class of tasks.
>
> I guess something like:
>
> #define SCHED_IPCC_DEFAULT_SCALE 1024
>
> ?
>
> I think I am fine with this value being the default. I also think that it
> is up to architectures to whether scale all IPC class scores from the
> best-performing class on the best-performing CPU. Doing so would introduce
> overhead, especially if hardware updates the IPC class scores multiple
> times during runtime.
>

Yes, it's a very good point. Initially I thought that one would need to
rescale the values anyway for them to make sense relative to each other,
but I now realise that would not be needed.

Therefore, you are right, to avoid this extra work it's best to leave
the range of possible score values up to the implementer and not force
something like [0 - 1024].

But again, this raises the point that if one architecture decides to
return its scores on a scale [0 - 1024] and possibly use these scores to
scale utilization/alter capacity for example, this cannot be generic
policy as not all architectures are guaranteed to use this scale for its
scores.

So leaving the score unrestricted makes it more difficult to have
generic policies across architectures that use them.

> >
> > I suppose you avoided returning 0 for the default implementation as you
> > intend that to mean the inability of the CPU to handle that class of
> > tasks? It would be good to document this.
>
> I meant this to be minimum possible IPC class score for any CPU: any
> CPU should be able handle any IPC class. If not implemented, all CPUs
> handle all IPC classes equally.
>

Ah, I see. In this case you might as well return 0 in the default
implementation of arch_get_ipcc_score(). I know it does not matter much
what gets returned there, but returning a meaningless "1" is strange to
me :).

Thanks,
Ionela.

2022-12-16 22:10:12

by Ricardo Neri

[permalink] [raw]
Subject: Re: [PATCH v2 02/22] sched: Add interfaces for IPC classes

On Wed, Dec 14, 2022 at 07:36:44AM +0000, Lukasz Luba wrote:
> Hi Richardo,
>
> I have some generic comment for the design of those interfaces.
>
> On 11/28/22 13:20, Ricardo Neri wrote:
> > Add the interfaces that architectures shall implement to convey the data
> > to support IPC classes.
> >
> > arch_update_ipcc() updates the IPC classification of the current task as
> > given by hardware.
> >
> > arch_get_ipcc_score() provides a performance score for a given IPC class
> > when placed on a specific CPU. Higher scores indicate higher performance.
> >
> > The number of classes and the score of each class of task are determined
> > by hardware.
> >
> > Cc: Ben Segall <[email protected]>
> > Cc: Daniel Bristot de Oliveira <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Joel Fernandes (Google) <[email protected]>
> > Cc: Len Brown <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Srinivas Pandruvada <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Tim C. Chen <[email protected]>
> > Cc: Valentin Schneider <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: Ricardo Neri <[email protected]>
> > ---
> > Changes since v1:
> > * Shortened the names of the IPCC interfaces (PeterZ):
> > sched_task_classes_enabled >> sched_ipcc_enabled
> > arch_has_task_classes >> arch_has_ipc_classes
> > arch_update_task_class >> arch_update_ipcc
> > arch_get_task_class_score >> arch_get_ipcc_score
> > * Removed smt_siblings_idle argument from arch_update_ipcc(). (PeterZ)
> > ---
> > kernel/sched/sched.h | 60 +++++++++++++++++++++++++++++++++++++++++
> > kernel/sched/topology.c | 8 ++++++
> > 2 files changed, 68 insertions(+)
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index b1d338a740e5..75e22baa2622 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2531,6 +2531,66 @@ void arch_scale_freq_tick(void)
> > }
> > #endif
> > +#ifdef CONFIG_IPC_CLASSES
> > +DECLARE_STATIC_KEY_FALSE(sched_ipcc);
> > +
> > +static inline bool sched_ipcc_enabled(void)
> > +{
> > + return static_branch_unlikely(&sched_ipcc);
> > +}
> > +
> > +#ifndef arch_has_ipc_classes
> > +/**
> > + * arch_has_ipc_classes() - Check whether hardware supports IPC classes of tasks
> > + *
> > + * Returns: true of IPC classes of tasks are supported.
> > + */
> > +static __always_inline
> > +bool arch_has_ipc_classes(void)
> > +{
> > + return false;
> > +}
> > +#endif
> > +
> > +#ifndef arch_update_ipcc
> > +/**
> > + * arch_update_ipcc() - Update the IPC class of the current task
> > + * @curr: The current task
> > + *
> > + * Request that the IPC classification of @curr is updated.
> > + *
> > + * Returns: none
> > + */
> > +static __always_inline
> > +void arch_update_ipcc(struct task_struct *curr)
> > +{
> > +}
> > +#endif
> > +
> > +#ifndef arch_get_ipcc_score
> > +/**
> > + * arch_get_ipcc_score() - Get the IPC score of a class of task
> > + * @ipcc: The IPC class
> > + * @cpu: A CPU number
> > + *
> > + * Returns the performance score of an IPC class when running on @cpu.
> > + * Error when either @class or @cpu are invalid.
> > + */
> > +static __always_inline
> > +int arch_get_ipcc_score(unsigned short ipcc, int cpu)
> > +{
> > + return 1;
> > +}
> > +#endif
>
> Those interfaces are quite simple, probably work really OK with
> your HW/FW. If any other architecture is going to re-use them
> in future, we might face some issue. Let me explain why.
>
> These kernel functions are start to be used very early in boot.
> Your HW/FW is probably instantly ready to work from the very
> beginning during boot. What is some other HW needs some
> preparation code, like setup communication channel to FW or enable
> needed clocks/events/etc.
>
> What I would like to see is a similar mechanism to the one in schedutil.
> Schedutil governor has to wait till cpufreq initialize the cpu freq
> driver and policy objects (which sometimes takes ~2-3 sec). After that
> cpufreq fwk starts the governor which populates this hook [1].
> It's based on RCU mechanism with function pointer that can be then
> called from the task scheduler when everything is ready to work.
>
> If we (Arm) is going to use your proposed interfaces, we might need
> different mechanisms because the platform likely would be ready after
> our SCMI FW channels and cpufreq are setup.
>
> Would it be possible to address such need now or I would have to
> change that interface code later?

Thank you very much for your feeback, Lukas!

I took a look a cpufreq implementation you refer. I can certainly try to
accommodate your requirements. Before jumping into it, I have a few
questions.

I see that cpufreq_update_util() only does something when the per-CPU
pointers cpufreq_update_util_data become non-NULL. I use static key for the
same purpose. Is this not usable for you?

Indeed, arch_has_ipc_classes() implies that has to return true very early
after boot if called, as per Ionela's suggestion from sched_init_smp(). I
can convert this interface to an arch_enable_ipc_classes() that drivers or
preparation code can call when ready. Would this be acceptable?

Do think that a hook per CPU would be needed? If unsure, perhaps this can
be left for future work.

Thanks and BR,
Ricardo
>
> [1]
> https://elixir.bootlin.com/linux/latest/source/kernel/sched/cpufreq.c#L29
>

2022-12-20 00:13:17

by Ricardo Neri

[permalink] [raw]
Subject: Re: [PATCH v2 02/22] sched: Add interfaces for IPC classes

On Wed, Dec 14, 2022 at 11:15:56PM +0000, Ionela Voinescu wrote:
> Hi,
>
> On Tuesday 13 Dec 2022 at 16:31:28 (-0800), Ricardo Neri wrote:
> > On Thu, Dec 08, 2022 at 08:48:46AM +0000, Ionela Voinescu wrote:
> > > Hi,
> > >
> > > On Monday 28 Nov 2022 at 05:20:40 (-0800), Ricardo Neri wrote:
> > > [..]
> > > > +#ifndef arch_has_ipc_classes
> > > > +/**
> > > > + * arch_has_ipc_classes() - Check whether hardware supports IPC classes of tasks
> > > > + *
> > > > + * Returns: true of IPC classes of tasks are supported.
> > > > + */
> > > > +static __always_inline
> > > > +bool arch_has_ipc_classes(void)
> > > > +{
> > > > + return false;
> > > > +}
> > > > +#endif
> > > > +
> > > > +#ifndef arch_update_ipcc
> > > > +/**
> > > > + * arch_update_ipcc() - Update the IPC class of the current task
> > > > + * @curr: The current task
> > > > + *
> > > > + * Request that the IPC classification of @curr is updated.
> > > > + *
> > > > + * Returns: none
> > > > + */
> > > > +static __always_inline
> > > > +void arch_update_ipcc(struct task_struct *curr)
> > > > +{
> > > > +}
> > > > +#endif
> > > > +
> > > > +#ifndef arch_get_ipcc_score
> > > > +/**
> > > > + * arch_get_ipcc_score() - Get the IPC score of a class of task
> > > > + * @ipcc: The IPC class
> > > > + * @cpu: A CPU number
> > > > + *
> > > > + * Returns the performance score of an IPC class when running on @cpu.
> > > > + * Error when either @class or @cpu are invalid.
> > > > + */
> > > > +static __always_inline
> > > > +int arch_get_ipcc_score(unsigned short ipcc, int cpu)
> > > > +{
> > > > + return 1;
> > > > +}
> > > > +#endif
> >
> > Thank you very much for your feedback Ionela!
> >
> > >
> > > The interface looks mostly alright but this arch_get_ipcc_score() leaves
> > > unclear what are the characteristics of the returned value.
> >
> > Fair point. I mean for the return value to be defined by architectures;
> > but yes, architectures need to know how to implement this function.
> >
> > >
> > > Does it have a meaning as an absolute value or is it a value on an
> > > abstract scale? If it should be interpreted as instructions per cycle,
> > > if I wanted to have a proper comparison between the ability of two CPUs
> > > to handle this class of tasks then I would need to take into consideration
> > > the maximum frequency of each CPU.
> >
> > Do you mean when calling arch_get_ipcc_score()? If yes, then I agree, IPC
> > class may not be the only factor, but the criteria to use the return value
> > is up to the caller.
> >
>
> Yes, but if different architectures give different meanings to this score
> (scale, relative difference between two values, etc) while the policies
> are common (uses of arch_get_ipcc_score() in common scheduler paths)
> then the outcome can be vastly different.

One more reason to leave to the caller the handling of the returned value.

>
> If the "criteria to use the returned value is up to the caller", then
> the caller of arch_get_ipcc_score() should always be architecture
> specific code, which currently is not (see 09/22).

Agreed. I now get your point. I'll change my patch accordingly.

>
> > In asym_packing it is assumed that higher-priority CPUs are preferred.
> > When balancing load, IPC class scores are used to select between otherwise
> > identical runqueues. This should also be the case for migrate_misfit: we
> > know already that the tasks being considered do not fit on their current
> > CPU.
> >
> > We would need to think what to do with other type of balancing, if at all.
> >
> > That said, arch_get_ipcc_score() should only return a metric of the
> > instructions-per-*cycle*, independent of frequency, no?
> >
>
> Yes, performance on an abstract scale is preferred here. We would not
> want to have to scale the score by frequency :). It was just an example
> showing that the description of arch_get_ipcc_score() should be clarified.
> Another possible clarification: is it expected that the scores scale
> linearly with performance (does double the score mean double the
> performance?).

Indeed this seems sensible.

>
> > > If it's a performance value on an
> > > abstract scale (more likely), similar cu capacity, then it might be good
> > > to better define this abstract scale. That would help with the default
> > > implementation where possibly the best choice for a return value would
> > > be the maximum value on the scale, suggesting equal/maximum performance
> > > for different CPUs handling the class of tasks.
> >
> > I guess something like:
> >
> > #define SCHED_IPCC_DEFAULT_SCALE 1024
> >
> > ?
> >
> > I think I am fine with this value being the default. I also think that it
> > is up to architectures to whether scale all IPC class scores from the
> > best-performing class on the best-performing CPU. Doing so would introduce
> > overhead, especially if hardware updates the IPC class scores multiple
> > times during runtime.
> >
>
> Yes, it's a very good point. Initially I thought that one would need to
> rescale the values anyway for them to make sense relative to each other,
> but I now realise that would not be needed.
>
> Therefore, you are right, to avoid this extra work it's best to leave
> the range of possible score values up to the implementer and not force
> something like [0 - 1024].
>
> But again, this raises the point that if one architecture decides to
> return its scores on a scale [0 - 1024] and possibly use these scores to
> scale utilization/alter capacity for example, this cannot be generic
> policy as not all architectures are guaranteed to use this scale for its
> scores.

Very good point.

>
> So leaving the score unrestricted makes it more difficult to have
> generic policies across architectures that use them.
>

In asym_packing we select the CPU of higher priority, regardless of how big
the priority delta is. IPC classes extend the same mechanism. (We do have
a throughput calculation, but it does not require IPC class to be scaled).

So yes, IPC classes need to be scaled when combined with another metric.

Another addition to the documentation of the interface? :)

>
> > >
> > > I suppose you avoided returning 0 for the default implementation as you
> > > intend that to mean the inability of the CPU to handle that class of
> > > tasks? It would be good to document this.
> >
> > I meant this to be minimum possible IPC class score for any CPU: any
> > CPU should be able handle any IPC class. If not implemented, all CPUs
> > handle all IPC classes equally.
> >
>
> Ah, I see. In this case you might as well return 0 in the default
> implementation of arch_get_ipcc_score(). I know it does not matter much
> what gets returned there, but returning a meaningless "1" is strange to
> me :).

Yes, the value does not really matter to my use case, as long as it the
same for all all CPUs. I can use 1024 as other scheduler metrics.

Thanks and BR,
Ricardo