For a more accurate (i.e. frequency- and cpu-invariant) load-tracking
the task scheduler needs a frequency-scaling and on a heterogeneous
system a cpu-scaling correction factor.
This patch-set implements a Frequency Invariance Engine (FIE)
(topology_get_freq_scale()) in drivers/base/arch_topology.c to provide
a frequency-scaling correction factor.
During the v1 [1] review Viresh Kumar pointed out that such a FIE based
on cpufreq transition notifier will not work with future cpufreq
policies supporting fast frequency switching.
To include support for fast frequency switching policies the FIE
implementation has been changed. Whenever there is a frequency change
cpufreq now calls the arch specific function arch_set_freq_scale() which
has to be implemented by the architecture. In case the arch does not
specify this function FIE support is compiled out of cpufreq.
The advantage is that this would support fast frequency switching since
it does not rely on cpufreq transition (or policy) notifier anymore.
The Cpu Invariance Engine (CIE) (topology_get_cpu_scale()) providing a
cpu-scaling correction factor was already introduced by the "Fix issues
and factorize arm/arm64 capacity information code" patch-set [2].
This patch-set also enables the frequency- and cpu-invariant accounting
support. Enabling here means to associate (wire) the task scheduler
function name arch_scale_freq_capacity and arch_scale_cpu_capacity with
the FIE and CIE function names from drivers/base/arch_topology.c. This
replaces the task scheduler's default FIE and CIE in
kernel/sched/sched.h.
There is an additional patch [10/10] in v2 which allows inlining of the
FIE and CIE into the appropriate task scheduler functions.
+------------------------------+ +------------------------------+
| | | |
| cpufreq: | | arch: |
| | | |
| arch_set_freq_scale() +-----------> topology_set_freq_scale() |
| | | |
+------------------------------+ | |
| |
+------------------------------+ | |
| | | |
| task scheduler: | | |
| | | |
| arch_scale_freq_capacity() +-----------> topology_get_freq_scale() |
| | | |
| arch_scale_cpu_capacity() +-----------> topology_get_cpu_scale() |
| | | |
+------------------------------+ +------------------------------+
Patch high level description:
[ 01/10] Fix to free cpumask cpus_to_visit
[ 02/10] Let cpufreq provide current and max supported frequency for
the set of related cpus
[ 03/10] Frequency Invariance Engine (FIE)
[ 04/10] Connect cpufreq input data to FIE on arm
[05,06/10] Enable frequency- and cpu-invariant accounting support on
arm
[ 07/10] Connect cpufreq input data to FIE on arm64
[08,09/10] Enable frequency- and cpu-invariant accounting support on
arm64
[ 10/10] Allow CIE and FIE inlining
Changes v1->v2:
- Rebase on top of next-20170630
- Add fixup patch to free cpumask cpus_to_visit [01/10]
- Propose solution to support fast frequency switching [02-04,07/10]
- Add patch to allow CIE and FIE inlining [10/10]
The patch-set is based on top of linux-next/master (tag: next-20170630)
and it is also available from:
git://linux-arm.org/linux-de.git upstream/freq_and_cpu_inv_v2
It has been tested on TC2 (arm) and JUNO (arm64) by running a ramp-up
rt-app task pinned to a cpu with the ondemand cpufreq governor and
checking the load-tracking signals of this task.
[1] https://marc.info/?l=linux-kernel&m=149690865010019&w=2
[2] https://marc.info/?l=linux-kernel&m=149625018223002&w=2
Dietmar Eggemann (10):
drivers base/arch_topology: free cpumask cpus_to_visit
cpufreq: provide data for frequency-invariant load-tracking support
drivers base/arch_topology: frequency-invariant load-tracking support
arm: wire cpufreq input data for frequency-invariant accounting up to
the arch
arm: wire frequency-invariant accounting support up to the task
scheduler
arm: wire cpu-invariant accounting support up to the task scheduler
arm64: wire cpufreq input data for frequency-invariant accounting up
to the arch
arm64: wire frequency-invariant accounting support up to the task
scheduler
arm64: wire cpu-invariant accounting support up to the task scheduler
drivers base/arch_topology: inline cpu- and frequency-invariant
accounting
arch/arm/include/asm/topology.h | 11 +++++++++++
arch/arm/kernel/topology.c | 1 -
arch/arm64/include/asm/topology.h | 11 +++++++++++
arch/arm64/kernel/topology.c | 1 -
drivers/base/arch_topology.c | 30 ++++++++++++++++++++++++------
drivers/cpufreq/cpufreq.c | 26 ++++++++++++++++++++++++++
include/linux/arch_topology.h | 20 +++++++++++++++++++-
7 files changed, 91 insertions(+), 9 deletions(-)
--
2.11.0
Free cpumask cpus_to_visit in case registering
init_cpu_capacity_notifier has failed or the parsing of the cpu
capacity-dmips-mhz property is done. The cpumask cpus_to_visit is
only used inside the notifier call init_cpu_capacity_callback.
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Juri Lelli <[email protected]>
Reported-by: Vincent Guittot <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
Acked-by: Vincent Guittot <[email protected]>
Tested-by: Juri Lelli <[email protected]>
Reviewed-by: Juri Lelli <[email protected]>
---
drivers/base/arch_topology.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index d1c33a85059e..f4832c662762 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -206,6 +206,8 @@ static struct notifier_block init_cpu_capacity_notifier = {
static int __init register_cpufreq_notifier(void)
{
+ int ret;
+
/*
* on ACPI-based systems we need to use the default cpu capacity
* until we have the necessary code to parse the cpu capacity, so
@@ -221,13 +223,19 @@ static int __init register_cpufreq_notifier(void)
cpumask_copy(cpus_to_visit, cpu_possible_mask);
- return cpufreq_register_notifier(&init_cpu_capacity_notifier,
- CPUFREQ_POLICY_NOTIFIER);
+ ret = cpufreq_register_notifier(&init_cpu_capacity_notifier,
+ CPUFREQ_POLICY_NOTIFIER);
+
+ if (ret)
+ free_cpumask_var(cpus_to_visit);
+
+ return ret;
}
core_initcall(register_cpufreq_notifier);
static void parsing_done_workfn(struct work_struct *work)
{
+ free_cpumask_var(cpus_to_visit);
cpufreq_unregister_notifier(&init_cpu_capacity_notifier,
CPUFREQ_POLICY_NOTIFIER);
}
--
2.11.0
A frequency-invariant load-tracking solution based on cpufreq transition
notifier will not work for future fast frequency switching policies.
That is why a different solution is presented with this patch.
Let cpufreq call the function arch_set_freq_scale() to pass the current
frequency, the max supported frequency and the cpumask of the related
cpus to a consumer (an arch) which defines arch_set_freq_scale().
The consumer has to associate arch_set_freq_scale with the name of its
own implementation foo_set_freq_scale() to overwrite the empty standard
definition in drivers/cpufreq/cpufreq.c.
An arch could do this in one of its arch-specific header files
(e.g. arch/$ARCH/include/asm/topology.h) which gets included in
drivers/cpufreq/cpufreq.c.
In case arch_set_freq_scale() is not defined (and because of the
pr_debug() drivers/cpufreq/cpufreq.c is not compiled with -DDEBUG) the
function cpufreq_set_freq_scale() gets compiled out.
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
drivers/cpufreq/cpufreq.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 9bf97a366029..a04c5886a5ce 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -347,6 +347,28 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
}
}
+/*********************************************************************
+ * FREQUENCY INVARIANT CPU CAPACITY SUPPORT *
+ *********************************************************************/
+
+#ifndef arch_set_freq_scale
+static void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
+ unsigned long max_freq)
+{}
+#endif
+
+static void cpufreq_set_freq_scale(struct cpufreq_policy *policy,
+ struct cpufreq_freqs *freqs)
+{
+ unsigned long cur_freq = freqs ? freqs->new : policy->cur;
+ unsigned long max_freq = policy->cpuinfo.max_freq;
+
+ pr_debug("cpus %*pbl cur/cur max freq %lu/%lu kHz\n",
+ cpumask_pr_args(policy->related_cpus), cur_freq, max_freq);
+
+ arch_set_freq_scale(policy->related_cpus, cur_freq, max_freq);
+}
+
/**
* cpufreq_notify_transition - call notifier chain and adjust_jiffies
* on frequency transition.
@@ -405,6 +427,8 @@ void cpufreq_freq_transition_begin(struct cpufreq_policy *policy,
spin_unlock(&policy->transition_lock);
+ cpufreq_set_freq_scale(policy, freqs);
+
cpufreq_notify_transition(policy, freqs, CPUFREQ_PRECHANGE);
}
EXPORT_SYMBOL_GPL(cpufreq_freq_transition_begin);
@@ -2203,6 +2227,8 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
CPUFREQ_NOTIFY, new_policy);
+ cpufreq_set_freq_scale(new_policy, NULL);
+
policy->min = new_policy->min;
policy->max = new_policy->max;
--
2.11.0
Commit dfbca41f3479 ("sched: Optimize freq invariant accounting")
changed the wiring which now has to be done by associating
arch_scale_freq_capacity with the actual implementation provided
by the architecture.
Define arch_scale_freq_capacity to use the arch_topology "driver"
function topology_get_freq_scale() for the task scheduler's
frequency-invariant accounting instead of the default
arch_scale_freq_capacity() in kernel/sched/sched.h.
Cc: Russell King <[email protected]>
Cc: Juri Lelli <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
Acked-by: Vincent Guittot <[email protected]>
Tested-by: Juri Lelli <[email protected]>
Reviewed-by: Juri Lelli <[email protected]>
---
arch/arm/include/asm/topology.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index ca05d1b90411..57aebfa03e24 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -29,6 +29,9 @@ const struct cpumask *cpu_coregroup_mask(int cpu);
/* Subscribe for input data for frequency-invariant load-tracking */
#define arch_set_freq_scale topology_set_freq_scale
+/* Replace task scheduler's default frequency-invariant accounting */
+#define arch_scale_freq_capacity topology_get_freq_scale
+
#else
static inline void init_cpu_topology(void) { }
--
2.11.0
Define arch_set_freq_scale to be the arch_topology "driver" function
topology_set_freq_scale() to let FIE work correctly.
Cc: Russell King <[email protected]>
Cc: Juri Lelli <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
arch/arm/include/asm/topology.h | 5 +++++
arch/arm/kernel/topology.c | 1 -
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 370f7a732900..ca05d1b90411 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -24,6 +24,11 @@ void init_cpu_topology(void);
void store_cpu_topology(unsigned int cpuid);
const struct cpumask *cpu_coregroup_mask(int cpu);
+#include <linux/arch_topology.h>
+
+/* Subscribe for input data for frequency-invariant load-tracking */
+#define arch_set_freq_scale topology_set_freq_scale
+
#else
static inline void init_cpu_topology(void) { }
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index bf949a763dbe..2c47a76c67b0 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -11,7 +11,6 @@
* for more details.
*/
-#include <linux/arch_topology.h>
#include <linux/cpu.h>
#include <linux/cpufreq.h>
#include <linux/cpumask.h>
--
2.11.0
To speed up the cpu- and frequency-invariant accounting of the task
scheduler make sure that the CIE (topology_get_cpu_scale()) and FIE
(topology_get_freq_scale() get completely inlined into the task
scheduler consumer functions (e.g. __update_load_avg_se()).
This patch-set changes the interface for CIE and FIE from:
drivers/base/arch_topology.c:
static DEFINE_PER_CPU(unsigned long, item);
unsigned long topology_get_item_scale(...)
{
return per_cpu(item, cpu)
}
include/linux/arch_topology.h:
unsigned long topology_get_item_scale(...);
to:
drivers/base/arch_topology.c:
DEFINE_PER_CPU(unsigned long, item);
include/linux/arch_topology.h:
DECLARE_PER_CPU(unsigned long, item);
static inline
unsigned long topology_get_item_scale(...)
{
return per_cpu(item, cpu)
}
An uplift in performance could be detected running the kernel with the
following test patch on top (on JUNO R0 (arm64)):
@@ -2812,10 +2812,18 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
unsigned long scale_freq, scale_cpu;
u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
u64 periods;
+ u64 t1, t2;
+
+ t1 = sched_clock_cpu(cpu);
scale_freq = arch_scale_freq_capacity(NULL, cpu);
scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+ t2 = sched_clock_cpu(cpu);
+
+ trace_printk("cpu=%d t1=%llu t2=%llu diff=%llu\n",
+ cpu, t1, t2, t2 - t1);
+
delta += sa->period_contrib;
periods = delta / 1024; /* A period is * 1024us * (~1ms) */
The following test results (3 test runs each) have been obtained by
tracing this trace printk (diff=x) for Cortex A-53 (LITTLE) and Cortex
A-57 (big) cpus w/ (inline) and w/o (non-inline) this patch.
mean max min
A-57 inline:
119.6 300 60
96.8 280 60
110.2 660 60
A-57 non-inline:
142.8 460 80
157.6 680 80
153.4 720 80
A-53 inline:
141.6 360 100
118.8 500 100
148.6 380 100
A-53 non-inline:
293 840 120
253.2 840 120
299.6 1060 140
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Juri Lelli <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
drivers/base/arch_topology.c | 14 ++------------
include/linux/arch_topology.h | 15 +++++++++++++--
2 files changed, 15 insertions(+), 14 deletions(-)
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 63fb3f945d21..b4481cff14bf 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -22,12 +22,7 @@
#include <linux/string.h>
#include <linux/sched/topology.h>
-static DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
-
-unsigned long topology_get_freq_scale(struct sched_domain *sd, int cpu)
-{
- return per_cpu(freq_scale, cpu);
-}
+DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
void topology_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
unsigned long max_freq)
@@ -43,12 +38,7 @@ void topology_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
static DEFINE_MUTEX(cpu_scale_mutex);
-static DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
-
-unsigned long topology_get_cpu_scale(struct sched_domain *sd, int cpu)
-{
- return per_cpu(cpu_scale, cpu);
-}
+DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
{
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index 168104d2d2cf..361e85a30151 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -11,12 +11,23 @@ void topology_normalize_cpu_scale(void);
struct device_node;
int topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu);
+DECLARE_PER_CPU(unsigned long, cpu_scale);
+DECLARE_PER_CPU(unsigned long, freq_scale);
+
struct sched_domain;
-unsigned long topology_get_cpu_scale(struct sched_domain *sd, int cpu);
+static inline
+unsigned long topology_get_cpu_scale(struct sched_domain *sd, int cpu)
+{
+ return per_cpu(cpu_scale, cpu);
+}
void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity);
-unsigned long topology_get_freq_scale(struct sched_domain *sd, int cpu);
+static inline
+unsigned long topology_get_freq_scale(struct sched_domain *sd, int cpu)
+{
+ return per_cpu(freq_scale, cpu);
+}
void topology_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
unsigned long max_freq);
--
2.11.0
Commit dfbca41f3479 ("sched: Optimize freq invariant accounting")
changed the wiring which now has to be done by associating
arch_scale_freq_capacity with the actual implementation provided
by the architecture.
Define arch_scale_freq_capacity to use the arch_topology "driver"
function topology_get_freq_scale() for the task scheduler's
frequency-invariant accounting instead of the default
arch_scale_freq_capacity() in kernel/sched/sched.h.
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Juri Lelli <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Acked-by: Vincent Guittot <[email protected]>
Tested-by: Juri Lelli <[email protected]>
Reviewed-by: Juri Lelli <[email protected]>
---
arch/arm64/include/asm/topology.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
index d9cbb289e295..4fd598b432e6 100644
--- a/arch/arm64/include/asm/topology.h
+++ b/arch/arm64/include/asm/topology.h
@@ -37,6 +37,9 @@ int pcibus_to_node(struct pci_bus *bus);
/* Subscribe for input data for frequency-invariant load-tracking */
#define arch_set_freq_scale topology_set_freq_scale
+/* Replace task scheduler's default frequency-invariant accounting */
+#define arch_scale_freq_capacity topology_get_freq_scale
+
#include <asm-generic/topology.h>
#endif /* _ASM_ARM_TOPOLOGY_H */
--
2.11.0
Commit 8cd5601c5060 ("sched/fair: Convert arch_scale_cpu_capacity() from
weak function to #define") changed the wiring which now has to be done
by associating arch_scale_cpu_capacity with the actual implementation
provided by the architecture.
Define arch_scale_cpu_capacity to use the arch_topology "driver"
function topology_get_cpu_scale() for the task scheduler's cpu-invariant
accounting instead of the default arch_scale_cpu_capacity() in
kernel/sched/sched.h.
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Juri Lelli <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Acked-by: Vincent Guittot <[email protected]>
Tested-by: Juri Lelli <[email protected]>
Reviewed-by: Juri Lelli <[email protected]>
---
arch/arm64/include/asm/topology.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
index 4fd598b432e6..0dc81860cc0d 100644
--- a/arch/arm64/include/asm/topology.h
+++ b/arch/arm64/include/asm/topology.h
@@ -40,6 +40,9 @@ int pcibus_to_node(struct pci_bus *bus);
/* Replace task scheduler's default frequency-invariant accounting */
#define arch_scale_freq_capacity topology_get_freq_scale
+/* Replace task scheduler's default cpu-invariant accounting */
+#define arch_scale_cpu_capacity topology_get_cpu_scale
+
#include <asm-generic/topology.h>
#endif /* _ASM_ARM_TOPOLOGY_H */
--
2.11.0
Commit 8cd5601c5060 ("sched/fair: Convert arch_scale_cpu_capacity() from
weak function to #define") changed the wiring which now has to be done
by associating arch_scale_cpu_capacity with the actual implementation
provided by the architecture.
Define arch_scale_cpu_capacity to use the arch_topology "driver"
function topology_get_cpu_scale() for the task scheduler's cpu-invariant
accounting instead of the default arch_scale_cpu_capacity() in
kernel/sched/sched.h.
Cc: Russell King <[email protected]>
Cc: Juri Lelli <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
Acked-by: Vincent Guittot <[email protected]>
Tested-by: Juri Lelli <[email protected]>
Reviewed-by: Juri Lelli <[email protected]>
---
arch/arm/include/asm/topology.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 57aebfa03e24..26d491c65955 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -32,6 +32,9 @@ const struct cpumask *cpu_coregroup_mask(int cpu);
/* Replace task scheduler's default frequency-invariant accounting */
#define arch_scale_freq_capacity topology_get_freq_scale
+/* Replace task scheduler's default cpu-invariant accounting */
+#define arch_scale_cpu_capacity topology_get_cpu_scale
+
#else
static inline void init_cpu_topology(void) { }
--
2.11.0
Define arch_set_freq_scale to be the arch_topology "driver" function
topology_set_freq_scale() to let FIE work correctly.
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Juri Lelli <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
arch/arm64/include/asm/topology.h | 5 +++++
arch/arm64/kernel/topology.c | 1 -
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
index 8b57339823e9..d9cbb289e295 100644
--- a/arch/arm64/include/asm/topology.h
+++ b/arch/arm64/include/asm/topology.h
@@ -32,6 +32,11 @@ int pcibus_to_node(struct pci_bus *bus);
#endif /* CONFIG_NUMA */
+#include <linux/arch_topology.h>
+
+/* Subscribe for input data for frequency-invariant load-tracking */
+#define arch_set_freq_scale topology_set_freq_scale
+
#include <asm-generic/topology.h>
#endif /* _ASM_ARM_TOPOLOGY_H */
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index 79244c75eaec..6cbb6315e493 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -11,7 +11,6 @@
* for more details.
*/
-#include <linux/arch_topology.h>
#include <linux/cpu.h>
#include <linux/cpumask.h>
#include <linux/init.h>
--
2.11.0
Implements an arch-specific frequency-scaling function
topology_get_freq_scale() which provides the following frequency
scaling factor:
current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_supported_freq(cpu)
One possible consumer of this is the Per-Entity Load Tracking (PELT)
mechanism of the task scheduler.
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Juri Lelli <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
drivers/base/arch_topology.c | 20 ++++++++++++++++++++
include/linux/arch_topology.h | 7 +++++++
2 files changed, 27 insertions(+)
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index f4832c662762..63fb3f945d21 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -22,6 +22,26 @@
#include <linux/string.h>
#include <linux/sched/topology.h>
+static DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
+
+unsigned long topology_get_freq_scale(struct sched_domain *sd, int cpu)
+{
+ return per_cpu(freq_scale, cpu);
+}
+
+void topology_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
+ unsigned long max_freq)
+{
+ unsigned long scale;
+ int i;
+
+ scale = (cur_freq << SCHED_CAPACITY_SHIFT) / max_freq;
+
+ for_each_cpu(i, cpus)
+ per_cpu(freq_scale, i) = scale;
+}
+
+
static DEFINE_MUTEX(cpu_scale_mutex);
static DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index 9af3c174c03a..168104d2d2cf 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -4,6 +4,8 @@
#ifndef _LINUX_ARCH_TOPOLOGY_H_
#define _LINUX_ARCH_TOPOLOGY_H_
+#include <linux/cpumask.h>
+
void topology_normalize_cpu_scale(void);
struct device_node;
@@ -14,4 +16,9 @@ unsigned long topology_get_cpu_scale(struct sched_domain *sd, int cpu);
void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity);
+unsigned long topology_get_freq_scale(struct sched_domain *sd, int cpu);
+
+void topology_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
+ unsigned long max_freq);
+
#endif /* _LINUX_ARCH_TOPOLOGY_H_ */
--
2.11.0
On 06-07-17, 10:49, Dietmar Eggemann wrote:
> Free cpumask cpus_to_visit in case registering
> init_cpu_capacity_notifier has failed or the parsing of the cpu
> capacity-dmips-mhz property is done. The cpumask cpus_to_visit is
> only used inside the notifier call init_cpu_capacity_callback.
>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Reported-by: Vincent Guittot <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> Acked-by: Vincent Guittot <[email protected]>
> Tested-by: Juri Lelli <[email protected]>
> Reviewed-by: Juri Lelli <[email protected]>
> ---
> drivers/base/arch_topology.c | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index d1c33a85059e..f4832c662762 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -206,6 +206,8 @@ static struct notifier_block init_cpu_capacity_notifier = {
>
> static int __init register_cpufreq_notifier(void)
> {
> + int ret;
> +
> /*
> * on ACPI-based systems we need to use the default cpu capacity
> * until we have the necessary code to parse the cpu capacity, so
> @@ -221,13 +223,19 @@ static int __init register_cpufreq_notifier(void)
>
> cpumask_copy(cpus_to_visit, cpu_possible_mask);
>
> - return cpufreq_register_notifier(&init_cpu_capacity_notifier,
> - CPUFREQ_POLICY_NOTIFIER);
> + ret = cpufreq_register_notifier(&init_cpu_capacity_notifier,
> + CPUFREQ_POLICY_NOTIFIER);
> +
> + if (ret)
> + free_cpumask_var(cpus_to_visit);
> +
> + return ret;
> }
> core_initcall(register_cpufreq_notifier);
>
> static void parsing_done_workfn(struct work_struct *work)
> {
> + free_cpumask_var(cpus_to_visit);
> cpufreq_unregister_notifier(&init_cpu_capacity_notifier,
> CPUFREQ_POLICY_NOTIFIER);
As a general rule (and good coding practice), it is better to free resources
only after the users are gone. And so we should have changed the order here.
i.e. Unregister the notifier first and then free the cpumask.
And because of that we may end up crashing the kernel here.
Here is an example:
Consider that init_cpu_capacity_callback() is getting called concurrently on big
and LITTLE CPUs.
CPU0 (big) CPU4 (LITTLE)
if (cap_parsing_failed || cap_parsing_done)
return 0;
cap_parsing_done = true;
schedule_work(&parsing_done_work);
parsing_done_workfn(work)
-> free_cpumask_var(cpus_to_visit);
-> cpufreq_unregister_notifier()
switch (val) {
...
/* Touch cpus_to_visit and crash */
My assumption here is that the same notifier head can get called in parallel on
two CPUs as all I see there is a down_read() in __blocking_notifier_call_chain()
which shouldn't block parallel calls.
Maybe I am wrong :(
--
viresh
On 06-07-17, 10:49, Dietmar Eggemann wrote:
> A frequency-invariant load-tracking solution based on cpufreq transition
> notifier will not work for future fast frequency switching policies.
> That is why a different solution is presented with this patch.
>
> Let cpufreq call the function arch_set_freq_scale() to pass the current
> frequency, the max supported frequency and the cpumask of the related
> cpus to a consumer (an arch) which defines arch_set_freq_scale().
>
> The consumer has to associate arch_set_freq_scale with the name of its
> own implementation foo_set_freq_scale() to overwrite the empty standard
> definition in drivers/cpufreq/cpufreq.c.
> An arch could do this in one of its arch-specific header files
> (e.g. arch/$ARCH/include/asm/topology.h) which gets included in
> drivers/cpufreq/cpufreq.c.
>
> In case arch_set_freq_scale() is not defined (and because of the
> pr_debug() drivers/cpufreq/cpufreq.c is not compiled with -DDEBUG)
The line within () needs to be improved to convey a clear message.
> the
> function cpufreq_set_freq_scale() gets compiled out.
>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> ---
> drivers/cpufreq/cpufreq.c | 26 ++++++++++++++++++++++++++
> 1 file changed, 26 insertions(+)
>
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 9bf97a366029..a04c5886a5ce 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -347,6 +347,28 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
> }
> }
>
> +/*********************************************************************
> + * FREQUENCY INVARIANT CPU CAPACITY SUPPORT *
> + *********************************************************************/
> +
> +#ifndef arch_set_freq_scale
> +static void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
> + unsigned long max_freq)
> +{}
> +#endif
> +
> +static void cpufreq_set_freq_scale(struct cpufreq_policy *policy,
> + struct cpufreq_freqs *freqs)
> +{
> + unsigned long cur_freq = freqs ? freqs->new : policy->cur;
> + unsigned long max_freq = policy->cpuinfo.max_freq;
> +
> + pr_debug("cpus %*pbl cur/cur max freq %lu/%lu kHz\n",
> + cpumask_pr_args(policy->related_cpus), cur_freq, max_freq);
> +
> + arch_set_freq_scale(policy->related_cpus, cur_freq, max_freq);
I am not sure why all these are required to be sent here and will come back to
it later on after going through other patches.
> +}
> +
> /**
> * cpufreq_notify_transition - call notifier chain and adjust_jiffies
> * on frequency transition.
> @@ -405,6 +427,8 @@ void cpufreq_freq_transition_begin(struct cpufreq_policy *policy,
>
> spin_unlock(&policy->transition_lock);
>
> + cpufreq_set_freq_scale(policy, freqs);
> +
Why do this before even changing the frequency ? We may fail while changing it.
IMHO, you should call this routine whenever we update policy->cur and that
happens regularly in __cpufreq_notify_transition() and few other places..
> cpufreq_notify_transition(policy, freqs, CPUFREQ_PRECHANGE);
> }
> EXPORT_SYMBOL_GPL(cpufreq_freq_transition_begin);
> @@ -2203,6 +2227,8 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
> blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
> CPUFREQ_NOTIFY, new_policy);
>
> + cpufreq_set_freq_scale(new_policy, NULL);
Why added it here ? To get it initialized ? If yes, then we should do that in
cpufreq_online() where we first initialize policy->cur.
Apart from this, you also need to update this in the schedutil governor (if you
haven't done that in this series later) as that also updates policy->cur in the
fast path.
--
viresh
On 06-07-17, 10:49, Dietmar Eggemann wrote:
> Define arch_set_freq_scale to be the arch_topology "driver" function
> topology_set_freq_scale() to let FIE work correctly.
>
> Cc: Russell King <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> ---
> arch/arm/include/asm/topology.h | 5 +++++
> arch/arm/kernel/topology.c | 1 -
> 2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
> index 370f7a732900..ca05d1b90411 100644
> --- a/arch/arm/include/asm/topology.h
> +++ b/arch/arm/include/asm/topology.h
> @@ -24,6 +24,11 @@ void init_cpu_topology(void);
> void store_cpu_topology(unsigned int cpuid);
> const struct cpumask *cpu_coregroup_mask(int cpu);
>
> +#include <linux/arch_topology.h>
> +
> +/* Subscribe for input data for frequency-invariant load-tracking */
> +#define arch_set_freq_scale topology_set_freq_scale
> +
> #else
>
> static inline void init_cpu_topology(void) { }
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index bf949a763dbe..2c47a76c67b0 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -11,7 +11,6 @@
> * for more details.
> */
>
> -#include <linux/arch_topology.h>
Why is this diff part of this patch ?
--
viresh
On 06-07-17, 10:49, Dietmar Eggemann wrote:
> Implements an arch-specific frequency-scaling function
> topology_get_freq_scale() which provides the following frequency
> scaling factor:
>
> current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_supported_freq(cpu)
>
> One possible consumer of this is the Per-Entity Load Tracking (PELT)
> mechanism of the task scheduler.
>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> ---
> drivers/base/arch_topology.c | 20 ++++++++++++++++++++
> include/linux/arch_topology.h | 7 +++++++
> 2 files changed, 27 insertions(+)
>
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index f4832c662762..63fb3f945d21 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -22,6 +22,26 @@
> #include <linux/string.h>
> #include <linux/sched/topology.h>
>
> +static DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
> +
> +unsigned long topology_get_freq_scale(struct sched_domain *sd, int cpu)
> +{
> + return per_cpu(freq_scale, cpu);
> +}
> +
> +void topology_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
> + unsigned long max_freq)
> +{
> + unsigned long scale;
> + int i;
> +
> + scale = (cur_freq << SCHED_CAPACITY_SHIFT) / max_freq;
> +
> + for_each_cpu(i, cpus)
> + per_cpu(freq_scale, i) = scale;
> +}
> +
> +
> static DEFINE_MUTEX(cpu_scale_mutex);
> static DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
>
> diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
> index 9af3c174c03a..168104d2d2cf 100644
> --- a/include/linux/arch_topology.h
> +++ b/include/linux/arch_topology.h
> @@ -4,6 +4,8 @@
> #ifndef _LINUX_ARCH_TOPOLOGY_H_
> #define _LINUX_ARCH_TOPOLOGY_H_
>
> +#include <linux/cpumask.h>
> +
You don't need a full include here, instead following will work pretty well.
struct cpumask;
> void topology_normalize_cpu_scale(void);
>
> struct device_node;
> @@ -14,4 +16,9 @@ unsigned long topology_get_cpu_scale(struct sched_domain *sd, int cpu);
>
> void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity);
>
> +unsigned long topology_get_freq_scale(struct sched_domain *sd, int cpu);
> +
> +void topology_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
> + unsigned long max_freq);
> +
> #endif /* _LINUX_ARCH_TOPOLOGY_H_ */
Apart from that:
Reviewed-by: Viresh Kumar <[email protected]>
--
viresh
On 06-07-17, 10:49, Dietmar Eggemann wrote:
> Commit dfbca41f3479 ("sched: Optimize freq invariant accounting")
> changed the wiring which now has to be done by associating
> arch_scale_freq_capacity with the actual implementation provided
> by the architecture.
>
> Define arch_scale_freq_capacity to use the arch_topology "driver"
> function topology_get_freq_scale() for the task scheduler's
> frequency-invariant accounting instead of the default
> arch_scale_freq_capacity() in kernel/sched/sched.h.
>
> Cc: Russell King <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> Acked-by: Vincent Guittot <[email protected]>
> Tested-by: Juri Lelli <[email protected]>
> Reviewed-by: Juri Lelli <[email protected]>
> ---
> arch/arm/include/asm/topology.h | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
> index ca05d1b90411..57aebfa03e24 100644
> --- a/arch/arm/include/asm/topology.h
> +++ b/arch/arm/include/asm/topology.h
> @@ -29,6 +29,9 @@ const struct cpumask *cpu_coregroup_mask(int cpu);
> /* Subscribe for input data for frequency-invariant load-tracking */
> #define arch_set_freq_scale topology_set_freq_scale
>
> +/* Replace task scheduler's default frequency-invariant accounting */
> +#define arch_scale_freq_capacity topology_get_freq_scale
> +
> #else
>
> static inline void init_cpu_topology(void) { }
Reviewed-by: Viresh Kumar <[email protected]>
--
viresh
On 06-07-17, 10:49, Dietmar Eggemann wrote:
> Commit 8cd5601c5060 ("sched/fair: Convert arch_scale_cpu_capacity() from
> weak function to #define") changed the wiring which now has to be done
> by associating arch_scale_cpu_capacity with the actual implementation
> provided by the architecture.
>
> Define arch_scale_cpu_capacity to use the arch_topology "driver"
> function topology_get_cpu_scale() for the task scheduler's cpu-invariant
> accounting instead of the default arch_scale_cpu_capacity() in
> kernel/sched/sched.h.
>
> Cc: Russell King <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> Acked-by: Vincent Guittot <[email protected]>
> Tested-by: Juri Lelli <[email protected]>
> Reviewed-by: Juri Lelli <[email protected]>
> ---
> arch/arm/include/asm/topology.h | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
> index 57aebfa03e24..26d491c65955 100644
> --- a/arch/arm/include/asm/topology.h
> +++ b/arch/arm/include/asm/topology.h
> @@ -32,6 +32,9 @@ const struct cpumask *cpu_coregroup_mask(int cpu);
> /* Replace task scheduler's default frequency-invariant accounting */
> #define arch_scale_freq_capacity topology_get_freq_scale
>
> +/* Replace task scheduler's default cpu-invariant accounting */
> +#define arch_scale_cpu_capacity topology_get_cpu_scale
> +
> #else
>
> static inline void init_cpu_topology(void) { }
Reviewed-by: Viresh Kumar <[email protected]>
--
viresh
On 06-07-17, 10:49, Dietmar Eggemann wrote:
> Define arch_set_freq_scale to be the arch_topology "driver" function
> topology_set_freq_scale() to let FIE work correctly.
>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> ---
> arch/arm64/include/asm/topology.h | 5 +++++
> arch/arm64/kernel/topology.c | 1 -
> 2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
> index 8b57339823e9..d9cbb289e295 100644
> --- a/arch/arm64/include/asm/topology.h
> +++ b/arch/arm64/include/asm/topology.h
> @@ -32,6 +32,11 @@ int pcibus_to_node(struct pci_bus *bus);
>
> #endif /* CONFIG_NUMA */
>
> +#include <linux/arch_topology.h>
> +
> +/* Subscribe for input data for frequency-invariant load-tracking */
> +#define arch_set_freq_scale topology_set_freq_scale
> +
> #include <asm-generic/topology.h>
Reviewed-by: Viresh Kumar <[email protected]>
>
> #endif /* _ASM_ARM_TOPOLOGY_H */
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index 79244c75eaec..6cbb6315e493 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
> @@ -11,7 +11,6 @@
> * for more details.
> */
>
> -#include <linux/arch_topology.h>
??
--
viresh
On 06-07-17, 10:49, Dietmar Eggemann wrote:
> Commit dfbca41f3479 ("sched: Optimize freq invariant accounting")
> changed the wiring which now has to be done by associating
> arch_scale_freq_capacity with the actual implementation provided
> by the architecture.
>
> Define arch_scale_freq_capacity to use the arch_topology "driver"
> function topology_get_freq_scale() for the task scheduler's
> frequency-invariant accounting instead of the default
> arch_scale_freq_capacity() in kernel/sched/sched.h.
>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> Acked-by: Catalin Marinas <[email protected]>
> Acked-by: Vincent Guittot <[email protected]>
> Tested-by: Juri Lelli <[email protected]>
> Reviewed-by: Juri Lelli <[email protected]>
> ---
> arch/arm64/include/asm/topology.h | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
> index d9cbb289e295..4fd598b432e6 100644
> --- a/arch/arm64/include/asm/topology.h
> +++ b/arch/arm64/include/asm/topology.h
> @@ -37,6 +37,9 @@ int pcibus_to_node(struct pci_bus *bus);
> /* Subscribe for input data for frequency-invariant load-tracking */
> #define arch_set_freq_scale topology_set_freq_scale
>
> +/* Replace task scheduler's default frequency-invariant accounting */
> +#define arch_scale_freq_capacity topology_get_freq_scale
> +
> #include <asm-generic/topology.h>
>
> #endif /* _ASM_ARM_TOPOLOGY_H */
Reviewed-by: Viresh Kumar <[email protected]>
--
viresh
On 06-07-17, 10:49, Dietmar Eggemann wrote:
> Commit 8cd5601c5060 ("sched/fair: Convert arch_scale_cpu_capacity() from
> weak function to #define") changed the wiring which now has to be done
> by associating arch_scale_cpu_capacity with the actual implementation
> provided by the architecture.
>
> Define arch_scale_cpu_capacity to use the arch_topology "driver"
> function topology_get_cpu_scale() for the task scheduler's cpu-invariant
> accounting instead of the default arch_scale_cpu_capacity() in
> kernel/sched/sched.h.
>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> Acked-by: Catalin Marinas <[email protected]>
> Acked-by: Vincent Guittot <[email protected]>
> Tested-by: Juri Lelli <[email protected]>
> Reviewed-by: Juri Lelli <[email protected]>
> ---
> arch/arm64/include/asm/topology.h | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
> index 4fd598b432e6..0dc81860cc0d 100644
> --- a/arch/arm64/include/asm/topology.h
> +++ b/arch/arm64/include/asm/topology.h
> @@ -40,6 +40,9 @@ int pcibus_to_node(struct pci_bus *bus);
> /* Replace task scheduler's default frequency-invariant accounting */
> #define arch_scale_freq_capacity topology_get_freq_scale
>
> +/* Replace task scheduler's default cpu-invariant accounting */
> +#define arch_scale_cpu_capacity topology_get_cpu_scale
> +
> #include <asm-generic/topology.h>
>
> #endif /* _ASM_ARM_TOPOLOGY_H */
Reviewed-by: Viresh Kumar <[email protected]>
--
viresh
Sure this patch looks pretty useful, but ...
On 06-07-17, 10:49, Dietmar Eggemann wrote:
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 63fb3f945d21..b4481cff14bf 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -22,12 +22,7 @@
> #include <linux/string.h>
> #include <linux/sched/topology.h>
>
> -static DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
> -
> -unsigned long topology_get_freq_scale(struct sched_domain *sd, int cpu)
> -{
> - return per_cpu(freq_scale, cpu);
> -}
> +DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
... you just undo what you did earlier in this series, and that is somewhat
discouraged.
What about making this as the first patch of the series and move only the below
part to the header. And then you can add the above part to the right place in
the first attempt itself?
But maybe this is all okay :)
--
viresh
Hi Viresh,
On 06/07/17 15:52, Viresh Kumar wrote:
> On 06-07-17, 10:49, Dietmar Eggemann wrote:
[...]
> > static void parsing_done_workfn(struct work_struct *work)
> > {
> > + free_cpumask_var(cpus_to_visit);
> > cpufreq_unregister_notifier(&init_cpu_capacity_notifier,
> > CPUFREQ_POLICY_NOTIFIER);
>
> As a general rule (and good coding practice), it is better to free resources
> only after the users are gone. And so we should have changed the order here.
> i.e. Unregister the notifier first and then free the cpumask.
>
> And because of that we may end up crashing the kernel here.
>
> Here is an example:
>
> Consider that init_cpu_capacity_callback() is getting called concurrently on big
> and LITTLE CPUs.
>
>
> CPU0 (big) CPU4 (LITTLE)
>
> if (cap_parsing_failed || cap_parsing_done)
> return 0;
>
But, in this case the policy notifier for LITTLE cluster has not been
executed yet, so the domain's CPUs have not yet been cleared out from
cpus_to_visit. CPU0 won't see the mask as empty then, right?
> cap_parsing_done = true;
> schedule_work(&parsing_done_work);
>
> parsing_done_workfn(work)
> -> free_cpumask_var(cpus_to_visit);
> -> cpufreq_unregister_notifier()
>
>
> switch (val) {
> ...
> /* Touch cpus_to_visit and crash */
>
>
> My assumption here is that the same notifier head can get called in parallel on
> two CPUs as all I see there is a down_read() in __blocking_notifier_call_chain()
> which shouldn't block parallel calls.
>
If that's the case I'm wondering however if we need explicit
synchronization though. Otherwise both threads can read the mask as
full, clear only their bits and not schedule the workfn?
But, can the policies be concurrently initialized? Or is the
initialization process serialized or the different domains?
Thanks,
- Juri
On 06-07-17, 11:59, Juri Lelli wrote:
> On 06/07/17 15:52, Viresh Kumar wrote:
> > CPU0 (big) CPU4 (LITTLE)
> >
> > if (cap_parsing_failed || cap_parsing_done)
> > return 0;
> >
>
> But, in this case the policy notifier for LITTLE cluster has not been
> executed yet,
Not necessarily. The cpufreq notifier with CPUFREQ_NOTIFY event can get called
again and again (as soon as the policy is changed, for example min/max changed
from sysfs). And so it is possible that the LITTLE cpus are already cleared from
the mask.
> so the domain's CPUs have not yet been cleared out from
> cpus_to_visit. CPU0 won't see the mask as empty then, right?
And so it can.
> > cap_parsing_done = true;
> > schedule_work(&parsing_done_work);
> >
> > parsing_done_workfn(work)
> > -> free_cpumask_var(cpus_to_visit);
> > -> cpufreq_unregister_notifier()
> >
> >
> > switch (val) {
> > ...
> > /* Touch cpus_to_visit and crash */
> >
> >
> > My assumption here is that the same notifier head can get called in parallel on
> > two CPUs as all I see there is a down_read() in __blocking_notifier_call_chain()
> > which shouldn't block parallel calls.
> >
>
> If that's the case I'm wondering however if we need explicit
> synchronization though. Otherwise both threads can read the mask as
> full, clear only their bits and not schedule the workfn?
Maybe not as the policies are created one by one only, not concurrently.
> But, can the policies be concurrently initialized? Or is the
> initialization process serialized or the different domains?
There can be complex cases here. For example consider this.
Only the little CPUs are brought online at boot. Their policy is set and they
are cleared from the cpus_to_visit mask. Now we try to bring any big CPU online
and at the same time try changing min/max from sysfs for the LITTLE CPU policy.
The notifier may get called concurrently here I believe and cause the problem I
mentioned earlier.
--
viresh
On Thursday, July 06, 2017 04:10:27 PM Viresh Kumar wrote:
> On 06-07-17, 10:49, Dietmar Eggemann wrote:
> > A frequency-invariant load-tracking solution based on cpufreq transition
> > notifier will not work for future fast frequency switching policies.
> > That is why a different solution is presented with this patch.
> >
> > Let cpufreq call the function arch_set_freq_scale() to pass the current
> > frequency, the max supported frequency and the cpumask of the related
> > cpus to a consumer (an arch) which defines arch_set_freq_scale().
> >
> > The consumer has to associate arch_set_freq_scale with the name of its
> > own implementation foo_set_freq_scale() to overwrite the empty standard
> > definition in drivers/cpufreq/cpufreq.c.
> > An arch could do this in one of its arch-specific header files
> > (e.g. arch/$ARCH/include/asm/topology.h) which gets included in
> > drivers/cpufreq/cpufreq.c.
> >
> > In case arch_set_freq_scale() is not defined (and because of the
> > pr_debug() drivers/cpufreq/cpufreq.c is not compiled with -DDEBUG)
>
> The line within () needs to be improved to convey a clear message.
>
> > the
> > function cpufreq_set_freq_scale() gets compiled out.
> >
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Viresh Kumar <[email protected]>
> > Signed-off-by: Dietmar Eggemann <[email protected]>
> > ---
> > drivers/cpufreq/cpufreq.c | 26 ++++++++++++++++++++++++++
> > 1 file changed, 26 insertions(+)
> >
> > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > index 9bf97a366029..a04c5886a5ce 100644
> > --- a/drivers/cpufreq/cpufreq.c
> > +++ b/drivers/cpufreq/cpufreq.c
> > @@ -347,6 +347,28 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
> > }
> > }
> >
> > +/*********************************************************************
> > + * FREQUENCY INVARIANT CPU CAPACITY SUPPORT *
> > + *********************************************************************/
> > +
> > +#ifndef arch_set_freq_scale
> > +static void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
> > + unsigned long max_freq)
> > +{}
> > +#endif
> > +
> > +static void cpufreq_set_freq_scale(struct cpufreq_policy *policy,
> > + struct cpufreq_freqs *freqs)
> > +{
> > + unsigned long cur_freq = freqs ? freqs->new : policy->cur;
> > + unsigned long max_freq = policy->cpuinfo.max_freq;
> > +
> > + pr_debug("cpus %*pbl cur/cur max freq %lu/%lu kHz\n",
> > + cpumask_pr_args(policy->related_cpus), cur_freq, max_freq);
> > +
> > + arch_set_freq_scale(policy->related_cpus, cur_freq, max_freq);
>
> I am not sure why all these are required to be sent here and will come back to
> it later on after going through other patches.
>
> > +}
> > +
> > /**
> > * cpufreq_notify_transition - call notifier chain and adjust_jiffies
> > * on frequency transition.
> > @@ -405,6 +427,8 @@ void cpufreq_freq_transition_begin(struct cpufreq_policy *policy,
> >
> > spin_unlock(&policy->transition_lock);
> >
> > + cpufreq_set_freq_scale(policy, freqs);
> > +
>
> Why do this before even changing the frequency ? We may fail while changing it.
>
> IMHO, you should call this routine whenever we update policy->cur and that
> happens regularly in __cpufreq_notify_transition() and few other places..
There seems to be a general problem with doing this in the core with respect to
things like intel_pstate that use their own governor callbacks and don't invoke
cpufreq_freq_transition_begin() then.
Thanks,
Rafael
On 06/07/17 12:15, Viresh Kumar wrote:
> On 06-07-17, 11:59, Juri Lelli wrote:
>> On 06/07/17 15:52, Viresh Kumar wrote:
[...]
>>
>> If that's the case I'm wondering however if we need explicit
>> synchronization though. Otherwise both threads can read the mask as
>> full, clear only their bits and not schedule the workfn?
>
> Maybe not as the policies are created one by one only, not concurrently.
>
>> But, can the policies be concurrently initialized? Or is the
>> initialization process serialized or the different domains?
>
> There can be complex cases here. For example consider this.
>
> Only the little CPUs are brought online at boot. Their policy is set and they
> are cleared from the cpus_to_visit mask. Now we try to bring any big CPU online
> and at the same time try changing min/max from sysfs for the LITTLE CPU policy.
>
> The notifier may get called concurrently here I believe and cause the problem I
> mentioned earlier.
>
I chatted with Juri and your proposed fix to do the unregister before
the free makes sense to us. Thanks for spotting this! I will change in
the next version.
On 06/07/17 11:40, Viresh Kumar wrote:
> On 06-07-17, 10:49, Dietmar Eggemann wrote:
[...]
>> In case arch_set_freq_scale() is not defined (and because of the
>> pr_debug() drivers/cpufreq/cpufreq.c is not compiled with -DDEBUG)
>
> The line within () needs to be improved to convey a clear message.
Probably not needed anymore. See below.
[...]
>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
>> index 9bf97a366029..a04c5886a5ce 100644
>> --- a/drivers/cpufreq/cpufreq.c
>> +++ b/drivers/cpufreq/cpufreq.c
>> @@ -347,6 +347,28 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
>> }
>> }
>>
>> +/*********************************************************************
>> + * FREQUENCY INVARIANT CPU CAPACITY SUPPORT *
>> + *********************************************************************/
>> +
>> +#ifndef arch_set_freq_scale
>> +static void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
>> + unsigned long max_freq)
>> +{}
>> +#endif
>> +
>> +static void cpufreq_set_freq_scale(struct cpufreq_policy *policy,
>> + struct cpufreq_freqs *freqs)
>> +{
>> + unsigned long cur_freq = freqs ? freqs->new : policy->cur;
>> + unsigned long max_freq = policy->cpuinfo.max_freq;
>> +
>> + pr_debug("cpus %*pbl cur/cur max freq %lu/%lu kHz\n",
>> + cpumask_pr_args(policy->related_cpus), cur_freq, max_freq);
>> +
>> + arch_set_freq_scale(policy->related_cpus, cur_freq, max_freq);
>
> I am not sure why all these are required to be sent here and will come back to
> it later on after going through other patches.
See below.
>> +}
>> +
>> /**
>> * cpufreq_notify_transition - call notifier chain and adjust_jiffies
>> * on frequency transition.
>> @@ -405,6 +427,8 @@ void cpufreq_freq_transition_begin(struct cpufreq_policy *policy,
>>
>> spin_unlock(&policy->transition_lock);
>>
>> + cpufreq_set_freq_scale(policy, freqs);
>> +
>
> Why do this before even changing the frequency ? We may fail while changing it.
>
> IMHO, you should call this routine whenever we update policy->cur and that
> happens regularly in __cpufreq_notify_transition() and few other places..
See below.
>> cpufreq_notify_transition(policy, freqs, CPUFREQ_PRECHANGE);
>> }
>> EXPORT_SYMBOL_GPL(cpufreq_freq_transition_begin);
>> @@ -2203,6 +2227,8 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
>> blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
>> CPUFREQ_NOTIFY, new_policy);
>>
>> + cpufreq_set_freq_scale(new_policy, NULL);
>
> Why added it here ? To get it initialized ? If yes, then we should do that in
> cpufreq_online() where we first initialize policy->cur.
I agree. This can go away. Initialization is not really needed here. We initialize
the scale values to SCHED_CAPACITY_SCALE at boot-time.
> Apart from this, you also need to update this in the schedutil governor (if you
> haven't done that in this series later) as that also updates policy->cur in the
> fast path.
So what about I call arch_set_freq_scale() in __cpufreq_notify_transition() in the
CPUFREQ_POSTCHANGE case for slow-switching and in cpufreq_driver_fast_switch() for
fast-switching?
Like this: (only slow-path slightly tested):
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 9bf97a366029..77c4d5e7a598 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -301,6 +301,12 @@ static void adjust_jiffies(unsigned long val, struct cpufreq_freqs *ci)
#endif
}
+#ifndef arch_set_freq_scale
+static void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
+ unsigned long max_freq)
+{}
+#endif
+
static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
struct cpufreq_freqs *freqs, unsigned int state)
{
@@ -343,6 +349,8 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
CPUFREQ_POSTCHANGE, freqs);
if (likely(policy) && likely(policy->cpu == freqs->cpu))
policy->cur = freqs->new;
+ arch_set_freq_scale(policy->related_cpus, policy->cur,
+ policy->cpuinfo.max_freq);
break;
}
}
@@ -1824,9 +1832,18 @@ EXPORT_SYMBOL(cpufreq_unregister_notifier);
unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
unsigned int target_freq)
{
+ unsigned int ret;
+
target_freq = clamp_val(target_freq, policy->min, policy->max);
- return cpufreq_driver->fast_switch(policy, target_freq);
+ ret = cpufreq_driver->fast_switch(policy, target_freq);
+
+ if (ret != CPUFREQ_ENTRY_INVALID) {
+ arch_set_freq_scale(policy->related_cpus, target_freq,
+ policy->cpuinfo.max_freq);
+ }
+
+ return ret;
}
EXPORT_SYMBOL_GPL(cpufreq_driver_fast_switch);
On Fri, Jul 7, 2017 at 6:01 PM, Dietmar Eggemann
<[email protected]> wrote:
> On 06/07/17 11:40, Viresh Kumar wrote:
>> On 06-07-17, 10:49, Dietmar Eggemann wrote:
>
> [...]
>
>>> In case arch_set_freq_scale() is not defined (and because of the
>>> pr_debug() drivers/cpufreq/cpufreq.c is not compiled with -DDEBUG)
>>
>> The line within () needs to be improved to convey a clear message.
>
> Probably not needed anymore. See below.
>
> [...]
>
>>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
>>> index 9bf97a366029..a04c5886a5ce 100644
>>> --- a/drivers/cpufreq/cpufreq.c
>>> +++ b/drivers/cpufreq/cpufreq.c
>>> @@ -347,6 +347,28 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
>>> }
>>> }
>>>
>>> +/*********************************************************************
>>> + * FREQUENCY INVARIANT CPU CAPACITY SUPPORT *
>>> + *********************************************************************/
>>> +
>>> +#ifndef arch_set_freq_scale
>>> +static void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
>>> + unsigned long max_freq)
>>> +{}
>>> +#endif
>>> +
>>> +static void cpufreq_set_freq_scale(struct cpufreq_policy *policy,
>>> + struct cpufreq_freqs *freqs)
>>> +{
>>> + unsigned long cur_freq = freqs ? freqs->new : policy->cur;
>>> + unsigned long max_freq = policy->cpuinfo.max_freq;
>>> +
>>> + pr_debug("cpus %*pbl cur/cur max freq %lu/%lu kHz\n",
>>> + cpumask_pr_args(policy->related_cpus), cur_freq, max_freq);
>>> +
>>> + arch_set_freq_scale(policy->related_cpus, cur_freq, max_freq);
>>
>> I am not sure why all these are required to be sent here and will come back to
>> it later on after going through other patches.
>
> See below.
>
>>> +}
>>> +
>>> /**
>>> * cpufreq_notify_transition - call notifier chain and adjust_jiffies
>>> * on frequency transition.
>>> @@ -405,6 +427,8 @@ void cpufreq_freq_transition_begin(struct cpufreq_policy *policy,
>>>
>>> spin_unlock(&policy->transition_lock);
>>>
>>> + cpufreq_set_freq_scale(policy, freqs);
>>> +
>>
>> Why do this before even changing the frequency ? We may fail while changing it.
>>
>> IMHO, you should call this routine whenever we update policy->cur and that
>> happens regularly in __cpufreq_notify_transition() and few other places..
>
> See below.
>
>>> cpufreq_notify_transition(policy, freqs, CPUFREQ_PRECHANGE);
>>> }
>>> EXPORT_SYMBOL_GPL(cpufreq_freq_transition_begin);
>>> @@ -2203,6 +2227,8 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
>>> blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
>>> CPUFREQ_NOTIFY, new_policy);
>>>
>>> + cpufreq_set_freq_scale(new_policy, NULL);
>>
>> Why added it here ? To get it initialized ? If yes, then we should do that in
>> cpufreq_online() where we first initialize policy->cur.
>
> I agree. This can go away. Initialization is not really needed here. We initialize
> the scale values to SCHED_CAPACITY_SCALE at boot-time.
>
>> Apart from this, you also need to update this in the schedutil governor (if you
>> haven't done that in this series later) as that also updates policy->cur in the
>> fast path.
>
> So what about I call arch_set_freq_scale() in __cpufreq_notify_transition() in the
> CPUFREQ_POSTCHANGE case for slow-switching and in cpufreq_driver_fast_switch() for
> fast-switching?
Why don't you do this in drivers instead of in the core?
Ultimately, the driver knows what frequency it has requested, so why
can't it call arch_set_freq_scale()?
Thanks,
Rafael
On 06/07/17 11:45, Viresh Kumar wrote:
> On 06-07-17, 10:49, Dietmar Eggemann wrote:
[...]
>> diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
>> index 9af3c174c03a..168104d2d2cf 100644
>> --- a/include/linux/arch_topology.h
>> +++ b/include/linux/arch_topology.h
>> @@ -4,6 +4,8 @@
>> #ifndef _LINUX_ARCH_TOPOLOGY_H_
>> #define _LINUX_ARCH_TOPOLOGY_H_
>>
>> +#include <linux/cpumask.h>
>> +
>
> You don't need a full include here, instead following will work pretty well.
>
> struct cpumask;
True. Forward declaration is sufficient here. Will change it.
[...]
> Apart from that:
>
> Reviewed-by: Viresh Kumar <[email protected]>
Thanks!
On 07/07/17 17:18, Rafael J. Wysocki wrote:
> On Fri, Jul 7, 2017 at 6:01 PM, Dietmar Eggemann
> <[email protected]> wrote:
>> On 06/07/17 11:40, Viresh Kumar wrote:
>>> On 06-07-17, 10:49, Dietmar Eggemann wrote:
[...]
>> So what about I call arch_set_freq_scale() in __cpufreq_notify_transition() in the
>> CPUFREQ_POSTCHANGE case for slow-switching and in cpufreq_driver_fast_switch() for
>> fast-switching?
>
> Why don't you do this in drivers instead of in the core?
>
> Ultimately, the driver knows what frequency it has requested, so why
> can't it call arch_set_freq_scale()?
That's correct but for arm/arm64 we have a lot of different cpufreq
drivers to deal with. And doing this call to arch_set_freq_scale() once
in the cpufreq core will cover them all.
[...]
On Friday, July 07, 2017 06:06:30 PM Dietmar Eggemann wrote:
> On 07/07/17 17:18, Rafael J. Wysocki wrote:
> > On Fri, Jul 7, 2017 at 6:01 PM, Dietmar Eggemann
> > <[email protected]> wrote:
> >> On 06/07/17 11:40, Viresh Kumar wrote:
> >>> On 06-07-17, 10:49, Dietmar Eggemann wrote:
>
> [...]
>
> >> So what about I call arch_set_freq_scale() in __cpufreq_notify_transition() in the
> >> CPUFREQ_POSTCHANGE case for slow-switching and in cpufreq_driver_fast_switch() for
> >> fast-switching?
> >
> > Why don't you do this in drivers instead of in the core?
> >
> > Ultimately, the driver knows what frequency it has requested, so why
> > can't it call arch_set_freq_scale()?
>
> That's correct but for arm/arm64 we have a lot of different cpufreq
> drivers to deal with. And doing this call to arch_set_freq_scale() once
> in the cpufreq core will cover them all.
>
> [...]
I'm sort of wondering how many is "a lot" really. For instance, do you really
want all of the existing ARM platforms to use the new stuff even though
it may regress things there in principle?
Anyway, if everyone agrees that doing it in the core is the way to go (Peter?),
why don't you introduce a __weak function for setting policy->cur and
override it from your arch so as to call arch_set_freq_scale() from there?
Thanks,
Rafael
On 07-07-17, 17:01, Dietmar Eggemann wrote:
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 9bf97a366029..77c4d5e7a598 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -301,6 +301,12 @@ static void adjust_jiffies(unsigned long val, struct cpufreq_freqs *ci)
> #endif
> }
>
> +#ifndef arch_set_freq_scale
> +static void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
> + unsigned long max_freq)
> +{}
> +#endif
> +
> static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
> struct cpufreq_freqs *freqs, unsigned int state)
> {
> @@ -343,6 +349,8 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
> CPUFREQ_POSTCHANGE, freqs);
> if (likely(policy) && likely(policy->cpu == freqs->cpu))
> policy->cur = freqs->new;
> + arch_set_freq_scale(policy->related_cpus, policy->cur,
> + policy->cpuinfo.max_freq);
This function gets called for-each-cpu-in-policy, so you don't need the first
argument. And the topology code already has max_freq, so not sure if you need
the last parameter as well, specially for the ARM solution.
--
viresh
On 08-07-17, 14:09, Rafael J. Wysocki wrote:
> I'm sort of wondering how many is "a lot" really. For instance, do you really
> want all of the existing ARM platforms to use the new stuff even though
> it may regress things there in principle?
That's a valid question and we must (maybe we already have) have a policy for
such changes. I thought that such changes (which are so closely bound to the
scheduler) must be at least done at the architecture level and not really at
platform level. And so doing it widely (like done in this patch) maybe the right
thing to do.
> Anyway, if everyone agrees that doing it in the core is the way to go (Peter?),
> why don't you introduce a __weak function for setting policy->cur and
> override it from your arch so as to call arch_set_freq_scale() from there?
I agree. I wanted to suggest that earlier but somehow forgot to mention this.
--
viresh
On Sat, Jul 08, 2017 at 02:09:37PM +0200, Rafael J. Wysocki wrote:
> Anyway, if everyone agrees that doing it in the core is the way to go (Peter?),
> why don't you introduce a __weak function for setting policy->cur and
> override it from your arch so as to call arch_set_freq_scale() from there?
>
So I'm terminally backlogged and my recent break didn't help any with
that.
I'm at a total loss as to what is proposed here and why we need it. I
tried reading both the Changelog and patch but came up empty.
On 10-07-17, 11:30, Peter Zijlstra wrote:
> On Sat, Jul 08, 2017 at 02:09:37PM +0200, Rafael J. Wysocki wrote:
> > Anyway, if everyone agrees that doing it in the core is the way to go (Peter?),
> > why don't you introduce a __weak function for setting policy->cur and
> > override it from your arch so as to call arch_set_freq_scale() from there?
> >
>
> So I'm terminally backlogged and my recent break didn't help any with
> that.
>
> I'm at a total loss as to what is proposed here and why we need it. I
> tried reading both the Changelog and patch but came up empty.
Dietmar is proposing the implementation of arch_set_freq_scale() for ARM (32/64)
platforms here with following equation in drivers/base/arch_topology.c:
scale = (cur_freq << SCHED_CAPACITY_SHIFT) / max_freq
The only variable part here is "cur_freq" and he is looking for sane ways to get
that value in the arch_topology.c file, so he can use that in the above
equation. He tried to use cpufreq transition notifiers earlier but they block us
from using fast switching.
What he is proposing now is a function:
void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
unsigned long max_freq);
which has to be called by someone after the frequency of the CPU is changed.
Dietmar proposed that this be called by cpufreq core and Rafael was wondering if
the cpufreq drivers should call it. Dietmar's argument is that it will be used
for the entire ARM architecture this way and wouldn't lead to redundant core
across drivers.
Hope I didn't confuse you more with this :)
--
viresh
On 10/07/17 10:42, Viresh Kumar wrote:
> On 10-07-17, 11:30, Peter Zijlstra wrote:
>> On Sat, Jul 08, 2017 at 02:09:37PM +0200, Rafael J. Wysocki wrote:
>>> Anyway, if everyone agrees that doing it in the core is the way to go (Peter?),
>>> why don't you introduce a __weak function for setting policy->cur and
>>> override it from your arch so as to call arch_set_freq_scale() from there?
>>>
>>
>> So I'm terminally backlogged and my recent break didn't help any with
>> that.
>>
>> I'm at a total loss as to what is proposed here and why we need it. I
>> tried reading both the Changelog and patch but came up empty.
>
> Dietmar is proposing the implementation of arch_set_freq_scale() for ARM (32/64)
> platforms here with following equation in drivers/base/arch_topology.c:
>
> scale = (cur_freq << SCHED_CAPACITY_SHIFT) / max_freq
>
> The only variable part here is "cur_freq" and he is looking for sane ways to get
> that value in the arch_topology.c file, so he can use that in the above
> equation. He tried to use cpufreq transition notifiers earlier but they block us
> from using fast switching.
>
> What he is proposing now is a function:
>
> void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
> unsigned long max_freq);
>
> which has to be called by someone after the frequency of the CPU is changed.
>
> Dietmar proposed that this be called by cpufreq core and Rafael was wondering if
> the cpufreq drivers should call it. Dietmar's argument is that it will be used
> for the entire ARM architecture this way and wouldn't lead to redundant core
> across drivers.
>
> Hope I didn't confuse you more with this :)
>
Perfect summary, thanks Viresh!
This is required for architectures (like arm/arm64) which do not have
any other way to know about the current CPU frequency.
X86 can do the frequency invariance support based on APERF/MPERF already
today so it does not need the support from cpufreq.
On 08/07/17 13:09, Rafael J. Wysocki wrote:
> On Friday, July 07, 2017 06:06:30 PM Dietmar Eggemann wrote:
>> On 07/07/17 17:18, Rafael J. Wysocki wrote:
>>> On Fri, Jul 7, 2017 at 6:01 PM, Dietmar Eggemann
>>> <[email protected]> wrote:
>>>> On 06/07/17 11:40, Viresh Kumar wrote:
>>>>> On 06-07-17, 10:49, Dietmar Eggemann wrote:
>>
>> [...]
>>
>>>> So what about I call arch_set_freq_scale() in __cpufreq_notify_transition() in the
>>>> CPUFREQ_POSTCHANGE case for slow-switching and in cpufreq_driver_fast_switch() for
>>>> fast-switching?
>>>
>>> Why don't you do this in drivers instead of in the core?
>>>
>>> Ultimately, the driver knows what frequency it has requested, so why
>>> can't it call arch_set_freq_scale()?
>>
>> That's correct but for arm/arm64 we have a lot of different cpufreq
>> drivers to deal with. And doing this call to arch_set_freq_scale() once
>> in the cpufreq core will cover them all.
>>
>> [...]
>
> I'm sort of wondering how many is "a lot" really. For instance, do you really
> want all of the existing ARM platforms to use the new stuff even though
> it may regress things there in principle?
Yeah, in mainline we probably only care about a couple of them, I know
about cpufreq-dt.c, mt8173-cpufreq.c and arm_big_little.c.
But a lot of development in arm64 still happens outside mainline and we
would have to inform people to provision their cpufreq drivers with this
functionality.
With a solution in cpufreq.c we could just implement the functionality
in the arch which we then connect to the call in cpufreq.c.
> Anyway, if everyone agrees that doing it in the core is the way to go (Peter?),
> why don't you introduce a __weak function for setting policy->cur and
> override it from your arch so as to call arch_set_freq_scale() from there?
Yes, I will change this. The #define approach is not really necessary
here since we're not in the scheduler hot-path and inlining is not
really required here.
Thanks,
-- Dietmar
[...]
On Monday, July 10, 2017 12:24:43 PM Viresh Kumar wrote:
> On 08-07-17, 14:09, Rafael J. Wysocki wrote:
> > I'm sort of wondering how many is "a lot" really. For instance, do you really
> > want all of the existing ARM platforms to use the new stuff even though
> > it may regress things there in principle?
>
> That's a valid question and we must (maybe we already have) have a policy for
> such changes.
I don't think it is a matter of policy, as it tends to vary from case to case.
What really matters is the reason to make the changes in this particular case.
If the reason if to help new systems to work better in the first place and the
old ones are affected just by the way, it may be better to avoid affecting them.
On the other hand, if the reason is to improve things for all the new and old
systems altogether, then sure let's do it this way.
> I thought that such changes (which are so closely bound to the
> scheduler) must be at least done at the architecture level and not really at
> platform level. And so doing it widely (like done in this patch) maybe the right
> thing to do.
This particular change is about a new feature, so making it in the core is OK
in two cases IMO: (a) when you actively want everyone to be affected by it and
(b) when the effect of it on the old systems should not be noticeable.
Thanks,
Rafael
On 06/07/17 11:42, Viresh Kumar wrote:
> On 06-07-17, 10:49, Dietmar Eggemann wrote:
>> Define arch_set_freq_scale to be the arch_topology "driver" function
>> topology_set_freq_scale() to let FIE work correctly.
>>
>> Cc: Russell King <[email protected]>
>> Cc: Juri Lelli <[email protected]>
>> Signed-off-by: Dietmar Eggemann <[email protected]>
>> ---
>> arch/arm/include/asm/topology.h | 5 +++++
>> arch/arm/kernel/topology.c | 1 -
>> 2 files changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
>> index 370f7a732900..ca05d1b90411 100644
>> --- a/arch/arm/include/asm/topology.h
>> +++ b/arch/arm/include/asm/topology.h
>> @@ -24,6 +24,11 @@ void init_cpu_topology(void);
>> void store_cpu_topology(unsigned int cpuid);
>> const struct cpumask *cpu_coregroup_mask(int cpu);
>>
>> +#include <linux/arch_topology.h>
>> +
>> +/* Subscribe for input data for frequency-invariant load-tracking */
>> +#define arch_set_freq_scale topology_set_freq_scale
>> +
>> #else
>>
>> static inline void init_cpu_topology(void) { }
>> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
>> index bf949a763dbe..2c47a76c67b0 100644
>> --- a/arch/arm/kernel/topology.c
>> +++ b/arch/arm/kernel/topology.c
>> @@ -11,7 +11,6 @@
>> * for more details.
>> */
>>
>> -#include <linux/arch_topology.h>
>
> Why is this diff part of this patch ?
Since 'arch/$ARCH/include/asm/topology.h' now includes
'include/linux/arch_topology.h' and 'arch/$ARCH/kernel/topology.c'
already includes 'arch/$ARCH/include/asm/topology.h' I thought it's a
good idea to get rid of this include here.
On 06/07/17 11:57, Viresh Kumar wrote:
> Sure this patch looks pretty useful, but ...
>
> On 06-07-17, 10:49, Dietmar Eggemann wrote:
>> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
>> index 63fb3f945d21..b4481cff14bf 100644
>> --- a/drivers/base/arch_topology.c
>> +++ b/drivers/base/arch_topology.c
>> @@ -22,12 +22,7 @@
>> #include <linux/string.h>
>> #include <linux/sched/topology.h>
>>
>> -static DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
>> -
>> -unsigned long topology_get_freq_scale(struct sched_domain *sd, int cpu)
>> -{
>> - return per_cpu(freq_scale, cpu);
>> -}
>> +DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
>
> ... you just undo what you did earlier in this series, and that is somewhat
> discouraged.
>
> What about making this as the first patch of the series and move only the below
> part to the header. And then you can add the above part to the right place in
> the first attempt itself?
>
> But maybe this is all okay :)
I just wanted to show people what we gain in completely inlining FIE and
CIE on ARM64 in the scheduler hot-path. But yes, with the next version I
want to fold this inlining into the actual FIE/CIE patch.
On 10-07-17, 13:02, Dietmar Eggemann wrote:
> Yes, I will change this. The #define approach is not really necessary
> here since we're not in the scheduler hot-path and inlining is not
> really required here.
It would be part of scheduler hot-path for the fast-switching case, isn't it ?
(I am not arguing against using weak functions, just wanted to correct above
statement).
--
viresh
On 10-07-17, 16:13, Dietmar Eggemann wrote:
> Since 'arch/$ARCH/include/asm/topology.h' now includes
> 'include/linux/arch_topology.h' and 'arch/$ARCH/kernel/topology.c'
> already includes 'arch/$ARCH/include/asm/topology.h' I thought it's a
> good idea to get rid of this include here.
Ahh, makes sense.
--
viresh
On 10-07-17, 14:46, Rafael J. Wysocki wrote:
> This particular change is about a new feature, so making it in the core is OK
> in two cases IMO: (a) when you actively want everyone to be affected by it and
IMO this change should be done for the whole ARM architecture. And if some
regression happens due to this, then we come back and solve it.
> (b) when the effect of it on the old systems should not be noticeable.
I am not sure about the effects of this on performance really.
@Dietmar: Any inputs for that ?
--
viresh
On 11/07/17 07:01, Viresh Kumar wrote:
> On 10-07-17, 13:02, Dietmar Eggemann wrote:
>> Yes, I will change this. The #define approach is not really necessary
>> here since we're not in the scheduler hot-path and inlining is not
>> really required here.
>
> It would be part of scheduler hot-path for the fast-switching case, isn't it ?
> (I am not arguing against using weak functions, just wanted to correct above
> statement).
Yes you're right here.
But in the meantime we're convinced that cpufreq_driver_fast_switch() is
not the right place to call arch_set_freq_scale() since for (future)
arm/arm64 fast-switch driver, the return value of
cpufreq_driver->fast_switch() does not give us the information that the
frequency value did actually change.
So we probably have to do this soemwhere in the cpufreq driver(s) to
support fast-switching until we have aperf/mperf like counters.
On Tuesday, July 11, 2017 04:06:01 PM Dietmar Eggemann wrote:
> On 11/07/17 07:01, Viresh Kumar wrote:
> > On 10-07-17, 13:02, Dietmar Eggemann wrote:
> >> Yes, I will change this. The #define approach is not really necessary
> >> here since we're not in the scheduler hot-path and inlining is not
> >> really required here.
> >
> > It would be part of scheduler hot-path for the fast-switching case, isn't it ?
> > (I am not arguing against using weak functions, just wanted to correct above
> > statement).
>
> Yes you're right here.
>
> But in the meantime we're convinced that cpufreq_driver_fast_switch() is
> not the right place to call arch_set_freq_scale() since for (future)
> arm/arm64 fast-switch driver, the return value of
> cpufreq_driver->fast_switch() does not give us the information that the
> frequency value did actually change.
>
> So we probably have to do this soemwhere in the cpufreq driver(s) to
> support fast-switching until we have aperf/mperf like counters.
If that's the case, I'd say call arch_set_freq_scale() from drivers in all
cases or it will get *really* confusing.
Thanks,
Rafael
On 11/07/17 15:59, Rafael J. Wysocki wrote:
> On Tuesday, July 11, 2017 04:06:01 PM Dietmar Eggemann wrote:
>> On 11/07/17 07:01, Viresh Kumar wrote:
>>> On 10-07-17, 13:02, Dietmar Eggemann wrote:
>>>> Yes, I will change this. The #define approach is not really necessary
>>>> here since we're not in the scheduler hot-path and inlining is not
>>>> really required here.
>>>
>>> It would be part of scheduler hot-path for the fast-switching case, isn't it ?
>>> (I am not arguing against using weak functions, just wanted to correct above
>>> statement).
>>
>> Yes you're right here.
>>
>> But in the meantime we're convinced that cpufreq_driver_fast_switch() is
>> not the right place to call arch_set_freq_scale() since for (future)
>> arm/arm64 fast-switch driver, the return value of
>> cpufreq_driver->fast_switch() does not give us the information that the
>> frequency value did actually change.
>>
>> So we probably have to do this soemwhere in the cpufreq driver(s) to
>> support fast-switching until we have aperf/mperf like counters.
>
> If that's the case, I'd say call arch_set_freq_scale() from drivers in all
> cases or it will get *really* confusing.
Agreed, we should do it for slow-switching drivers from within the
driver as well in this case.
On 11/07/17 07:39, Viresh Kumar wrote:
> On 10-07-17, 14:46, Rafael J. Wysocki wrote:
>> This particular change is about a new feature, so making it in the core is OK
>> in two cases IMO: (a) when you actively want everyone to be affected by it and
>
> IMO this change should be done for the whole ARM architecture. And if some
> regression happens due to this, then we come back and solve it.
>
>> (b) when the effect of it on the old systems should not be noticeable.
>
> I am not sure about the effects of this on performance really.
>
> @Dietmar: Any inputs for that ?
Like I said in the other email, since for (future)
arm/arm64 fast-switch driver, the return value of
cpufreq_driver->fast_switch() does not give us the information that the
frequency value did actually change, we have to implement
arch_set_freq_scale() in the driver.
This means that we probably only implement this in the subset of drivers
which will be used in platforms on which we want to have
frequency-invariant load-tracking.
A future aperf/mperf like counter FIE solution can give us arch-wide
support when those counters are available.
On 11-07-17, 16:06, Dietmar Eggemann wrote:
> But in the meantime we're convinced that cpufreq_driver_fast_switch() is
> not the right place to call arch_set_freq_scale() since for (future)
> arm/arm64 fast-switch driver, the return value of
> cpufreq_driver->fast_switch() does not give us the information that the
> frequency value did actually change.
Yeah, I saw your discussion with Peter on #linux-rt IRC and TBH I wasn't aware
that we are going to do fast switching that way. Just trying to get
understanding of that idea a bit..
So we will do fast switching from scheduler's point of view, i.e. we wouldn't
schedule a kthread to change the frequency. But the real hardware still can't do
that without sleeping, like if we have I2C somewhere in between. AFAIU, we will
still have some kind of *software* bottom half to do that work, isn't it? And it
wouldn't be that we have pushed some instructions to the hardware, which it can
do a bit later.
For example, the regulator may be accessed via I2C and we need to program that
before changing the clock. So, it will be done by some software code only.
And now I am wondering on why that would be any better than the kthread in
schedutil. Sorry, I haven't understood the idea completely yet :(
--
viresh
On Wed, Jul 12, 2017 at 09:39:17AM +0530, Viresh Kumar wrote:
> Yeah, I saw your discussion with Peter on #linux-rt IRC and TBH I wasn't aware
> that we are going to do fast switching that way. Just trying to get
> understanding of that idea a bit..
>
> So we will do fast switching from scheduler's point of view, i.e. we wouldn't
> schedule a kthread to change the frequency. But the real hardware still can't do
> that without sleeping, like if we have I2C somewhere in between. AFAIU, we will
> still have some kind of *software* bottom half to do that work, isn't it? And it
> wouldn't be that we have pushed some instructions to the hardware, which it can
> do a bit later.
>
> For example, the regulator may be accessed via I2C and we need to program that
> before changing the clock. So, it will be done by some software code only.
>
> And now I am wondering on why that would be any better than the kthread in
> schedutil. Sorry, I haven't understood the idea completely yet :(
So the problem with the thread is two-fold; one the one hand we like the
scheduler to directly set frequency, but then we need to schedule a task
to change the frequency, which will change the frequency and around we
go.
On the other hand, there's very nasty issues with PI. This thread would
have very high priority (otherwise the SCHED_DEADLINE stuff won't work)
but that then means this thread needs to boost the owner of the i2c
mutex. And that then creates a massive bandwidth accounting hole.
The advantage of using an interrupt driven state machine is that all
those issues go away.
But yes, whichever way around you turn things, its crap. But given the
hardware its the best we can do.
On 12-07-17, 10:31, Peter Zijlstra wrote:
> So the problem with the thread is two-fold; one the one hand we like the
> scheduler to directly set frequency, but then we need to schedule a task
> to change the frequency, which will change the frequency and around we
> go.
>
> On the other hand, there's very nasty issues with PI. This thread would
> have very high priority (otherwise the SCHED_DEADLINE stuff won't work)
> but that then means this thread needs to boost the owner of the i2c
> mutex. And that then creates a massive bandwidth accounting hole.
>
>
> The advantage of using an interrupt driven state machine is that all
> those issues go away.
>
> But yes, whichever way around you turn things, its crap. But given the
> hardware its the best we can do.
Thanks for the explanation Peter.
IIUC, it will take more time to change the frequency eventually with
the interrupt-driven state machine as there may be multiple bottom
halves involved here, for supply, clk, etc, which would run at normal
priorities now. And those were boosted currently due to the high
priority sugov thread. And we are fine with that (from performance
point of view) ?
Coming back to where we started from (where should we call
arch_set_freq_scale() from ?).
I think we would still need some kind of synchronization between
cpufreq core and the cpufreq drivers to make sure we don't start
another freq change before the previous one is complete. Otherwise
the cpufreq drivers would be required to have similar support with
proper locking in place.
And if the core is going to get notified about successful freq changes
(which it should IMHO), then it may still be better to call
arch_set_freq_scale() from the core itself and not from individual
drivers.
--
viresh
On Wed, Jul 12, 2017 at 02:57:55PM +0530, Viresh Kumar wrote:
> On 12-07-17, 10:31, Peter Zijlstra wrote:
> > So the problem with the thread is two-fold; one the one hand we like the
> > scheduler to directly set frequency, but then we need to schedule a task
> > to change the frequency, which will change the frequency and around we
> > go.
> >
> > On the other hand, there's very nasty issues with PI. This thread would
> > have very high priority (otherwise the SCHED_DEADLINE stuff won't work)
> > but that then means this thread needs to boost the owner of the i2c
> > mutex. And that then creates a massive bandwidth accounting hole.
> >
> >
> > The advantage of using an interrupt driven state machine is that all
> > those issues go away.
> >
> > But yes, whichever way around you turn things, its crap. But given the
> > hardware its the best we can do.
>
> Thanks for the explanation Peter.
>
> IIUC, it will take more time to change the frequency eventually with
> the interrupt-driven state machine as there may be multiple bottom
> halves involved here, for supply, clk, etc, which would run at normal
> priorities now. And those were boosted currently due to the high
> priority sugov thread. And we are fine with that (from performance
> point of view) ?
I'm not sure what you mean; bottom halves as in softirq? From what I can
tell an i2c bus does clk_prepare_enable() on registration and from that
point on clk_enable() is usable from atomic contexts. But afaict clk
stuff doesn't do interrupts at all.
(with a note that I absolutely hate the clk locking)
I think the interrupt driven thing can actually be faster than the
'regular' task waiting on the mutex. The regulator message can be
locklessly queued (it only ever makes sense to have 1 such message
pending, any later one will invalidate a prior one).
Then the i2c interrupt can detect the availability of this pending
message and splice it into the transfer queue at an opportune moment.
(of course, the current i2c bits don't support any of that)
> Coming back to where we started from (where should we call
> arch_set_freq_scale() from ?).
The drivers.. the core cpufreq doesn't know when (if any) transition is
completed.
> I think we would still need some kind of synchronization between
> cpufreq core and the cpufreq drivers to make sure we don't start
> another freq change before the previous one is complete. Otherwise
> the cpufreq drivers would be required to have similar support with
> proper locking in place.
Not sure what you mean; also not sure why. On x86 we never know, cannot
know. So why would this stuff be any different.
> And if the core is going to get notified about successful freq changes
> (which it should IMHO), then it may still be better to call
> arch_set_freq_scale() from the core itself and not from individual
> drivers.
I would not involve the core. All we want from the core is a unified
interface towards requesting DVFS changes. Everything that happens after
is not its business.
On Wednesday, July 12, 2017 01:14:26 PM Peter Zijlstra wrote:
> On Wed, Jul 12, 2017 at 02:57:55PM +0530, Viresh Kumar wrote:
> > On 12-07-17, 10:31, Peter Zijlstra wrote:
> > > So the problem with the thread is two-fold; one the one hand we like the
> > > scheduler to directly set frequency, but then we need to schedule a task
> > > to change the frequency, which will change the frequency and around we
> > > go.
> > >
> > > On the other hand, there's very nasty issues with PI. This thread would
> > > have very high priority (otherwise the SCHED_DEADLINE stuff won't work)
> > > but that then means this thread needs to boost the owner of the i2c
> > > mutex. And that then creates a massive bandwidth accounting hole.
> > >
> > >
> > > The advantage of using an interrupt driven state machine is that all
> > > those issues go away.
> > >
> > > But yes, whichever way around you turn things, its crap. But given the
> > > hardware its the best we can do.
> >
> > Thanks for the explanation Peter.
> >
> > IIUC, it will take more time to change the frequency eventually with
> > the interrupt-driven state machine as there may be multiple bottom
> > halves involved here, for supply, clk, etc, which would run at normal
> > priorities now. And those were boosted currently due to the high
> > priority sugov thread. And we are fine with that (from performance
> > point of view) ?
>
> I'm not sure what you mean; bottom halves as in softirq? From what I can
> tell an i2c bus does clk_prepare_enable() on registration and from that
> point on clk_enable() is usable from atomic contexts. But afaict clk
> stuff doesn't do interrupts at all.
>
> (with a note that I absolutely hate the clk locking)
>
> I think the interrupt driven thing can actually be faster than the
> 'regular' task waiting on the mutex. The regulator message can be
> locklessly queued (it only ever makes sense to have 1 such message
> pending, any later one will invalidate a prior one).
>
> Then the i2c interrupt can detect the availability of this pending
> message and splice it into the transfer queue at an opportune moment.
>
> (of course, the current i2c bits don't support any of that)
I *guess* the concern is that in the new model there is no control over the
time of requesting the frequency change and when the change actually
happens.
IIUC the whole point of making the governor thread an RT or DL one is to
ensure that the change will happen as soon as technically possible, because if
it doesn't happen in a specific time frame, it can very well be skipped entirely.
> > Coming back to where we started from (where should we call
> > arch_set_freq_scale() from ?).
>
> The drivers.. the core cpufreq doesn't know when (if any) transition is
> completed.
Right.
> > I think we would still need some kind of synchronization between
> > cpufreq core and the cpufreq drivers to make sure we don't start
> > another freq change before the previous one is complete. Otherwise
> > the cpufreq drivers would be required to have similar support with
> > proper locking in place.
>
> Not sure what you mean; also not sure why. On x86 we never know, cannot
> know. So why would this stuff be any different.
We seem to be in violent agreement on this. :-)
Thanks,
Rafael
On Thu, Jul 13, 2017 at 01:13:43AM +0200, Rafael J. Wysocki wrote:
> I *guess* the concern is that in the new model there is no control over the
> time of requesting the frequency change and when the change actually
> happens.
There isn't in any case. If some brilliant hardware designer shares the
regulator's I2C bus with another device that requires 'big' transfers
(say a DSP) we're hosed whichever way around.
> IIUC the whole point of making the governor thread an RT or DL one is to
> ensure that the change will happen as soon as technically possible, because if
> it doesn't happen in a specific time frame, it can very well be skipped entirely.
So I think the interrupt driven thing can actually do better than the
rt-mutex based one, and I cannot think of a case where it would do
worse.
The I2C completion interrupt can splice in the pending request at the
earliest opportunity, even though the mutex owner might still think it
owns things.
And if that is not possible, it'll still be as fast as waiting for the
mutex to be released (or ever so slightly faster, because it will not
quiesce the bus like it would otherwise at the end of a transfer).
On 13-07-17, 01:13, Rafael J. Wysocki wrote:
> On Wednesday, July 12, 2017 01:14:26 PM Peter Zijlstra wrote:
> > On Wed, Jul 12, 2017 at 02:57:55PM +0530, Viresh Kumar wrote:
> > > IIUC, it will take more time to change the frequency eventually with
> > > the interrupt-driven state machine as there may be multiple bottom
> > > halves involved here, for supply, clk, etc, which would run at normal
> > > priorities now. And those were boosted currently due to the high
> > > priority sugov thread. And we are fine with that (from performance
> > > point of view) ?
> >
> > I'm not sure what you mean; bottom halves as in softirq?
Workqueues or normal threads actually. Maybe I am completely wrong,
but this is how I believe things are going to be:
Configuration: Both regulator and clk registers accessible over I2C
bus.
Scheduler calls schedutil, which eventually calls cpufreq driver (w/o
kthread). The cpufreq backend driver will queue a async request with
callback (with regulator core) to update regulator's constraints
(which can sleep as we need to talk over I2C). The callback will be
called once regulator is programmed. And we return right after
submitting the request with regulator core.
Now, I2C transfer will finish (i.e. regulator programmed) and the
driver specific callback will get called. It will try to change the
frequency now and wait (sleep) until it finishes. I hope the regulator
core wouldn't call the driver callback from interrupt context but some
sort of bottom half, maybe workqueue (That's what I was referring to
earlier).
And finally the clk is programmed and the state machine finished.
> > From what I can
> > tell an i2c bus does clk_prepare_enable() on registration and from that
> > point on clk_enable() is usable from atomic contexts.
That assumes that we can access registers of the I2C controller
atomically without sleeping. Not sure how many ARM platforms have I2C
controller connected over a slow bus though.
> > But afaict clk
> > stuff doesn't do interrupts at all.
The clk stuff may not need it if the clock controllers registers can
be accessed atomically. But if (as in my example) the clk controller
is also over the I2C bus, then the interrupt will be provided from I2C
bus and clk routines would return only after transfer is done.
> > (with a note that I absolutely hate the clk locking)
Yeah, that's a beast :)
> > I think the interrupt driven thing can actually be faster than the
> > 'regular' task waiting on the mutex. The regulator message can be
> > locklessly queued (it only ever makes sense to have 1 such message
> > pending, any later one will invalidate a prior one).
> >
> > Then the i2c interrupt can detect the availability of this pending
> > message and splice it into the transfer queue at an opportune moment.
> >
> > (of course, the current i2c bits don't support any of that)
>
> I *guess* the concern is that in the new model there is no control over the
> time of requesting the frequency change and when the change actually
> happens.
Right.
> IIUC the whole point of making the governor thread an RT or DL one is to
> ensure that the change will happen as soon as technically possible, because if
> it doesn't happen in a specific time frame, it can very well be skipped entirely.
Yes, or actually we can have more trouble (below..).
> > > Coming back to where we started from (where should we call
> > > arch_set_freq_scale() from ?).
> >
> > The drivers.. the core cpufreq doesn't know when (if any) transition is
> > completed.
>
> Right.
>
> > > I think we would still need some kind of synchronization between
> > > cpufreq core and the cpufreq drivers to make sure we don't start
> > > another freq change before the previous one is complete. Otherwise
> > > the cpufreq drivers would be required to have similar support with
> > > proper locking in place.
> >
> > Not sure what you mean; also not sure why. On x86 we never know, cannot
> > know. So why would this stuff be any different.
So as per the above example, the software on ARM requires to program
multiple hardware components (clk, regulator, power-domain, etc) for
changing CPUs frequency. And this has to be non-racy, otherwise we
might have programmed regulator, domain etc, but right before we
change the frequency another request may land which may try to program
the regulator again. We would be screwed badly here.
Its not a problem on X86 because (I believe) most of this is done by
the hardware for you. And you guys don't have to worry for that.
We already take care of these synchronization issues in slow switching
path (cpufreq_freq_transition_begin()), where we guarantee that a new
freq change request doesn't start before the previous one is over.
--
viresh
On Thu, Jul 13, 2017 at 02:18:04PM +0530, Viresh Kumar wrote:
> Workqueues or normal threads actually. Maybe I am completely wrong,
> but this is how I believe things are going to be:
>
> Configuration: Both regulator and clk registers accessible over I2C
> bus.
>
> Scheduler calls schedutil, which eventually calls cpufreq driver (w/o
> kthread). The cpufreq backend driver will queue a async request with
> callback (with regulator core) to update regulator's constraints
> (which can sleep as we need to talk over I2C). The callback will be
> called once regulator is programmed. And we return right after
> submitting the request with regulator core.
>
> Now, I2C transfer will finish (i.e. regulator programmed) and the
> driver specific callback will get called. It will try to change the
> frequency now and wait (sleep) until it finishes. I hope the regulator
> core wouldn't call the driver callback from interrupt context but some
> sort of bottom half, maybe workqueue (That's what I was referring to
> earlier).
>
> And finally the clk is programmed and the state machine finished.
>
> > > From what I can
> > > tell an i2c bus does clk_prepare_enable() on registration and from that
> > > point on clk_enable() is usable from atomic contexts.
>
> That assumes that we can access registers of the I2C controller
> atomically without sleeping. Not sure how many ARM platforms have I2C
> controller connected over a slow bus though.
>
> > > But afaict clk
> > > stuff doesn't do interrupts at all.
>
> The clk stuff may not need it if the clock controllers registers can
> be accessed atomically. But if (as in my example) the clk controller
> is also over the I2C bus, then the interrupt will be provided from I2C
> bus and clk routines would return only after transfer is done.
So no, that's not the idea at all.
The one thing we assume is that the I2C bus does indeed have interrupts
(which isn't a given per se, but lacking that we're completely hosed).
For interrupt enabled I2C busses, we must be able to write to their
registers from atomic context, otherwise the interrupts can't very well
work.
With that, you can drive the entire thing from the IRQ.
I don't believe anyone is quite as silly as to put the regulator on a
nested I2C; in which case I think you still _could_ do, but I'll just
class that with failure of using a polling I2C bus for your regulator.
As to the serialization, your driver can easily keep track of that and
not allow a new transition to start while a previous is still in
critical transit.
On 11/07/17 16:21, Dietmar Eggemann wrote:
> On 11/07/17 07:39, Viresh Kumar wrote:
>> On 10-07-17, 14:46, Rafael J. Wysocki wrote:
>>> This particular change is about a new feature, so making it in the core is OK
>>> in two cases IMO: (a) when you actively want everyone to be affected by it and
>>
>> IMO this change should be done for the whole ARM architecture. And if some
>> regression happens due to this, then we come back and solve it.
>>
>>> (b) when the effect of it on the old systems should not be noticeable.
>>
>> I am not sure about the effects of this on performance really.
>>
>> @Dietmar: Any inputs for that ?
>
> Like I said in the other email, since for (future)
> arm/arm64 fast-switch driver, the return value of
> cpufreq_driver->fast_switch() does not give us the information that the
> frequency value did actually change, we have to implement
I was under the impression that we strictly don't care about that
information when I started exploring the fast_switch with the standard
firmware interface on ARM platforms(until if and when ARM provides an
instruction to achieve that).
If f/w failed to change the frequency, will that be not corrected in the
next sample or instance. I would like to know the impact of absence of
such notifications.
> arch_set_freq_scale() in the driver.
> This means that we probably only implement this in the subset of drivers
> which will be used in platforms on which we want to have
> frequency-invariant load-tracking.
>
> A future aperf/mperf like counter FIE solution can give us arch-wide
> support when those counters are available.
>
Agreed.
--
Regards,
Sudeep
On 12/07/17 05:09, Viresh Kumar wrote:
> On 11-07-17, 16:06, Dietmar Eggemann wrote:
>> But in the meantime we're convinced that cpufreq_driver_fast_switch() is
>> not the right place to call arch_set_freq_scale() since for (future)
>> arm/arm64 fast-switch driver, the return value of
>> cpufreq_driver->fast_switch() does not give us the information that the
>> frequency value did actually change.
>
> Yeah, I saw your discussion with Peter on #linux-rt IRC and TBH I wasn't aware
> that we are going to do fast switching that way. Just trying to get
> understanding of that idea a bit..
>
> So we will do fast switching from scheduler's point of view, i.e. we wouldn't
> schedule a kthread to change the frequency. But the real hardware still can't do
> that without sleeping, like if we have I2C somewhere in between. AFAIU, we will
> still have some kind of *software* bottom half to do that work, isn't it? And it
> wouldn't be that we have pushed some instructions to the hardware, which it can
> do a bit later.
>
No the platforms we are considering are only where a standard firmware
interface is provided and the firmware deals with all those I2C/PMIC crap.
> For example, the regulator may be accessed via I2C and we need to program that
> before changing the clock. So, it will be done by some software code only.
>
Software but just not Linux OSPM but some firmware(remote processors
presumably, can't imagine on the same processor though)
--
Regards,
Sudeep
On 12/07/17 10:27, Viresh Kumar wrote:
> On 12-07-17, 10:31, Peter Zijlstra wrote:
>> So the problem with the thread is two-fold; one the one hand we like the
>> scheduler to directly set frequency, but then we need to schedule a task
>> to change the frequency, which will change the frequency and around we
>> go.
>>
>> On the other hand, there's very nasty issues with PI. This thread would
>> have very high priority (otherwise the SCHED_DEADLINE stuff won't work)
>> but that then means this thread needs to boost the owner of the i2c
>> mutex. And that then creates a massive bandwidth accounting hole.
>>
>>
>> The advantage of using an interrupt driven state machine is that all
>> those issues go away.
>>
>> But yes, whichever way around you turn things, its crap. But given the
>> hardware its the best we can do.
>
> Thanks for the explanation Peter.
>
> IIUC, it will take more time to change the frequency eventually with
> the interrupt-driven state machine as there may be multiple bottom
> halves involved here, for supply, clk, etc, which would run at normal
> priorities now. And those were boosted currently due to the high
> priority sugov thread. And we are fine with that (from performance
> point of view) ?
>
> Coming back to where we started from (where should we call
> arch_set_freq_scale() from ?).
>
> I think we would still need some kind of synchronization between
> cpufreq core and the cpufreq drivers to make sure we don't start
> another freq change before the previous one is complete. Otherwise
> the cpufreq drivers would be required to have similar support with
> proper locking in place.
>
Good point, but with firmware interface we are considering fro
fast-switch, the firmware can override the previous request if it's not
yet started. So I assume that's fine and expected ?
> And if the core is going to get notified about successful freq changes
> (which it should IMHO),
Is that mandatory for even fast-switching ?
--
Regards,
Sudeep
On 13/07/17 13:40, Sudeep Holla wrote:
>
>
> On 11/07/17 16:21, Dietmar Eggemann wrote:
>> On 11/07/17 07:39, Viresh Kumar wrote:
>>> On 10-07-17, 14:46, Rafael J. Wysocki wrote:
[...]
>> Like I said in the other email, since for (future)
>> arm/arm64 fast-switch driver, the return value of
>> cpufreq_driver->fast_switch() does not give us the information that the
>> frequency value did actually change, we have to implement
>
> I was under the impression that we strictly don't care about that
> information when I started exploring the fast_switch with the standard
> firmware interface on ARM platforms(until if and when ARM provides an
> instruction to achieve that).
>
> If f/w failed to change the frequency, will that be not corrected in the
> next sample or instance. I would like to know the impact of absence of
> such notifications.
In the meantime we agreed that we have to invoke frequency invariance
from within the cpufreq driver.
For a fast-switch driver I would have to put the call to
arch_set_freq_scale() somewhere where I know that the frequency has been
set.
Without a notification (from the firmware) that the frequency has been
set, I would have to call arch_set_freq_scale() somewhere in the
driver::fast_switch() call assuming that the frequency has been actually
set.
[...]
On 12/07/17 12:14, Peter Zijlstra wrote:
> On Wed, Jul 12, 2017 at 02:57:55PM +0530, Viresh Kumar wrote:
>> On 12-07-17, 10:31, Peter Zijlstra wrote:
>>> So the problem with the thread is two-fold; one the one hand we like the
>>> scheduler to directly set frequency, but then we need to schedule a task
>>> to change the frequency, which will change the frequency and around we
>>> go.
>>>
>>> On the other hand, there's very nasty issues with PI. This thread would
>>> have very high priority (otherwise the SCHED_DEADLINE stuff won't work)
>>> but that then means this thread needs to boost the owner of the i2c
>>> mutex. And that then creates a massive bandwidth accounting hole.
>>>
>>>
>>> The advantage of using an interrupt driven state machine is that all
>>> those issues go away.
>>>
>>> But yes, whichever way around you turn things, its crap. But given the
>>> hardware its the best we can do.
>>
>> Thanks for the explanation Peter.
>>
>> IIUC, it will take more time to change the frequency eventually with
>> the interrupt-driven state machine as there may be multiple bottom
>> halves involved here, for supply, clk, etc, which would run at normal
>> priorities now. And those were boosted currently due to the high
>> priority sugov thread. And we are fine with that (from performance
>> point of view) ?
>
> I'm not sure what you mean; bottom halves as in softirq? From what I can
> tell an i2c bus does clk_prepare_enable() on registration and from that
> point on clk_enable() is usable from atomic contexts. But afaict clk
> stuff doesn't do interrupts at all.
>
> (with a note that I absolutely hate the clk locking)
>
Agreed. Juri pointed out this as a blocker a while ago and when we
started implementing the new and shiny ARM SCMI specification, I dropped
the whole clock layer interaction for the CPUFreq driver. However, I
still have to deal with some mailbox locking(still experimenting currently)
> I think the interrupt driven thing can actually be faster than the
> 'regular' task waiting on the mutex. The regulator message can be
> locklessly queued (it only ever makes sense to have 1 such message
> pending, any later one will invalidate a prior one).
>
Ah OK, I just asked the same in the other thread, you have already
answered me. Good we can ignore.
> Then the i2c interrupt can detect the availability of this pending
> message and splice it into the transfer queue at an opportune moment.
>
> (of course, the current i2c bits don't support any of that)
>
>> Coming back to where we started from (where should we call
>> arch_set_freq_scale() from ?).
>
> The drivers.. the core cpufreq doesn't know when (if any) transition is
> completed.
>
>> I think we would still need some kind of synchronization between
>> cpufreq core and the cpufreq drivers to make sure we don't start
>> another freq change before the previous one is complete. Otherwise
>> the cpufreq drivers would be required to have similar support with
>> proper locking in place.
>
> Not sure what you mean; also not sure why. On x86 we never know, cannot
> know. So why would this stuff be any different.
>
Good, I was under the same assumption that it's okay to override the old
request with new.
>> And if the core is going to get notified about successful freq changes
>> (which it should IMHO), then it may still be better to call
>> arch_set_freq_scale() from the core itself and not from individual
>> drivers.
>
> I would not involve the core. All we want from the core is a unified
> interface towards requesting DVFS changes. Everything that happens after
> is not its business.
>
The question is whether we *need* to know the completion of frequency
transition. What is the impact of absence of it ? I am considering
platforms which may take up to a ms or more to do the actual transition
in the firmware.
--
Regards,
Sudeep
On 13/07/17 14:08, Dietmar Eggemann wrote:
> On 13/07/17 13:40, Sudeep Holla wrote:
>>
>>
>> On 11/07/17 16:21, Dietmar Eggemann wrote:
>>> On 11/07/17 07:39, Viresh Kumar wrote:
>>>> On 10-07-17, 14:46, Rafael J. Wysocki wrote:
>
> [...]
>
>>> Like I said in the other email, since for (future)
>>> arm/arm64 fast-switch driver, the return value of
>>> cpufreq_driver->fast_switch() does not give us the information that the
>>> frequency value did actually change, we have to implement
>>
>> I was under the impression that we strictly don't care about that
>> information when I started exploring the fast_switch with the standard
>> firmware interface on ARM platforms(until if and when ARM provides an
>> instruction to achieve that).
>>
>> If f/w failed to change the frequency, will that be not corrected in the
>> next sample or instance. I would like to know the impact of absence of
>> such notifications.
>
> In the meantime we agreed that we have to invoke frequency invariance
> from within the cpufreq driver.
>
> For a fast-switch driver I would have to put the call to
> arch_set_freq_scale() somewhere where I know that the frequency has been
> set.
>
> Without a notification (from the firmware) that the frequency has been
> set, I would have to call arch_set_freq_scale() somewhere in the
> driver::fast_switch() call assuming that the frequency has been actually
> set.
>
Yes, that's what I was thinking too.
--
Regards,
Sudeep
On Thu, Jul 13, 2017 at 03:04:09PM +0100, Sudeep Holla wrote:
> The question is whether we *need* to know the completion of frequency
> transition. What is the impact of absence of it ? I am considering
> platforms which may take up to a ms or more to do the actual transition
> in the firmware.
So on x86 we can recover from not knowing by means of the APERF/MPERF
thing, which gives us the average effective frequency over the last
period.
If you lack that you need something to update the actual effective
frequency.
Changing the effective frequency at request time might confuse things --
esp. if the request might not be honoured at all or can take a
significant time to complete. Not to mention that _IF_ you rely on the
effective frequency to set other clocks things can come unstuck.
So unless you go the whole distance and do APERF/MPERF like things, I
think it would be very good to have a notification of completion (and
possibly a read-back of the effective frequency that is now set).
On 13/07/17 15:42, Peter Zijlstra wrote:
> On Thu, Jul 13, 2017 at 03:04:09PM +0100, Sudeep Holla wrote:
>
>> The question is whether we *need* to know the completion of frequency
>> transition. What is the impact of absence of it ? I am considering
>> platforms which may take up to a ms or more to do the actual transition
>> in the firmware.
>
> So on x86 we can recover from not knowing by means of the APERF/MPERF
> thing, which gives us the average effective frequency over the last
> period.
>
Understood, and I agree we *must* head in that direction.
> If you lack that you need something to update the actual effective
> frequency.
>
> Changing the effective frequency at request time might confuse things --
> esp. if the request might not be honoured at all or can take a
> significant time to complete. Not to mention that _IF_ you rely on the
> effective frequency to set other clocks things can come unstuck.
>
> So unless you go the whole distance and do APERF/MPERF like things, I
> think it would be very good to have a notification of completion (and
> possibly a read-back of the effective frequency that is now set).
>
Yes I agree, but sadly/unfortunately the new SCMI specification I keep
referring makes these notification optional. We even have statistics
collected by the firmware to get the effective frequency, but again
optional and may get updated at slower rate than what we would expect as
they are typically running on slower processors. But IMO it's still
worth exploring the feasibility of fast switch on system with such
standard interface.
I completely agree with you on the old system which Linux has to deal
with I2C and other slow paths. I was under the assumption that we had
already eliminated the use of fast switch on such systems. When Dietmar
referred fast switching, I think he was referring to only systems with
std. firmware interface.
--
Regards,
Sudeep