Introducing arm64 specific version of arch_freq_get_on_cpu, cashing on
existing implementation for FIE and AMUv1 support: the frequency scale
factor, updated on each sched tick, serves as a base for retrieving
the frequency for a given CPU, representing an average frequency
reported between the ticks - thus its accuracy is limited.
The changes have been rather lightly (due to some limitations) tested on
an FVP model. Note that some small discrepancies have been observed while
testing (on the model) and this is currently being investigated, though it
should not have any significant impact on the overall results.
Relevant discussions:
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/7eozim2xnepacnnkzxlbx34hib4otycnbn4dqymfziqou5lw5u@5xzpv3t7sxo3/
[3] https://lore.kernel.org/all/[email protected]/
[4] https://lore.kernel.org/lkml/[email protected]/T/#m4e74cb5a0aaa353c60fedc6cfb95ab7a6e381e3c
v4:
- dropping seqcount
- fixing identifying active cpu within given policy
- skipping full dynticks cpus when retrieving the freq
- bringing back plugging in arch_freq_get_on_cpu into cpuinfo_cur_freq
v3:
- dropping changes to cpufreq_verify_current_freq
- pulling in changes from Ionela initializing capacity_freq_ref to 0
(thanks for that!) and applying suggestions made by her during last review:
- switching to arch_scale_freq_capacity and arch_scale_freq_ref when
reversing freq scale factor computation
- swapping shift with multiplication
- adding time limit for considering last scale update as valid
- updating frequency scale factor upon entering idle
v2:
- Splitting the patches
- Adding comment for full dyntick mode
- Plugging arch_freq_get_on_cpu into cpufreq_verify_current_freq instead
of in show_cpuinfo_cur_freq to allow the framework to stay more in sync
with potential freq changes
Beata Michalska (3):
arm64: Provide an AMU-based version of arch_freq_get_on_cpu
arm64: Update AMU-based frequency scale factor on entering idle
cpufreq: Use arch specific feedback for cpuinfo_cur_freq
Ionela Voinescu (1):
arch_topology: init capacity_freq_ref to 0
arch/arm64/kernel/topology.c | 125 ++++++++++++++++++++++++++++++++---
drivers/base/arch_topology.c | 8 ++-
drivers/cpufreq/cpufreq.c | 4 +-
3 files changed, 123 insertions(+), 14 deletions(-)
--
2.25.1
With the Frequency Invariance Engine (FIE) being already wired up with
sched tick and making use of relevant (core counter and constant
counter) AMU counters, getting the current frequency for a given CPU,
can be achieved by utilizing the frequency scale factor which reflects
an average CPU frequency for the last tick period length.
The solution is partially based on APERF/MPERF implementation of
arch_freq_get_on_cpu.
Suggested-by: Ionela Voinescu <[email protected]>
Signed-off-by: Beata Michalska <[email protected]>
---
arch/arm64/kernel/topology.c | 112 +++++++++++++++++++++++++++++++----
1 file changed, 102 insertions(+), 10 deletions(-)
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index 1a2c72f3e7f8..b03fe8617721 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -17,6 +17,7 @@
#include <linux/cpufreq.h>
#include <linux/init.h>
#include <linux/percpu.h>
+#include <linux/sched/isolation.h>
#include <asm/cpu.h>
#include <asm/cputype.h>
@@ -88,18 +89,28 @@ int __init parse_acpi_topology(void)
* initialized.
*/
static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, arch_max_freq_scale) = 1UL << (2 * SCHED_CAPACITY_SHIFT);
-static DEFINE_PER_CPU(u64, arch_const_cycles_prev);
-static DEFINE_PER_CPU(u64, arch_core_cycles_prev);
static cpumask_var_t amu_fie_cpus;
+struct amu_cntr_sample {
+ u64 arch_const_cycles_prev;
+ u64 arch_core_cycles_prev;
+ unsigned long last_update;
+};
+
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct amu_cntr_sample, cpu_amu_samples);
+
void update_freq_counters_refs(void)
{
- this_cpu_write(arch_core_cycles_prev, read_corecnt());
- this_cpu_write(arch_const_cycles_prev, read_constcnt());
+ struct amu_cntr_sample *amu_sample = this_cpu_ptr(&cpu_amu_samples);
+
+ amu_sample->arch_core_cycles_prev = read_corecnt();
+ amu_sample->arch_const_cycles_prev = read_constcnt();
}
static inline bool freq_counters_valid(int cpu)
{
+ struct amu_cntr_sample *amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
+
if ((cpu >= nr_cpu_ids) || !cpumask_test_cpu(cpu, cpu_present_mask))
return false;
@@ -108,8 +119,8 @@ static inline bool freq_counters_valid(int cpu)
return false;
}
- if (unlikely(!per_cpu(arch_const_cycles_prev, cpu) ||
- !per_cpu(arch_core_cycles_prev, cpu))) {
+ if (unlikely(!amu_sample->arch_const_cycles_prev ||
+ !amu_sample->arch_core_cycles_prev)) {
pr_debug("CPU%d: cycle counters are not enabled.\n", cpu);
return false;
}
@@ -152,17 +163,22 @@ void freq_inv_set_max_ratio(int cpu, u64 max_rate)
static void amu_scale_freq_tick(void)
{
+ struct amu_cntr_sample *amu_sample = this_cpu_ptr(&cpu_amu_samples);
u64 prev_core_cnt, prev_const_cnt;
u64 core_cnt, const_cnt, scale;
- prev_const_cnt = this_cpu_read(arch_const_cycles_prev);
- prev_core_cnt = this_cpu_read(arch_core_cycles_prev);
+ prev_const_cnt = amu_sample->arch_const_cycles_prev;
+ prev_core_cnt = amu_sample->arch_core_cycles_prev;
update_freq_counters_refs();
- const_cnt = this_cpu_read(arch_const_cycles_prev);
- core_cnt = this_cpu_read(arch_core_cycles_prev);
+ const_cnt = amu_sample->arch_const_cycles_prev;
+ core_cnt = amu_sample->arch_core_cycles_prev;
+ /*
+ * This should not happen unless the AMUs have been reset and the
+ * counter values have not been restored - unlikely
+ */
if (unlikely(core_cnt <= prev_core_cnt ||
const_cnt <= prev_const_cnt))
return;
@@ -182,6 +198,8 @@ static void amu_scale_freq_tick(void)
scale = min_t(unsigned long, scale, SCHED_CAPACITY_SCALE);
this_cpu_write(arch_freq_scale, (unsigned long)scale);
+
+ amu_sample->last_update = jiffies;
}
static struct scale_freq_data amu_sfd = {
@@ -189,6 +207,80 @@ static struct scale_freq_data amu_sfd = {
.set_freq_scale = amu_scale_freq_tick,
};
+#define AMU_SAMPLE_EXP_MS 20
+
+unsigned int arch_freq_get_on_cpu(int cpu)
+{
+ struct amu_cntr_sample *amu_sample;
+ cpumask_var_t ref_cpumask = NULL;
+ unsigned long last_update;
+ unsigned int freq;
+ u64 scale;
+
+ if (!cpumask_test_cpu(cpu, amu_fie_cpus) || !arch_scale_freq_ref(cpu))
+ return 0;
+retry:
+ amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
+
+ last_update = amu_sample->last_update;
+
+ /*
+ * For those CPUs that are in full dynticks mode,
+ * and those that have not seen tick for a while
+ * try an alternative source for the counters (and thus freq scale),
+ * if available, for given policy:
+ * this boils down to identifying an active cpu within the same freq
+ * domain, if any.
+ */
+ if (!housekeeping_cpu(cpu, HK_TYPE_TICK) ||
+ time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS))) {
+ struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
+ int ref_cpu = cpu;
+
+ if (!policy_is_shared(policy)) {
+ cpufreq_cpu_put(policy);
+ return 0;
+ }
+
+ if (!ref_cpumask) {
+ if (!zalloc_cpumask_var(&ref_cpumask, GFP_KERNEL)) {
+ cpufreq_cpu_put(policy);
+ return 0;
+ }
+
+ cpumask_copy(ref_cpumask, policy->cpus);
+ }
+
+ cpufreq_cpu_put(policy);
+
+ do {
+ cpumask_clear_cpu(ref_cpu, ref_cpumask);
+ ref_cpu = cpumask_first(ref_cpumask);
+
+ } while (ref_cpu < nr_cpu_ids && idle_cpu(ref_cpu));
+
+ if (ref_cpu >= nr_cpu_ids) {
+ /* No alternative to pull info from */
+ free_cpumask_var(ref_cpumask);
+ return 0;
+ }
+ cpu = ref_cpu;
+ goto retry;
+ }
+ /*
+ * Reversed computation to the one used to determine
+ * the arch_freq_scale value
+ * (see amu_scale_freq_tick for details)
+ */
+ scale = arch_scale_freq_capacity(cpu);
+ freq = scale * arch_scale_freq_ref(cpu);
+ freq >>= SCHED_CAPACITY_SHIFT;
+
+ free_cpumask_var(ref_cpumask);
+
+ return freq;
+}
+
static void amu_fie_setup(const struct cpumask *cpus)
{
int cpu;
--
2.25.1
Some architectures provide a way to determine an average frequency over
a certain period of time based on available performance monitors (AMU on
ARM or APERF/MPERf on x86). With those at hand, enroll arch_freq_get_on_cpu
into cpuinfo_cur_freq policy sysfs attribute handler, which is expected to
represent the current frequency of a given CPU, as obtained by the hardware.
This is the type of feedback that counters do provide.
Signed-off-by: Beata Michalska <[email protected]>
---
drivers/cpufreq/cpufreq.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 66e10a19d76a..603533b2608f 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -795,8 +795,10 @@ store_one(scaling_max_freq, max);
static ssize_t show_cpuinfo_cur_freq(struct cpufreq_policy *policy,
char *buf)
{
- unsigned int cur_freq = __cpufreq_get(policy);
+ unsigned int cur_freq = arch_freq_get_on_cpu(policy->cpu);
+ if (!cur_freq)
+ cur_freq = __cpufreq_get(policy);
if (cur_freq)
return sprintf(buf, "%u\n", cur_freq);
--
2.25.1
From: Ionela Voinescu <[email protected]>
It's useful to have capacity_freq_ref initialized to 0 for users of
arch_scale_freq_ref() to detect when capacity_freq_ref was not
yet set.
The only scenario affected by this change in the init value is when a
cpufreq driver is never loaded. As a result, the only setter of a
cpu scale factor remains the call of topology_normalize_cpu_scale()
from parse_dt_topology(). There we cannot use the value 0 of
capacity_freq_ref so we have to compensate for its uninitialized state.
Signed-off-by: Ionela Voinescu <[email protected]>
Signed-off-by: Beata Michalska <[email protected]>
---
drivers/base/arch_topology.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 024b78a0cfc1..7d4c92cd2bad 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -27,7 +27,7 @@
static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
static struct cpumask scale_freq_counters_mask;
static bool scale_freq_invariant;
-DEFINE_PER_CPU(unsigned long, capacity_freq_ref) = 1;
+DEFINE_PER_CPU(unsigned long, capacity_freq_ref) = 0;
EXPORT_PER_CPU_SYMBOL_GPL(capacity_freq_ref);
static bool supports_scale_freq_counters(const struct cpumask *cpus)
@@ -292,13 +292,15 @@ void topology_normalize_cpu_scale(void)
capacity_scale = 1;
for_each_possible_cpu(cpu) {
- capacity = raw_capacity[cpu] * per_cpu(capacity_freq_ref, cpu);
+ capacity = raw_capacity[cpu] *
+ (per_cpu(capacity_freq_ref, cpu) ?: 1);
capacity_scale = max(capacity, capacity_scale);
}
pr_debug("cpu_capacity: capacity_scale=%llu\n", capacity_scale);
for_each_possible_cpu(cpu) {
- capacity = raw_capacity[cpu] * per_cpu(capacity_freq_ref, cpu);
+ capacity = raw_capacity[cpu] *
+ (per_cpu(capacity_freq_ref, cpu) ?: 1);
capacity = div64_u64(capacity << SCHED_CAPACITY_SHIFT,
capacity_scale);
topology_set_cpu_scale(cpu, capacity);
--
2.25.1
Now that the frequency scale factor has been activated for retrieving
current frequency on a given CPU, trigger its update upon entering
idle. This will, to an extent, allow querying last known frequency
in a non-invasive way. It will also improve the frequency scale factor
accuracy when a CPU entering idle did not receive a tick for a while.
As a consequence, for idle cores, the reported frequency will be the
last one observed before entering the idle state.
Suggested-by: Vanshidhar Konda <[email protected]>
Signed-off-by: Beata Michalska <[email protected]>
---
arch/arm64/kernel/topology.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index b03fe8617721..f204f6489f98 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -207,6 +207,19 @@ static struct scale_freq_data amu_sfd = {
.set_freq_scale = amu_scale_freq_tick,
};
+void arch_cpu_idle_enter(void)
+{
+ unsigned int cpu = smp_processor_id();
+
+ if (!cpumask_test_cpu(cpu, amu_fie_cpus))
+ return;
+
+ /* Kick in AMU update but only if one has not happened already */
+ if (housekeeping_cpu(cpu, HK_TYPE_TICK) &&
+ time_is_before_jiffies(per_cpu(cpu_amu_samples.last_update, cpu)))
+ amu_scale_freq_tick();
+}
+
#define AMU_SAMPLE_EXP_MS 20
unsigned int arch_freq_get_on_cpu(int cpu)
@@ -232,8 +245,8 @@ unsigned int arch_freq_get_on_cpu(int cpu)
* this boils down to identifying an active cpu within the same freq
* domain, if any.
*/
- if (!housekeeping_cpu(cpu, HK_TYPE_TICK) ||
- time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS))) {
+ if (!housekeeping_cpu(cpu, HK_TYPE_TICK) || (!idle_cpu(cpu) &&
+ time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS)))) {
struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
int ref_cpu = cpu;
--
2.25.1
On Fri, 5 Apr 2024 at 15:33, Beata Michalska <[email protected]> wrote:
>
> From: Ionela Voinescu <[email protected]>
>
> It's useful to have capacity_freq_ref initialized to 0 for users of
> arch_scale_freq_ref() to detect when capacity_freq_ref was not
> yet set.
>
> The only scenario affected by this change in the init value is when a
> cpufreq driver is never loaded. As a result, the only setter of a
> cpu scale factor remains the call of topology_normalize_cpu_scale()
> from parse_dt_topology(). There we cannot use the value 0 of
> capacity_freq_ref so we have to compensate for its uninitialized state.
>
> Signed-off-by: Ionela Voinescu <[email protected]>
> Signed-off-by: Beata Michalska <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
> ---
> drivers/base/arch_topology.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 024b78a0cfc1..7d4c92cd2bad 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -27,7 +27,7 @@
> static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
> static struct cpumask scale_freq_counters_mask;
> static bool scale_freq_invariant;
> -DEFINE_PER_CPU(unsigned long, capacity_freq_ref) = 1;
> +DEFINE_PER_CPU(unsigned long, capacity_freq_ref) = 0;
> EXPORT_PER_CPU_SYMBOL_GPL(capacity_freq_ref);
>
> static bool supports_scale_freq_counters(const struct cpumask *cpus)
> @@ -292,13 +292,15 @@ void topology_normalize_cpu_scale(void)
>
> capacity_scale = 1;
> for_each_possible_cpu(cpu) {
> - capacity = raw_capacity[cpu] * per_cpu(capacity_freq_ref, cpu);
> + capacity = raw_capacity[cpu] *
> + (per_cpu(capacity_freq_ref, cpu) ?: 1);
> capacity_scale = max(capacity, capacity_scale);
> }
>
> pr_debug("cpu_capacity: capacity_scale=%llu\n", capacity_scale);
> for_each_possible_cpu(cpu) {
> - capacity = raw_capacity[cpu] * per_cpu(capacity_freq_ref, cpu);
> + capacity = raw_capacity[cpu] *
> + (per_cpu(capacity_freq_ref, cpu) ?: 1);
> capacity = div64_u64(capacity << SCHED_CAPACITY_SHIFT,
> capacity_scale);
> topology_set_cpu_scale(cpu, capacity);
> --
> 2.25.1
>
Hi Beata,
This patch is giving abort due to 'amu_fie_cpus' mask which gets
allocated later in 'init_amu_fie()'.
] smp: Bringing up secondary CPUs ...
] Unable to handle kernel read from unreadable memory at virtual
address 0000000000000000
.......
] Call trace:
] arch_cpu_idle_enter+0x30/0xe0
] do_idle+0xb8/0x2e0
] cpu_startup_entry+0x3c/0x50
] rest_init+0x108/0x128
] start_kernel+0x7a4/0xa50
] __primary_switched+0x80/0x90
] Code: d53cd042 b8626800 f943c821 53067c02 (f8627821)
] ---[ end trace 0000000000000000 ]---
] Kernel panic - not syncing: Oops: Fatal exception
Added cpumask_available() check before access to fix.
+++ b/arch/arm64/kernel/topology.c
@@ -211,9 +211,13 @@ void arch_cpu_idle_enter(void)
{
unsigned int cpu = smp_processor_id();
- if (!cpumask_test_cpu(cpu, amu_fie_cpus))
+ if (cpumask_available(amu_fie_cpus) &&
+ !cpumask_test_cpu(cpu, amu_fie_cpus))
return;
Thank you,
Sumit Gupta
On 05/04/24 19:03, Beata Michalska wrote:
> External email: Use caution opening links or attachments
>
>
> Now that the frequency scale factor has been activated for retrieving
> current frequency on a given CPU, trigger its update upon entering
> idle. This will, to an extent, allow querying last known frequency
> in a non-invasive way. It will also improve the frequency scale factor
> accuracy when a CPU entering idle did not receive a tick for a while.
> As a consequence, for idle cores, the reported frequency will be the
> last one observed before entering the idle state.
>
> Suggested-by: Vanshidhar Konda <[email protected]>
> Signed-off-by: Beata Michalska <[email protected]>
> ---
> arch/arm64/kernel/topology.c | 17 +++++++++++++++--
> 1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index b03fe8617721..f204f6489f98 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
> @@ -207,6 +207,19 @@ static struct scale_freq_data amu_sfd = {
> .set_freq_scale = amu_scale_freq_tick,
> };
>
> +void arch_cpu_idle_enter(void)
> +{
> + unsigned int cpu = smp_processor_id();
> +
> + if (!cpumask_test_cpu(cpu, amu_fie_cpus))
> + return;
> +
> + /* Kick in AMU update but only if one has not happened already */
> + if (housekeeping_cpu(cpu, HK_TYPE_TICK) &&
> + time_is_before_jiffies(per_cpu(cpu_amu_samples.last_update, cpu)))
> + amu_scale_freq_tick();
> +}
> +
> #define AMU_SAMPLE_EXP_MS 20
>
> unsigned int arch_freq_get_on_cpu(int cpu)
> @@ -232,8 +245,8 @@ unsigned int arch_freq_get_on_cpu(int cpu)
> * this boils down to identifying an active cpu within the same freq
> * domain, if any.
> */
> - if (!housekeeping_cpu(cpu, HK_TYPE_TICK) ||
> - time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS))) {
> + if (!housekeeping_cpu(cpu, HK_TYPE_TICK) || (!idle_cpu(cpu) &&
> + time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS)))) {
> struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
> int ref_cpu = cpu;
>
> --
> 2.25.1
>
On Thu, Apr 11, 2024 at 12:27:00AM +0530, Sumit Gupta wrote:
Hi Sumit,
> Hi Beata,
>
> This patch is giving abort due to 'amu_fie_cpus' mask which gets allocated
> later in 'init_amu_fie()'.
>
> ] smp: Bringing up secondary CPUs ...
> ] Unable to handle kernel read from unreadable memory at virtual
> address 0000000000000000
> .......
> ] Call trace:
> ] arch_cpu_idle_enter+0x30/0xe0
> ] do_idle+0xb8/0x2e0
> ] cpu_startup_entry+0x3c/0x50
> ] rest_init+0x108/0x128
> ] start_kernel+0x7a4/0xa50
> ] __primary_switched+0x80/0x90
> ] Code: d53cd042 b8626800 f943c821 53067c02 (f8627821)
> ] ---[ end trace 0000000000000000 ]---
> ] Kernel panic - not syncing: Oops: Fatal exception
>
> Added cpumask_available() check before access to fix.
>
> +++ b/arch/arm64/kernel/topology.c
> @@ -211,9 +211,13 @@ void arch_cpu_idle_enter(void)
> {
> unsigned int cpu = smp_processor_id();
>
> - if (!cpumask_test_cpu(cpu, amu_fie_cpus))
> + if (cpumask_available(amu_fie_cpus) &&
> + !cpumask_test_cpu(cpu, amu_fie_cpus))
> return;
>
That's actually a gruesome mistake on my side. Thanks for catching that one.
On that note: arch_freq_get_on_cpu should be also protected case amu fie init
fails for some reason. Will be sending an update.
Thanks again.
---
BR
Beata
> Thank you,
> Sumit Gupta
>
> On 05/04/24 19:03, Beata Michalska wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > Now that the frequency scale factor has been activated for retrieving
> > current frequency on a given CPU, trigger its update upon entering
> > idle. This will, to an extent, allow querying last known frequency
> > in a non-invasive way. It will also improve the frequency scale factor
> > accuracy when a CPU entering idle did not receive a tick for a while.
> > As a consequence, for idle cores, the reported frequency will be the
> > last one observed before entering the idle state.
> >
> > Suggested-by: Vanshidhar Konda <[email protected]>
> > Signed-off-by: Beata Michalska <[email protected]>
> > ---
> > arch/arm64/kernel/topology.c | 17 +++++++++++++++--
> > 1 file changed, 15 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> > index b03fe8617721..f204f6489f98 100644
> > --- a/arch/arm64/kernel/topology.c
> > +++ b/arch/arm64/kernel/topology.c
> > @@ -207,6 +207,19 @@ static struct scale_freq_data amu_sfd = {
> > .set_freq_scale = amu_scale_freq_tick,
> > };
> >
> > +void arch_cpu_idle_enter(void)
> > +{
> > + unsigned int cpu = smp_processor_id();
> > +
> > + if (!cpumask_test_cpu(cpu, amu_fie_cpus))
> > + return;
> > +
> > + /* Kick in AMU update but only if one has not happened already */
> > + if (housekeeping_cpu(cpu, HK_TYPE_TICK) &&
> > + time_is_before_jiffies(per_cpu(cpu_amu_samples.last_update, cpu)))
> > + amu_scale_freq_tick();
> > +}
> > +
> > #define AMU_SAMPLE_EXP_MS 20
> >
> > unsigned int arch_freq_get_on_cpu(int cpu)
> > @@ -232,8 +245,8 @@ unsigned int arch_freq_get_on_cpu(int cpu)
> > * this boils down to identifying an active cpu within the same freq
> > * domain, if any.
> > */
> > - if (!housekeeping_cpu(cpu, HK_TYPE_TICK) ||
> > - time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS))) {
> > + if (!housekeeping_cpu(cpu, HK_TYPE_TICK) || (!idle_cpu(cpu) &&
> > + time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS)))) {
> > struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
> > int ref_cpu = cpu;
> >
> > --
> > 2.25.1
> >
On Fri, Apr 05, 2024 at 02:33:19PM +0100, Beata Michalska wrote:
>Some architectures provide a way to determine an average frequency over
>a certain period of time based on available performance monitors (AMU on
>ARM or APERF/MPERf on x86). With those at hand, enroll arch_freq_get_on_cpu
>into cpuinfo_cur_freq policy sysfs attribute handler, which is expected to
>represent the current frequency of a given CPU, as obtained by the hardware.
>This is the type of feedback that counters do provide.
>
--- snip ---
While testing this patch series on AmpereOne system, I simulated CPU
frequency throttling when the system is under power or thermal
constraints.
In this scenario, based on the user guilde, I expect scaling_cur_freq
is the frequency the kernel requests from the hardware; cpuinfo_cur_freq
is the actual frequency that the hardware is able to run at during the
power or thermal constraints.
The AmpereOne system I'm testing on has the following configuration:
- Max frequency is 3000000
- Support for AMU registers
- ACPI CPPC feedback counters use PCC register space
- Fedora 39 with 6.7.5 kernel
- Fedora 39 with 6.9.0-rc3 + this patch series
With 6.7.5 kernel:
Core scaling_cur_freq cpuinfo_cur_freq
---- ---------------- ----------------
0 3000000 2593000
1 3000000 2613000
2 3000000 2625000
3 3000000 2632000
With 6.9.0-rc3 + this patch series:
Core scaling_cur_freq cpuinfo_cur_freq
---- ---------------- ----------------
0 2671875 2671875
1 2589632 2589632
2 2648437 2648437
3 2698242 2698242
In the second case we can't identify that the CPU frequency is
being throttled by the hardware. I noticed this behavior with
or without this patch.
Thanks,
Vanshi
On Mon, Apr 15, 2024 at 09:23:10PM -0700, Vanshidhar Konda wrote:
> On Fri, Apr 05, 2024 at 02:33:19PM +0100, Beata Michalska wrote:
> > Some architectures provide a way to determine an average frequency over
> > a certain period of time based on available performance monitors (AMU on
> > ARM or APERF/MPERf on x86). With those at hand, enroll arch_freq_get_on_cpu
> > into cpuinfo_cur_freq policy sysfs attribute handler, which is expected to
> > represent the current frequency of a given CPU, as obtained by the hardware.
> > This is the type of feedback that counters do provide.
> >
>
> --- snip ---
>
> While testing this patch series on AmpereOne system, I simulated CPU
> frequency throttling when the system is under power or thermal
> constraints.
>
> In this scenario, based on the user guilde, I expect scaling_cur_freq
> is the frequency the kernel requests from the hardware; cpuinfo_cur_freq
> is the actual frequency that the hardware is able to run at during the
> power or thermal constraints.
There has been a discussion on scaling_cur_freq vs cpuinfo_cur_freq [1].
The guidelines you are referring here (assuming you mean [2]) are kinda
out-of-sync already as scaling_cur_freq has been wired earlier to use arch
specific feedback. As there was no Arm dedicated implementation of
arch_freq_get_on_cpu, this went kinda unnoticed.
The conclusion of the above mentioned discussion (though rather unstated
explicitly) was to keep the current behaviour of scaling_cur_freq and align
both across different archs: so with the patches, both attributes will provide
hw feedback on current frequency, when available.
Note that if we are to adhere to the docs cpuinfo_cur_freq is the place to use
the counters really.
That change was also requested through [3]
Adding @Viresh in case there was any shift in the tides ....
>
> The AmpereOne system I'm testing on has the following configuration:
> - Max frequency is 3000000
> - Support for AMU registers
> - ACPI CPPC feedback counters use PCC register space
> - Fedora 39 with 6.7.5 kernel
> - Fedora 39 with 6.9.0-rc3 + this patch series
>
> With 6.7.5 kernel:
> Core scaling_cur_freq cpuinfo_cur_freq
> ---- ---------------- ----------------
> 0 3000000 2593000
> 1 3000000 2613000
> 2 3000000 2625000
> 3 3000000 2632000
>
So if I got it right from the info you have provided the numbers above are
obtained without applying the patches. In that case, scaling_cur_freq will
use policy->cur (in your case) showing last frequency set, not necessarily
the actual freq, whereas cpuinfo_cur_freq uses __cpufreq_get and AMU counters.
> With 6.9.0-rc3 + this patch series:
> Core scaling_cur_freq cpuinfo_cur_freq
> ---- ---------------- ----------------
> 0 2671875 2671875
> 1 2589632 2589632
> 2 2648437 2648437
> 3 2698242 2698242
>
With the patches applied both scaling_cur_freq and cpuinfo_cur_freq will use AMU
counters, or fie scale factor obtained based on AMU counters to be more precise:
both should now show similar/same frequency (as discussed in [1])
I'd say due to existing implementation for scaling_cur_freq (which we cannot
change at this point) this is unavoidable.
> In the second case we can't identify that the CPU frequency is
> being throttled by the hardware. I noticed this behavior with
> or without this patch.
>
I am not entirely sure comparing the two should be a way to go about throttling
(whether w/ or w/o the changes).
It would probably be best to refer to thermal sysfs and get a hold of cur_state
which should indicate current throttle state:
/sys/class/thermal/thermal_zone[0-*]/cdev[0-*]/cur_state
with values above '0' implying ongoing throttling.
The appropriate thermal_zone can be identified through 'type' attribute.
Thank you for giving those patches a spin.
---
BR
Beata
---
[1] https://lore.kernel.org/all/20230609043922.eyyqutbwlofqaddz@vireshk-i7/
[2] https://elixir.bootlin.com/linux/latest/source/Documentation/admin-guide/pm/cpufreq.rst#L197
[3] https://lore.kernel.org/lkml/[email protected]/
---
> Thanks,
> Vanshi
On Tue, Apr 16, 2024 at 05:46:18PM +0200, Beata Michalska wrote:
>On Mon, Apr 15, 2024 at 09:23:10PM -0700, Vanshidhar Konda wrote:
>> On Fri, Apr 05, 2024 at 02:33:19PM +0100, Beata Michalska wrote:
>> > Some architectures provide a way to determine an average frequency over
>> > a certain period of time based on available performance monitors (AMU on
>> > ARM or APERF/MPERf on x86). With those at hand, enroll arch_freq_get_on_cpu
>> > into cpuinfo_cur_freq policy sysfs attribute handler, which is expected to
>> > represent the current frequency of a given CPU, as obtained by the hardware.
>> > This is the type of feedback that counters do provide.
>> >
>>
>> --- snip ---
>>
>> While testing this patch series on AmpereOne system, I simulated CPU
>> frequency throttling when the system is under power or thermal
>> constraints.
>>
>> In this scenario, based on the user guilde, I expect scaling_cur_freq
>> is the frequency the kernel requests from the hardware; cpuinfo_cur_freq
>> is the actual frequency that the hardware is able to run at during the
>> power or thermal constraints.
>There has been a discussion on scaling_cur_freq vs cpuinfo_cur_freq [1].
>The guidelines you are referring here (assuming you mean [2]) are kinda
>out-of-sync already as scaling_cur_freq has been wired earlier to use arch
>specific feedback. As there was no Arm dedicated implementation of
>arch_freq_get_on_cpu, this went kinda unnoticed.
>The conclusion of the above mentioned discussion (though rather unstated
>explicitly) was to keep the current behaviour of scaling_cur_freq and align
>both across different archs: so with the patches, both attributes will provide
>hw feedback on current frequency, when available.
>Note that if we are to adhere to the docs cpuinfo_cur_freq is the place to use
>the counters really.
>
>That change was also requested through [3]
>
>Adding @Viresh in case there was any shift in the tides ....
>
Thank you for the pointer to the discussion in [1]. It makes sense to
bring arm64 behavior in line with x86. The question about whether
modifying the behavior of scaling_cur_freq was a good idea did not get
any response.
>>
>> The AmpereOne system I'm testing on has the following configuration:
>> - Max frequency is 3000000
>> - Support for AMU registers
>> - ACPI CPPC feedback counters use PCC register space
>> - Fedora 39 with 6.7.5 kernel
>> - Fedora 39 with 6.9.0-rc3 + this patch series
>>
>> With 6.7.5 kernel:
>> Core scaling_cur_freq cpuinfo_cur_freq
>> ---- ---------------- ----------------
>> 0 3000000 2593000
>> 1 3000000 2613000
>> 2 3000000 2625000
>> 3 3000000 2632000
>>
>So if I got it right from the info you have provided the numbers above are
>obtained without applying the patches. In that case, scaling_cur_freq will
>use policy->cur (in your case) showing last frequency set, not necessarily
>the actual freq, whereas cpuinfo_cur_freq uses __cpufreq_get and AMU counters.
>
>
>> With 6.9.0-rc3 + this patch series:
>> Core scaling_cur_freq cpuinfo_cur_freq
>> ---- ---------------- ----------------
>> 0 2671875 2671875
>> 1 2589632 2589632
>> 2 2648437 2648437
>> 3 2698242 2698242
>>
>With the patches applied both scaling_cur_freq and cpuinfo_cur_freq will use AMU
>counters, or fie scale factor obtained based on AMU counters to be more precise:
>both should now show similar/same frequency (as discussed in [1])
>I'd say due to existing implementation for scaling_cur_freq (which we cannot
>change at this point) this is unavoidable.
>
>> In the second case we can't identify that the CPU frequency is
>> being throttled by the hardware. I noticed this behavior with
>> or without this patch.
>>
>I am not entirely sure comparing the two should be a way to go about throttling
>(whether w/ or w/o the changes).
>It would probably be best to refer to thermal sysfs and get a hold of cur_state
Throttling could happen due to non-thermal reasons. Or a system may not
even support thermal zones. So on those systems we wouldn't be able to
identify/debug the behavior of the hardware providing less than maximum
performance. The discussion around scaling_cur_freq should probably be
re-visited in a targeted manner I think.
I'll test v5 of the series on AmpereOne and report back on that thread.
Thanks,
Vanshi
>which should indicate current throttle state:
>
> /sys/class/thermal/thermal_zone[0-*]/cdev[0-*]/cur_state
>
>with values above '0' implying ongoing throttling.
>
>The appropriate thermal_zone can be identified through 'type' attribute.
>
>
>Thank you for giving those patches a spin.
>
>---
>BR
>Beata
>---
>[1] https://lore.kernel.org/all/20230609043922.eyyqutbwlofqaddz@vireshk-i7/
>[2] https://elixir.bootlin.com/linux/latest/source/Documentation/admin-guide/pm/cpufreq.rst#L197
>[3] https://lore.kernel.org/lkml/[email protected]/
>---
>
>
>> Thanks,
>> Vanshi
On Wed, Apr 17, 2024 at 02:38:58PM -0700, Vanshidhar Konda wrote:
> On Tue, Apr 16, 2024 at 05:46:18PM +0200, Beata Michalska wrote:
> > On Mon, Apr 15, 2024 at 09:23:10PM -0700, Vanshidhar Konda wrote:
> > > On Fri, Apr 05, 2024 at 02:33:19PM +0100, Beata Michalska wrote:
> > > > Some architectures provide a way to determine an average frequency over
> > > > a certain period of time based on available performance monitors (AMU on
> > > > ARM or APERF/MPERf on x86). With those at hand, enroll arch_freq_get_on_cpu
> > > > into cpuinfo_cur_freq policy sysfs attribute handler, which is expected to
> > > > represent the current frequency of a given CPU, as obtained by the hardware.
> > > > This is the type of feedback that counters do provide.
> > > >
> > >
> > > --- snip ---
> > >
> > > While testing this patch series on AmpereOne system, I simulated CPU
> > > frequency throttling when the system is under power or thermal
> > > constraints.
> > >
> > > In this scenario, based on the user guilde, I expect scaling_cur_freq
> > > is the frequency the kernel requests from the hardware; cpuinfo_cur_freq
> > > is the actual frequency that the hardware is able to run at during the
> > > power or thermal constraints.
> > There has been a discussion on scaling_cur_freq vs cpuinfo_cur_freq [1].
> > The guidelines you are referring here (assuming you mean [2]) are kinda
> > out-of-sync already as scaling_cur_freq has been wired earlier to use arch
> > specific feedback. As there was no Arm dedicated implementation of
> > arch_freq_get_on_cpu, this went kinda unnoticed.
> > The conclusion of the above mentioned discussion (though rather unstated
> > explicitly) was to keep the current behaviour of scaling_cur_freq and align
> > both across different archs: so with the patches, both attributes will provide
> > hw feedback on current frequency, when available.
> > Note that if we are to adhere to the docs cpuinfo_cur_freq is the place to use
> > the counters really.
> >
> > That change was also requested through [3]
> >
> > Adding @Viresh in case there was any shift in the tides ....
> >
>
> Thank you for the pointer to the discussion in [1]. It makes sense to
> bring arm64 behavior in line with x86. The question about whether
> modifying the behavior of scaling_cur_freq was a good idea did not get
> any response.
>
> > >
> > > The AmpereOne system I'm testing on has the following configuration:
> > > - Max frequency is 3000000
> > > - Support for AMU registers
> > > - ACPI CPPC feedback counters use PCC register space
> > > - Fedora 39 with 6.7.5 kernel
> > > - Fedora 39 with 6.9.0-rc3 + this patch series
> > >
> > > With 6.7.5 kernel:
> > > Core scaling_cur_freq cpuinfo_cur_freq
> > > ---- ---------------- ----------------
> > > 0 3000000 2593000
> > > 1 3000000 2613000
> > > 2 3000000 2625000
> > > 3 3000000 2632000
> > >
> > So if I got it right from the info you have provided the numbers above are
> > obtained without applying the patches. In that case, scaling_cur_freq will
> > use policy->cur (in your case) showing last frequency set, not necessarily
> > the actual freq, whereas cpuinfo_cur_freq uses __cpufreq_get and AMU counters.
> >
> >
> > > With 6.9.0-rc3 + this patch series:
> > > Core scaling_cur_freq cpuinfo_cur_freq
> > > ---- ---------------- ----------------
> > > 0 2671875 2671875
> > > 1 2589632 2589632
> > > 2 2648437 2648437
> > > 3 2698242 2698242
> > >
> > With the patches applied both scaling_cur_freq and cpuinfo_cur_freq will use AMU
> > counters, or fie scale factor obtained based on AMU counters to be more precise:
> > both should now show similar/same frequency (as discussed in [1])
> > I'd say due to existing implementation for scaling_cur_freq (which we cannot
> > change at this point) this is unavoidable.
> >
> > > In the second case we can't identify that the CPU frequency is
> > > being throttled by the hardware. I noticed this behavior with
> > > or without this patch.
> > >
> > I am not entirely sure comparing the two should be a way to go about throttling
> > (whether w/ or w/o the changes).
> > It would probably be best to refer to thermal sysfs and get a hold of cur_state
>
> Throttling could happen due to non-thermal reasons. Or a system may not
> even support thermal zones. So on those systems we wouldn't be able to
> identify/debug the behavior of the hardware providing less than maximum
> performance. The discussion around scaling_cur_freq should probably be
> re-visited in a targeted manner I think.
>
@Viresh:
It seems that we might need to revisit the discussion we've had around
scaling_cur_freq and cpuinfo_cur_freq and the use of arch_freq_get_on_cpu.
As Vanshi has raised, having both utilizing arch specific feedback for
getting current frequency is bit problematic and might be confusing at best.
As arch_freq_get_on_cpu is already used by show_scaling_cur_freq there are not
many options we are left with, if we want to kee all archs aligned:
we can either try to rework show_scaling_cur_freq and it's use of
arch_freq_get_on_cpu, and move it to cpuinfo_cur_freq, which would align with
relevant docs, though that will not work for x86, or we keep it only there and
skip updating cpuinfo_cur_freq, going against the guidelines. Other options,
purely theoretical, would involve making arch_freq_get_on_cpu aware of type of
the info requested (hw vs sw) or adding yet another arch-specific implementation,
and those are not really appealing alternatives to say at least.
What's your opinion on this one ?
---
BR
Beata
> I'll test v5 of the series on AmpereOne and report back on that thread.
>
> Thanks,
> Vanshi
>
> > which should indicate current throttle state:
> >
> > /sys/class/thermal/thermal_zone[0-*]/cdev[0-*]/cur_state
> >
> > with values above '0' implying ongoing throttling.
> >
> > The appropriate thermal_zone can be identified through 'type' attribute.
> >
> >
> > Thank you for giving those patches a spin.
> >
> > ---
> > BR
> > Beata
> > ---
> > [1] https://lore.kernel.org/all/20230609043922.eyyqutbwlofqaddz@vireshk-i7/
> > [2] https://elixir.bootlin.com/linux/latest/source/Documentation/admin-guide/pm/cpufreq.rst#L197
> > [3] https://lore.kernel.org/lkml/[email protected]/
> > ---
> >
> >
> > > Thanks,
> > > Vanshi
On 26-04-24, 12:45, Beata Michalska wrote:
> It seems that we might need to revisit the discussion we've had around
> scaling_cur_freq and cpuinfo_cur_freq and the use of arch_freq_get_on_cpu.
> As Vanshi has raised, having both utilizing arch specific feedback for
> getting current frequency is bit problematic and might be confusing at best.
> As arch_freq_get_on_cpu is already used by show_scaling_cur_freq there are not
> many options we are left with, if we want to kee all archs aligned:
> we can either try to rework show_scaling_cur_freq and it's use of
> arch_freq_get_on_cpu, and move it to cpuinfo_cur_freq, which would align with
> relevant docs, though that will not work for x86, or we keep it only there and
> skip updating cpuinfo_cur_freq, going against the guidelines. Other options,
> purely theoretical, would involve making arch_freq_get_on_cpu aware of type of
> the info requested (hw vs sw) or adding yet another arch-specific implementation,
> and those are not really appealing alternatives to say at least.
> What's your opinion on this one ?
Hi Beata / Vanshidhar,
Lets forget for once what X86 and ARM may have done and think about it
once again. I also had a chat with Vincent today about this.
The documentation says it clearly, cpuinfo_cur_freq is the one
received from hardware and scaling_cur_freq is the one requested from
software.
Now, I know that X86 has made both of them quite similar and I
suggested to make them all aligned (and never received a reply on my
previous message).
There are few reasons why it may be worth keeping the definition (and
behavior) of the sysfs files as is, at least for ARM:
- First is that the documentation says so.
- There is no point providing the same information via both the
interfaces, there are two interfaces here for a reason.
- There maybe tools around which depend on the documented behavior.
- From userspace, currently there is only one way to know the exact
frequency that the cpufreq governors have requested from a platform,
i.e. the value from scaling_cur_freq. If we make it similar to
cpuinfo_cur_freq, then userspace will never know about the requested
frequency and the eventual one and if they are same or different.
Lets keep the behavior as is and update only cpuinfo_cur_freq with
arch_freq_get_on_cpu().
Makes sense ?
--
viresh
On Mon, Apr 29, 2024 at 02:55:15PM +0530, Viresh Kumar wrote:
>On 26-04-24, 12:45, Beata Michalska wrote:
>> It seems that we might need to revisit the discussion we've had around
>> scaling_cur_freq and cpuinfo_cur_freq and the use of arch_freq_get_on_cpu.
>> As Vanshi has raised, having both utilizing arch specific feedback for
>> getting current frequency is bit problematic and might be confusing at best.
>> As arch_freq_get_on_cpu is already used by show_scaling_cur_freq there are not
>> many options we are left with, if we want to kee all archs aligned:
>> we can either try to rework show_scaling_cur_freq and it's use of
>> arch_freq_get_on_cpu, and move it to cpuinfo_cur_freq, which would align with
>> relevant docs, though that will not work for x86, or we keep it only there and
>> skip updating cpuinfo_cur_freq, going against the guidelines. Other options,
>> purely theoretical, would involve making arch_freq_get_on_cpu aware of type of
>> the info requested (hw vs sw) or adding yet another arch-specific implementation,
>> and those are not really appealing alternatives to say at least.
>> What's your opinion on this one ?
>
>Hi Beata / Vanshidhar,
>
>Lets forget for once what X86 and ARM may have done and think about it
>once again. I also had a chat with Vincent today about this.
>
>The documentation says it clearly, cpuinfo_cur_freq is the one
>received from hardware and scaling_cur_freq is the one requested from
>software.
>
>Now, I know that X86 has made both of them quite similar and I
>suggested to make them all aligned (and never received a reply on my
>previous message).
>
>There are few reasons why it may be worth keeping the definition (and
>behavior) of the sysfs files as is, at least for ARM:
>- First is that the documentation says so.
>- There is no point providing the same information via both the
> interfaces, there are two interfaces here for a reason.
>- There maybe tools around which depend on the documented behavior.
>- From userspace, currently there is only one way to know the exact
> frequency that the cpufreq governors have requested from a platform,
> i.e. the value from scaling_cur_freq. If we make it similar to
> cpuinfo_cur_freq, then userspace will never know about the requested
> frequency and the eventual one and if they are same or different.
>
>Lets keep the behavior as is and update only cpuinfo_cur_freq with
>arch_freq_get_on_cpu().
>
>Makes sense ?
I had the same concerns. It probably makes sense explicity note this in
the next version of the patch series; in the future readers may be
confused by x86 and ARM behave differntly on scaling_cur_freq.
Thanks,
Vanshi
>
>--
>viresh
On Mon, Apr 29, 2024 at 02:55:15PM +0530, Viresh Kumar wrote:
> On 26-04-24, 12:45, Beata Michalska wrote:
> > It seems that we might need to revisit the discussion we've had around
> > scaling_cur_freq and cpuinfo_cur_freq and the use of arch_freq_get_on_cpu.
> > As Vanshi has raised, having both utilizing arch specific feedback for
> > getting current frequency is bit problematic and might be confusing at best.
> > As arch_freq_get_on_cpu is already used by show_scaling_cur_freq there are not
> > many options we are left with, if we want to kee all archs aligned:
> > we can either try to rework show_scaling_cur_freq and it's use of
> > arch_freq_get_on_cpu, and move it to cpuinfo_cur_freq, which would align with
> > relevant docs, though that will not work for x86, or we keep it only there and
> > skip updating cpuinfo_cur_freq, going against the guidelines. Other options,
> > purely theoretical, would involve making arch_freq_get_on_cpu aware of type of
> > the info requested (hw vs sw) or adding yet another arch-specific implementation,
> > and those are not really appealing alternatives to say at least.
> > What's your opinion on this one ?
>
> Hi Beata / Vanshidhar,
>
> Lets forget for once what X86 and ARM may have done and think about it
> once again. I also had a chat with Vincent today about this.
>
> The documentation says it clearly, cpuinfo_cur_freq is the one
> received from hardware and scaling_cur_freq is the one requested from
> software.
>
> Now, I know that X86 has made both of them quite similar and I
> suggested to make them all aligned (and never received a reply on my
> previous message).
>
> There are few reasons why it may be worth keeping the definition (and
> behavior) of the sysfs files as is, at least for ARM:
> - First is that the documentation says so.
> - There is no point providing the same information via both the
> interfaces, there are two interfaces here for a reason.
> - There maybe tools around which depend on the documented behavior.
> - From userspace, currently there is only one way to know the exact
> frequency that the cpufreq governors have requested from a platform,
> i.e. the value from scaling_cur_freq. If we make it similar to
> cpuinfo_cur_freq, then userspace will never know about the requested
> frequency and the eventual one and if they are same or different.
>
> Lets keep the behavior as is and update only cpuinfo_cur_freq with
> arch_freq_get_on_cpu().
>
> Makes sense ?
>
First of all - apologies for late reply.
It all makes sense, though to clarify things up, for my own benefit, and to
avoid any potential confusion ....
Adding arch_freq_get_on_cpu to cpuinfo_cur_freq does seem to be the right
approach - no argue on this one. Doing that though means we need a way to
skip calling arch_freq_get_on_cpu() from show_scaling_cur_freq(), to avoid
having both providing the same information when that should not be the case.
In the initial approach [1], that was handled by checking whether the cpufreq
driver supports 'get' callback (and thus cpuinfo_cur_freq). In case it didn't,
things remained unchanged for scaling_cur_freq. That does not seem to be a viable
option though, as there are at least few drivers, that will support both:
cpuinfo_cur_freq alongside scaling_cur_freq (+ APERF/MPERF) and for those,
we would hit the initial problem of both relying on arch_freq_get_on_cpu.
So I guess we need another way of avoiding calling arch_freq_get_on_cpu
for show_scaling_cur_freq (and most probably avoid calling that one for
cpuinfo_cur_freq). Quick idea on how to not bring arch specificity into
cpufreq generic code would be to introduce a new flag for cpufreq drivers though
that seems a bit stretched. Will ponder a bit about that but in the meantime
suggestions are more than welcomed.
Aside: I will most probably send the changes separately from this series to not
mix things any further.
---
[1] https://lore.kernel.org/all/[email protected]/
---
BR
Beata
> --
> viresh
On Tue, May 07, 2024 at 10:31:52AM +0200, Beata Michalska wrote:
> On Mon, Apr 29, 2024 at 02:55:15PM +0530, Viresh Kumar wrote:
> > On 26-04-24, 12:45, Beata Michalska wrote:
> > > It seems that we might need to revisit the discussion we've had around
> > > scaling_cur_freq and cpuinfo_cur_freq and the use of arch_freq_get_on_cpu.
> > > As Vanshi has raised, having both utilizing arch specific feedback for
> > > getting current frequency is bit problematic and might be confusing at best.
> > > As arch_freq_get_on_cpu is already used by show_scaling_cur_freq there are not
> > > many options we are left with, if we want to kee all archs aligned:
> > > we can either try to rework show_scaling_cur_freq and it's use of
> > > arch_freq_get_on_cpu, and move it to cpuinfo_cur_freq, which would align with
> > > relevant docs, though that will not work for x86, or we keep it only there and
> > > skip updating cpuinfo_cur_freq, going against the guidelines. Other options,
> > > purely theoretical, would involve making arch_freq_get_on_cpu aware of type of
> > > the info requested (hw vs sw) or adding yet another arch-specific implementation,
> > > and those are not really appealing alternatives to say at least.
> > > What's your opinion on this one ?
> >
> > Hi Beata / Vanshidhar,
> >
> > Lets forget for once what X86 and ARM may have done and think about it
> > once again. I also had a chat with Vincent today about this.
> >
> > The documentation says it clearly, cpuinfo_cur_freq is the one
> > received from hardware and scaling_cur_freq is the one requested from
> > software.
> >
> > Now, I know that X86 has made both of them quite similar and I
> > suggested to make them all aligned (and never received a reply on my
> > previous message).
> >
> > There are few reasons why it may be worth keeping the definition (and
> > behavior) of the sysfs files as is, at least for ARM:
> > - First is that the documentation says so.
> > - There is no point providing the same information via both the
> > interfaces, there are two interfaces here for a reason.
> > - There maybe tools around which depend on the documented behavior.
> > - From userspace, currently there is only one way to know the exact
> > frequency that the cpufreq governors have requested from a platform,
> > i.e. the value from scaling_cur_freq. If we make it similar to
> > cpuinfo_cur_freq, then userspace will never know about the requested
> > frequency and the eventual one and if they are same or different.
> >
> > Lets keep the behavior as is and update only cpuinfo_cur_freq with
> > arch_freq_get_on_cpu().
> >
> > Makes sense ?
> >
> First of all - apologies for late reply.
>
> It all makes sense, though to clarify things up, for my own benefit, and to
> avoid any potential confusion ....
>
> Adding arch_freq_get_on_cpu to cpuinfo_cur_freq does seem to be the right
> approach - no argue on this one. Doing that though means we need a way to
> skip calling arch_freq_get_on_cpu() from show_scaling_cur_freq(), to avoid
> having both providing the same information when that should not be the case.
> In the initial approach [1], that was handled by checking whether the cpufreq
> driver supports 'get' callback (and thus cpuinfo_cur_freq). In case it didn't,
> things remained unchanged for scaling_cur_freq. That does not seem to be a viable
> option though, as there are at least few drivers, that will support both:
> cpuinfo_cur_freq alongside scaling_cur_freq (+ APERF/MPERF) and for those,
> we would hit the initial problem of both relying on arch_freq_get_on_cpu.
> So I guess we need another way of avoiding calling arch_freq_get_on_cpu
> for show_scaling_cur_freq (and most probably avoid calling that one for
> cpuinfo_cur_freq). Quick idea on how to not bring arch specificity into
> cpufreq generic code would be to introduce a new flag for cpufreq drivers though
> that seems a bit stretched. Will ponder a bit about that but in the meantime
> suggestions are more than welcomed.
Alternatively we could add a parameter to arch_freq_get_on_cpu specifying type
of feedback required: hw vs sw. Then the arch specific implementation could
decide which to provide when. It will get slightly counter-intuitive, especially
for cases when sw feedback provides hw one, as it is the case for current
arch_freq_get_on_cpu() and scaling_cur_freq but at least the changes would be
minimal and it will contain handling the tricky bits inside arch specific
implementation - hiding those messy bits.
---
BR
Beata
>
> Aside: I will most probably send the changes separately from this series to not
> mix things any further.
>
> ---
> [1] https://lore.kernel.org/all/[email protected]/
> ---
> BR
> Beata
>
>
> > --
> > viresh
On 07-05-24, 12:02, Beata Michalska wrote:
> On Tue, May 07, 2024 at 10:31:52AM +0200, Beata Michalska wrote:
> > On Mon, Apr 29, 2024 at 02:55:15PM +0530, Viresh Kumar wrote:
> > > Lets forget for once what X86 and ARM may have done and think about it
> > > once again. I also had a chat with Vincent today about this.
> > >
> > > The documentation says it clearly, cpuinfo_cur_freq is the one
> > > received from hardware and scaling_cur_freq is the one requested from
> > > software.
> > >
> > > Now, I know that X86 has made both of them quite similar and I
> > > suggested to make them all aligned (and never received a reply on my
> > > previous message).
> > >
> > > There are few reasons why it may be worth keeping the definition (and
> > > behavior) of the sysfs files as is, at least for ARM:
> > > - First is that the documentation says so.
> > > - There is no point providing the same information via both the
> > > interfaces, there are two interfaces here for a reason.
> > > - There maybe tools around which depend on the documented behavior.
> > > - From userspace, currently there is only one way to know the exact
> > > frequency that the cpufreq governors have requested from a platform,
> > > i.e. the value from scaling_cur_freq. If we make it similar to
> > > cpuinfo_cur_freq, then userspace will never know about the requested
> > > frequency and the eventual one and if they are same or different.
> > >
> > > Lets keep the behavior as is and update only cpuinfo_cur_freq with
> > > arch_freq_get_on_cpu().
> > >
> > > Makes sense ?
> > >
> > First of all - apologies for late reply.
> >
> > It all makes sense, though to clarify things up, for my own benefit, and to
> > avoid any potential confusion ....
> >
> > Adding arch_freq_get_on_cpu to cpuinfo_cur_freq does seem to be the right
> > approach - no argue on this one. Doing that though means we need a way to
> > skip calling arch_freq_get_on_cpu() from show_scaling_cur_freq(), to avoid
> > having both providing the same information when that should not be the case.
> > In the initial approach [1], that was handled by checking whether the cpufreq
> > driver supports 'get' callback (and thus cpuinfo_cur_freq). In case it didn't,
> > things remained unchanged for scaling_cur_freq. That does not seem to be a viable
> > option though, as there are at least few drivers, that will support both:
> > cpuinfo_cur_freq alongside scaling_cur_freq (+ APERF/MPERF) and for those,
> > we would hit the initial problem of both relying on arch_freq_get_on_cpu.
> > So I guess we need another way of avoiding calling arch_freq_get_on_cpu
> > for show_scaling_cur_freq (and most probably avoid calling that one for
> > cpuinfo_cur_freq). Quick idea on how to not bring arch specificity into
> > cpufreq generic code would be to introduce a new flag for cpufreq drivers though
> > that seems a bit stretched. Will ponder a bit about that but in the meantime
> > suggestions are more than welcomed.
> Alternatively we could add a parameter to arch_freq_get_on_cpu specifying type
> of feedback required: hw vs sw. Then the arch specific implementation could
> decide which to provide when. It will get slightly counter-intuitive, especially
> for cases when sw feedback provides hw one, as it is the case for current
> arch_freq_get_on_cpu() and scaling_cur_freq but at least the changes would be
> minimal and it will contain handling the tricky bits inside arch specific
> implementation - hiding those messy bits.
I think we should just move the call to arch_freq_get_on_cpu() from
show_scaling_cur_freq() to show_cpuinfo_cur_freq() and post it.
Lets see what X86 guys say to that. You can provide all the reasoning
we discussed here, which makes perfect sense.
--
viresh