2024-03-12 08:35:15

by Beata Michalska

[permalink] [raw]
Subject: [PATCH v3 0/3] Add support for AArch64 AMUv1-based arch_freq_get_on_cpu

Introducing arm64 specific version of arch_freq_get_on_cpu, cashing on
existing implementation for FIE and AMUv1 support: the frequency scale
factor, updated on each sched tick, serves as a base for retrieving
the frequency for a given CPU, representing an average frequency
reported between the ticks - thus its accuracy is limited.

The changes have been rather lightly (due to some limitations) tested on
an FVP model.

Relevant discussions:
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/7eozim2xnepacnnkzxlbx34hib4otycnbn4dqymfziqou5lw5u@5xzpv3t7sxo3/
[3] https://lore.kernel.org/all/[email protected]/
[4] https://lore.kernel.org/lkml/[email protected]/T/#m4e74cb5a0aaa353c60fedc6cfb95ab7a6e381e3c

v3:
- dropping changes to cpufreq_verify_current_freq
- pulling in changes from Ionela initializing capacity_freq_ref to 0
(thanks for that!) and applying suggestions made by her during last review:
- switching to arch_scale_freq_capacity and arch_scale_freq_ref when
reversing freq scale factor computation
- swapping shift with multiplication
- adding time limit for considering last scale update as valid
- updating frequency scale factor upon entering idle

v2:
- Splitting the patches
- Adding comment for full dyntick mode
- Plugging arch_freq_get_on_cpu into cpufreq_verify_current_freq instead
of in show_cpuinfo_cur_freq to allow the framework to stay more in sync
with potential freq changes

Beata Michalska (2):
arm64: Provide an AMU-based version of arch_freq_get_on_cpu
arm64: Update AMU-based frequency scale factor on entering idle

Ionela Voinescu (1):
arch_topology: init capacity_freq_ref to 0

arch/arm64/kernel/topology.c | 116 +++++++++++++++++++++++++++++++----
drivers/base/arch_topology.c | 8 ++-
2 files changed, 110 insertions(+), 14 deletions(-)

--
2.25.1



2024-03-12 08:35:27

by Beata Michalska

[permalink] [raw]
Subject: [PATCH v3 1/3] arch_topology: init capacity_freq_ref to 0

From: Ionela Voinescu <[email protected]>

It's useful to have capacity_freq_ref initialized to 0 for users of
arch_scale_freq_ref() to detect when capacity_freq_ref was not
yet set.

The only scenario affected by this change in the init value is when a
cpufreq driver is never loaded. As a result, the only setter of a
cpu scale factor remains the call of topology_normalize_cpu_scale()
from parse_dt_topology(). There we cannot use the value 0 of
capacity_freq_ref so we have to compensate for its uninitialized state.

Signed-off-by: Ionela Voinescu <[email protected]>
Signed-off-by: Beata Michalska <[email protected]>
---
drivers/base/arch_topology.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 024b78a0cfc1..7d4c92cd2bad 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -27,7 +27,7 @@
static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
static struct cpumask scale_freq_counters_mask;
static bool scale_freq_invariant;
-DEFINE_PER_CPU(unsigned long, capacity_freq_ref) = 1;
+DEFINE_PER_CPU(unsigned long, capacity_freq_ref) = 0;
EXPORT_PER_CPU_SYMBOL_GPL(capacity_freq_ref);

static bool supports_scale_freq_counters(const struct cpumask *cpus)
@@ -292,13 +292,15 @@ void topology_normalize_cpu_scale(void)

capacity_scale = 1;
for_each_possible_cpu(cpu) {
- capacity = raw_capacity[cpu] * per_cpu(capacity_freq_ref, cpu);
+ capacity = raw_capacity[cpu] *
+ (per_cpu(capacity_freq_ref, cpu) ?: 1);
capacity_scale = max(capacity, capacity_scale);
}

pr_debug("cpu_capacity: capacity_scale=%llu\n", capacity_scale);
for_each_possible_cpu(cpu) {
- capacity = raw_capacity[cpu] * per_cpu(capacity_freq_ref, cpu);
+ capacity = raw_capacity[cpu] *
+ (per_cpu(capacity_freq_ref, cpu) ?: 1);
capacity = div64_u64(capacity << SCHED_CAPACITY_SHIFT,
capacity_scale);
topology_set_cpu_scale(cpu, capacity);
--
2.25.1


2024-03-12 08:35:51

by Beata Michalska

[permalink] [raw]
Subject: [PATCH v3 3/3] arm64: Update AMU-based frequency scale factor on entering idle

Now that the frequency scale factor has been activated for retrieving
current frequency on a given CPU, trigger its update upon entering
idle. This will, to an extent, allow querying last known frequency
in a non-invasive way. It will also improve the frequency scale factor
accuracy when a CPU entering idle did not receive a tick for a while.

Suggested-by: Vanshidhar Konda <[email protected]>
Signed-off-by: Beata Michalska <[email protected]>
---
arch/arm64/kernel/topology.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index 42cb19c31719..77c6ac577cef 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -215,6 +215,18 @@ static struct scale_freq_data amu_sfd = {
.set_freq_scale = amu_scale_freq_tick,
};

+void arch_cpu_idle_enter(void)
+{
+ unsigned int cpu = smp_processor_id();
+
+ if (!cpumask_test_cpu(cpu, amu_fie_cpus))
+ return;
+
+ /* Kick in AMU update but only if one has not happned already */
+ if (time_is_before_jiffies(per_cpu(cpu_amu_samples.last_update, cpu)))
+ amu_scale_freq_tick();
+}
+
#define AMU_SAMPLE_EXP_MS 20

unsigned int arch_freq_get_on_cpu(int cpu)
@@ -242,7 +254,8 @@ unsigned int arch_freq_get_on_cpu(int cpu)
* try an alternative source for the counters (and thus freq scale),
* if available for given policy
*/
- if (time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS))) {
+ if (!idle_cpu(cpu) &&
+ time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS))) {
struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
int ref_cpu = nr_cpu_ids;

--
2.25.1


2024-03-12 08:36:01

by Beata Michalska

[permalink] [raw]
Subject: [PATCH v3 2/3] arm64: Provide an AMU-based version of arch_freq_get_on_cpu

With the Frequency Invariance Engine (FIE) being already wired up with
sched tick and making use of relevant (core counter and constant
counter) AMU counters, getting the current frequency for a given CPU
on supported platforms can be achieved by utilizing the frequency scale
factor which reflects an average CPU frequency for the last tick period
length.

The solution is partially based on APERF/MPERF implementation of
arch_freq_get_on_cpu.

Suggested-by: Ionela Voinescu <[email protected]>
Signed-off-by: Beata Michalska <[email protected]>
---
arch/arm64/kernel/topology.c | 103 +++++++++++++++++++++++++++++++----
1 file changed, 92 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index 1a2c72f3e7f8..42cb19c31719 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -17,6 +17,8 @@
#include <linux/cpufreq.h>
#include <linux/init.h>
#include <linux/percpu.h>
+#include <linux/sched/isolation.h>
+#include <linux/seqlock_types.h>

#include <asm/cpu.h>
#include <asm/cputype.h>
@@ -88,18 +90,31 @@ int __init parse_acpi_topology(void)
* initialized.
*/
static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, arch_max_freq_scale) = 1UL << (2 * SCHED_CAPACITY_SHIFT);
-static DEFINE_PER_CPU(u64, arch_const_cycles_prev);
-static DEFINE_PER_CPU(u64, arch_core_cycles_prev);
static cpumask_var_t amu_fie_cpus;

+struct amu_cntr_sample {
+ u64 arch_const_cycles_prev;
+ u64 arch_core_cycles_prev;
+ unsigned long last_update;
+ seqcount_t seq;
+};
+
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct amu_cntr_sample, cpu_amu_samples) = {
+ .seq = SEQCNT_ZERO(cpu_amu_samples.seq)
+};
+
void update_freq_counters_refs(void)
{
- this_cpu_write(arch_core_cycles_prev, read_corecnt());
- this_cpu_write(arch_const_cycles_prev, read_constcnt());
+ struct amu_cntr_sample *amu_sample = this_cpu_ptr(&cpu_amu_samples);
+
+ amu_sample->arch_core_cycles_prev = read_corecnt();
+ amu_sample->arch_const_cycles_prev = read_constcnt();
}

static inline bool freq_counters_valid(int cpu)
{
+ struct amu_cntr_sample *amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
+
if ((cpu >= nr_cpu_ids) || !cpumask_test_cpu(cpu, cpu_present_mask))
return false;

@@ -108,8 +123,8 @@ static inline bool freq_counters_valid(int cpu)
return false;
}

- if (unlikely(!per_cpu(arch_const_cycles_prev, cpu) ||
- !per_cpu(arch_core_cycles_prev, cpu))) {
+ if (unlikely(!amu_sample->arch_const_cycles_prev ||
+ !amu_sample->arch_core_cycles_prev)) {
pr_debug("CPU%d: cycle counters are not enabled.\n", cpu);
return false;
}
@@ -152,20 +167,27 @@ void freq_inv_set_max_ratio(int cpu, u64 max_rate)

static void amu_scale_freq_tick(void)
{
+ struct amu_cntr_sample *amu_sample = this_cpu_ptr(&cpu_amu_samples);
u64 prev_core_cnt, prev_const_cnt;
u64 core_cnt, const_cnt, scale;

- prev_const_cnt = this_cpu_read(arch_const_cycles_prev);
- prev_core_cnt = this_cpu_read(arch_core_cycles_prev);
+ prev_const_cnt = amu_sample->arch_const_cycles_prev;
+ prev_core_cnt = amu_sample->arch_core_cycles_prev;
+
+ write_seqcount_begin(&amu_sample->seq);

update_freq_counters_refs();

- const_cnt = this_cpu_read(arch_const_cycles_prev);
- core_cnt = this_cpu_read(arch_core_cycles_prev);
+ const_cnt = amu_sample->arch_const_cycles_prev;
+ core_cnt = amu_sample->arch_core_cycles_prev;

+ /*
+ * This should not happen unless the AMUs have been reset and the
+ * counter values have not been resroted - unlikely
+ */
if (unlikely(core_cnt <= prev_core_cnt ||
const_cnt <= prev_const_cnt))
- return;
+ goto leave;

/*
* /\core arch_max_freq_scale
@@ -182,6 +204,10 @@ static void amu_scale_freq_tick(void)

scale = min_t(unsigned long, scale, SCHED_CAPACITY_SCALE);
this_cpu_write(arch_freq_scale, (unsigned long)scale);
+
+ amu_sample->last_update = jiffies;
+leave:
+ write_seqcount_end(&amu_sample->seq);
}

static struct scale_freq_data amu_sfd = {
@@ -189,6 +215,61 @@ static struct scale_freq_data amu_sfd = {
.set_freq_scale = amu_scale_freq_tick,
};

+#define AMU_SAMPLE_EXP_MS 20
+
+unsigned int arch_freq_get_on_cpu(int cpu)
+{
+ struct amu_cntr_sample *amu_sample;
+ unsigned long last_update;
+ unsigned int seq;
+ unsigned int freq;
+ u64 scale;
+
+ if (!cpumask_test_cpu(cpu, amu_fie_cpus) || !arch_scale_freq_ref(cpu))
+ return 0;
+
+retry:
+ amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
+
+ do {
+ seq = raw_read_seqcount_begin(&amu_sample->seq);
+ last_update = amu_sample->last_update;
+ } while (read_seqcount_retry(&amu_sample->seq, seq));
+
+ /*
+ * For those CPUs that are in full dynticks mode,
+ * and those that have not seen tick for a while
+ * try an alternative source for the counters (and thus freq scale),
+ * if available for given policy
+ */
+ if (time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS))) {
+ struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
+ int ref_cpu = nr_cpu_ids;
+
+ if (cpumask_intersects(housekeeping_cpumask(HK_TYPE_TICK),
+ policy->cpus))
+ ref_cpu = cpumask_nth_and(cpu, policy->cpus,
+ housekeeping_cpumask(HK_TYPE_TICK));
+
+ cpufreq_cpu_put(policy);
+ if (ref_cpu >= nr_cpu_ids || ref_cpu == cpu)
+ /* No alternative to pull info from */
+ return 0;
+ cpu = ref_cpu;
+ goto retry;
+ }
+ /*
+ * Reversed computation to the one used to determine
+ * the arch_freq_scale value
+ * (see amu_scale_freq_tick for details)
+ */
+ scale = arch_scale_freq_capacity(cpu);
+ freq = scale * arch_scale_freq_ref(cpu);
+ freq >>= SCHED_CAPACITY_SHIFT;
+
+ return freq;
+}
+
static void amu_fie_setup(const struct cpumask *cpus)
{
int cpu;
--
2.25.1


2024-03-13 02:13:01

by Vanshidhar Konda

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] arm64: Provide an AMU-based version of arch_freq_get_on_cpu

On Tue, Mar 12, 2024 at 08:34:30AM +0000, Beata Michalska wrote:
>With the Frequency Invariance Engine (FIE) being already wired up with
>sched tick and making use of relevant (core counter and constant
>counter) AMU counters, getting the current frequency for a given CPU
>on supported platforms can be achieved by utilizing the frequency scale
>factor which reflects an average CPU frequency for the last tick period
>length.
>
>The solution is partially based on APERF/MPERF implementation of
>arch_freq_get_on_cpu.
>
>Suggested-by: Ionela Voinescu <[email protected]>
>Signed-off-by: Beata Michalska <[email protected]>
>---
> arch/arm64/kernel/topology.c | 103 +++++++++++++++++++++++++++++++----
> 1 file changed, 92 insertions(+), 11 deletions(-)
>
>diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
>index 1a2c72f3e7f8..42cb19c31719 100644
>--- a/arch/arm64/kernel/topology.c
>+++ b/arch/arm64/kernel/topology.c
>@@ -17,6 +17,8 @@
> #include <linux/cpufreq.h>
> #include <linux/init.h>
> #include <linux/percpu.h>
>+#include <linux/sched/isolation.h>
>+#include <linux/seqlock_types.h>
>
> #include <asm/cpu.h>
> #include <asm/cputype.h>
>@@ -88,18 +90,31 @@ int __init parse_acpi_topology(void)
> * initialized.
> */
> static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, arch_max_freq_scale) = 1UL << (2 * SCHED_CAPACITY_SHIFT);
>-static DEFINE_PER_CPU(u64, arch_const_cycles_prev);
>-static DEFINE_PER_CPU(u64, arch_core_cycles_prev);
> static cpumask_var_t amu_fie_cpus;
>
>+struct amu_cntr_sample {
>+ u64 arch_const_cycles_prev;
>+ u64 arch_core_cycles_prev;
>+ unsigned long last_update;
>+ seqcount_t seq;
>+};
>+
>+static DEFINE_PER_CPU_SHARED_ALIGNED(struct amu_cntr_sample, cpu_amu_samples) = {
>+ .seq = SEQCNT_ZERO(cpu_amu_samples.seq)
>+};
>+
> void update_freq_counters_refs(void)
> {
>- this_cpu_write(arch_core_cycles_prev, read_corecnt());
>- this_cpu_write(arch_const_cycles_prev, read_constcnt());
>+ struct amu_cntr_sample *amu_sample = this_cpu_ptr(&cpu_amu_samples);
>+
>+ amu_sample->arch_core_cycles_prev = read_corecnt();
>+ amu_sample->arch_const_cycles_prev = read_constcnt();
> }
>
> static inline bool freq_counters_valid(int cpu)
> {
>+ struct amu_cntr_sample *amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
>+
> if ((cpu >= nr_cpu_ids) || !cpumask_test_cpu(cpu, cpu_present_mask))
> return false;
>
>@@ -108,8 +123,8 @@ static inline bool freq_counters_valid(int cpu)
> return false;
> }
>
>- if (unlikely(!per_cpu(arch_const_cycles_prev, cpu) ||
>- !per_cpu(arch_core_cycles_prev, cpu))) {
>+ if (unlikely(!amu_sample->arch_const_cycles_prev ||
>+ !amu_sample->arch_core_cycles_prev)) {
> pr_debug("CPU%d: cycle counters are not enabled.\n", cpu);
> return false;
> }
>@@ -152,20 +167,27 @@ void freq_inv_set_max_ratio(int cpu, u64 max_rate)
>
> static void amu_scale_freq_tick(void)
> {
>+ struct amu_cntr_sample *amu_sample = this_cpu_ptr(&cpu_amu_samples);
> u64 prev_core_cnt, prev_const_cnt;
> u64 core_cnt, const_cnt, scale;
>
>- prev_const_cnt = this_cpu_read(arch_const_cycles_prev);
>- prev_core_cnt = this_cpu_read(arch_core_cycles_prev);
>+ prev_const_cnt = amu_sample->arch_const_cycles_prev;
>+ prev_core_cnt = amu_sample->arch_core_cycles_prev;
>+
>+ write_seqcount_begin(&amu_sample->seq);
>
> update_freq_counters_refs();
>
>- const_cnt = this_cpu_read(arch_const_cycles_prev);
>- core_cnt = this_cpu_read(arch_core_cycles_prev);
>+ const_cnt = amu_sample->arch_const_cycles_prev;
>+ core_cnt = amu_sample->arch_core_cycles_prev;
>
>+ /*
>+ * This should not happen unless the AMUs have been reset and the
>+ * counter values have not been resroted - unlikely

/resroted/restored

>+ */
> if (unlikely(core_cnt <= prev_core_cnt ||
> const_cnt <= prev_const_cnt))
>- return;
>+ goto leave;
>
> /*
> * /\core arch_max_freq_scale
>@@ -182,6 +204,10 @@ static void amu_scale_freq_tick(void)
>
> scale = min_t(unsigned long, scale, SCHED_CAPACITY_SCALE);
> this_cpu_write(arch_freq_scale, (unsigned long)scale);
>+
>+ amu_sample->last_update = jiffies;
>+leave:
>+ write_seqcount_end(&amu_sample->seq);
> }
>
> static struct scale_freq_data amu_sfd = {
>@@ -189,6 +215,61 @@ static struct scale_freq_data amu_sfd = {
> .set_freq_scale = amu_scale_freq_tick,
> };
>
>+#define AMU_SAMPLE_EXP_MS 20
>+
>+unsigned int arch_freq_get_on_cpu(int cpu)
>+{
>+ struct amu_cntr_sample *amu_sample;
>+ unsigned long last_update;
>+ unsigned int seq;
>+ unsigned int freq;
>+ u64 scale;
>+
>+ if (!cpumask_test_cpu(cpu, amu_fie_cpus) || !arch_scale_freq_ref(cpu))
>+ return 0;
>+
>+retry:
>+ amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
>+
>+ do {
>+ seq = raw_read_seqcount_begin(&amu_sample->seq);
>+ last_update = amu_sample->last_update;
>+ } while (read_seqcount_retry(&amu_sample->seq, seq));
>+
>+ /*
>+ * For those CPUs that are in full dynticks mode,
>+ * and those that have not seen tick for a while
>+ * try an alternative source for the counters (and thus freq scale),
>+ * if available for given policy
>+ */
>+ if (time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS))) {
>+ struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
>+ int ref_cpu = nr_cpu_ids;
>+
>+ if (cpumask_intersects(housekeeping_cpumask(HK_TYPE_TICK),
>+ policy->cpus))
>+ ref_cpu = cpumask_nth_and(cpu, policy->cpus,
>+ housekeeping_cpumask(HK_TYPE_TICK));
>+

Could you help me understand why getting the frequency from another
housekeeping cpu would be a better than returning 0? Wouldn't different
CPUs in the HK_TYPE_TICK domain be running at independent frequencies?
May be adding this explanation to the patch commit message would help
people who look at this in the future?

Thanks,
Vanshi

>+ cpufreq_cpu_put(policy);
>+ if (ref_cpu >= nr_cpu_ids || ref_cpu == cpu)
>+ /* No alternative to pull info from */
>+ return 0;
>+ cpu = ref_cpu;
>+ goto retry;
>+ }
>+ /*
>+ * Reversed computation to the one used to determine
>+ * the arch_freq_scale value
>+ * (see amu_scale_freq_tick for details)
>+ */
>+ scale = arch_scale_freq_capacity(cpu);
>+ freq = scale * arch_scale_freq_ref(cpu);
>+ freq >>= SCHED_CAPACITY_SHIFT;
>+
>+ return freq;
>+}
>+
> static void amu_fie_setup(const struct cpumask *cpus)
> {
> int cpu;
>--
>2.25.1
>

2024-03-13 12:20:30

by Ionela Voinescu

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] arm64: Provide an AMU-based version of arch_freq_get_on_cpu

Hi Beata,

Thank you for the patches!

On Tuesday 12 Mar 2024 at 08:34:30 (+0000), Beata Michalska wrote:
> With the Frequency Invariance Engine (FIE) being already wired up with
> sched tick and making use of relevant (core counter and constant
> counter) AMU counters, getting the current frequency for a given CPU
> on supported platforms can be achieved by utilizing the frequency scale
> factor which reflects an average CPU frequency for the last tick period
> length.
>
> The solution is partially based on APERF/MPERF implementation of
> arch_freq_get_on_cpu.
>
> Suggested-by: Ionela Voinescu <[email protected]>
> Signed-off-by: Beata Michalska <[email protected]>
> ---
> arch/arm64/kernel/topology.c | 103 +++++++++++++++++++++++++++++++----
> 1 file changed, 92 insertions(+), 11 deletions(-)
>
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index 1a2c72f3e7f8..42cb19c31719 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
> @@ -17,6 +17,8 @@
> #include <linux/cpufreq.h>
> #include <linux/init.h>
> #include <linux/percpu.h>
> +#include <linux/sched/isolation.h>
> +#include <linux/seqlock_types.h>
>
> #include <asm/cpu.h>
> #include <asm/cputype.h>
> @@ -88,18 +90,31 @@ int __init parse_acpi_topology(void)
> * initialized.
> */
> static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, arch_max_freq_scale) = 1UL << (2 * SCHED_CAPACITY_SHIFT);
> -static DEFINE_PER_CPU(u64, arch_const_cycles_prev);
> -static DEFINE_PER_CPU(u64, arch_core_cycles_prev);
> static cpumask_var_t amu_fie_cpus;
>
> +struct amu_cntr_sample {
> + u64 arch_const_cycles_prev;
> + u64 arch_core_cycles_prev;
> + unsigned long last_update;
> + seqcount_t seq;
> +};
> +
> +static DEFINE_PER_CPU_SHARED_ALIGNED(struct amu_cntr_sample, cpu_amu_samples) = {
> + .seq = SEQCNT_ZERO(cpu_amu_samples.seq)
> +};
> +
> void update_freq_counters_refs(void)
> {
> - this_cpu_write(arch_core_cycles_prev, read_corecnt());
> - this_cpu_write(arch_const_cycles_prev, read_constcnt());
> + struct amu_cntr_sample *amu_sample = this_cpu_ptr(&cpu_amu_samples);
> +
> + amu_sample->arch_core_cycles_prev = read_corecnt();
> + amu_sample->arch_const_cycles_prev = read_constcnt();
> }
>
> static inline bool freq_counters_valid(int cpu)
> {
> + struct amu_cntr_sample *amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
> +
> if ((cpu >= nr_cpu_ids) || !cpumask_test_cpu(cpu, cpu_present_mask))
> return false;
>
> @@ -108,8 +123,8 @@ static inline bool freq_counters_valid(int cpu)
> return false;
> }
>
> - if (unlikely(!per_cpu(arch_const_cycles_prev, cpu) ||
> - !per_cpu(arch_core_cycles_prev, cpu))) {
> + if (unlikely(!amu_sample->arch_const_cycles_prev ||
> + !amu_sample->arch_core_cycles_prev)) {
> pr_debug("CPU%d: cycle counters are not enabled.\n", cpu);
> return false;
> }
> @@ -152,20 +167,27 @@ void freq_inv_set_max_ratio(int cpu, u64 max_rate)
>
> static void amu_scale_freq_tick(void)
> {
> + struct amu_cntr_sample *amu_sample = this_cpu_ptr(&cpu_amu_samples);
> u64 prev_core_cnt, prev_const_cnt;
> u64 core_cnt, const_cnt, scale;
>
> - prev_const_cnt = this_cpu_read(arch_const_cycles_prev);
> - prev_core_cnt = this_cpu_read(arch_core_cycles_prev);
> + prev_const_cnt = amu_sample->arch_const_cycles_prev;
> + prev_core_cnt = amu_sample->arch_core_cycles_prev;
> +
> + write_seqcount_begin(&amu_sample->seq);

The critical section here does not need to be this extensive, right?

The arch_freq_get_on_cpu() function only uses the frequency scale factor
and the last_update value, so this need only be placed above
"this_cpu_write(arch_freq_scale,..", if I'm not missing anything.

>
> update_freq_counters_refs();
>
> - const_cnt = this_cpu_read(arch_const_cycles_prev);
> - core_cnt = this_cpu_read(arch_core_cycles_prev);
> + const_cnt = amu_sample->arch_const_cycles_prev;
> + core_cnt = amu_sample->arch_core_cycles_prev;
>
> + /*
> + * This should not happen unless the AMUs have been reset and the
> + * counter values have not been resroted - unlikely
> + */
> if (unlikely(core_cnt <= prev_core_cnt ||
> const_cnt <= prev_const_cnt))
> - return;
> + goto leave;
>
> /*
> * /\core arch_max_freq_scale
> @@ -182,6 +204,10 @@ static void amu_scale_freq_tick(void)
>
> scale = min_t(unsigned long, scale, SCHED_CAPACITY_SCALE);
> this_cpu_write(arch_freq_scale, (unsigned long)scale);
> +
> + amu_sample->last_update = jiffies;
> +leave:
> + write_seqcount_end(&amu_sample->seq);
> }
>
> static struct scale_freq_data amu_sfd = {
> @@ -189,6 +215,61 @@ static struct scale_freq_data amu_sfd = {
> .set_freq_scale = amu_scale_freq_tick,
> };
>
> +#define AMU_SAMPLE_EXP_MS 20
> +
> +unsigned int arch_freq_get_on_cpu(int cpu)
> +{
> + struct amu_cntr_sample *amu_sample;
> + unsigned long last_update;
> + unsigned int seq;
> + unsigned int freq;
> + u64 scale;
> +
> + if (!cpumask_test_cpu(cpu, amu_fie_cpus) || !arch_scale_freq_ref(cpu))
> + return 0;
> +
> +retry:
> + amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
> +
> + do {
> + seq = raw_read_seqcount_begin(&amu_sample->seq);
> + last_update = amu_sample->last_update;
> + } while (read_seqcount_retry(&amu_sample->seq, seq));

Related to the point above, this retry loop should also contain
"scale = arch_scale_freq_capacity(cpu)", otherwise there's no much point
for synchronisation, as far as I can tell.

For x86, arch_freq_get_on_cpu() uses the counter deltas and it would be
bad if values from different ticks would be used. But here the only
benefit of synchronisation is to make sure that we're using the scale
factor computed at the last update time. For us, even skipping on the
synchronisation logic would still be acceptable, as we'd be ensuring that
there was a tick in the past 20ms and we'd always use the most recent
value of the frequency scale factor.

Hope it helps,
Ionela.

> +
> + /*
> + * For those CPUs that are in full dynticks mode,
> + * and those that have not seen tick for a while
> + * try an alternative source for the counters (and thus freq scale),
> + * if available for given policy
> + */
> + if (time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS))) {
> + struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
> + int ref_cpu = nr_cpu_ids;
> +
> + if (cpumask_intersects(housekeeping_cpumask(HK_TYPE_TICK),
> + policy->cpus))
> + ref_cpu = cpumask_nth_and(cpu, policy->cpus,
> + housekeeping_cpumask(HK_TYPE_TICK));
> +
> + cpufreq_cpu_put(policy);
> + if (ref_cpu >= nr_cpu_ids || ref_cpu == cpu)
> + /* No alternative to pull info from */
> + return 0;
> + cpu = ref_cpu;
> + goto retry;
> + }
> + /*
> + * Reversed computation to the one used to determine
> + * the arch_freq_scale value
> + * (see amu_scale_freq_tick for details)
> + */
> + scale = arch_scale_freq_capacity(cpu);
> + freq = scale * arch_scale_freq_ref(cpu);
> + freq >>= SCHED_CAPACITY_SHIFT;
> +
> + return freq;
> +}
> +
> static void amu_fie_setup(const struct cpumask *cpus)
> {
> int cpu;
> --
> 2.25.1
>

2024-03-13 12:28:06

by Ionela Voinescu

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add support for AArch64 AMUv1-based arch_freq_get_on_cpu

Hey,

On Tuesday 12 Mar 2024 at 08:34:28 (+0000), Beata Michalska wrote:
> Introducing arm64 specific version of arch_freq_get_on_cpu, cashing on
> existing implementation for FIE and AMUv1 support: the frequency scale
> factor, updated on each sched tick, serves as a base for retrieving
> the frequency for a given CPU, representing an average frequency
> reported between the ticks - thus its accuracy is limited.
>
> The changes have been rather lightly (due to some limitations) tested on
> an FVP model.
>
> Relevant discussions:
> [1] https://lore.kernel.org/all/[email protected]/
> [2] https://lore.kernel.org/all/7eozim2xnepacnnkzxlbx34hib4otycnbn4dqymfziqou5lw5u@5xzpv3t7sxo3/
> [3] https://lore.kernel.org/all/[email protected]/
> [4] https://lore.kernel.org/lkml/[email protected]/T/#m4e74cb5a0aaa353c60fedc6cfb95ab7a6e381e3c
>
> v3:
> - dropping changes to cpufreq_verify_current_freq
> - pulling in changes from Ionela initializing capacity_freq_ref to 0
> (thanks for that!) and applying suggestions made by her during last review:
> - switching to arch_scale_freq_capacity and arch_scale_freq_ref when
> reversing freq scale factor computation
> - swapping shift with multiplication
> - adding time limit for considering last scale update as valid
> - updating frequency scale factor upon entering idle
>
> v2:
> - Splitting the patches
> - Adding comment for full dyntick mode
> - Plugging arch_freq_get_on_cpu into cpufreq_verify_current_freq instead
> of in show_cpuinfo_cur_freq to allow the framework to stay more in sync
> with potential freq changes
>
> Beata Michalska (2):
> arm64: Provide an AMU-based version of arch_freq_get_on_cpu
> arm64: Update AMU-based frequency scale factor on entering idle
>
> Ionela Voinescu (1):
> arch_topology: init capacity_freq_ref to 0
>

Should there have been a patch that adds a call to
arch_freq_get_on_cpu() from show_cpuinfo_cur_freq() as well?

My understanding from this [1] thread and others referenced there is
that was something we wanted.

[1] https://lore.kernel.org/lkml/[email protected]/

Thanks,
Ionela.


> arch/arm64/kernel/topology.c | 116 +++++++++++++++++++++++++++++++----
> drivers/base/arch_topology.c | 8 ++-
> 2 files changed, 110 insertions(+), 14 deletions(-)
>
> --
> 2.25.1
>

2024-03-13 21:51:57

by Beata Michalska

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] arm64: Provide an AMU-based version of arch_freq_get_on_cpu

On Tue, Mar 12, 2024 at 07:12:37PM -0700, Vanshidhar Konda wrote:
> On Tue, Mar 12, 2024 at 08:34:30AM +0000, Beata Michalska wrote:
> > With the Frequency Invariance Engine (FIE) being already wired up with
> > sched tick and making use of relevant (core counter and constant
> > counter) AMU counters, getting the current frequency for a given CPU
> > on supported platforms can be achieved by utilizing the frequency scale
> > factor which reflects an average CPU frequency for the last tick period
> > length.
> >
> > The solution is partially based on APERF/MPERF implementation of
> > arch_freq_get_on_cpu.
> >
> > Suggested-by: Ionela Voinescu <[email protected]>
> > Signed-off-by: Beata Michalska <[email protected]>
> > ---
> > arch/arm64/kernel/topology.c | 103 +++++++++++++++++++++++++++++++----
> > 1 file changed, 92 insertions(+), 11 deletions(-)
> >
> > diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> > index 1a2c72f3e7f8..42cb19c31719 100644
> > --- a/arch/arm64/kernel/topology.c
> > +++ b/arch/arm64/kernel/topology.c
> > @@ -17,6 +17,8 @@
> > #include <linux/cpufreq.h>
> > #include <linux/init.h>
> > #include <linux/percpu.h>
> > +#include <linux/sched/isolation.h>
> > +#include <linux/seqlock_types.h>
> >
> > #include <asm/cpu.h>
> > #include <asm/cputype.h>
> > @@ -88,18 +90,31 @@ int __init parse_acpi_topology(void)
> > * initialized.
> > */
> > static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, arch_max_freq_scale) = 1UL << (2 * SCHED_CAPACITY_SHIFT);
> > -static DEFINE_PER_CPU(u64, arch_const_cycles_prev);
> > -static DEFINE_PER_CPU(u64, arch_core_cycles_prev);
> > static cpumask_var_t amu_fie_cpus;
> >
> > +struct amu_cntr_sample {
> > + u64 arch_const_cycles_prev;
> > + u64 arch_core_cycles_prev;
> > + unsigned long last_update;
> > + seqcount_t seq;
> > +};
> > +
> > +static DEFINE_PER_CPU_SHARED_ALIGNED(struct amu_cntr_sample, cpu_amu_samples) = {
> > + .seq = SEQCNT_ZERO(cpu_amu_samples.seq)
> > +};
> > +
> > void update_freq_counters_refs(void)
> > {
> > - this_cpu_write(arch_core_cycles_prev, read_corecnt());
> > - this_cpu_write(arch_const_cycles_prev, read_constcnt());
> > + struct amu_cntr_sample *amu_sample = this_cpu_ptr(&cpu_amu_samples);
> > +
> > + amu_sample->arch_core_cycles_prev = read_corecnt();
> > + amu_sample->arch_const_cycles_prev = read_constcnt();
> > }
> >
> > static inline bool freq_counters_valid(int cpu)
> > {
> > + struct amu_cntr_sample *amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
> > +
> > if ((cpu >= nr_cpu_ids) || !cpumask_test_cpu(cpu, cpu_present_mask))
> > return false;
> >
> > @@ -108,8 +123,8 @@ static inline bool freq_counters_valid(int cpu)
> > return false;
> > }
> >
> > - if (unlikely(!per_cpu(arch_const_cycles_prev, cpu) ||
> > - !per_cpu(arch_core_cycles_prev, cpu))) {
> > + if (unlikely(!amu_sample->arch_const_cycles_prev ||
> > + !amu_sample->arch_core_cycles_prev)) {
> > pr_debug("CPU%d: cycle counters are not enabled.\n", cpu);
> > return false;
> > }
> > @@ -152,20 +167,27 @@ void freq_inv_set_max_ratio(int cpu, u64 max_rate)
> >
> > static void amu_scale_freq_tick(void)
> > {
> > + struct amu_cntr_sample *amu_sample = this_cpu_ptr(&cpu_amu_samples);
> > u64 prev_core_cnt, prev_const_cnt;
> > u64 core_cnt, const_cnt, scale;
> >
> > - prev_const_cnt = this_cpu_read(arch_const_cycles_prev);
> > - prev_core_cnt = this_cpu_read(arch_core_cycles_prev);
> > + prev_const_cnt = amu_sample->arch_const_cycles_prev;
> > + prev_core_cnt = amu_sample->arch_core_cycles_prev;
> > +
> > + write_seqcount_begin(&amu_sample->seq);
> >
> > update_freq_counters_refs();
> >
> > - const_cnt = this_cpu_read(arch_const_cycles_prev);
> > - core_cnt = this_cpu_read(arch_core_cycles_prev);
> > + const_cnt = amu_sample->arch_const_cycles_prev;
> > + core_cnt = amu_sample->arch_core_cycles_prev;
> >
> > + /*
> > + * This should not happen unless the AMUs have been reset and the
> > + * counter values have not been resroted - unlikely
>
> /resroted/restored
>
> > + */
> > if (unlikely(core_cnt <= prev_core_cnt ||
> > const_cnt <= prev_const_cnt))
> > - return;
> > + goto leave;
> >
> > /*
> > * /\core arch_max_freq_scale
> > @@ -182,6 +204,10 @@ static void amu_scale_freq_tick(void)
> >
> > scale = min_t(unsigned long, scale, SCHED_CAPACITY_SCALE);
> > this_cpu_write(arch_freq_scale, (unsigned long)scale);
> > +
> > + amu_sample->last_update = jiffies;
> > +leave:
> > + write_seqcount_end(&amu_sample->seq);
> > }
> >
> > static struct scale_freq_data amu_sfd = {
> > @@ -189,6 +215,61 @@ static struct scale_freq_data amu_sfd = {
> > .set_freq_scale = amu_scale_freq_tick,
> > };
> >
> > +#define AMU_SAMPLE_EXP_MS 20
> > +
> > +unsigned int arch_freq_get_on_cpu(int cpu)
> > +{
> > + struct amu_cntr_sample *amu_sample;
> > + unsigned long last_update;
> > + unsigned int seq;
> > + unsigned int freq;
> > + u64 scale;
> > +
> > + if (!cpumask_test_cpu(cpu, amu_fie_cpus) || !arch_scale_freq_ref(cpu))
> > + return 0;
> > +
> > +retry:
> > + amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
> > +
> > + do {
> > + seq = raw_read_seqcount_begin(&amu_sample->seq);
> > + last_update = amu_sample->last_update;
> > + } while (read_seqcount_retry(&amu_sample->seq, seq));
> > +
> > + /*
> > + * For those CPUs that are in full dynticks mode,
> > + * and those that have not seen tick for a while
> > + * try an alternative source for the counters (and thus freq scale),
> > + * if available for given policy
> > + */
> > + if (time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS))) {
> > + struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
> > + int ref_cpu = nr_cpu_ids;
> > +
> > + if (cpumask_intersects(housekeeping_cpumask(HK_TYPE_TICK),
> > + policy->cpus))
> > + ref_cpu = cpumask_nth_and(cpu, policy->cpus,
> > + housekeeping_cpumask(HK_TYPE_TICK));
> > +
>
> Could you help me understand why getting the frequency from another
> housekeeping cpu would be a better than returning 0? Wouldn't different
> CPUs in the HK_TYPE_TICK domain be running at independent frequencies?
> May be adding this explanation to the patch commit message would help
> people who look at this in the future?

If the last AMU sample taken lost its assumed validity, we try another cpu within
the same frequency domain, choosing housekeeping cpu as one that might have seen
the tick within the last, assumed, 20ms. We stick to the cpus withn the same
policy, and thus same frequency domain which means those cpus do operate at the
same frequency. Now, in case of per-core dvfs this will bail out with '0' as the
policy->cpus will contain single CPU. Having said that, this code is bogus as it
does not handle needed wrapping in case currently considered cpu is the last one
in this policy cpus mask - will send an update soon.

I will also try to make the comment above more readable.

---
BR
Beata
>
> Thanks,
> Vanshi
>
> > + cpufreq_cpu_put(policy);
> > + if (ref_cpu >= nr_cpu_ids || ref_cpu == cpu)
> > + /* No alternative to pull info from */
> > + return 0;
> > + cpu = ref_cpu;
> > + goto retry;
> > + }
> > + /*
> > + * Reversed computation to the one used to determine
> > + * the arch_freq_scale value
> > + * (see amu_scale_freq_tick for details)
> > + */
> > + scale = arch_scale_freq_capacity(cpu);
> > + freq = scale * arch_scale_freq_ref(cpu);
> > + freq >>= SCHED_CAPACITY_SHIFT;
> > +
> > + return freq;
> > +}
> > +
> > static void amu_fie_setup(const struct cpumask *cpus)
> > {
> > int cpu;
> > --
> > 2.25.1
> >

2024-03-13 23:47:09

by Beata Michalska

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] arm64: Provide an AMU-based version of arch_freq_get_on_cpu

On Wed, Mar 13, 2024 at 12:20:16PM +0000, Ionela Voinescu wrote:
> Hi Beata,
>
> Thank you for the patches!
>
High time for those!

> On Tuesday 12 Mar 2024 at 08:34:30 (+0000), Beata Michalska wrote:
> > With the Frequency Invariance Engine (FIE) being already wired up with
> > sched tick and making use of relevant (core counter and constant
> > counter) AMU counters, getting the current frequency for a given CPU
> > on supported platforms can be achieved by utilizing the frequency scale
> > factor which reflects an average CPU frequency for the last tick period
> > length.
> >
> > The solution is partially based on APERF/MPERF implementation of
> > arch_freq_get_on_cpu.
> >
> > Suggested-by: Ionela Voinescu <[email protected]>
> > Signed-off-by: Beata Michalska <[email protected]>
> > ---
> > arch/arm64/kernel/topology.c | 103 +++++++++++++++++++++++++++++++----
> > 1 file changed, 92 insertions(+), 11 deletions(-)
> >
> > diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> > index 1a2c72f3e7f8..42cb19c31719 100644
> > --- a/arch/arm64/kernel/topology.c
> > +++ b/arch/arm64/kernel/topology.c
> > @@ -17,6 +17,8 @@
> > #include <linux/cpufreq.h>
> > #include <linux/init.h>
> > #include <linux/percpu.h>
> > +#include <linux/sched/isolation.h>
> > +#include <linux/seqlock_types.h>
> >
> > #include <asm/cpu.h>
> > #include <asm/cputype.h>
> > @@ -88,18 +90,31 @@ int __init parse_acpi_topology(void)
> > * initialized.
> > */
> > static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, arch_max_freq_scale) = 1UL << (2 * SCHED_CAPACITY_SHIFT);
> > -static DEFINE_PER_CPU(u64, arch_const_cycles_prev);
> > -static DEFINE_PER_CPU(u64, arch_core_cycles_prev);
> > static cpumask_var_t amu_fie_cpus;
> >
> > +struct amu_cntr_sample {
> > + u64 arch_const_cycles_prev;
> > + u64 arch_core_cycles_prev;
> > + unsigned long last_update;
> > + seqcount_t seq;
> > +};
> > +
> > +static DEFINE_PER_CPU_SHARED_ALIGNED(struct amu_cntr_sample, cpu_amu_samples) = {
> > + .seq = SEQCNT_ZERO(cpu_amu_samples.seq)
> > +};
> > +
> > void update_freq_counters_refs(void)
> > {
> > - this_cpu_write(arch_core_cycles_prev, read_corecnt());
> > - this_cpu_write(arch_const_cycles_prev, read_constcnt());
> > + struct amu_cntr_sample *amu_sample = this_cpu_ptr(&cpu_amu_samples);
> > +
> > + amu_sample->arch_core_cycles_prev = read_corecnt();
> > + amu_sample->arch_const_cycles_prev = read_constcnt();
> > }
> >
> > static inline bool freq_counters_valid(int cpu)
> > {
> > + struct amu_cntr_sample *amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
> > +
> > if ((cpu >= nr_cpu_ids) || !cpumask_test_cpu(cpu, cpu_present_mask))
> > return false;
> >
> > @@ -108,8 +123,8 @@ static inline bool freq_counters_valid(int cpu)
> > return false;
> > }
> >
> > - if (unlikely(!per_cpu(arch_const_cycles_prev, cpu) ||
> > - !per_cpu(arch_core_cycles_prev, cpu))) {
> > + if (unlikely(!amu_sample->arch_const_cycles_prev ||
> > + !amu_sample->arch_core_cycles_prev)) {
> > pr_debug("CPU%d: cycle counters are not enabled.\n", cpu);
> > return false;
> > }
> > @@ -152,20 +167,27 @@ void freq_inv_set_max_ratio(int cpu, u64 max_rate)
> >
> > static void amu_scale_freq_tick(void)
> > {
> > + struct amu_cntr_sample *amu_sample = this_cpu_ptr(&cpu_amu_samples);
> > u64 prev_core_cnt, prev_const_cnt;
> > u64 core_cnt, const_cnt, scale;
> >
> > - prev_const_cnt = this_cpu_read(arch_const_cycles_prev);
> > - prev_core_cnt = this_cpu_read(arch_core_cycles_prev);
> > + prev_const_cnt = amu_sample->arch_const_cycles_prev;
> > + prev_core_cnt = amu_sample->arch_core_cycles_prev;
> > +
> > + write_seqcount_begin(&amu_sample->seq);
>
> The critical section here does not need to be this extensive, right?
>
> The arch_freq_get_on_cpu() function only uses the frequency scale factor
> and the last_update value, so this need only be placed above
> "this_cpu_write(arch_freq_scale,..", if I'm not missing anything.

You're not missing anything. The write side critical section could span only
those two, but having it extended gives a chance for the readers to get in on
the update and as those are not really performance sensitive I though it might
be a good option, especially if we can save the cycles on not needing to poke
the cpufeq driver. Furthermore, if the critical section is to span only the two,
then it does not really change much and can be dropped.

>
> >
> > update_freq_counters_refs();
> >
> > - const_cnt = this_cpu_read(arch_const_cycles_prev);
> > - core_cnt = this_cpu_read(arch_core_cycles_prev);
> > + const_cnt = amu_sample->arch_const_cycles_prev;
> > + core_cnt = amu_sample->arch_core_cycles_prev;
> >
> > + /*
> > + * This should not happen unless the AMUs have been reset and the
> > + * counter values have not been resroted - unlikely
> > + */
> > if (unlikely(core_cnt <= prev_core_cnt ||
> > const_cnt <= prev_const_cnt))
> > - return;
> > + goto leave;
> >
> > /*
> > * /\core arch_max_freq_scale
> > @@ -182,6 +204,10 @@ static void amu_scale_freq_tick(void)
> >
> > scale = min_t(unsigned long, scale, SCHED_CAPACITY_SCALE);
> > this_cpu_write(arch_freq_scale, (unsigned long)scale);
> > +
> > + amu_sample->last_update = jiffies;
> > +leave:
> > + write_seqcount_end(&amu_sample->seq);
> > }
> >
> > static struct scale_freq_data amu_sfd = {
> > @@ -189,6 +215,61 @@ static struct scale_freq_data amu_sfd = {
> > .set_freq_scale = amu_scale_freq_tick,
> > };
> >
> > +#define AMU_SAMPLE_EXP_MS 20
> > +
> > +unsigned int arch_freq_get_on_cpu(int cpu)
> > +{
> > + struct amu_cntr_sample *amu_sample;
> > + unsigned long last_update;
> > + unsigned int seq;
> > + unsigned int freq;
> > + u64 scale;
> > +
> > + if (!cpumask_test_cpu(cpu, amu_fie_cpus) || !arch_scale_freq_ref(cpu))
> > + return 0;
> > +
> > +retry:
> > + amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
> > +
> > + do {
> > + seq = raw_read_seqcount_begin(&amu_sample->seq);
> > + last_update = amu_sample->last_update;
> > + } while (read_seqcount_retry(&amu_sample->seq, seq));
>
> Related to the point above, this retry loop should also contain
> "scale = arch_scale_freq_capacity(cpu)", otherwise there's no much point
> for synchronisation, as far as I can tell.
I'm not entirely sure why we would need to include the scale factor within
the read critical section. The aim here is to make sure we see the update if
one is ongoing and that the update to the timestamp is observed along with
one to the scale factor, which is what the write_seqcount_end will guarantee
(although the latter is not a hard sell as the update happens under interrupts
being disabled). If later on we fetch newer scale factor that's perfectly fine,
we do not want to see the stale one. Again, I can drop the seqcount (which is
slightly abused in this case I must admit) at a cost of potentially missing some
updates.
>
> For x86, arch_freq_get_on_cpu() uses the counter deltas and it would be
> bad if values from different ticks would be used. But here the only
> benefit of synchronisation is to make sure that we're using the scale
> factor computed at the last update time. For us, even skipping on the
> synchronisation logic would still be acceptable, as we'd be ensuring that
> there was a tick in the past 20ms and we'd always use the most recent
> value of the frequency scale factor.
How would we ensure there was a tick in last 20ms ?
>
> Hope it helps,
It does, thank you.

--
BR
Beata
> Ionela.
>
> > +
> > + /*
> > + * For those CPUs that are in full dynticks mode,
> > + * and those that have not seen tick for a while
> > + * try an alternative source for the counters (and thus freq scale),
> > + * if available for given policy
> > + */
> > + if (time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS))) {
> > + struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
> > + int ref_cpu = nr_cpu_ids;
> > +
> > + if (cpumask_intersects(housekeeping_cpumask(HK_TYPE_TICK),
> > + policy->cpus))
> > + ref_cpu = cpumask_nth_and(cpu, policy->cpus,
> > + housekeeping_cpumask(HK_TYPE_TICK));
> > +
> > + cpufreq_cpu_put(policy);
> > + if (ref_cpu >= nr_cpu_ids || ref_cpu == cpu)
> > + /* No alternative to pull info from */
> > + return 0;
> > + cpu = ref_cpu;
> > + goto retry;
> > + }
> > + /*
> > + * Reversed computation to the one used to determine
> > + * the arch_freq_scale value
> > + * (see amu_scale_freq_tick for details)
> > + */
> > + scale = arch_scale_freq_capacity(cpu);
> > + freq = scale * arch_scale_freq_ref(cpu);
> > + freq >>= SCHED_CAPACITY_SHIFT;
> > +
> > + return freq;
> > +}
> > +
> > static void amu_fie_setup(const struct cpumask *cpus)
> > {
> > int cpu;
> > --
> > 2.25.1
> >

2024-03-13 23:51:17

by Beata Michalska

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add support for AArch64 AMUv1-based arch_freq_get_on_cpu

On Wed, Mar 13, 2024 at 12:27:53PM +0000, Ionela Voinescu wrote:
> Hey,
>
> On Tuesday 12 Mar 2024 at 08:34:28 (+0000), Beata Michalska wrote:
> > Introducing arm64 specific version of arch_freq_get_on_cpu, cashing on
> > existing implementation for FIE and AMUv1 support: the frequency scale
> > factor, updated on each sched tick, serves as a base for retrieving
> > the frequency for a given CPU, representing an average frequency
> > reported between the ticks - thus its accuracy is limited.
> >
> > The changes have been rather lightly (due to some limitations) tested on
> > an FVP model.
> >
> > Relevant discussions:
> > [1] https://lore.kernel.org/all/[email protected]/
> > [2] https://lore.kernel.org/all/7eozim2xnepacnnkzxlbx34hib4otycnbn4dqymfziqou5lw5u@5xzpv3t7sxo3/
> > [3] https://lore.kernel.org/all/[email protected]/
> > [4] https://lore.kernel.org/lkml/[email protected]/T/#m4e74cb5a0aaa353c60fedc6cfb95ab7a6e381e3c
> >
> > v3:
> > - dropping changes to cpufreq_verify_current_freq
> > - pulling in changes from Ionela initializing capacity_freq_ref to 0
> > (thanks for that!) and applying suggestions made by her during last review:
> > - switching to arch_scale_freq_capacity and arch_scale_freq_ref when
> > reversing freq scale factor computation
> > - swapping shift with multiplication
> > - adding time limit for considering last scale update as valid
> > - updating frequency scale factor upon entering idle
> >
> > v2:
> > - Splitting the patches
> > - Adding comment for full dyntick mode
> > - Plugging arch_freq_get_on_cpu into cpufreq_verify_current_freq instead
> > of in show_cpuinfo_cur_freq to allow the framework to stay more in sync
> > with potential freq changes
> >
> > Beata Michalska (2):
> > arm64: Provide an AMU-based version of arch_freq_get_on_cpu
> > arm64: Update AMU-based frequency scale factor on entering idle
> >
> > Ionela Voinescu (1):
> > arch_topology: init capacity_freq_ref to 0
> >
>
> Should there have been a patch that adds a call to
> arch_freq_get_on_cpu() from show_cpuinfo_cur_freq() as well?
>
> My understanding from this [1] thread and others referenced there is
> that was something we wanted.
>
Right, so I must have missunderstood that, as the way I did read it was that
it is acceptable to keep things as they are wrt cpufreq sysfs entries.

---
BR
Beata
> [1] https://lore.kernel.org/lkml/[email protected]/
>
> Thanks,
> Ionela.
>
>
> > arch/arm64/kernel/topology.c | 116 +++++++++++++++++++++++++++++++----
> > drivers/base/arch_topology.c | 8 ++-
> > 2 files changed, 110 insertions(+), 14 deletions(-)
> >
> > --
> > 2.25.1
> >

2024-03-18 15:01:56

by Ionela Voinescu

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] arm64: Provide an AMU-based version of arch_freq_get_on_cpu

Hey,

On Thursday 14 Mar 2024 at 00:46:19 (+0100), Beata Michalska wrote:
[..]
> > > static void amu_scale_freq_tick(void)
> > > {
> > > + struct amu_cntr_sample *amu_sample = this_cpu_ptr(&cpu_amu_samples);
> > > u64 prev_core_cnt, prev_const_cnt;
> > > u64 core_cnt, const_cnt, scale;
> > >
> > > - prev_const_cnt = this_cpu_read(arch_const_cycles_prev);
> > > - prev_core_cnt = this_cpu_read(arch_core_cycles_prev);
> > > + prev_const_cnt = amu_sample->arch_const_cycles_prev;
> > > + prev_core_cnt = amu_sample->arch_core_cycles_prev;
> > > +
> > > + write_seqcount_begin(&amu_sample->seq);
> >
> > The critical section here does not need to be this extensive, right?
> >
> > The arch_freq_get_on_cpu() function only uses the frequency scale factor
> > and the last_update value, so this need only be placed above
> > "this_cpu_write(arch_freq_scale,..", if I'm not missing anything.
>
> You're not missing anything. The write side critical section could span only
> those two, but having it extended gives a chance for the readers to get in on
> the update and as those are not really performance sensitive I though it might
> be a good option, especially if we can save the cycles on not needing to poke
> the cpufeq driver. Furthermore, if the critical section is to span only the two,
> then it does not really change much and can be dropped.
>
> >
> > >
> > > update_freq_counters_refs();
> > >
> > > - const_cnt = this_cpu_read(arch_const_cycles_prev);
> > > - core_cnt = this_cpu_read(arch_core_cycles_prev);
> > > + const_cnt = amu_sample->arch_const_cycles_prev;
> > > + core_cnt = amu_sample->arch_core_cycles_prev;
> > >
> > > + /*
> > > + * This should not happen unless the AMUs have been reset and the
> > > + * counter values have not been resroted - unlikely
> > > + */
> > > if (unlikely(core_cnt <= prev_core_cnt ||
> > > const_cnt <= prev_const_cnt))
> > > - return;
> > > + goto leave;
> > >
> > > /*
> > > * /\core arch_max_freq_scale
> > > @@ -182,6 +204,10 @@ static void amu_scale_freq_tick(void)
> > >
> > > scale = min_t(unsigned long, scale, SCHED_CAPACITY_SCALE);
> > > this_cpu_write(arch_freq_scale, (unsigned long)scale);
> > > +
> > > + amu_sample->last_update = jiffies;
> > > +leave:
> > > + write_seqcount_end(&amu_sample->seq);
> > > }
> > >
> > > static struct scale_freq_data amu_sfd = {
> > > @@ -189,6 +215,61 @@ static struct scale_freq_data amu_sfd = {
> > > .set_freq_scale = amu_scale_freq_tick,
> > > };
> > >
> > > +#define AMU_SAMPLE_EXP_MS 20
> > > +
> > > +unsigned int arch_freq_get_on_cpu(int cpu)
> > > +{
> > > + struct amu_cntr_sample *amu_sample;
> > > + unsigned long last_update;
> > > + unsigned int seq;
> > > + unsigned int freq;
> > > + u64 scale;
> > > +
> > > + if (!cpumask_test_cpu(cpu, amu_fie_cpus) || !arch_scale_freq_ref(cpu))
> > > + return 0;
> > > +
> > > +retry:
> > > + amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
> > > +
> > > + do {
> > > + seq = raw_read_seqcount_begin(&amu_sample->seq);
> > > + last_update = amu_sample->last_update;
> > > + } while (read_seqcount_retry(&amu_sample->seq, seq));
> >
> > Related to the point above, this retry loop should also contain
> > "scale = arch_scale_freq_capacity(cpu)", otherwise there's no much point
> > for synchronisation, as far as I can tell.
> I'm not entirely sure why we would need to include the scale factor within
> the read critical section. The aim here is to make sure we see the update if
> one is ongoing and that the update to the timestamp is observed along with
> one to the scale factor, which is what the write_seqcount_end will guarantee
> (although the latter is not a hard sell as the update happens under interrupts
> being disabled). If later on we fetch newer scale factor that's perfectly fine,
> we do not want to see the stale one. Again, I can drop the seqcount (which is
> slightly abused in this case I must admit) at a cost of potentially missing some
> updates.

Replying here for both comments, as they are related.

I fully agree, but I would be more inclined to drop the seqcount. It
would be a game of chance if there was an update in the last few ns of
the 20ms deadline which we might hit or miss due to the presence of an
extended write critical section or the lack of one.

> >
> > For x86, arch_freq_get_on_cpu() uses the counter deltas and it would be
> > bad if values from different ticks would be used. But here the only
> > benefit of synchronisation is to make sure that we're using the scale
> > factor computed at the last update time. For us, even skipping on the
> > synchronisation logic would still be acceptable, as we'd be ensuring that
> > there was a tick in the past 20ms and we'd always use the most recent
> > value of the frequency scale factor.
> How would we ensure there was a tick in last 20ms ?

I just meant that we'd observe the presence of a tick in the last 20ms
(if there was one) and we don't necessarily need to guarantee that we'd
use the scale factor obtained at that time. We could use the latest, as
you mentioned above as well.

Thanks,
Ionela.

2024-03-20 16:43:56

by Sumit Gupta

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] arm64: Provide an AMU-based version of arch_freq_get_on_cpu



On 12/03/24 14:04, Beata Michalska wrote:
> External email: Use caution opening links or attachments
>
>
> With the Frequency Invariance Engine (FIE) being already wired up with
> sched tick and making use of relevant (core counter and constant
> counter) AMU counters, getting the current frequency for a given CPU
> on supported platforms can be achieved by utilizing the frequency scale
> factor which reflects an average CPU frequency for the last tick period
> length.
>
> The solution is partially based on APERF/MPERF implementation of
> arch_freq_get_on_cpu.
>
> Suggested-by: Ionela Voinescu <[email protected]>
> Signed-off-by: Beata Michalska <[email protected]>
> ---
> arch/arm64/kernel/topology.c | 103 +++++++++++++++++++++++++++++++----
> 1 file changed, 92 insertions(+), 11 deletions(-)
>
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index 1a2c72f3e7f8..42cb19c31719 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
> @@ -17,6 +17,8 @@
> #include <linux/cpufreq.h>
> #include <linux/init.h>
> #include <linux/percpu.h>
> +#include <linux/sched/isolation.h>
> +#include <linux/seqlock_types.h>
>
> #include <asm/cpu.h>
> #include <asm/cputype.h>
> @@ -88,18 +90,31 @@ int __init parse_acpi_topology(void)
> * initialized.
> */
> static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, arch_max_freq_scale) = 1UL << (2 * SCHED_CAPACITY_SHIFT);
> -static DEFINE_PER_CPU(u64, arch_const_cycles_prev);
> -static DEFINE_PER_CPU(u64, arch_core_cycles_prev);
> static cpumask_var_t amu_fie_cpus;
>
> +struct amu_cntr_sample {
> + u64 arch_const_cycles_prev;
> + u64 arch_core_cycles_prev;
> + unsigned long last_update;
> + seqcount_t seq;
> +};
> +
> +static DEFINE_PER_CPU_SHARED_ALIGNED(struct amu_cntr_sample, cpu_amu_samples) = {
> + .seq = SEQCNT_ZERO(cpu_amu_samples.seq)
> +};
> +
> void update_freq_counters_refs(void)
> {
> - this_cpu_write(arch_core_cycles_prev, read_corecnt());
> - this_cpu_write(arch_const_cycles_prev, read_constcnt());
> + struct amu_cntr_sample *amu_sample = this_cpu_ptr(&cpu_amu_samples);
> +
> + amu_sample->arch_core_cycles_prev = read_corecnt();
> + amu_sample->arch_const_cycles_prev = read_constcnt();
> }
>
> static inline bool freq_counters_valid(int cpu)
> {
> + struct amu_cntr_sample *amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
> +
> if ((cpu >= nr_cpu_ids) || !cpumask_test_cpu(cpu, cpu_present_mask))
> return false;
>
> @@ -108,8 +123,8 @@ static inline bool freq_counters_valid(int cpu)
> return false;
> }
>
> - if (unlikely(!per_cpu(arch_const_cycles_prev, cpu) ||
> - !per_cpu(arch_core_cycles_prev, cpu))) {
> + if (unlikely(!amu_sample->arch_const_cycles_prev ||
> + !amu_sample->arch_core_cycles_prev)) {
> pr_debug("CPU%d: cycle counters are not enabled.\n", cpu);
> return false;
> }
> @@ -152,20 +167,27 @@ void freq_inv_set_max_ratio(int cpu, u64 max_rate)
>
> static void amu_scale_freq_tick(void)
> {
> + struct amu_cntr_sample *amu_sample = this_cpu_ptr(&cpu_amu_samples);
> u64 prev_core_cnt, prev_const_cnt;
> u64 core_cnt, const_cnt, scale;
>
> - prev_const_cnt = this_cpu_read(arch_const_cycles_prev);
> - prev_core_cnt = this_cpu_read(arch_core_cycles_prev);
> + prev_const_cnt = amu_sample->arch_const_cycles_prev;
> + prev_core_cnt = amu_sample->arch_core_cycles_prev;
> +
> + write_seqcount_begin(&amu_sample->seq);
>
> update_freq_counters_refs();
>
> - const_cnt = this_cpu_read(arch_const_cycles_prev);
> - core_cnt = this_cpu_read(arch_core_cycles_prev);
> + const_cnt = amu_sample->arch_const_cycles_prev;
> + core_cnt = amu_sample->arch_core_cycles_prev;
>
> + /*
> + * This should not happen unless the AMUs have been reset and the
> + * counter values have not been resroted - unlikely
> + */
> if (unlikely(core_cnt <= prev_core_cnt ||
> const_cnt <= prev_const_cnt))
> - return;
> + goto leave;
>
> /*
> * /\core arch_max_freq_scale
> @@ -182,6 +204,10 @@ static void amu_scale_freq_tick(void)
>
> scale = min_t(unsigned long, scale, SCHED_CAPACITY_SCALE);
> this_cpu_write(arch_freq_scale, (unsigned long)scale);
> +
> + amu_sample->last_update = jiffies;
> +leave:
> + write_seqcount_end(&amu_sample->seq);
> }
>
> static struct scale_freq_data amu_sfd = {
> @@ -189,6 +215,61 @@ static struct scale_freq_data amu_sfd = {
> .set_freq_scale = amu_scale_freq_tick,
> };
>
> +#define AMU_SAMPLE_EXP_MS 20
> +
> +unsigned int arch_freq_get_on_cpu(int cpu)
> +{
> + struct amu_cntr_sample *amu_sample;
> + unsigned long last_update;
> + unsigned int seq;
> + unsigned int freq;
> + u64 scale;
> +
> + if (!cpumask_test_cpu(cpu, amu_fie_cpus) || !arch_scale_freq_ref(cpu))
> + return 0;
> +
> +retry:
> + amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
> +
> + do {
> + seq = raw_read_seqcount_begin(&amu_sample->seq);
> + last_update = amu_sample->last_update;
> + } while (read_seqcount_retry(&amu_sample->seq, seq));
> +
> + /*
> + * For those CPUs that are in full dynticks mode,
> + * and those that have not seen tick for a while
> + * try an alternative source for the counters (and thus freq scale),
> + * if available for given policy
> + */
> + if (time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS))) {
> + struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
> + int ref_cpu = nr_cpu_ids;
> +
> + if (cpumask_intersects(housekeeping_cpumask(HK_TYPE_TICK),
> + policy->cpus))
> + ref_cpu = cpumask_nth_and(cpu, policy->cpus,
> + housekeeping_cpumask(HK_TYPE_TICK));
> +

This is looking for any other HK CPU within same policy for counters.
AFAIU, cpumask_nth_and() will return small_cpumask_bits/nr_cpu_ids
if the number of bits in both masks is different. Could you check
again if the current change is fine or needs something like below.
BTW, we have one CPU per policy.

cpumask_and(&mask, policy->cpus, housekeeping_cpumask(HK_TYPE_TICK));
retry:
....
cpumask_andnot(&mask, &mask, cpumask_of(cpu));
ref_cpu = cpumask_any(&mask);

Thank you,
Sumit Gupta

> + cpufreq_cpu_put(policy);
> + if (ref_cpu >= nr_cpu_ids || ref_cpu == cpu)
> + /* No alternative to pull info from */
> + return 0;
> + cpu = ref_cpu;
> + goto retry;
> + }
> + /*
> + * Reversed computation to the one used to determine
> + * the arch_freq_scale value
> + * (see amu_scale_freq_tick for details)
> + */
> + scale = arch_scale_freq_capacity(cpu);
> + freq = scale * arch_scale_freq_ref(cpu);
> + freq >>= SCHED_CAPACITY_SHIFT;
> +
> + return freq;
> +}
> +
> static void amu_fie_setup(const struct cpumask *cpus)
> {
> int cpu;
> --
> 2.25.1
>

2024-03-20 16:53:04

by Sumit Gupta

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add support for AArch64 AMUv1-based arch_freq_get_on_cpu

Hi Beata,

>> On Tuesday 12 Mar 2024 at 08:34:28 (+0000), Beata Michalska wrote:
>>> Introducing arm64 specific version of arch_freq_get_on_cpu, cashing on
>>> existing implementation for FIE and AMUv1 support: the frequency scale
>>> factor, updated on each sched tick, serves as a base for retrieving
>>> the frequency for a given CPU, representing an average frequency
>>> reported between the ticks - thus its accuracy is limited.
>>>
>>> The changes have been rather lightly (due to some limitations) tested on
>>> an FVP model.
>>>
>>> Relevant discussions:
>>> [1] https://lore.kernel.org/all/[email protected]/
>>> [2] https://lore.kernel.org/all/7eozim2xnepacnnkzxlbx34hib4otycnbn4dqymfziqou5lw5u@5xzpv3t7sxo3/
>>> [3] https://lore.kernel.org/all/[email protected]/
>>> [4] https://lore.kernel.org/lkml/[email protected]/T/#m4e74cb5a0aaa353c60fedc6cfb95ab7a6e381e3c
>>>
>>> v3:
>>> - dropping changes to cpufreq_verify_current_freq
>>> - pulling in changes from Ionela initializing capacity_freq_ref to 0
>>> (thanks for that!) and applying suggestions made by her during last review:
>>> - switching to arch_scale_freq_capacity and arch_scale_freq_ref when
>>> reversing freq scale factor computation
>>> - swapping shift with multiplication
>>> - adding time limit for considering last scale update as valid
>>> - updating frequency scale factor upon entering idle
>>>
>>> v2:
>>> - Splitting the patches
>>> - Adding comment for full dyntick mode
>>> - Plugging arch_freq_get_on_cpu into cpufreq_verify_current_freq instead
>>> of in show_cpuinfo_cur_freq to allow the framework to stay more in sync
>>> with potential freq changes
>>>
>>> Beata Michalska (2):
>>> arm64: Provide an AMU-based version of arch_freq_get_on_cpu
>>> arm64: Update AMU-based frequency scale factor on entering idle
>>>
>>> Ionela Voinescu (1):
>>> arch_topology: init capacity_freq_ref to 0
>>>
>>
>> Should there have been a patch that adds a call to
>> arch_freq_get_on_cpu() from show_cpuinfo_cur_freq() as well?
>>
>> My understanding from this [1] thread and others referenced there is
>> that was something we wanted.
>>
> Right, so I must have missunderstood that, as the way I did read it was that
> it is acceptable to keep things as they are wrt cpufreq sysfs entries.
>
> ---
> BR
> Beata
>> [1] https://lore.kernel.org/lkml/[email protected]/
>>
>> Thanks,
>> Ionela.
>>

Yes, the change to show_cpuinfo_cur_freq from [1] is needed.

[1]
https://lore.kernel.org/lkml/[email protected]/

Thank you,
Sumit Gupta

2024-03-25 17:31:21

by Vanshidhar Konda

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add support for AArch64 AMUv1-based arch_freq_get_on_cpu

On Tue, Mar 12, 2024 at 08:34:28AM +0000, Beata Michalska wrote:
>Introducing arm64 specific version of arch_freq_get_on_cpu, cashing on
>existing implementation for FIE and AMUv1 support: the frequency scale
>factor, updated on each sched tick, serves as a base for retrieving
>the frequency for a given CPU, representing an average frequency
>reported between the ticks - thus its accuracy is limited.
>
>The changes have been rather lightly (due to some limitations) tested on
>an FVP model.
>

I tested these changes on an Ampere system. The results from reading
scaling_cur_freq look reasonable in the majority of cases I tested. I
only saw some unexpected behavior with cores that were configured for
no_hz full.

I observed the unexplained behavior when I tested as follows:
1. Run stress on all cores
stress-ng --cpu 186 --timeout 10m --metrics-brief
2. Observe scaling_cur_freq and cpuinfo_cur_freq for all cores
scaling_cur_freq values were within a few % of cpuinfo_cur_freq
3. Kill stress test
4. Observe scaling_cur_freq and cpuinfo_cur_freq for all cores
scaling_cur_freq values were within a few % of cpuinfo_cur_freq for
most cores except the ones configured with no_hz full.

no_hz full = 122-127
core scaling_cur_freq cpuinfo_cur_freq
[122]: 2997070 1000000
[123]: 2997070 1000000
[124]: 3000038 1000000
[125]: 2997070 1000000
[126]: 2997070 1000000
[127]: 2997070 1000000

These values were reflected for multiple seconds. I suspect the cores
entered WFI and there was no update to the scale while those cores were
idle.

Thanks,
Vanshi

2024-04-03 21:35:01

by Beata Michalska

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add support for AArch64 AMUv1-based arch_freq_get_on_cpu

On Mon, Mar 25, 2024 at 09:10:26AM -0700, Vanshidhar Konda wrote:
> On Tue, Mar 12, 2024 at 08:34:28AM +0000, Beata Michalska wrote:
> > Introducing arm64 specific version of arch_freq_get_on_cpu, cashing on
> > existing implementation for FIE and AMUv1 support: the frequency scale
> > factor, updated on each sched tick, serves as a base for retrieving
> > the frequency for a given CPU, representing an average frequency
> > reported between the ticks - thus its accuracy is limited.
> >
> > The changes have been rather lightly (due to some limitations) tested on
> > an FVP model.
> >
>
> I tested these changes on an Ampere system. The results from reading
> scaling_cur_freq look reasonable in the majority of cases I tested. I
> only saw some unexpected behavior with cores that were configured for
> no_hz full.
>
> I observed the unexplained behavior when I tested as follows:
> 1. Run stress on all cores
> stress-ng --cpu 186 --timeout 10m --metrics-brief
> 2. Observe scaling_cur_freq and cpuinfo_cur_freq for all cores
> scaling_cur_freq values were within a few % of cpuinfo_cur_freq
> 3. Kill stress test
> 4. Observe scaling_cur_freq and cpuinfo_cur_freq for all cores
> scaling_cur_freq values were within a few % of cpuinfo_cur_freq for
> most cores except the ones configured with no_hz full.
>
> no_hz full = 122-127
> core scaling_cur_freq cpuinfo_cur_freq
> [122]: 2997070 1000000
> [123]: 2997070 1000000
> [124]: 3000038 1000000
> [125]: 2997070 1000000
> [126]: 2997070 1000000
> [127]: 2997070 1000000
>
> These values were reflected for multiple seconds. I suspect the cores
> entered WFI and there was no update to the scale while those cores were
> idle.
>
Right, so the problem is with updating the counters upon entering idle, which at
this point is being done for all CPUs, and it should exclude the full dynticks
ones - otherwise it leads to such bad readings. So for nohz_full cores cpufreq
driver will have to take care of getting the current frequency.

Will be sending a fix for that.

Thank you very much for testing - appreciate that!

---
BR
Beata
> Thanks,
> Vanshi

2024-04-03 21:38:18

by Beata Michalska

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] arm64: Provide an AMU-based version of arch_freq_get_on_cpu

On Wed, Mar 20, 2024 at 10:13:18PM +0530, Sumit Gupta wrote:
>
>
> On 12/03/24 14:04, Beata Michalska wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > With the Frequency Invariance Engine (FIE) being already wired up with
> > sched tick and making use of relevant (core counter and constant
> > counter) AMU counters, getting the current frequency for a given CPU
> > on supported platforms can be achieved by utilizing the frequency scale
> > factor which reflects an average CPU frequency for the last tick period
> > length.
> >
> > The solution is partially based on APERF/MPERF implementation of
> > arch_freq_get_on_cpu.
> >
> > Suggested-by: Ionela Voinescu <[email protected]>
> > Signed-off-by: Beata Michalska <[email protected]>
> > ---
> > arch/arm64/kernel/topology.c | 103 +++++++++++++++++++++++++++++++----
> > 1 file changed, 92 insertions(+), 11 deletions(-)
> >
> > diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> > index 1a2c72f3e7f8..42cb19c31719 100644
> > --- a/arch/arm64/kernel/topology.c
> > +++ b/arch/arm64/kernel/topology.c
> > @@ -17,6 +17,8 @@
> > #include <linux/cpufreq.h>
> > #include <linux/init.h>
> > #include <linux/percpu.h>
> > +#include <linux/sched/isolation.h>
> > +#include <linux/seqlock_types.h>
> >
> > #include <asm/cpu.h>
> > #include <asm/cputype.h>
> > @@ -88,18 +90,31 @@ int __init parse_acpi_topology(void)
> > * initialized.
> > */
> > static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, arch_max_freq_scale) = 1UL << (2 * SCHED_CAPACITY_SHIFT);
> > -static DEFINE_PER_CPU(u64, arch_const_cycles_prev);
> > -static DEFINE_PER_CPU(u64, arch_core_cycles_prev);
> > static cpumask_var_t amu_fie_cpus;
> >
> > +struct amu_cntr_sample {
> > + u64 arch_const_cycles_prev;
> > + u64 arch_core_cycles_prev;
> > + unsigned long last_update;
> > + seqcount_t seq;
> > +};
> > +
> > +static DEFINE_PER_CPU_SHARED_ALIGNED(struct amu_cntr_sample, cpu_amu_samples) = {
> > + .seq = SEQCNT_ZERO(cpu_amu_samples.seq)
> > +};
> > +
> > void update_freq_counters_refs(void)
> > {
> > - this_cpu_write(arch_core_cycles_prev, read_corecnt());
> > - this_cpu_write(arch_const_cycles_prev, read_constcnt());
> > + struct amu_cntr_sample *amu_sample = this_cpu_ptr(&cpu_amu_samples);
> > +
> > + amu_sample->arch_core_cycles_prev = read_corecnt();
> > + amu_sample->arch_const_cycles_prev = read_constcnt();
> > }
> >
> > static inline bool freq_counters_valid(int cpu)
> > {
> > + struct amu_cntr_sample *amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
> > +
> > if ((cpu >= nr_cpu_ids) || !cpumask_test_cpu(cpu, cpu_present_mask))
> > return false;
> >
> > @@ -108,8 +123,8 @@ static inline bool freq_counters_valid(int cpu)
> > return false;
> > }
> >
> > - if (unlikely(!per_cpu(arch_const_cycles_prev, cpu) ||
> > - !per_cpu(arch_core_cycles_prev, cpu))) {
> > + if (unlikely(!amu_sample->arch_const_cycles_prev ||
> > + !amu_sample->arch_core_cycles_prev)) {
> > pr_debug("CPU%d: cycle counters are not enabled.\n", cpu);
> > return false;
> > }
> > @@ -152,20 +167,27 @@ void freq_inv_set_max_ratio(int cpu, u64 max_rate)
> >
> > static void amu_scale_freq_tick(void)
> > {
> > + struct amu_cntr_sample *amu_sample = this_cpu_ptr(&cpu_amu_samples);
> > u64 prev_core_cnt, prev_const_cnt;
> > u64 core_cnt, const_cnt, scale;
> >
> > - prev_const_cnt = this_cpu_read(arch_const_cycles_prev);
> > - prev_core_cnt = this_cpu_read(arch_core_cycles_prev);
> > + prev_const_cnt = amu_sample->arch_const_cycles_prev;
> > + prev_core_cnt = amu_sample->arch_core_cycles_prev;
> > +
> > + write_seqcount_begin(&amu_sample->seq);
> >
> > update_freq_counters_refs();
> >
> > - const_cnt = this_cpu_read(arch_const_cycles_prev);
> > - core_cnt = this_cpu_read(arch_core_cycles_prev);
> > + const_cnt = amu_sample->arch_const_cycles_prev;
> > + core_cnt = amu_sample->arch_core_cycles_prev;
> >
> > + /*
> > + * This should not happen unless the AMUs have been reset and the
> > + * counter values have not been resroted - unlikely
> > + */
> > if (unlikely(core_cnt <= prev_core_cnt ||
> > const_cnt <= prev_const_cnt))
> > - return;
> > + goto leave;
> >
> > /*
> > * /\core arch_max_freq_scale
> > @@ -182,6 +204,10 @@ static void amu_scale_freq_tick(void)
> >
> > scale = min_t(unsigned long, scale, SCHED_CAPACITY_SCALE);
> > this_cpu_write(arch_freq_scale, (unsigned long)scale);
> > +
> > + amu_sample->last_update = jiffies;
> > +leave:
> > + write_seqcount_end(&amu_sample->seq);
> > }
> >
> > static struct scale_freq_data amu_sfd = {
> > @@ -189,6 +215,61 @@ static struct scale_freq_data amu_sfd = {
> > .set_freq_scale = amu_scale_freq_tick,
> > };
> >
> > +#define AMU_SAMPLE_EXP_MS 20
> > +
> > +unsigned int arch_freq_get_on_cpu(int cpu)
> > +{
> > + struct amu_cntr_sample *amu_sample;
> > + unsigned long last_update;
> > + unsigned int seq;
> > + unsigned int freq;
> > + u64 scale;
> > +
> > + if (!cpumask_test_cpu(cpu, amu_fie_cpus) || !arch_scale_freq_ref(cpu))
> > + return 0;
> > +
> > +retry:
> > + amu_sample = per_cpu_ptr(&cpu_amu_samples, cpu);
> > +
> > + do {
> > + seq = raw_read_seqcount_begin(&amu_sample->seq);
> > + last_update = amu_sample->last_update;
> > + } while (read_seqcount_retry(&amu_sample->seq, seq));
> > +
> > + /*
> > + * For those CPUs that are in full dynticks mode,
> > + * and those that have not seen tick for a while
> > + * try an alternative source for the counters (and thus freq scale),
> > + * if available for given policy
> > + */
> > + if (time_is_before_jiffies(last_update + msecs_to_jiffies(AMU_SAMPLE_EXP_MS))) {
> > + struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
> > + int ref_cpu = nr_cpu_ids;
> > +
> > + if (cpumask_intersects(housekeeping_cpumask(HK_TYPE_TICK),
> > + policy->cpus))
> > + ref_cpu = cpumask_nth_and(cpu, policy->cpus,
> > + housekeeping_cpumask(HK_TYPE_TICK));
> > +
>
> This is looking for any other HK CPU within same policy for counters.
> AFAIU, cpumask_nth_and() will return small_cpumask_bits/nr_cpu_ids
> if the number of bits in both masks is different. Could you check
> again if the current change is fine or needs something like below.
> BTW, we have one CPU per policy.
>
> cpumask_and(&mask, policy->cpus, housekeeping_cpumask(HK_TYPE_TICK));
> retry:
> ....
> cpumask_andnot(&mask, &mask, cpumask_of(cpu));
> ref_cpu = cpumask_any(&mask);
>
At this point this is indeed bogus though for a different reason.
I've rewritten that part a bit, though still, this will bail out for single-cpu
policies.

---
BR
Beata


> Thank you,
> Sumit Gupta
>
> > + cpufreq_cpu_put(policy);
> > + if (ref_cpu >= nr_cpu_ids || ref_cpu == cpu)
> > + /* No alternative to pull info from */
> > + return 0;
> > + cpu = ref_cpu;
> > + goto retry;
> > + }
> > + /*
> > + * Reversed computation to the one used to determine
> > + * the arch_freq_scale value
> > + * (see amu_scale_freq_tick for details)
> > + */
> > + scale = arch_scale_freq_capacity(cpu);
> > + freq = scale * arch_scale_freq_ref(cpu);
> > + freq >>= SCHED_CAPACITY_SHIFT;
> > +
> > + return freq;
> > +}
> > +
> > static void amu_fie_setup(const struct cpumask *cpus)
> > {
> > int cpu;
> > --
> > 2.25.1
> >

2024-04-03 21:44:01

by Beata Michalska

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add support for AArch64 AMUv1-based arch_freq_get_on_cpu

On Wed, Mar 20, 2024 at 10:22:22PM +0530, Sumit Gupta wrote:
> Hi Beata,
>
> > > On Tuesday 12 Mar 2024 at 08:34:28 (+0000), Beata Michalska wrote:
> > > > Introducing arm64 specific version of arch_freq_get_on_cpu, cashing on
> > > > existing implementation for FIE and AMUv1 support: the frequency scale
> > > > factor, updated on each sched tick, serves as a base for retrieving
> > > > the frequency for a given CPU, representing an average frequency
> > > > reported between the ticks - thus its accuracy is limited.
> > > >
> > > > The changes have been rather lightly (due to some limitations) tested on
> > > > an FVP model.
> > > >
> > > > Relevant discussions:
> > > > [1] https://lore.kernel.org/all/[email protected]/
> > > > [2] https://lore.kernel.org/all/7eozim2xnepacnnkzxlbx34hib4otycnbn4dqymfziqou5lw5u@5xzpv3t7sxo3/
> > > > [3] https://lore.kernel.org/all/[email protected]/
> > > > [4] https://lore.kernel.org/lkml/[email protected]/T/#m4e74cb5a0aaa353c60fedc6cfb95ab7a6e381e3c
> > > >
> > > > v3:
> > > > - dropping changes to cpufreq_verify_current_freq
> > > > - pulling in changes from Ionela initializing capacity_freq_ref to 0
> > > > (thanks for that!) and applying suggestions made by her during last review:
> > > > - switching to arch_scale_freq_capacity and arch_scale_freq_ref when
> > > > reversing freq scale factor computation
> > > > - swapping shift with multiplication
> > > > - adding time limit for considering last scale update as valid
> > > > - updating frequency scale factor upon entering idle
> > > >
> > > > v2:
> > > > - Splitting the patches
> > > > - Adding comment for full dyntick mode
> > > > - Plugging arch_freq_get_on_cpu into cpufreq_verify_current_freq instead
> > > > of in show_cpuinfo_cur_freq to allow the framework to stay more in sync
> > > > with potential freq changes
> > > >
> > > > Beata Michalska (2):
> > > > arm64: Provide an AMU-based version of arch_freq_get_on_cpu
> > > > arm64: Update AMU-based frequency scale factor on entering idle
> > > >
> > > > Ionela Voinescu (1):
> > > > arch_topology: init capacity_freq_ref to 0
> > > >
> > >
> > > Should there have been a patch that adds a call to
> > > arch_freq_get_on_cpu() from show_cpuinfo_cur_freq() as well?
> > >
> > > My understanding from this [1] thread and others referenced there is
> > > that was something we wanted.
> > >
> > Right, so I must have missunderstood that, as the way I did read it was that
> > it is acceptable to keep things as they are wrt cpufreq sysfs entries.
> >
> > ---
> > BR
> > Beata
> > > [1] https://lore.kernel.org/lkml/[email protected]/
> > >
> > > Thanks,
> > > Ionela.
> > >
>
> Yes, the change to show_cpuinfo_cur_freq from [1] is needed.
>
Noted. Will send an update including fixes and this requested change.

---
BR
Beata
> [1]
> https://lore.kernel.org/lkml/[email protected]/
>
> Thank you,
> Sumit Gupta