LinuxLists.cc - [PATCH v2 0/1] Disable FIE on machines with slow counters

2022-07-28 22:16:29

Subject: [PATCH v2 0/1] Disable FIE on machines with slow counters

FIE assumes the delivered/relative perf registers are fast to read so
it goes ahead and hits them quite frequently. On a couple Arm
platforms though they end up in PCC regions which require mailbox
handshaking with other parts of the platform.

This results in a lot of overhead in the cppc_fie task. As such lets
runtime disable FIE if we detect it enabled on one of those platforms.
Also allow the user to manually disable it via a module parameter.

v1->v2:
Apply Rafael's review comments.
Move the MODULE_PARAM into the ifdef
Fix compiler warning when ACPI_CPPC_LIB is disabled.

Jeremy Linton (1):
ACPI: CPPC: Disable FIE if registers in PCC regions

drivers/acpi/cppc_acpi.c | 41 ++++++++++++++++++++++++++++++++++
drivers/cpufreq/cppc_cpufreq.c | 19 ++++++++++++----
include/acpi/cppc_acpi.h | 5 +++++
3 files changed, 61 insertions(+), 4 deletions(-)

--
2.35.3

2022-07-28 22:19:13

by Jeremy Linton

[permalink] [raw]

Subject: [PATCH v2 1/1] ACPI: CPPC: Disable FIE if registers in PCC regions

PCC regions utilize a mailbox to set/retrieve register values used by
the CPPC code. This is fine as long as the operations are
infrequent. With the FIE code enabled though the overhead can range
from 2-11% of system CPU overhead (ex: as measured by top) on Arm
based machines.

So, before enabling FIE assure none of the registers used by
cppc_get_perf_ctrs() are in the PCC region. Furthermore lets also
enable a module parameter which can also disable it at boot or module
reload.

Signed-off-by: Jeremy Linton <[email protected]>
---
drivers/acpi/cppc_acpi.c | 41 ++++++++++++++++++++++++++++++++++
drivers/cpufreq/cppc_cpufreq.c | 19 ++++++++++++----
include/acpi/cppc_acpi.h | 5 +++++
3 files changed, 61 insertions(+), 4 deletions(-)

diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
index 3c6d4ef87be0..38b881db14c7 100644
--- a/drivers/acpi/cppc_acpi.c
+++ b/drivers/acpi/cppc_acpi.c
@@ -1246,6 +1246,47 @@ int cppc_get_perf_caps(int cpunum, struct cppc_perf_caps *perf_caps)
}
EXPORT_SYMBOL_GPL(cppc_get_perf_caps);

+/**
+ * cppc_perf_ctrs_in_pcc - Check if any perf counters are in a PCC region.
+ *
+ * CPPC has flexibility about how counters describing CPU perf are delivered.
+ * One of the choices is PCC regions, which can have a high access latency. This
+ * routine allows callers of cppc_get_perf_ctrs() to know this ahead of time.
+ *
+ * Return: true if any of the counters are in PCC regions, false otherwise
+ */
+bool cppc_perf_ctrs_in_pcc(void)
+{
+ int cpu;
+
+ for_each_present_cpu(cpu) {
+ struct cpc_register_resource *ref_perf_reg;
+ struct cpc_desc *cpc_desc;
+
+ cpc_desc = per_cpu(cpc_desc_ptr, cpu);
+
+ if (CPC_IN_PCC(&cpc_desc->cpc_regs[DELIVERED_CTR]) ||
+ CPC_IN_PCC(&cpc_desc->cpc_regs[REFERENCE_CTR]) ||
+ CPC_IN_PCC(&cpc_desc->cpc_regs[CTR_WRAP_TIME]))
+ return true;
+
+
+ ref_perf_reg = &cpc_desc->cpc_regs[REFERENCE_PERF];
+
+ /*
+ * If reference perf register is not supported then we should
+ * use the nominal perf value
+ */
+ if (!CPC_SUPPORTED(ref_perf_reg))
+ ref_perf_reg = &cpc_desc->cpc_regs[NOMINAL_PERF];
+
+ if (CPC_IN_PCC(ref_perf_reg))
+ return true;
+ }
+ return false;
+}
+EXPORT_SYMBOL_GPL(cppc_perf_ctrs_in_pcc);
+
/**
* cppc_get_perf_ctrs - Read a CPU's performance feedback counters.
* @cpunum: CPU from which to read counters.
diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
index 24eaf0ec344d..ed607e27d6bb 100644
--- a/drivers/cpufreq/cppc_cpufreq.c
+++ b/drivers/cpufreq/cppc_cpufreq.c
@@ -63,7 +63,11 @@ static struct cppc_workaround_oem_info wa_info[] = {

static struct cpufreq_driver cppc_cpufreq_driver;

+static bool fie_disabled;
+
#ifdef CONFIG_ACPI_CPPC_CPUFREQ_FIE
+module_param(fie_disabled, bool, 0444);
+MODULE_PARM_DESC(fie_disabled, "Disable Frequency Invariance Engine (FIE)");

/* Frequency invariance support */
struct cppc_freq_invariance {
@@ -158,7 +162,7 @@ static void cppc_cpufreq_cpu_fie_init(struct cpufreq_policy *policy)
struct cppc_freq_invariance *cppc_fi;
int cpu, ret;

- if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
+ if (fie_disabled)
return;

for_each_cpu(cpu, policy->cpus) {
@@ -199,7 +203,7 @@ static void cppc_cpufreq_cpu_fie_exit(struct cpufreq_policy *policy)
struct cppc_freq_invariance *cppc_fi;
int cpu;

- if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
+ if (fie_disabled)
return;

/* policy->cpus will be empty here, use related_cpus instead */
@@ -229,7 +233,12 @@ static void __init cppc_freq_invariance_init(void)
};
int ret;

- if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
+ if (cppc_perf_ctrs_in_pcc()) {
+ pr_debug("FIE not enabled on systems with registers in PCC\n");
+ fie_disabled = true;
+ }
+
+ if (fie_disabled)
return;

kworker_fie = kthread_create_worker(0, "cppc_fie");
@@ -247,7 +256,7 @@ static void __init cppc_freq_invariance_init(void)

static void cppc_freq_invariance_exit(void)
{
- if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
+ if (fie_disabled)
return;

kthread_destroy_worker(kworker_fie);
@@ -940,6 +949,8 @@ static void cppc_check_hisi_workaround(void)
}
}

+ fie_disabled = true;
+
acpi_put_table(tbl);
}

diff --git a/include/acpi/cppc_acpi.h b/include/acpi/cppc_acpi.h
index d389bab54241..fe6dc3e5a454 100644
--- a/include/acpi/cppc_acpi.h
+++ b/include/acpi/cppc_acpi.h
@@ -140,6 +140,7 @@ extern int cppc_get_perf_ctrs(int cpu, struct cppc_perf_fb_ctrs *perf_fb_ctrs);
extern int cppc_set_perf(int cpu, struct cppc_perf_ctrls *perf_ctrls);
extern int cppc_set_enable(int cpu, bool enable);
extern int cppc_get_perf_caps(int cpu, struct cppc_perf_caps *caps);
+extern bool cppc_perf_ctrs_in_pcc(void);
extern bool acpi_cpc_valid(void);
extern bool cppc_allow_fast_switch(void);
extern int acpi_get_psd_map(unsigned int cpu, struct cppc_cpudata *cpu_data);
@@ -173,6 +174,10 @@ static inline int cppc_get_perf_caps(int cpu, struct cppc_perf_caps *caps)
{
return -ENOTSUPP;
}
+static inline bool cppc_perf_ctrs_in_pcc(void)
+{
+ return false;
+}
static inline bool acpi_cpc_valid(void)
{
return false;
--
2.35.3

2022-07-29 13:28:44

by Punit Agrawal

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] ACPI: CPPC: Disable FIE if registers in PCC regions

Hi Jeremy,

One comment / query below.

Jeremy Linton <[email protected]> writes:

> PCC regions utilize a mailbox to set/retrieve register values used by
> the CPPC code. This is fine as long as the operations are
> infrequent. With the FIE code enabled though the overhead can range
> from 2-11% of system CPU overhead (ex: as measured by top) on Arm
> based machines.
>
> So, before enabling FIE assure none of the registers used by
> cppc_get_perf_ctrs() are in the PCC region. Furthermore lets also
> enable a module parameter which can also disable it at boot or module
> reload.
>
> Signed-off-by: Jeremy Linton <[email protected]>
> ---
> drivers/acpi/cppc_acpi.c | 41 ++++++++++++++++++++++++++++++++++
> drivers/cpufreq/cppc_cpufreq.c | 19 ++++++++++++----
> include/acpi/cppc_acpi.h | 5 +++++
> 3 files changed, 61 insertions(+), 4 deletions(-)
>

[...]

> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
> index 24eaf0ec344d..ed607e27d6bb 100644
> --- a/drivers/cpufreq/cppc_cpufreq.c
> +++ b/drivers/cpufreq/cppc_cpufreq.c

[...]

> @@ -229,7 +233,12 @@ static void __init cppc_freq_invariance_init(void)
> };
> int ret;
>
> - if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
> + if (cppc_perf_ctrs_in_pcc()) {
> + pr_debug("FIE not enabled on systems with registers in PCC\n");

The message should probably be promoted to a pr_info() and exposed as
part of the kernel logs. It is a change in the default behaviour we've
had until now. The message will provide some hint about why it was
disabled.

Thoughts?

[...]

2022-07-29 15:24:07

by Jeremy Linton

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] ACPI: CPPC: Disable FIE if registers in PCC regions

Hi,

On 7/29/22 07:59, Punit Agrawal wrote:
> Hi Jeremy,
>
> One comment / query below.
>
> Jeremy Linton <[email protected]> writes:
>
>> PCC regions utilize a mailbox to set/retrieve register values used by
>> the CPPC code. This is fine as long as the operations are
>> infrequent. With the FIE code enabled though the overhead can range
>> from 2-11% of system CPU overhead (ex: as measured by top) on Arm
>> based machines.
>>
>> So, before enabling FIE assure none of the registers used by
>> cppc_get_perf_ctrs() are in the PCC region. Furthermore lets also
>> enable a module parameter which can also disable it at boot or module
>> reload.
>>
>> Signed-off-by: Jeremy Linton <[email protected]>
>> ---
>> drivers/acpi/cppc_acpi.c | 41 ++++++++++++++++++++++++++++++++++
>> drivers/cpufreq/cppc_cpufreq.c | 19 ++++++++++++----
>> include/acpi/cppc_acpi.h | 5 +++++
>> 3 files changed, 61 insertions(+), 4 deletions(-)
>>
>
> [...]
>
>> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
>> index 24eaf0ec344d..ed607e27d6bb 100644
>> --- a/drivers/cpufreq/cppc_cpufreq.c
>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>
> [...]
>
>> @@ -229,7 +233,12 @@ static void __init cppc_freq_invariance_init(void)
>> };
>> int ret;
>>
>> - if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
>> + if (cppc_perf_ctrs_in_pcc()) {
>> + pr_debug("FIE not enabled on systems with registers in PCC\n");
>
> The message should probably be promoted to a pr_info() and exposed as
> part of the kernel logs. It is a change in the default behaviour we've
> had until now. The message will provide some hint about why it was
> disabled.
>
> Thoughts?

I personally flip flopped between making it pr_info or pr_debug and
settled on debug because no one else was complaining about the cppc_fie
consumption. Which to me, meant that the users of platforms utilizing
PCC regions weren't sensitive to the problem, or weren't yet running a
distro/kernel with it enabled, or any number of other reasons why the
problem wasn't getting more attention. Mostly I concluded the FIE code
hadn't shown up in "enterprise" distros yet..

But, yah, if no one is going to complain about the extra messages
pr_info() is a better plan.

Thanks for looking at this!

2022-08-01 12:57:04

by Punit Agrawal

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] ACPI: CPPC: Disable FIE if registers in PCC regions

Jeremy Linton <[email protected]> writes:

> Hi,
>
> On 7/29/22 07:59, Punit Agrawal wrote:
>> Hi Jeremy,
>> One comment / query below.
>> Jeremy Linton <[email protected]> writes:
>>
>>> PCC regions utilize a mailbox to set/retrieve register values used by
>>> the CPPC code. This is fine as long as the operations are
>>> infrequent. With the FIE code enabled though the overhead can range
>>> from 2-11% of system CPU overhead (ex: as measured by top) on Arm
>>> based machines.
>>>
>>> So, before enabling FIE assure none of the registers used by
>>> cppc_get_perf_ctrs() are in the PCC region. Furthermore lets also
>>> enable a module parameter which can also disable it at boot or module
>>> reload.
>>>
>>> Signed-off-by: Jeremy Linton <[email protected]>
>>> ---
>>> drivers/acpi/cppc_acpi.c | 41 ++++++++++++++++++++++++++++++++++
>>> drivers/cpufreq/cppc_cpufreq.c | 19 ++++++++++++----
>>> include/acpi/cppc_acpi.h | 5 +++++
>>> 3 files changed, 61 insertions(+), 4 deletions(-)
>>>
>> [...]
>>
>>> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
>>> index 24eaf0ec344d..ed607e27d6bb 100644
>>> --- a/drivers/cpufreq/cppc_cpufreq.c
>>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>> [...]
>>
>>> @@ -229,7 +233,12 @@ static void __init cppc_freq_invariance_init(void)
>>> };
>>> int ret;
>>> - if (cppc_cpufreq_driver.get == hisi_cppc_cpufreq_get_rate)
>>> + if (cppc_perf_ctrs_in_pcc()) {
>>> + pr_debug("FIE not enabled on systems with registers in PCC\n");
>> The message should probably be promoted to a pr_info() and exposed
>> as
>> part of the kernel logs. It is a change in the default behaviour we've
>> had until now. The message will provide some hint about why it was
>> disabled.
>> Thoughts?
>
> I personally flip flopped between making it pr_info or pr_debug and
> settled on debug because no one else was complaining about the
> cppc_fie consumption. Which to me, meant that the users of platforms
> utilizing PCC regions weren't sensitive to the problem, or weren't yet
> running a distro/kernel with it enabled, or any number of other
> reasons why the problem wasn't getting more attention. Mostly I
> concluded the FIE code hadn't shown up in "enterprise" distros yet..

I too was thinking that likely enterprise users haven't started digging
into the performance impact of enabling frequency invariance.

> But, yah, if no one is going to complain about the extra messages
> pr_info() is a better plan.

Thanks! I'll look out for the updated patch.

FIE was designed to improve load balancing (and arguably fairness
too). Hopefully, the message will aid users in looking more closely and
complain to system vendor / upstream if it matters to their workloads.

2022-08-10 12:55:28

by Lukasz Luba

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] ACPI: CPPC: Disable FIE if registers in PCC regions

Hi Jeremy,

+CC Valentin since he might be interested in this finding
+CC Ionela, Dietmar

I have a few comments for this patch.

On 7/28/22 23:10, Jeremy Linton wrote:
> PCC regions utilize a mailbox to set/retrieve register values used by
> the CPPC code. This is fine as long as the operations are
> infrequent. With the FIE code enabled though the overhead can range
> from 2-11% of system CPU overhead (ex: as measured by top) on Arm
> based machines.
>
> So, before enabling FIE assure none of the registers used by
> cppc_get_perf_ctrs() are in the PCC region. Furthermore lets also
> enable a module parameter which can also disable it at boot or module
> reload.
>
> Signed-off-by: Jeremy Linton <[email protected]>
> ---
> drivers/acpi/cppc_acpi.c | 41 ++++++++++++++++++++++++++++++++++
> drivers/cpufreq/cppc_cpufreq.c | 19 ++++++++++++----
> include/acpi/cppc_acpi.h | 5 +++++
> 3 files changed, 61 insertions(+), 4 deletions(-)

1. You assume that all platforms would have this big overhead when
they have the PCC regions for this purpose.
Do we know which version of HW mailbox have been implemented
and used that have this 2-11% overhead in a platform?
Do also more recent MHU have such issues, so we could block
them by default (like in your code)?

2. I would prefer to simply change the default Kconfig value to 'n' for
the ACPI_CPPC_CPUFREQ_FIE, instead of creating a runtime
check code which disables it.
We have probably introduce this overhead for older platforms with
this commit:

commit 4c38f2df71c8e33c0b64865992d693f5022eeaad
Author: Viresh Kumar <[email protected]>
Date: Tue Jun 23 15:49:40 2020 +0530

cpufreq: CPPC: Add support for frequency invariance

If the test server with this config enabled performs well
in the stress-tests, then on production server the config may be
set to 'y' (or 'm' and loaded).

I would vote to not add extra code, which then after a while might be
decided to bw extended because actually some HW is actually capable (so
we could check in runtime and enable it). IMO this create an additional
complexity in our diverse configuration/tunnable space in our code.

When we don't compile-in this, we should fallback to old-style
FIE, which has been used on these old platforms.

BTW (I have to leave it here) the first-class solution for those servers
is to implement AMU counters, so the overhead to retrieve this info is
really low.

Regards,
Lukasz

2022-08-10 12:55:30

by Ionela Voinescu

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] ACPI: CPPC: Disable FIE if registers in PCC regions

Hi folks,

On Wednesday 10 Aug 2022 at 13:29:08 (+0100), Lukasz Luba wrote:
> Hi Jeremy,
>
> +CC Valentin since he might be interested in this finding
> +CC Ionela, Dietmar
>
> I have a few comments for this patch.
>
>
> On 7/28/22 23:10, Jeremy Linton wrote:
> > PCC regions utilize a mailbox to set/retrieve register values used by
> > the CPPC code. This is fine as long as the operations are
> > infrequent. With the FIE code enabled though the overhead can range
> > from 2-11% of system CPU overhead (ex: as measured by top) on Arm
> > based machines.
> >
> > So, before enabling FIE assure none of the registers used by
> > cppc_get_perf_ctrs() are in the PCC region. Furthermore lets also
> > enable a module parameter which can also disable it at boot or module
> > reload.
> >
> > Signed-off-by: Jeremy Linton <[email protected]>
> > ---
> > drivers/acpi/cppc_acpi.c | 41 ++++++++++++++++++++++++++++++++++
> > drivers/cpufreq/cppc_cpufreq.c | 19 ++++++++++++----
> > include/acpi/cppc_acpi.h | 5 +++++
> > 3 files changed, 61 insertions(+), 4 deletions(-)
>
>
> 1. You assume that all platforms would have this big overhead when
> they have the PCC regions for this purpose.
> Do we know which version of HW mailbox have been implemented
> and used that have this 2-11% overhead in a platform?
> Do also more recent MHU have such issues, so we could block
> them by default (like in your code)?
>
> 2. I would prefer to simply change the default Kconfig value to 'n' for
> the ACPI_CPPC_CPUFREQ_FIE, instead of creating a runtime
> check code which disables it.
> We have probably introduce this overhead for older platforms with
> this commit:
>
> commit 4c38f2df71c8e33c0b64865992d693f5022eeaad
> Author: Viresh Kumar <[email protected]>
> Date: Tue Jun 23 15:49:40 2020 +0530
>
> cpufreq: CPPC: Add support for frequency invariance
>
>
>
> If the test server with this config enabled performs well
> in the stress-tests, then on production server the config may be
> set to 'y' (or 'm' and loaded).
>
> I would vote to not add extra code, which then after a while might be
> decided to bw extended because actually some HW is actually capable (so
> we could check in runtime and enable it). IMO this create an additional
> complexity in our diverse configuration/tunnable space in our code.
>

I agree that having CONFIG_ACPI_CPPC_CPUFREQ_FIE default to no is the
simpler solution but it puts the decision in the hands of platform
providers which might result in this functionality not being used most
of the times, if at all. This being said, the use of CPPC counters is
meant as a last resort for FIE, if the platform does not have AMUs. This
is why I recommended this to default to no in the review of the original
patches.

But I don't see these runtime options as adding a lot of complexity
and therefore agree with the idea of this patch, versus the config
change above, with two design comments:
- Rather than having a check for fie_disabled in multiple init and exit
functions I think the code should be slightly redesigned to elegantly
bail out of most functions if cppc_freq_invariance_init() failed.
- Given the multiple options to disable this functionality (config,
PCC check), I don't see a need for a module parameter or runtime user
input, unless we make that overwrite all previous decisions, as in: if
CONFIG_ACPI_CPPC_CPUFREQ_FIE=y, even if cppc_perf_ctrs_in_pcc(), if
the fie_disabled module parameter is no, then counters should be used
for FIE.

Thanks,
Ionela.

> When we don't compile-in this, we should fallback to old-style
> FIE, which has been used on these old platforms.
>
> BTW (I have to leave it here) the first-class solution for those servers
> is to implement AMU counters, so the overhead to retrieve this info is
> really low.
>
> Regards,
> Lukasz

2022-08-10 14:18:48

by Lukasz Luba

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] ACPI: CPPC: Disable FIE if registers in PCC regions

On 8/10/22 13:51, Ionela Voinescu wrote:
> Hi folks,
>
> On Wednesday 10 Aug 2022 at 13:29:08 (+0100), Lukasz Luba wrote:
>> Hi Jeremy,
>>
>> +CC Valentin since he might be interested in this finding
>> +CC Ionela, Dietmar
>>
>> I have a few comments for this patch.
>>
>>
>> On 7/28/22 23:10, Jeremy Linton wrote:
>>> PCC regions utilize a mailbox to set/retrieve register values used by
>>> the CPPC code. This is fine as long as the operations are
>>> infrequent. With the FIE code enabled though the overhead can range
>>> from 2-11% of system CPU overhead (ex: as measured by top) on Arm
>>> based machines.
>>>
>>> So, before enabling FIE assure none of the registers used by
>>> cppc_get_perf_ctrs() are in the PCC region. Furthermore lets also
>>> enable a module parameter which can also disable it at boot or module
>>> reload.
>>>
>>> Signed-off-by: Jeremy Linton <[email protected]>
>>> ---
>>> drivers/acpi/cppc_acpi.c | 41 ++++++++++++++++++++++++++++++++++
>>> drivers/cpufreq/cppc_cpufreq.c | 19 ++++++++++++----
>>> include/acpi/cppc_acpi.h | 5 +++++
>>> 3 files changed, 61 insertions(+), 4 deletions(-)
>>
>>
>> 1. You assume that all platforms would have this big overhead when
>> they have the PCC regions for this purpose.
>> Do we know which version of HW mailbox have been implemented
>> and used that have this 2-11% overhead in a platform?
>> Do also more recent MHU have such issues, so we could block
>> them by default (like in your code)?
>>
>> 2. I would prefer to simply change the default Kconfig value to 'n' for
>> the ACPI_CPPC_CPUFREQ_FIE, instead of creating a runtime
>> check code which disables it.
>> We have probably introduce this overhead for older platforms with
>> this commit:
>>
>> commit 4c38f2df71c8e33c0b64865992d693f5022eeaad
>> Author: Viresh Kumar <[email protected]>
>> Date: Tue Jun 23 15:49:40 2020 +0530
>>
>> cpufreq: CPPC: Add support for frequency invariance
>>
>>
>>
>> If the test server with this config enabled performs well
>> in the stress-tests, then on production server the config may be
>> set to 'y' (or 'm' and loaded).
>>
>> I would vote to not add extra code, which then after a while might be
>> decided to bw extended because actually some HW is actually capable (so
>> we could check in runtime and enable it). IMO this create an additional
>> complexity in our diverse configuration/tunnable space in our code.
>>
>
> I agree that having CONFIG_ACPI_CPPC_CPUFREQ_FIE default to no is the
> simpler solution but it puts the decision in the hands of platform
> providers which might result in this functionality not being used most
> of the times, if at all. This being said, the use of CPPC counters is
> meant as a last resort for FIE, if the platform does not have AMUs. This
> is why I recommended this to default to no in the review of the original
> patches.
>
> But I don't see these runtime options as adding a lot of complexity
> and therefore agree with the idea of this patch, versus the config
> change above, with two design comments:
> - Rather than having a check for fie_disabled in multiple init and exit
> functions I think the code should be slightly redesigned to elegantly
> bail out of most functions if cppc_freq_invariance_init() failed.
> - Given the multiple options to disable this functionality (config,
> PCC check), I don't see a need for a module parameter or runtime user
> input, unless we make that overwrite all previous decisions, as in: if
> CONFIG_ACPI_CPPC_CPUFREQ_FIE=y, even if cppc_perf_ctrs_in_pcc(), if
> the fie_disabled module parameter is no, then counters should be used
> for FIE.
>

A few things:
1. With this default CONFIG_ACPI_CPPC_CPUFREQ_FIE=y we've introduced
a performance regression on older HW servers, which is not good IMO.
It looks like it wasn't a good idea. The FIE which is used in a tick
and going through mailbox and FW sounds like a bad design.
You need to have a really fast HW mailbox, FW and uC running it,
to be able to provide a decent performance.
2. Keeping a code which is not used in a server because at runtime we
discover this PCC overhead issue doesn't make sense.
3. System integrator or distro engineers should be able to experiment
with different kernel config options on the platform and disable/
enable this option on particular server. I am afraid that we cannot
figure out and assume performance at runtime in this code and say
it would be good or not to use it. Only stress-tests would tell this.

2022-08-10 14:22:43

by Jeremy Linton

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] ACPI: CPPC: Disable FIE if registers in PCC regions

Hi,

On 8/10/22 07:29, Lukasz Luba wrote:
> Hi Jeremy,
>
> +CC Valentin since he might be interested in this finding
> +CC Ionela, Dietmar
>
> I have a few comments for this patch.
>
>
> On 7/28/22 23:10, Jeremy Linton wrote:
>> PCC regions utilize a mailbox to set/retrieve register values used by
>> the CPPC code. This is fine as long as the operations are
>> infrequent. With the FIE code enabled though the overhead can range
>> from 2-11% of system CPU overhead (ex: as measured by top) on Arm
>> based machines.
>>
>> So, before enabling FIE assure none of the registers used by
>> cppc_get_perf_ctrs() are in the PCC region. Furthermore lets also
>> enable a module parameter which can also disable it at boot or module
>> reload.
>>
>> Signed-off-by: Jeremy Linton <[email protected]>
>> ---
>> drivers/acpi/cppc_acpi.c       | 41 ++++++++++++++++++++++++++++++++++
>> drivers/cpufreq/cppc_cpufreq.c | 19 ++++++++++++----
>> include/acpi/cppc_acpi.h       | 5 +++++
>> 3 files changed, 61 insertions(+), 4 deletions(-)
>
>
> 1. You assume that all platforms would have this big overhead when
>    they have the PCC regions for this purpose.
>    Do we know which version of HW mailbox have been implemented
>    and used that have this 2-11% overhead in a platform?
>    Do also more recent MHU have such issues, so we could block
>    them by default (like in your code)?

Well, the mailbox nature of PCC pretty much assures its "slow", relative
the alternative of providing an actual register. If a platform provides
direct access to say MHU registers, then of course they won't actually
be in a PCC region and the FIE will remain on.

>
> 2. I would prefer to simply change the default Kconfig value to 'n' for
>    the ACPI_CPPC_CPUFREQ_FIE, instead of creating a runtime
>    check code which disables it.
>    We have probably introduce this overhead for older platforms with
>    this commit:

The problem here is that these ACPI kernels are being shipped as single
images in distro's which expect them to run on a wide range of platforms
(including x86/amd in this case), and preform optimally on all of them.

So the 'n' option basically is saying that the latest FIE code doesn't
provide a befit anywhere?

>
> commit 4c38f2df71c8e33c0b64865992d693f5022eeaad
> Author: Viresh Kumar <[email protected]>
> Date:   Tue Jun 23 15:49:40 2020 +0530
>
>     cpufreq: CPPC: Add support for frequency invariance
>
>
>
> If the test server with this config enabled performs well
> in the stress-tests, then on production server the config may be
> set to 'y' (or 'm' and loaded).
>
> I would vote to not add extra code, which then after a while might be
> decided to bw extended because actually some HW is actually capable (so
> we could check in runtime and enable it). IMO this create an additional
> complexity in our diverse configuration/tunnable space in our code.
>
> When we don't compile-in this, we should fallback to old-style
> FIE, which has been used on these old platforms.
>
> BTW (I have to leave it here) the first-class solution for those servers
> is to implement AMU counters, so the overhead to retrieve this info is
> really low.
>
> Regards,
> Lukasz

2022-08-10 14:41:55

by Lukasz Luba

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] ACPI: CPPC: Disable FIE if registers in PCC regions

On 8/10/22 15:30, Jeremy Linton wrote:
> Hi,
>
> On 8/10/22 07:29, Lukasz Luba wrote:
>> Hi Jeremy,
>>
>> +CC Valentin since he might be interested in this finding
>> +CC Ionela, Dietmar
>>
>> I have a few comments for this patch.
>>
>>
>> On 7/28/22 23:10, Jeremy Linton wrote:
>>> PCC regions utilize a mailbox to set/retrieve register values used by
>>> the CPPC code. This is fine as long as the operations are
>>> infrequent. With the FIE code enabled though the overhead can range
>>> from 2-11% of system CPU overhead (ex: as measured by top) on Arm
>>> based machines.
>>>
>>> So, before enabling FIE assure none of the registers used by
>>> cppc_get_perf_ctrs() are in the PCC region. Furthermore lets also
>>> enable a module parameter which can also disable it at boot or module
>>> reload.
>>>
>>> Signed-off-by: Jeremy Linton <[email protected]>
>>> ---
>>> drivers/acpi/cppc_acpi.c       | 41 ++++++++++++++++++++++++++++++++++
>>> drivers/cpufreq/cppc_cpufreq.c | 19 ++++++++++++----
>>> include/acpi/cppc_acpi.h       | 5 +++++
>>> 3 files changed, 61 insertions(+), 4 deletions(-)
>>
>>
>> 1. You assume that all platforms would have this big overhead when
>>     they have the PCC regions for this purpose.
>>     Do we know which version of HW mailbox have been implemented
>>     and used that have this 2-11% overhead in a platform?
>>     Do also more recent MHU have such issues, so we could block
>>     them by default (like in your code)?
>
> I posted that other email before being awake and conflated MHU with AMU
> (which could potentially expose the values directly). But the CPPC code
> isn't aware of whether a MHU or some other mailbox is in use. Either
> way, its hard to imagine a general mailbox with a doorbell/wait for
> completion handshake will ever be fast enough to consider running at the
> granularity this code is running at. If there were a case like that, the
> kernel would have to benchmark it at runtime to differentiate it from
> something that is talking over a slow link to a slowly responding mgmt
> processor.

Exactly, I'm afraid the same, that we would never get such fast
mailbox-based platform. Newer platforms would just use AMU, so
completely different code and no one would even bother to test if
their HW mailbox is fast-enough for this FIE purpose ;)

2022-08-10 14:52:08

by Jeremy Linton

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] ACPI: CPPC: Disable FIE if registers in PCC regions

Hi,

On 8/10/22 07:29, Lukasz Luba wrote:
> Hi Jeremy,
>
> +CC Valentin since he might be interested in this finding
> +CC Ionela, Dietmar
>
> I have a few comments for this patch.
>
>
> On 7/28/22 23:10, Jeremy Linton wrote:
>> PCC regions utilize a mailbox to set/retrieve register values used by
>> the CPPC code. This is fine as long as the operations are
>> infrequent. With the FIE code enabled though the overhead can range
>> from 2-11% of system CPU overhead (ex: as measured by top) on Arm
>> based machines.
>>
>> So, before enabling FIE assure none of the registers used by
>> cppc_get_perf_ctrs() are in the PCC region. Furthermore lets also
>> enable a module parameter which can also disable it at boot or module
>> reload.
>>
>> Signed-off-by: Jeremy Linton <[email protected]>
>> ---
>> drivers/acpi/cppc_acpi.c       | 41 ++++++++++++++++++++++++++++++++++
>> drivers/cpufreq/cppc_cpufreq.c | 19 ++++++++++++----
>> include/acpi/cppc_acpi.h       | 5 +++++
>> 3 files changed, 61 insertions(+), 4 deletions(-)
>
>
> 1. You assume that all platforms would have this big overhead when
>    they have the PCC regions for this purpose.
>    Do we know which version of HW mailbox have been implemented
>    and used that have this 2-11% overhead in a platform?
>    Do also more recent MHU have such issues, so we could block
>    them by default (like in your code)?

I posted that other email before being awake and conflated MHU with AMU
(which could potentially expose the values directly). But the CPPC code
isn't aware of whether a MHU or some other mailbox is in use. Either
way, its hard to imagine a general mailbox with a doorbell/wait for
completion handshake will ever be fast enough to consider running at the
granularity this code is running at. If there were a case like that, the
kernel would have to benchmark it at runtime to differentiate it from
something that is talking over a slow link to a slowly responding mgmt
processor.

>
> 2. I would prefer to simply change the default Kconfig value to 'n' for
>    the ACPI_CPPC_CPUFREQ_FIE, instead of creating a runtime
>    check code which disables it.
>    We have probably introduce this overhead for older platforms with
>    this commit:
>
> commit 4c38f2df71c8e33c0b64865992d693f5022eeaad
> Author: Viresh Kumar <[email protected]>
> Date:   Tue Jun 23 15:49:40 2020 +0530
>
>     cpufreq: CPPC: Add support for frequency invariance
>
>
>
> If the test server with this config enabled performs well
> in the stress-tests, then on production server the config may be
> set to 'y' (or 'm' and loaded).
>
> I would vote to not add extra code, which then after a while might be
> decided to bw extended because actually some HW is actually capable (so
> we could check in runtime and enable it). IMO this create an additional
> complexity in our diverse configuration/tunnable space in our code.
>
> When we don't compile-in this, we should fallback to old-style
> FIE, which has been used on these old platforms.
>
> BTW (I have to leave it here) the first-class solution for those servers
> is to implement AMU counters, so the overhead to retrieve this info is
> really low.
>
> Regards,
> Lukasz

2022-08-10 14:57:34

by Lukasz Luba

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] ACPI: CPPC: Disable FIE if registers in PCC regions

On 8/10/22 15:08, Jeremy Linton wrote:
> Hi,
>
> On 8/10/22 07:29, Lukasz Luba wrote:
>> Hi Jeremy,
>>
>> +CC Valentin since he might be interested in this finding
>> +CC Ionela, Dietmar
>>
>> I have a few comments for this patch.
>>
>>
>> On 7/28/22 23:10, Jeremy Linton wrote:
>>> PCC regions utilize a mailbox to set/retrieve register values used by
>>> the CPPC code. This is fine as long as the operations are
>>> infrequent. With the FIE code enabled though the overhead can range
>>> from 2-11% of system CPU overhead (ex: as measured by top) on Arm
>>> based machines.
>>>
>>> So, before enabling FIE assure none of the registers used by
>>> cppc_get_perf_ctrs() are in the PCC region. Furthermore lets also
>>> enable a module parameter which can also disable it at boot or module
>>> reload.
>>>
>>> Signed-off-by: Jeremy Linton <[email protected]>
>>> ---
>>> drivers/acpi/cppc_acpi.c       | 41 ++++++++++++++++++++++++++++++++++
>>> drivers/cpufreq/cppc_cpufreq.c | 19 ++++++++++++----
>>> include/acpi/cppc_acpi.h       | 5 +++++
>>> 3 files changed, 61 insertions(+), 4 deletions(-)
>>
>>
>> 1. You assume that all platforms would have this big overhead when
>>     they have the PCC regions for this purpose.
>>     Do we know which version of HW mailbox have been implemented
>>     and used that have this 2-11% overhead in a platform?
>>     Do also more recent MHU have such issues, so we could block
>>     them by default (like in your code)?
>
> Well, the mailbox nature of PCC pretty much assures its "slow", relative
> the alternative of providing an actual register. If a platform provides
> direct access to say MHU registers, then of course they won't actually
> be in a PCC region and the FIE will remain on.
>
>
>>
>> 2. I would prefer to simply change the default Kconfig value to 'n' for
>>     the ACPI_CPPC_CPUFREQ_FIE, instead of creating a runtime
>>     check code which disables it.
>>     We have probably introduce this overhead for older platforms with
>>     this commit:
>
> The problem here is that these ACPI kernels are being shipped as single
> images in distro's which expect them to run on a wide range of platforms
> (including x86/amd in this case), and preform optimally on all of them.
>
> So the 'n' option basically is saying that the latest FIE code doesn't
> provide a befit anywhere?

How we define the 'benefit' here - it's a better task utilization.
How much better it would be vs. previous approach with old-style FIE?

TBH, I haven't found any test results from the development of the patch
set. Maybe someone could point me to the test results which bring
this benefit of better utilization.

In the RFC I could find that statement [1]:

"This is tested with some hacks, as I didn't have access to the right
hardware, on the ARM64 hikey board to check the overall functionality
and that works fine."

There should be a rule that such code is tested on a real server with
many CPUs under some stress-test.

Ionela do you have some test results where this new FIE feature
introduces some better & meaningful accuracy improvement to the
tasks utilization?

With this overhead measured on a real server platform I think
it's not worth to keep it 'y' in default.

The design is heavy, as stated in the commit message:
" On an invocation of cppc_scale_freq_tick(), we schedule an irq work
(since we reach here from hard-irq context), which then schedules a
normal work item and cppc_scale_freq_workfn() updates the per_cpu
arch_freq_scale variable based on the counter updates since the last
tick.
"

As you said Jeremy, this mailbox would always be with overhead. IMO
untill we cannot be sure we have some powerful new HW mailbox, this
feature should be disabled.

[1]
https://lore.kernel.org/lkml/[email protected]/

2022-08-10 15:52:59

by Pierre Gondois

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] ACPI: CPPC: Disable FIE if registers in PCC regions

On 8/10/22 16:37, Lukasz Luba wrote:
>
>
> On 8/10/22 15:30, Jeremy Linton wrote:
>> Hi,
>>
>> On 8/10/22 07:29, Lukasz Luba wrote:
>>> Hi Jeremy,
>>>
>>> +CC Valentin since he might be interested in this finding
>>> +CC Ionela, Dietmar
>>>
>>> I have a few comments for this patch.
>>>
>>>
>>> On 7/28/22 23:10, Jeremy Linton wrote:
>>>> PCC regions utilize a mailbox to set/retrieve register values used by
>>>> the CPPC code. This is fine as long as the operations are
>>>> infrequent. With the FIE code enabled though the overhead can range
>>>> from 2-11% of system CPU overhead (ex: as measured by top) on Arm
>>>> based machines.
>>>>
>>>> So, before enabling FIE assure none of the registers used by
>>>> cppc_get_perf_ctrs() are in the PCC region. Furthermore lets also
>>>> enable a module parameter which can also disable it at boot or module
>>>> reload.
>>>>
>>>> Signed-off-by: Jeremy Linton <[email protected]>
>>>> ---
>>>> drivers/acpi/cppc_acpi.c       | 41 ++++++++++++++++++++++++++++++++++
>>>> drivers/cpufreq/cppc_cpufreq.c | 19 ++++++++++++----
>>>> include/acpi/cppc_acpi.h       | 5 +++++
>>>> 3 files changed, 61 insertions(+), 4 deletions(-)
>>>
>>>
>>> 1. You assume that all platforms would have this big overhead when
>>>     they have the PCC regions for this purpose.
>>>     Do we know which version of HW mailbox have been implemented
>>>     and used that have this 2-11% overhead in a platform?
>>>     Do also more recent MHU have such issues, so we could block
>>>     them by default (like in your code)?
>>
>> I posted that other email before being awake and conflated MHU with AMU
>> (which could potentially expose the values directly). But the CPPC code
>> isn't aware of whether a MHU or some other mailbox is in use. Either
>> way, its hard to imagine a general mailbox with a doorbell/wait for
>> completion handshake will ever be fast enough to consider running at the
>> granularity this code is running at. If there were a case like that, the
>> kernel would have to benchmark it at runtime to differentiate it from
>> something that is talking over a slow link to a slowly responding mgmt
>> processor.
>
> Exactly, I'm afraid the same, that we would never get such fast
> mailbox-based platform. Newer platforms would just use AMU, so
> completely different code and no one would even bother to test if
> their HW mailbox is fast-enough for this FIE purpose ;)

To add some platform information, the following platforms are using
CPPC through PCC channels (so mailboxes):
- Cavium ThunderX2
- Ampere eMAG
- Ampere Altra

Fwiw, I can confirm the cppc_fie kthread can represent a significant load,
with a utilization between 2% and 30%.

Regards,
Pierre

2022-08-10 18:30:47

by Jeremy Linton

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] ACPI: CPPC: Disable FIE if registers in PCC regions

Hi,

On 8/10/22 07:51, Ionela Voinescu wrote:
> Hi folks,
>
> On Wednesday 10 Aug 2022 at 13:29:08 (+0100), Lukasz Luba wrote:
>> Hi Jeremy,
>>
>> +CC Valentin since he might be interested in this finding
>> +CC Ionela, Dietmar
>>
>> I have a few comments for this patch.
>>
>>
>> On 7/28/22 23:10, Jeremy Linton wrote:
>>> PCC regions utilize a mailbox to set/retrieve register values used by
>>> the CPPC code. This is fine as long as the operations are
>>> infrequent. With the FIE code enabled though the overhead can range
>>> from 2-11% of system CPU overhead (ex: as measured by top) on Arm
>>> based machines.
>>>
>>> So, before enabling FIE assure none of the registers used by
>>> cppc_get_perf_ctrs() are in the PCC region. Furthermore lets also
>>> enable a module parameter which can also disable it at boot or module
>>> reload.
>>>
>>> Signed-off-by: Jeremy Linton <[email protected]>
>>> ---
>>> drivers/acpi/cppc_acpi.c | 41 ++++++++++++++++++++++++++++++++++
>>> drivers/cpufreq/cppc_cpufreq.c | 19 ++++++++++++----
>>> include/acpi/cppc_acpi.h | 5 +++++
>>> 3 files changed, 61 insertions(+), 4 deletions(-)
>>
>>
>> 1. You assume that all platforms would have this big overhead when
>> they have the PCC regions for this purpose.
>> Do we know which version of HW mailbox have been implemented
>> and used that have this 2-11% overhead in a platform?
>> Do also more recent MHU have such issues, so we could block
>> them by default (like in your code)?
>>
>> 2. I would prefer to simply change the default Kconfig value to 'n' for
>> the ACPI_CPPC_CPUFREQ_FIE, instead of creating a runtime
>> check code which disables it.
>> We have probably introduce this overhead for older platforms with
>> this commit:
>>
>> commit 4c38f2df71c8e33c0b64865992d693f5022eeaad
>> Author: Viresh Kumar <[email protected]>
>> Date: Tue Jun 23 15:49:40 2020 +0530
>>
>> cpufreq: CPPC: Add support for frequency invariance
>>
>>
>>
>> If the test server with this config enabled performs well
>> in the stress-tests, then on production server the config may be
>> set to 'y' (or 'm' and loaded).
>>
>> I would vote to not add extra code, which then after a while might be
>> decided to bw extended because actually some HW is actually capable (so
>> we could check in runtime and enable it). IMO this create an additional
>> complexity in our diverse configuration/tunnable space in our code.
>>
>
> I agree that having CONFIG_ACPI_CPPC_CPUFREQ_FIE default to no is the
> simpler solution but it puts the decision in the hands of platform
> providers which might result in this functionality not being used most
> of the times, if at all. This being said, the use of CPPC counters is
> meant as a last resort for FIE, if the platform does not have AMUs. This
> is why I recommended this to default to no in the review of the original
> patches.
>
> But I don't see these runtime options as adding a lot of complexity
> and therefore agree with the idea of this patch, versus the config
> change above, with two design comments:
> - Rather than having a check for fie_disabled in multiple init and exit
> functions I think the code should be slightly redesigned to elegantly
> bail out of most functions if cppc_freq_invariance_init() failed.

I'm not sure what that would look like, I will have to mess with it a
bit more, but as you can see its really just the two init entry points
(one for the module, and one for the registered cpufreq), and their
associated exit's which I'm not sure I see a way to simplify that short
of maybe creating a second cpufreq_driver table, which replaces the
.init calls with ones which include cppc_cpufreq_cpu_fie_init. The
alternative is runtime setting the .init to switch between an init with
FIE and one without. I'm not sure that clarifies what is happening in
the code, and I thought in general dynamic runtime dispatch was to be
avoided in the ACPI code when possible. Neither choice of course affects
actual runtime because they are both firing during module load/unload.

> - Given the multiple options to disable this functionality (config,
> PCC check), I don't see a need for a module parameter or runtime user
> input, unless we make that overwrite all previous decisions, as in: if
> CONFIG_ACPI_CPPC_CPUFREQ_FIE=y, even if cppc_perf_ctrs_in_pcc(), if
> the fie_disabled module parameter is no, then counters should be used
> for FIE.

Tristating the module parameter with default=detect, ON, OFF is a
reasonable idea, and one I considered, but ignored because in the hisi
quirk case even with ON it will have to be OFF, so it really ends up
with 4 states default=detect, request ON, ON, OFF.

I'm good with any of this if people feel strongly about it.

>
> Thanks,
> Ionela.
>
>
>> When we don't compile-in this, we should fallback to old-style
>> FIE, which has been used on these old platforms.
>>
>> BTW (I have to leave it here) the first-class solution for those servers
>> is to implement AMU counters, so the overhead to retrieve this info is
>> really low.
>>
>> Regards,
>> Lukasz

Thanks for looking at this!

2022-08-10 18:31:20

by Jeremy Linton

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] ACPI: CPPC: Disable FIE if registers in PCC regions

Hi,

On 8/10/22 09:32, Lukasz Luba wrote:
>
>
> On 8/10/22 15:08, Jeremy Linton wrote:
>> Hi,
>>
>> On 8/10/22 07:29, Lukasz Luba wrote:
>>> Hi Jeremy,
>>>
>>> +CC Valentin since he might be interested in this finding
>>> +CC Ionela, Dietmar
>>>
>>> I have a few comments for this patch.
>>>
>>>
>>> On 7/28/22 23:10, Jeremy Linton wrote:
>>>> PCC regions utilize a mailbox to set/retrieve register values used by
>>>> the CPPC code. This is fine as long as the operations are
>>>> infrequent. With the FIE code enabled though the overhead can range
>>>> from 2-11% of system CPU overhead (ex: as measured by top) on Arm
>>>> based machines.
>>>>
>>>> So, before enabling FIE assure none of the registers used by
>>>> cppc_get_perf_ctrs() are in the PCC region. Furthermore lets also
>>>> enable a module parameter which can also disable it at boot or module
>>>> reload.
>>>>
>>>> Signed-off-by: Jeremy Linton <[email protected]>
>>>> ---
>>>> drivers/acpi/cppc_acpi.c       | 41
>>>> ++++++++++++++++++++++++++++++++++
>>>> drivers/cpufreq/cppc_cpufreq.c | 19 ++++++++++++----
>>>> include/acpi/cppc_acpi.h       | 5 +++++
>>>> 3 files changed, 61 insertions(+), 4 deletions(-)
>>>
>>>
>>> 1. You assume that all platforms would have this big overhead when
>>>     they have the PCC regions for this purpose.
>>>     Do we know which version of HW mailbox have been implemented
>>>     and used that have this 2-11% overhead in a platform?
>>>     Do also more recent MHU have such issues, so we could block
>>>     them by default (like in your code)?
>>
>> Well, the mailbox nature of PCC pretty much assures its "slow",
>> relative the alternative of providing an actual register. If a
>> platform provides direct access to say MHU registers, then of course
>> they won't actually be in a PCC region and the FIE will remain on.
>>
>>
>>>
>>> 2. I would prefer to simply change the default Kconfig value to 'n' for
>>>     the ACPI_CPPC_CPUFREQ_FIE, instead of creating a runtime
>>>     check code which disables it.
>>>     We have probably introduce this overhead for older platforms with
>>>     this commit:
>>
>> The problem here is that these ACPI kernels are being shipped as
>> single images in distro's which expect them to run on a wide range of
>> platforms (including x86/amd in this case), and preform optimally on
>> all of them.
>>
>> So the 'n' option basically is saying that the latest FIE code doesn't
>> provide a befit anywhere?
>
> How we define the 'benefit' here - it's a better task utilization.
> How much better it would be vs. previous approach with old-style FIE?
>
> TBH, I haven't found any test results from the development of the patch
> set. Maybe someone could point me to the test results which bring
> this benefit of better utilization.
>
> In the RFC I could find that statement [1]:
>
> "This is tested with some hacks, as I didn't have access to the right
> hardware, on the ARM64 hikey board to check the overall functionality
> and that works fine."
>
> There should be a rule that such code is tested on a real server with
> many CPUs under some stress-test.
>
> Ionela do you have some test results where this new FIE feature
> introduces some better & meaningful accuracy improvement to the
> tasks utilization?
>
> With this overhead measured on a real server platform I think
> it's not worth to keep it 'y' in default.
>
> The design is heavy, as stated in the commit message:
> "    On an invocation of cppc_scale_freq_tick(), we schedule an irq work
>     (since we reach here from hard-irq context), which then schedules a
>     normal work item and cppc_scale_freq_workfn() updates the per_cpu
>     arch_freq_scale variable based on the counter updates since the last
>     tick.
> "
>
> As you said Jeremy, this mailbox would always be with overhead. IMO
> untill we cannot be sure we have some powerful new HW mailbox, this
> feature should be disabled.

Right, the design of the feature would be completely different if it
were a simple register read to get the delivered perf avoiding all the
jumping around you quoted.

Which sorta implies that its not really fixable as is, which IMHO means
that 'n' isn't really strong enough, it should probably be under
CONFIG_EXPERT as well if such a change were made to discourage its use.

>
> [1]
> https://lore.kernel.org/lkml/[email protected]/

2022-08-11 07:43:21

by Lukasz Luba

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] ACPI: CPPC: Disable FIE if registers in PCC regions

On 8/10/22 19:04, Jeremy Linton wrote:
> Hi,
>
> On 8/10/22 09:32, Lukasz Luba wrote:
>>
>>
>> On 8/10/22 15:08, Jeremy Linton wrote:
>>> Hi,
>>>
>>> On 8/10/22 07:29, Lukasz Luba wrote:
>>>> Hi Jeremy,
>>>>
>>>> +CC Valentin since he might be interested in this finding
>>>> +CC Ionela, Dietmar
>>>>
>>>> I have a few comments for this patch.
>>>>
>>>>
>>>> On 7/28/22 23:10, Jeremy Linton wrote:
>>>>> PCC regions utilize a mailbox to set/retrieve register values used by
>>>>> the CPPC code. This is fine as long as the operations are
>>>>> infrequent. With the FIE code enabled though the overhead can range
>>>>> from 2-11% of system CPU overhead (ex: as measured by top) on Arm
>>>>> based machines.
>>>>>
>>>>> So, before enabling FIE assure none of the registers used by
>>>>> cppc_get_perf_ctrs() are in the PCC region. Furthermore lets also
>>>>> enable a module parameter which can also disable it at boot or module
>>>>> reload.
>>>>>
>>>>> Signed-off-by: Jeremy Linton <[email protected]>
>>>>> ---
>>>>> drivers/acpi/cppc_acpi.c       | 41
>>>>> ++++++++++++++++++++++++++++++++++
>>>>> drivers/cpufreq/cppc_cpufreq.c | 19 ++++++++++++----
>>>>> include/acpi/cppc_acpi.h       | 5 +++++
>>>>> 3 files changed, 61 insertions(+), 4 deletions(-)
>>>>
>>>>
>>>> 1. You assume that all platforms would have this big overhead when
>>>>     they have the PCC regions for this purpose.
>>>>     Do we know which version of HW mailbox have been implemented
>>>>     and used that have this 2-11% overhead in a platform?
>>>>     Do also more recent MHU have such issues, so we could block
>>>>     them by default (like in your code)?
>>>
>>> Well, the mailbox nature of PCC pretty much assures its "slow",
>>> relative the alternative of providing an actual register. If a
>>> platform provides direct access to say MHU registers, then of course
>>> they won't actually be in a PCC region and the FIE will remain on.
>>>
>>>
>>>>
>>>> 2. I would prefer to simply change the default Kconfig value to 'n' for
>>>>     the ACPI_CPPC_CPUFREQ_FIE, instead of creating a runtime
>>>>     check code which disables it.
>>>>     We have probably introduce this overhead for older platforms with
>>>>     this commit:
>>>
>>> The problem here is that these ACPI kernels are being shipped as
>>> single images in distro's which expect them to run on a wide range of
>>> platforms (including x86/amd in this case), and preform optimally on
>>> all of them.
>>>
>>> So the 'n' option basically is saying that the latest FIE code
>>> doesn't provide a befit anywhere?
>>
>> How we define the 'benefit' here - it's a better task utilization.
>> How much better it would be vs. previous approach with old-style FIE?
>>
>> TBH, I haven't found any test results from the development of the patch
>> set. Maybe someone could point me to the test results which bring
>> this benefit of better utilization.
>>
>> In the RFC I could find that statement [1]:
>>
>> "This is tested with some hacks, as I didn't have access to the right
>> hardware, on the ARM64 hikey board to check the overall functionality
>> and that works fine."
>>
>> There should be a rule that such code is tested on a real server with
>> many CPUs under some stress-test.
>>
>> Ionela do you have some test results where this new FIE feature
>> introduces some better & meaningful accuracy improvement to the
>> tasks utilization?
>>
>> With this overhead measured on a real server platform I think
>> it's not worth to keep it 'y' in default.
>>
>> The design is heavy, as stated in the commit message:
>> "    On an invocation of cppc_scale_freq_tick(), we schedule an irq work
>>      (since we reach here from hard-irq context), which then schedules a
>>      normal work item and cppc_scale_freq_workfn() updates the per_cpu
>>      arch_freq_scale variable based on the counter updates since the last
>>      tick.
>> "
>>
>> As you said Jeremy, this mailbox would always be with overhead. IMO
>> untill we cannot be sure we have some powerful new HW mailbox, this
>> feature should be disabled.
>
>
> Right, the design of the feature would be completely different if it
> were a simple register read to get the delivered perf avoiding all the
> jumping around you quoted.
>
> Which sorta implies that its not really fixable as is, which IMHO means
> that 'n' isn't really strong enough, it should probably be under
> CONFIG_EXPERT as well if such a change were made to discourage its use.
>

That's something that I also started to consider, since we are aware of
the impact.

You have my vote when you decide to go forward with that config change.

2022-08-11 07:48:16

by Lukasz Luba

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] ACPI: CPPC: Disable FIE if registers in PCC regions

On 8/10/22 16:32, Pierre Gondois wrote:
>
>
> On 8/10/22 16:37, Lukasz Luba wrote:
>>
>>
>> On 8/10/22 15:30, Jeremy Linton wrote:
>>> Hi,
>>>
>>> On 8/10/22 07:29, Lukasz Luba wrote:
>>>> Hi Jeremy,
>>>>
>>>> +CC Valentin since he might be interested in this finding
>>>> +CC Ionela, Dietmar
>>>>
>>>> I have a few comments for this patch.
>>>>
>>>>
>>>> On 7/28/22 23:10, Jeremy Linton wrote:
>>>>> PCC regions utilize a mailbox to set/retrieve register values used by
>>>>> the CPPC code. This is fine as long as the operations are
>>>>> infrequent. With the FIE code enabled though the overhead can range
>>>>> from 2-11% of system CPU overhead (ex: as measured by top) on Arm
>>>>> based machines.
>>>>>
>>>>> So, before enabling FIE assure none of the registers used by
>>>>> cppc_get_perf_ctrs() are in the PCC region. Furthermore lets also
>>>>> enable a module parameter which can also disable it at boot or module
>>>>> reload.
>>>>>
>>>>> Signed-off-by: Jeremy Linton <[email protected]>
>>>>> ---
>>>>>    drivers/acpi/cppc_acpi.c       | 41
>>>>> ++++++++++++++++++++++++++++++++++
>>>>>    drivers/cpufreq/cppc_cpufreq.c | 19 ++++++++++++----
>>>>>    include/acpi/cppc_acpi.h       | 5 +++++
>>>>>    3 files changed, 61 insertions(+), 4 deletions(-)
>>>>
>>>>
>>>> 1. You assume that all platforms would have this big overhead when
>>>>      they have the PCC regions for this purpose.
>>>>      Do we know which version of HW mailbox have been implemented
>>>>      and used that have this 2-11% overhead in a platform?
>>>>      Do also more recent MHU have such issues, so we could block
>>>>      them by default (like in your code)?
>>>
>>> I posted that other email before being awake and conflated MHU with AMU
>>> (which could potentially expose the values directly). But the CPPC code
>>> isn't aware of whether a MHU or some other mailbox is in use. Either
>>> way, its hard to imagine a general mailbox with a doorbell/wait for
>>> completion handshake will ever be fast enough to consider running at the
>>> granularity this code is running at. If there were a case like that, the
>>> kernel would have to benchmark it at runtime to differentiate it from
>>> something that is talking over a slow link to a slowly responding mgmt
>>> processor.
>>
>> Exactly, I'm afraid the same, that we would never get such fast
>> mailbox-based platform. Newer platforms would just use AMU, so
>> completely different code and no one would even bother to test if
>> their HW mailbox is fast-enough for this FIE purpose ;)
>
> To add some platform information, the following platforms are using
> CPPC through PCC channels (so mailboxes):
> - Cavium ThunderX2
> - Ampere eMAG
> - Ampere Altra
>
> Fwiw, I can confirm the cppc_fie kthread can represent a significant load,
> with a utilization between 2% and 30%.
>

Thank you Pierre for the test results. I have been also told about some
platform under stress-test having cppc_fie kthread "up to 50% CPU
utilization". I don't know how many additional wake-ups they would see.

We also don't know if the tasks utilization thanks to that feature on
these machine is noticeable better (or if it was an issue at the
begging).

These numbers are not acceptable on a server.