2023-03-28 19:40:53

by Yang Shi

[permalink] [raw]
Subject: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error

When testing CPPC cpufreq on our platform, we noticed the error may be quite
high and the high error may happen quite often. For example, on a platform
with a maximum frequency of 2.8GHz when the CPUs were fully loaded (100% load),
we saw cpuinfo_cur_freq may show 4GHz, it means the error is > 40%. And the
high error (> 1%) happened 256 times out of 2127 samples (sampled every 3
seconds) in an approximate 2hrs test.

We tried to enlarge the delay, and tested with 100us, 1ms and 10ms. The
below is the results.

100us:
The highest error is 4GHz, 22 times out of 3623 samples

1ms:
The highest error is 3.3GHz, 3 times out of 2814 samples

10ms:
No high error anymore

Increase the measurement delay in cppc_cpufreq_get_rate to 10ms to avoid
high measurement errors.

Signed-off-by: Yang Shi <[email protected]>
---
drivers/cpufreq/cppc_cpufreq.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
index 022e3555407c..c2bf65448d3d 100644
--- a/drivers/cpufreq/cppc_cpufreq.c
+++ b/drivers/cpufreq/cppc_cpufreq.c
@@ -851,7 +851,7 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpu)
if (ret)
return ret;

- udelay(2); /* 2usec delay between sampling */
+ mdelay(10); /* 10msec delay between sampling */

ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
if (ret)
--
2.39.2


2023-03-29 18:50:11

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error

On Tue, Mar 28, 2023 at 9:39 PM Yang Shi <[email protected]> wrote:
>
> When testing CPPC cpufreq on our platform, we noticed the error may be quite
> high and the high error may happen quite often. For example, on a platform
> with a maximum frequency of 2.8GHz when the CPUs were fully loaded (100% load),
> we saw cpuinfo_cur_freq may show 4GHz, it means the error is > 40%. And the
> high error (> 1%) happened 256 times out of 2127 samples (sampled every 3
> seconds) in an approximate 2hrs test.

The description above is a bit cryptic IMV. For example, it is not
particularly clear what "high error" means.

> We tried to enlarge the delay, and tested with 100us, 1ms and 10ms. The
> below is the results.
>
> 100us:
> The highest error is 4GHz, 22 times out of 3623 samples
>
> 1ms:
> The highest error is 3.3GHz, 3 times out of 2814 samples
>
> 10ms:
> No high error anymore
>
> Increase the measurement delay in cppc_cpufreq_get_rate to 10ms to avoid
> high measurement errors.
>
> Signed-off-by: Yang Shi <[email protected]>
> ---
> drivers/cpufreq/cppc_cpufreq.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
> index 022e3555407c..c2bf65448d3d 100644
> --- a/drivers/cpufreq/cppc_cpufreq.c
> +++ b/drivers/cpufreq/cppc_cpufreq.c
> @@ -851,7 +851,7 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpu)
> if (ret)
> return ret;
>
> - udelay(2); /* 2usec delay between sampling */
> + mdelay(10); /* 10msec delay between sampling */

This function can be called with interrupts off, so it cannot spin for 10 ms.

> ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
> if (ret)
> --

2023-03-29 23:24:14

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error



On 3/29/23 11:43 AM, Rafael J. Wysocki wrote:
> On Tue, Mar 28, 2023 at 9:39 PM Yang Shi <[email protected]> wrote:
>> When testing CPPC cpufreq on our platform, we noticed the error may be quite
>> high and the high error may happen quite often. For example, on a platform
>> with a maximum frequency of 2.8GHz when the CPUs were fully loaded (100% load),
>> we saw cpuinfo_cur_freq may show 4GHz, it means the error is > 40%. And the
>> high error (> 1%) happened 256 times out of 2127 samples (sampled every 3
>> seconds) in an approximate 2hrs test.
> The description above is a bit cryptic IMV. For example, it is not
> particularly clear what "high error" means.

The max freq is approximately 2.8GHz, but reading cpuinfo_cur_freq
showed 3.x GHz ~ 4GHz sometimes. The error seems high.

>
>> We tried to enlarge the delay, and tested with 100us, 1ms and 10ms. The
>> below is the results.
>>
>> 100us:
>> The highest error is 4GHz, 22 times out of 3623 samples
>>
>> 1ms:
>> The highest error is 3.3GHz, 3 times out of 2814 samples
>>
>> 10ms:
>> No high error anymore
>>
>> Increase the measurement delay in cppc_cpufreq_get_rate to 10ms to avoid
>> high measurement errors.
>>
>> Signed-off-by: Yang Shi <[email protected]>
>> ---
>> drivers/cpufreq/cppc_cpufreq.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
>> index 022e3555407c..c2bf65448d3d 100644
>> --- a/drivers/cpufreq/cppc_cpufreq.c
>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>> @@ -851,7 +851,7 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpu)
>> if (ret)
>> return ret;
>>
>> - udelay(2); /* 2usec delay between sampling */
>> + mdelay(10); /* 10msec delay between sampling */
> This function can be called with interrupts off, so it cannot spin for 10 ms.

I didn't notice it may be called with irq off, if so 10ms spin sounds
too long. Is 100us fine? It can't eliminate the high error completely,
but alleviate it a lot. Or any other suggestion?

Thanks,
Yang

>
>> ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
>> if (ret)
>> --

2023-03-30 03:56:59

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error

+ few folks.

On 28-03-23, 12:38, Yang Shi wrote:
> When testing CPPC cpufreq on our platform, we noticed the error may be quite
> high and the high error may happen quite often. For example, on a platform
> with a maximum frequency of 2.8GHz when the CPUs were fully loaded (100% load),
> we saw cpuinfo_cur_freq may show 4GHz, it means the error is > 40%. And the
> high error (> 1%) happened 256 times out of 2127 samples (sampled every 3
> seconds) in an approximate 2hrs test.
>
> We tried to enlarge the delay, and tested with 100us, 1ms and 10ms. The
> below is the results.
>
> 100us:
> The highest error is 4GHz, 22 times out of 3623 samples
>
> 1ms:
> The highest error is 3.3GHz, 3 times out of 2814 samples
>
> 10ms:
> No high error anymore
>
> Increase the measurement delay in cppc_cpufreq_get_rate to 10ms to avoid
> high measurement errors.
>
> Signed-off-by: Yang Shi <[email protected]>
> ---
> drivers/cpufreq/cppc_cpufreq.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
> index 022e3555407c..c2bf65448d3d 100644
> --- a/drivers/cpufreq/cppc_cpufreq.c
> +++ b/drivers/cpufreq/cppc_cpufreq.c
> @@ -851,7 +851,7 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpu)
> if (ret)
> return ret;
>
> - udelay(2); /* 2usec delay between sampling */
> + mdelay(10); /* 10msec delay between sampling */
>
> ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
> if (ret)
> --
> 2.39.2

--
viresh

2023-03-31 09:56:13

by Pierre Gondois

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error

Hello,

On 3/30/23 05:56, Viresh Kumar wrote:
> + few folks.
>
> On 28-03-23, 12:38, Yang Shi wrote:
>> When testing CPPC cpufreq on our platform, we noticed the error may be quite
>> high and the high error may happen quite often. For example, on a platform
>> with a maximum frequency of 2.8GHz when the CPUs were fully loaded (100% load),
>> we saw cpuinfo_cur_freq may show 4GHz, it means the error is > 40%. And the
>> high error (> 1%) happened 256 times out of 2127 samples (sampled every 3
>> seconds) in an approximate 2hrs test.
>>
>> We tried to enlarge the delay, and tested with 100us, 1ms and 10ms. The
>> below is the results.
>>
>> 100us:
>> The highest error is 4GHz, 22 times out of 3623 samples
>>
>> 1ms:
>> The highest error is 3.3GHz, 3 times out of 2814 samples
>>
>> 10ms:
>> No high error anymore
>>
>> Increase the measurement delay in cppc_cpufreq_get_rate to 10ms to avoid
>> high measurement errors.
>>
>> Signed-off-by: Yang Shi <[email protected]>
>> ---
>> drivers/cpufreq/cppc_cpufreq.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
>> index 022e3555407c..c2bf65448d3d 100644
>> --- a/drivers/cpufreq/cppc_cpufreq.c
>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>> @@ -851,7 +851,7 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpu)
>> if (ret)
>> return ret;
>>
>> - udelay(2); /* 2usec delay between sampling */
>> + mdelay(10); /* 10msec delay between sampling */
>>
>> ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
>> if (ret)
>> --
>> 2.39.2
>

Just 2 considerations:
-
When using the schedutil governor, frequencies should be updated with
a period of cppc_cpufreq_get_transition_delay_us().
This period should be 1ms if CPPC doesn't rely on PCC channels, otherwise
the value depends on the PCC channel (cf. cppc_get_transition_latency()).

If the evaluation duration for the perf/ref counters is higher than this
period, I think this would mean that multiple frequency update would happen
while trying to evaluate the current frequency of a CPU.

-
There is a TimeWindowRegister field in CPPC (cf. enum cppc_regs::TIME_WINDOW
and ACPI 6.5 s8.4.6.1.2.5 "Time Window Register") that should approximately
match what this patch aims to solve.
"""
When Autonomous Selection is enabled, values written to the Time Window Register are ignored. Reads of the Time
Window register indicate minimum length of time (in ms) between successive reads of the platform’s performance
counters.
"""
The only issue being that we should be in the case where Autonomous Selection
is disabled, where the description of the register is different.

Regards,
Pierre

2023-04-04 19:02:59

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error



On 3/31/23 2:53 AM, Pierre Gondois wrote:
> Hello,
>
> On 3/30/23 05:56, Viresh Kumar wrote:
>> + few folks.
>>
>> On 28-03-23, 12:38, Yang Shi wrote:
>>> When testing CPPC cpufreq on our platform, we noticed the error may
>>> be quite
>>> high and the high error may happen quite often.  For example, on a
>>> platform
>>> with a maximum frequency of 2.8GHz when the CPUs were fully loaded
>>> (100% load),
>>> we saw cpuinfo_cur_freq may show 4GHz, it means the error is > 40%. 
>>> And the
>>> high error (> 1%) happened 256 times out of 2127 samples (sampled
>>> every 3
>>> seconds) in an approximate 2hrs test.
>>>
>>> We tried to enlarge the delay, and tested with 100us, 1ms and 10ms. 
>>> The
>>> below is the results.
>>>
>>> 100us:
>>> The highest error is 4GHz, 22 times out of 3623 samples
>>>
>>> 1ms:
>>> The highest error is 3.3GHz, 3 times out of 2814 samples
>>>
>>> 10ms:
>>> No high error anymore
>>>
>>> Increase the measurement delay in cppc_cpufreq_get_rate to 10ms to
>>> avoid
>>> high measurement errors.
>>>
>>> Signed-off-by: Yang Shi <[email protected]>
>>> ---
>>>   drivers/cpufreq/cppc_cpufreq.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/cpufreq/cppc_cpufreq.c
>>> b/drivers/cpufreq/cppc_cpufreq.c
>>> index 022e3555407c..c2bf65448d3d 100644
>>> --- a/drivers/cpufreq/cppc_cpufreq.c
>>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>>> @@ -851,7 +851,7 @@ static unsigned int
>>> cppc_cpufreq_get_rate(unsigned int cpu)
>>>       if (ret)
>>>           return ret;
>>>   -    udelay(2); /* 2usec delay between sampling */
>>> +    mdelay(10); /* 10msec delay between sampling */
>>>         ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
>>>       if (ret)
>>> --
>>> 2.39.2
>>
>
> Just 2 considerations:
> -
> When using the schedutil governor, frequencies should be updated with
> a period of cppc_cpufreq_get_transition_delay_us().
> This period should be 1ms if CPPC doesn't rely on PCC channels, otherwise
> the value depends on the PCC channel (cf. cppc_get_transition_latency()).
>
> If the evaluation duration for the perf/ref counters is higher than this
> period, I think this would mean that multiple frequency update would
> happen
> while trying to evaluate the current frequency of a CPU.
>
> -
> There is a TimeWindowRegister field in CPPC (cf. enum
> cppc_regs::TIME_WINDOW
> and ACPI 6.5 s8.4.6.1.2.5 "Time Window Register") that should
> approximately
> match what this patch aims to solve.
> """
> When Autonomous Selection is enabled, values written to the Time
> Window Register are ignored. Reads of the Time
> Window register indicate minimum length of time (in ms) between
> successive reads of the platform’s performance
> counters.
> """
> The only issue being that we should be in the case where Autonomous
> Selection
> is disabled, where the description of the register is different.
>

Thanks for the points. IIUC, the delay should be limited by the two
factors. So it should be max(cppc_cpufreq_get_transition_delay_us(),
Time Winder Register)?

> Regards,
> Pierre

2023-04-04 19:08:11

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error



On 3/29/23 11:43 AM, Rafael J. Wysocki wrote:
> On Tue, Mar 28, 2023 at 9:39 PM Yang Shi <[email protected]> wrote:
>> When testing CPPC cpufreq on our platform, we noticed the error may be quite
>> high and the high error may happen quite often. For example, on a platform
>> with a maximum frequency of 2.8GHz when the CPUs were fully loaded (100% load),
>> we saw cpuinfo_cur_freq may show 4GHz, it means the error is > 40%. And the
>> high error (> 1%) happened 256 times out of 2127 samples (sampled every 3
>> seconds) in an approximate 2hrs test.
> The description above is a bit cryptic IMV. For example, it is not
> particularly clear what "high error" means.
>
>> We tried to enlarge the delay, and tested with 100us, 1ms and 10ms. The
>> below is the results.
>>
>> 100us:
>> The highest error is 4GHz, 22 times out of 3623 samples
>>
>> 1ms:
>> The highest error is 3.3GHz, 3 times out of 2814 samples
>>
>> 10ms:
>> No high error anymore
>>
>> Increase the measurement delay in cppc_cpufreq_get_rate to 10ms to avoid
>> high measurement errors.
>>
>> Signed-off-by: Yang Shi <[email protected]>
>> ---
>> drivers/cpufreq/cppc_cpufreq.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
>> index 022e3555407c..c2bf65448d3d 100644
>> --- a/drivers/cpufreq/cppc_cpufreq.c
>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>> @@ -851,7 +851,7 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpu)
>> if (ret)
>> return ret;
>>
>> - udelay(2); /* 2usec delay between sampling */
>> + mdelay(10); /* 10msec delay between sampling */
> This function can be called with interrupts off, so it cannot spin for 10 ms.

Per Pierre's comment, the delay may still be ms. Is it still too long? A
quick look at the code shows cpufreq_quick_get() is the only caller with
irq off IIRC. So can we have another callback for it, for example,
get_quick() which does spin for shorter time (for example, keep 2us
delay). Then have ->get() callback use longer delay?


>
>> ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
>> if (ret)
>> --

2023-04-05 17:58:30

by Pierre Gondois

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error



On 4/4/23 21:07, Yang Shi wrote:
>
>
> On 3/29/23 11:43 AM, Rafael J. Wysocki wrote:
>> On Tue, Mar 28, 2023 at 9:39 PM Yang Shi <[email protected]> wrote:
>>> When testing CPPC cpufreq on our platform, we noticed the error may be quite
>>> high and the high error may happen quite often. For example, on a platform
>>> with a maximum frequency of 2.8GHz when the CPUs were fully loaded (100% load),
>>> we saw cpuinfo_cur_freq may show 4GHz, it means the error is > 40%. And the
>>> high error (> 1%) happened 256 times out of 2127 samples (sampled every 3
>>> seconds) in an approximate 2hrs test.
>> The description above is a bit cryptic IMV. For example, it is not
>> particularly clear what "high error" means.
>>
>>> We tried to enlarge the delay, and tested with 100us, 1ms and 10ms. The
>>> below is the results.
>>>
>>> 100us:
>>> The highest error is 4GHz, 22 times out of 3623 samples
>>>
>>> 1ms:
>>> The highest error is 3.3GHz, 3 times out of 2814 samples
>>>
>>> 10ms:
>>> No high error anymore
>>>
>>> Increase the measurement delay in cppc_cpufreq_get_rate to 10ms to avoid
>>> high measurement errors.
>>>
>>> Signed-off-by: Yang Shi <[email protected]>
>>> ---
>>> drivers/cpufreq/cppc_cpufreq.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
>>> index 022e3555407c..c2bf65448d3d 100644
>>> --- a/drivers/cpufreq/cppc_cpufreq.c
>>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>>> @@ -851,7 +851,7 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpu)
>>> if (ret)
>>> return ret;
>>>
>>> - udelay(2); /* 2usec delay between sampling */
>>> + mdelay(10); /* 10msec delay between sampling */
>> This function can be called with interrupts off, so it cannot spin for 10 ms.
>
> Per Pierre's comment, the delay may still be ms. Is it still too long? A
> quick look at the code shows cpufreq_quick_get() is the only caller with
> irq off IIRC. So can we have another callback for it, for example,
> get_quick() which does spin for shorter time (for example, keep 2us
> delay). Then have ->get() callback use longer delay?

-
According to the same ACPI 6.5 s8.4.6.1.2.5 "Time Window Register" paragraph,
and assuming we are in the 'Autonomous Selection is not enabled' case,
the OS is supposed to write (not read) the delta between successive reads of the
counter, so using this field as is would be bending the definition I think.

-
It is correct that the "Time Window Register" field specifies a value in ms,
but it seems a long time to wait for with irqs off.

-
Theoretically, the perf/ref counters should accumulate to allow computing
a correct frequency. Is it possible to know how these counters are accessed ?
Is it through PCC channels and there is some undesired delay between the
reads of the perf/ref counters ?

-
About making the delay:
max(cppc_cpufreq_get_transition_delay_us(), Time Winder Register)
I think it would be good to know why the values of the counters
don't accumulate correctly, ideally by getting a trace where a frequency
above the maximum frequency is computed, and with the timestamps at which
the counters are read.
If the values are coming from PCC channels / the firmware, it might be difficult
to get.

Just for reference, the value of 2us was picked in this thread:
https://lore.kernel.org/all/[email protected]/T/#u

Regards,
Pierre


>
>
>>
>>> ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
>>> if (ret)
>>> --
>

2023-04-06 21:56:15

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error



On 4/5/23 10:57 AM, Pierre Gondois wrote:
>
>
> On 4/4/23 21:07, Yang Shi wrote:
>>
>>
>> On 3/29/23 11:43 AM, Rafael J. Wysocki wrote:
>>> On Tue, Mar 28, 2023 at 9:39 PM Yang Shi
>>> <[email protected]> wrote:
>>>> When testing CPPC cpufreq on our platform, we noticed the error may
>>>> be quite
>>>> high and the high error may happen quite often.  For example, on a
>>>> platform
>>>> with a maximum frequency of 2.8GHz when the CPUs were fully loaded
>>>> (100% load),
>>>> we saw cpuinfo_cur_freq may show 4GHz, it means the error is >
>>>> 40%.  And the
>>>> high error (> 1%) happened 256 times out of 2127 samples (sampled
>>>> every 3
>>>> seconds) in an approximate 2hrs test.
>>> The description above is a bit cryptic IMV.  For example, it is not
>>> particularly clear what "high error" means.
>>>
>>>> We tried to enlarge the delay, and tested with 100us, 1ms and
>>>> 10ms.  The
>>>> below is the results.
>>>>
>>>> 100us:
>>>> The highest error is 4GHz, 22 times out of 3623 samples
>>>>
>>>> 1ms:
>>>> The highest error is 3.3GHz, 3 times out of 2814 samples
>>>>
>>>> 10ms:
>>>> No high error anymore
>>>>
>>>> Increase the measurement delay in cppc_cpufreq_get_rate to 10ms to
>>>> avoid
>>>> high measurement errors.
>>>>
>>>> Signed-off-by: Yang Shi <[email protected]>
>>>> ---
>>>>    drivers/cpufreq/cppc_cpufreq.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/cpufreq/cppc_cpufreq.c
>>>> b/drivers/cpufreq/cppc_cpufreq.c
>>>> index 022e3555407c..c2bf65448d3d 100644
>>>> --- a/drivers/cpufreq/cppc_cpufreq.c
>>>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>>>> @@ -851,7 +851,7 @@ static unsigned int
>>>> cppc_cpufreq_get_rate(unsigned int cpu)
>>>>           if (ret)
>>>>                   return ret;
>>>>
>>>> -       udelay(2); /* 2usec delay between sampling */
>>>> +       mdelay(10); /* 10msec delay between sampling */
>>> This function can be called with interrupts off, so it cannot spin
>>> for 10 ms.
>>
>> Per Pierre's comment, the delay may still be ms. Is it still too long? A
>> quick look at the code shows cpufreq_quick_get() is the only caller with
>> irq off IIRC. So can we have another callback for it, for example,
>> get_quick() which does spin for shorter time (for example, keep 2us
>> delay). Then have ->get() callback use longer delay?
>
> -
> According to the same ACPI 6.5 s8.4.6.1.2.5 "Time Window Register"
> paragraph,
> and assuming we are in the 'Autonomous Selection is not enabled' case,
> the OS is supposed to write (not read) the delta between successive
> reads of the
> counter, so using this field as is would be bending the definition I
> think.
>
> -
> It is correct that the "Time Window Register" field specifies a value
> in ms,
> but it seems a long time to wait for with irqs off.

AFAIK, our platforms don't support "Time Window Register".

>
> -
> Theoretically, the perf/ref counters should accumulate to allow computing
> a correct frequency. Is it possible to know how these counters are
> accessed ?
> Is it through PCC channels and there is some undesired delay between the
> reads of the perf/ref counters ?

The counters are implemented via mmio instead of PCC channels. So the
cpc_read() calls should go to ACPI_ADR_SPACE_SYSTEM_MEMORY IIRC.

>
> -
> About making the delay:
>     max(cppc_cpufreq_get_transition_delay_us(), Time Winder Register)
> I think it would be good to know why the values of the counters
> don't accumulate correctly, ideally by getting a trace where a frequency
> above the maximum frequency is computed, and with the timestamps at which
> the counters are read.
> If the values are coming from PCC channels / the firmware, it might be
> difficult
> to get.

I wrote a bpftrace script to trace the below data:
    - The CPU number
    - The frequency
    - The start and end timestamp of the first cppc_get_perf_ctrs() call
    - The duration/latency of the first cppc_get_perf_ctrs() call
    - The start and end timestamp of the second cppc_get_perf_ctrs() call
    - The duration/latency of the second cppc_get_perf_ctrs() call

The typical logs look like below.
Good
CPU: 1
Freq: 2801485KHz
First:  2489382384  2489387084 4700ns
Second: 2489390824  2489394024  3200ns
--------------------------------------------------
CPU:    2
Freq:   2797956KHz
First:  2490406524  2490411204  4680ns
Second: 2490414764  2490417684  2920ns

Bad:
CPU:    55
Freq:   3969372KHz
First:  875659868  875721568  61700ns
Second: 875725148  875727708  2560ns
--------------------------------------------------
CPU: 65
Freq: 3829744KHz
First:  3854951136  3854995896 44760ns
Second: 3854999416  3855002696 3280ns
--------------------------------------------------
CPU: 21
Freq: 4279242KHz
First:  240834204  240910484 76280ns
Second: 240914264  240916944  2680ns


The first line is cpu number, the second line is frequency returned by
cppc_cpufreq_get_rate(), the third line is the start and end timestamps
and duration of the first cppc_get_perf_ctrs(), the fourth line is the
start and end timestamps and duration of the second cppc_get_perf_ctrs().

So per the log I think we can tell basically the longer the duration the
higher the error. The 2us delay is not long enough to offset the impact
from unexpected latency of reading the counters.

In the worst case the frequency is 4279242KHz, comparing 2800000KHz the
error is over 50%. So the delay should be 4ms ~ 5ms in order to offset
the impact from reading the counters if I do the math correctly.

Hope the trace data is helpful to diagnose the problem.

>
> Just for reference, the value of 2us was picked in this thread:
> https://lore.kernel.org/all/[email protected]/T/#u
>
>
> Regards,
> Pierre
>
>
>>
>>
>>>
>>>>           ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
>>>>           if (ret)
>>>> --
>>

2023-04-07 08:33:49

by Pierre Gondois

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error

Hello Yang,

On 4/6/23 23:52, Yang Shi wrote:
>
>
> On 4/5/23 10:57 AM, Pierre Gondois wrote:
>>
>>
>> On 4/4/23 21:07, Yang Shi wrote:
>>>
>>>
>>> On 3/29/23 11:43 AM, Rafael J. Wysocki wrote:
>>>> On Tue, Mar 28, 2023 at 9:39 PM Yang Shi
>>>> <[email protected]> wrote:
>>>>> When testing CPPC cpufreq on our platform, we noticed the error may
>>>>> be quite
>>>>> high and the high error may happen quite often.  For example, on a
>>>>> platform
>>>>> with a maximum frequency of 2.8GHz when the CPUs were fully loaded
>>>>> (100% load),
>>>>> we saw cpuinfo_cur_freq may show 4GHz, it means the error is >
>>>>> 40%.  And the
>>>>> high error (> 1%) happened 256 times out of 2127 samples (sampled
>>>>> every 3
>>>>> seconds) in an approximate 2hrs test.
>>>> The description above is a bit cryptic IMV.  For example, it is not
>>>> particularly clear what "high error" means.
>>>>
>>>>> We tried to enlarge the delay, and tested with 100us, 1ms and
>>>>> 10ms.  The
>>>>> below is the results.
>>>>>
>>>>> 100us:
>>>>> The highest error is 4GHz, 22 times out of 3623 samples
>>>>>
>>>>> 1ms:
>>>>> The highest error is 3.3GHz, 3 times out of 2814 samples
>>>>>
>>>>> 10ms:
>>>>> No high error anymore
>>>>>
>>>>> Increase the measurement delay in cppc_cpufreq_get_rate to 10ms to
>>>>> avoid
>>>>> high measurement errors.
>>>>>
>>>>> Signed-off-by: Yang Shi <[email protected]>
>>>>> ---
>>>>>    drivers/cpufreq/cppc_cpufreq.c | 2 +-
>>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/cpufreq/cppc_cpufreq.c
>>>>> b/drivers/cpufreq/cppc_cpufreq.c
>>>>> index 022e3555407c..c2bf65448d3d 100644
>>>>> --- a/drivers/cpufreq/cppc_cpufreq.c
>>>>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>>>>> @@ -851,7 +851,7 @@ static unsigned int
>>>>> cppc_cpufreq_get_rate(unsigned int cpu)
>>>>>           if (ret)
>>>>>                   return ret;
>>>>>
>>>>> -       udelay(2); /* 2usec delay between sampling */
>>>>> +       mdelay(10); /* 10msec delay between sampling */
>>>> This function can be called with interrupts off, so it cannot spin
>>>> for 10 ms.
>>>
>>> Per Pierre's comment, the delay may still be ms. Is it still too long? A
>>> quick look at the code shows cpufreq_quick_get() is the only caller with
>>> irq off IIRC. So can we have another callback for it, for example,
>>> get_quick() which does spin for shorter time (for example, keep 2us
>>> delay). Then have ->get() callback use longer delay?
>>
>> -
>> According to the same ACPI 6.5 s8.4.6.1.2.5 "Time Window Register"
>> paragraph,
>> and assuming we are in the 'Autonomous Selection is not enabled' case,
>> the OS is supposed to write (not read) the delta between successive
>> reads of the
>> counter, so using this field as is would be bending the definition I
>> think.
>>
>> -
>> It is correct that the "Time Window Register" field specifies a value
>> in ms,
>> but it seems a long time to wait for with irqs off.
>
> AFAIK, our platforms don't support "Time Window Register".
>
>>
>> -
>> Theoretically, the perf/ref counters should accumulate to allow computing
>> a correct frequency. Is it possible to know how these counters are
>> accessed ?
>> Is it through PCC channels and there is some undesired delay between the
>> reads of the perf/ref counters ?
>
> The counters are implemented via mmio instead of PCC channels. So the
> cpc_read() calls should go to ACPI_ADR_SPACE_SYSTEM_MEMORY IIRC.
>
>>
>> -
>> About making the delay:
>>     max(cppc_cpufreq_get_transition_delay_us(), Time Winder Register)
>> I think it would be good to know why the values of the counters
>> don't accumulate correctly, ideally by getting a trace where a frequency
>> above the maximum frequency is computed, and with the timestamps at which
>> the counters are read.
>> If the values are coming from PCC channels / the firmware, it might be
>> difficult
>> to get.
>
> I wrote a bpftrace script to trace the below data:
>     - The CPU number
>     - The frequency
>     - The start and end timestamp of the first cppc_get_perf_ctrs() call
>     - The duration/latency of the first cppc_get_perf_ctrs() call
>     - The start and end timestamp of the second cppc_get_perf_ctrs() call
>     - The duration/latency of the second cppc_get_perf_ctrs() call
>
> The typical logs look like below.
> Good
> CPU: 1
> Freq: 2801485KHz
> First:  2489382384  2489387084 4700ns
> Second: 2489390824  2489394024  3200ns
> --------------------------------------------------
> CPU:    2
> Freq:   2797956KHz
> First:  2490406524  2490411204  4680ns
> Second: 2490414764  2490417684  2920ns
>
> Bad:
> CPU:    55
> Freq:   3969372KHz
> First:  875659868  875721568  61700ns
> Second: 875725148  875727708  2560ns
> --------------------------------------------------
> CPU: 65
> Freq: 3829744KHz
> First:  3854951136  3854995896 44760ns
> Second: 3854999416  3855002696 3280ns
> --------------------------------------------------
> CPU: 21
> Freq: 4279242KHz
> First:  240834204  240910484 76280ns
> Second: 240914264  240916944  2680ns
>
>
> The first line is cpu number, the second line is frequency returned by
> cppc_cpufreq_get_rate(), the third line is the start and end timestamps
> and duration of the first cppc_get_perf_ctrs(), the fourth line is the
> start and end timestamps and duration of the second cppc_get_perf_ctrs().
>
> So per the log I think we can tell basically the longer the duration the
> higher the error. The 2us delay is not long enough to offset the impact
> from unexpected latency of reading the counters.
>
> In the worst case the frequency is 4279242KHz, comparing 2800000KHz the
> error is over 50%. So the delay should be 4ms ~ 5ms in order to offset
> the impact from reading the counters if I do the math correctly.
>
> Hope the trace data is helpful to diagnose the problem.


Thanks for the data. I was thinking the following was happening:

cppc_get_perf_ctrs()[0] cppc_get_perf_ctrs()[1]
/ \ / \
ref[0] delivered[0] ref[1] delivered[1]
| | | |
v v v v
----------------------------------------------------------------------> time
<-delta[0]-> <-------------2us------------> <----delta[1]---->

If delta[0] is really different from delta[1] like above, then the
reference and delivered counters would have accumulated during different
intervals, resulting in a wrong frequency.
If more/less than 2us elapse between the two cppc_get_perf_ctrs() calls,
then it shouldn't have any impact. So waiting ~10ms should theoretically
not solve the issue.

freq = ref_freq * (delivered[1] - delivered[0]) / (ref[1] - ref[0])

If the counters are accessed through mmio, I don't see anything that would
make delta[x] vary when calling cppc_get_perf_ctrs(), cf. cpc_read().
Do you know if the address represents real counters or a place in memory
representing something else ?

Would it be possible to try setting the CPU frequency to one unique value
and get a serie of values like:
[timestamp, ref_counter_value, deliverd_counter_value]

This would allow to check that the counters are accumulating at a valid
pace. Also you said there were frequencies above the maximum value, but
are there also frequencies below the minimum value ?

Regards,
Pierre

2023-04-07 22:23:39

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error



On 4/7/23 1:31 AM, Pierre Gondois wrote:
> Hello Yang,
>
> On 4/6/23 23:52, Yang Shi wrote:
>>
>>
>> On 4/5/23 10:57 AM, Pierre Gondois wrote:
>>>
>>>
>>> On 4/4/23 21:07, Yang Shi wrote:
>>>>
>>>>
>>>> On 3/29/23 11:43 AM, Rafael J. Wysocki wrote:
>>>>> On Tue, Mar 28, 2023 at 9:39 PM Yang Shi
>>>>> <[email protected]> wrote:
>>>>>> When testing CPPC cpufreq on our platform, we noticed the error may
>>>>>> be quite
>>>>>> high and the high error may happen quite often.  For example, on a
>>>>>> platform
>>>>>> with a maximum frequency of 2.8GHz when the CPUs were fully loaded
>>>>>> (100% load),
>>>>>> we saw cpuinfo_cur_freq may show 4GHz, it means the error is >
>>>>>> 40%.  And the
>>>>>> high error (> 1%) happened 256 times out of 2127 samples (sampled
>>>>>> every 3
>>>>>> seconds) in an approximate 2hrs test.
>>>>> The description above is a bit cryptic IMV.  For example, it is not
>>>>> particularly clear what "high error" means.
>>>>>
>>>>>> We tried to enlarge the delay, and tested with 100us, 1ms and
>>>>>> 10ms.  The
>>>>>> below is the results.
>>>>>>
>>>>>> 100us:
>>>>>> The highest error is 4GHz, 22 times out of 3623 samples
>>>>>>
>>>>>> 1ms:
>>>>>> The highest error is 3.3GHz, 3 times out of 2814 samples
>>>>>>
>>>>>> 10ms:
>>>>>> No high error anymore
>>>>>>
>>>>>> Increase the measurement delay in cppc_cpufreq_get_rate to 10ms to
>>>>>> avoid
>>>>>> high measurement errors.
>>>>>>
>>>>>> Signed-off-by: Yang Shi <[email protected]>
>>>>>> ---
>>>>>>     drivers/cpufreq/cppc_cpufreq.c | 2 +-
>>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/cpufreq/cppc_cpufreq.c
>>>>>> b/drivers/cpufreq/cppc_cpufreq.c
>>>>>> index 022e3555407c..c2bf65448d3d 100644
>>>>>> --- a/drivers/cpufreq/cppc_cpufreq.c
>>>>>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>>>>>> @@ -851,7 +851,7 @@ static unsigned int
>>>>>> cppc_cpufreq_get_rate(unsigned int cpu)
>>>>>>            if (ret)
>>>>>>                    return ret;
>>>>>>
>>>>>> -       udelay(2); /* 2usec delay between sampling */
>>>>>> +       mdelay(10); /* 10msec delay between sampling */
>>>>> This function can be called with interrupts off, so it cannot spin
>>>>> for 10 ms.
>>>>
>>>> Per Pierre's comment, the delay may still be ms. Is it still too
>>>> long? A
>>>> quick look at the code shows cpufreq_quick_get() is the only caller
>>>> with
>>>> irq off IIRC. So can we have another callback for it, for example,
>>>> get_quick() which does spin for shorter time (for example, keep 2us
>>>> delay). Then have ->get() callback use longer delay?
>>>
>>> -
>>> According to the same ACPI 6.5 s8.4.6.1.2.5 "Time Window Register"
>>> paragraph,
>>> and assuming we are in the 'Autonomous Selection is not enabled' case,
>>> the OS is supposed to write (not read) the delta between successive
>>> reads of the
>>> counter, so using this field as is would be bending the definition I
>>> think.
>>>
>>> -
>>> It is correct that the "Time Window Register" field specifies a value
>>> in ms,
>>> but it seems a long time to wait for with irqs off.
>>
>> AFAIK, our platforms don't support "Time Window Register".
>>
>>>
>>> -
>>> Theoretically, the perf/ref counters should accumulate to allow
>>> computing
>>> a correct frequency. Is it possible to know how these counters are
>>> accessed ?
>>> Is it through PCC channels and there is some undesired delay between
>>> the
>>> reads of the perf/ref counters ?
>>
>> The counters are implemented via mmio instead of PCC channels. So the
>> cpc_read() calls should go to ACPI_ADR_SPACE_SYSTEM_MEMORY IIRC.
>>
>>>
>>> -
>>> About making the delay:
>>>      max(cppc_cpufreq_get_transition_delay_us(), Time Winder Register)
>>> I think it would be good to know why the values of the counters
>>> don't accumulate correctly, ideally by getting a trace where a
>>> frequency
>>> above the maximum frequency is computed, and with the timestamps at
>>> which
>>> the counters are read.
>>> If the values are coming from PCC channels / the firmware, it might be
>>> difficult
>>> to get.
>>
>> I wrote a bpftrace script to trace the below data:
>>       - The CPU number
>>       - The frequency
>>       - The start and end timestamp of the first cppc_get_perf_ctrs()
>> call
>>       - The duration/latency of the first cppc_get_perf_ctrs() call
>>       - The start and end timestamp of the second
>> cppc_get_perf_ctrs() call
>>       - The duration/latency of the second cppc_get_perf_ctrs() call
>>
>> The typical logs look like below.
>> Good
>> CPU: 1
>> Freq: 2801485KHz
>> First:  2489382384  2489387084 4700ns
>> Second: 2489390824  2489394024  3200ns
>> --------------------------------------------------
>> CPU:    2
>> Freq:   2797956KHz
>> First:  2490406524  2490411204  4680ns
>> Second: 2490414764  2490417684  2920ns
>>
>> Bad:
>> CPU:    55
>> Freq:   3969372KHz
>> First:  875659868  875721568  61700ns
>> Second: 875725148  875727708  2560ns
>> --------------------------------------------------
>> CPU: 65
>> Freq: 3829744KHz
>> First:  3854951136  3854995896 44760ns
>> Second: 3854999416  3855002696 3280ns
>> --------------------------------------------------
>> CPU: 21
>> Freq: 4279242KHz
>> First:  240834204  240910484 76280ns
>> Second: 240914264  240916944  2680ns
>>
>>
>> The first line is cpu number, the second line is frequency returned by
>> cppc_cpufreq_get_rate(), the third line is the start and end timestamps
>> and duration of the first cppc_get_perf_ctrs(), the fourth line is the
>> start and end timestamps and duration of the second
>> cppc_get_perf_ctrs().
>>
>> So per the log I think we can tell basically the longer the duration the
>> higher the error. The 2us delay is not long enough to offset the impact
>> from unexpected latency of reading the counters.
>>
>> In the worst case the frequency is 4279242KHz, comparing 2800000KHz the
>> error is over 50%. So the delay should be 4ms ~ 5ms in order to offset
>> the impact from reading the counters if I do the math correctly.
>>
>> Hope the trace data is helpful to diagnose the problem.
>
>
> Thanks for the data. I was thinking the following was happening:
>
>  cppc_get_perf_ctrs()[0] cppc_get_perf_ctrs()[1]
> /                    \ /                         \
> ref[0]    delivered[0]                    ref[1] delivered[1]
>   |            |                              |                  |
>   v            v                              v                  v
> ---------------------------------------------------------------------->
> time
>    <-delta[0]-> <-------------2us------------> <----delta[1]---->
>
> If delta[0] is really different from delta[1] like above, then the
> reference and delivered counters would have accumulated during different
> intervals, resulting in a wrong frequency.

Yeah, it looks like so.

> If more/less than 2us elapse between the two cppc_get_perf_ctrs() calls,
> then it shouldn't have any impact. So waiting ~10ms should theoretically
> not solve the issue.

I'm not sure whether the 10ms delay really resolved the issue, but it
did reduce the magnitude of the error.

BTW, I don't see irq is disabled when reading cpuinfo_cur_freq, so it
looks like interrupts could easily result in the difference between
delta[0] and delta[1]. And it seems like the difference matters.

And the counters are accessed through an interconnect on our platform,
so the interconnect congestion may result in the difference as well.

>
> freq = ref_freq * (delivered[1] - delivered[0]) / (ref[1] - ref[0])
>
> If the counters are accessed through mmio, I don't see anything that
> would
> make delta[x] vary when calling cppc_get_perf_ctrs(), cf. cpc_read().
> Do you know if the address represents real counters or a place in memory
> representing something else ?

The address does represent real counters.

>
> Would it be possible to try setting the CPU frequency to one unique value
> and get a serie of values like:
> [timestamp, ref_counter_value, deliverd_counter_value]

Could you please elaborate regarding "setting the CPU frequency to one
unique value"? What value is unique?

>
> This would allow to check that the counters are accumulating at a valid
> pace. Also you said there were frequencies above the maximum value, but
> are there also frequencies below the minimum value ?

I've never seen the frequency below the minimum value.

>
> Regards,
> Pierre

2023-04-11 09:05:31

by Pierre Gondois

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error

Hello Yang,

>>>>
>>>> -
>>>> According to the same ACPI 6.5 s8.4.6.1.2.5 "Time Window Register"
>>>> paragraph,
>>>> and assuming we are in the 'Autonomous Selection is not enabled' case,
>>>> the OS is supposed to write (not read) the delta between successive
>>>> reads of the
>>>> counter, so using this field as is would be bending the definition I
>>>> think.
>>>>
>>>> -
>>>> It is correct that the "Time Window Register" field specifies a value
>>>> in ms,
>>>> but it seems a long time to wait for with irqs off.
>>>
>>> AFAIK, our platforms don't support "Time Window Register".
>>>
>>>>
>>>> -
>>>> Theoretically, the perf/ref counters should accumulate to allow
>>>> computing
>>>> a correct frequency. Is it possible to know how these counters are
>>>> accessed ?
>>>> Is it through PCC channels and there is some undesired delay between
>>>> the
>>>> reads of the perf/ref counters ?
>>>
>>> The counters are implemented via mmio instead of PCC channels. So the
>>> cpc_read() calls should go to ACPI_ADR_SPACE_SYSTEM_MEMORY IIRC.
>>>
>>>>
>>>> -
>>>> About making the delay:
>>>>      max(cppc_cpufreq_get_transition_delay_us(), Time Winder Register)
>>>> I think it would be good to know why the values of the counters
>>>> don't accumulate correctly, ideally by getting a trace where a
>>>> frequency
>>>> above the maximum frequency is computed, and with the timestamps at
>>>> which
>>>> the counters are read.
>>>> If the values are coming from PCC channels / the firmware, it might be
>>>> difficult
>>>> to get.
>>>
>>> I wrote a bpftrace script to trace the below data:
>>>       - The CPU number
>>>       - The frequency
>>>       - The start and end timestamp of the first cppc_get_perf_ctrs()
>>> call
>>>       - The duration/latency of the first cppc_get_perf_ctrs() call
>>>       - The start and end timestamp of the second
>>> cppc_get_perf_ctrs() call
>>>       - The duration/latency of the second cppc_get_perf_ctrs() call
>>>
>>> The typical logs look like below.
>>> Good
>>> CPU: 1
>>> Freq: 2801485KHz
>>> First:  2489382384  2489387084 4700ns
>>> Second: 2489390824  2489394024  3200ns
>>> --------------------------------------------------
>>> CPU:    2
>>> Freq:   2797956KHz
>>> First:  2490406524  2490411204  4680ns
>>> Second: 2490414764  2490417684  2920ns
>>>
>>> Bad:
>>> CPU:    55
>>> Freq:   3969372KHz
>>> First:  875659868  875721568  61700ns
>>> Second: 875725148  875727708  2560ns
>>> --------------------------------------------------
>>> CPU: 65
>>> Freq: 3829744KHz
>>> First:  3854951136  3854995896 44760ns
>>> Second: 3854999416  3855002696 3280ns
>>> --------------------------------------------------
>>> CPU: 21
>>> Freq: 4279242KHz
>>> First:  240834204  240910484 76280ns
>>> Second: 240914264  240916944  2680ns
>>>
>>>
>>> The first line is cpu number, the second line is frequency returned by
>>> cppc_cpufreq_get_rate(), the third line is the start and end timestamps
>>> and duration of the first cppc_get_perf_ctrs(), the fourth line is the
>>> start and end timestamps and duration of the second
>>> cppc_get_perf_ctrs().
>>>
>>> So per the log I think we can tell basically the longer the duration the
>>> higher the error. The 2us delay is not long enough to offset the impact
>>> from unexpected latency of reading the counters.
>>>
>>> In the worst case the frequency is 4279242KHz, comparing 2800000KHz the
>>> error is over 50%. So the delay should be 4ms ~ 5ms in order to offset
>>> the impact from reading the counters if I do the math correctly.
>>>
>>> Hope the trace data is helpful to diagnose the problem.
>>
>>
>> Thanks for the data. I was thinking the following was happening:
>>
>>  cppc_get_perf_ctrs()[0] cppc_get_perf_ctrs()[1]
>> /                    \ /                         \
>> ref[0]    delivered[0]                    ref[1] delivered[1]
>>   |            |                              |                  |
>>   v            v                              v                  v
>> ---------------------------------------------------------------------->
>> time
>>    <-delta[0]-> <-------------2us------------> <----delta[1]---->
>>
>> If delta[0] is really different from delta[1] like above, then the
>> reference and delivered counters would have accumulated during different
>> intervals, resulting in a wrong frequency.
>
> Yeah, it looks like so.
>
>> If more/less than 2us elapse between the two cppc_get_perf_ctrs() calls,
>> then it shouldn't have any impact. So waiting ~10ms should theoretically
>> not solve the issue.
>
> I'm not sure whether the 10ms delay really resolved the issue, but it
> did reduce the magnitude of the error.
>
> BTW, I don't see irq is disabled when reading cpuinfo_cur_freq, so it
> looks like interrupts could easily result in the difference between
> delta[0] and delta[1]. And it seems like the difference matters.

Ok, maybe disabling irqs would have an impact ?

>
> And the counters are accessed through an interconnect on our platform,
> so the interconnect congestion may result in the difference as well.
>
>>
>> freq = ref_freq * (delivered[1] - delivered[0]) / (ref[1] - ref[0])
>>
>> If the counters are accessed through mmio, I don't see anything that
>> would
>> make delta[x] vary when calling cppc_get_perf_ctrs(), cf. cpc_read().
>> Do you know if the address represents real counters or a place in memory
>> representing something else ?
>
> The address does represent real counters.

Oh ok, is it possible to know what is there ?

>
>>
>> Would it be possible to try setting the CPU frequency to one unique value
>> and get a serie of values like:
>> [timestamp, ref_counter_value, deliverd_counter_value]
>
> Could you please elaborate regarding "setting the CPU frequency to one
> unique value"? What value is unique?

I meant having the CPUs using only on frequency. The following should work:
cat /sys/devices/system/cpu/cpu[X]/cpufreq/scaling_min_freq > /sys/devices/system/cpu/cpu[X]/cpufreq/scaling_max_freq

>
>>
>> This would allow to check that the counters are accumulating at a valid
>> pace. Also you said there were frequencies above the maximum value, but
>> are there also frequencies below the minimum value ?
>
> I've never seen the frequency below the minimum value.

Maybe this is because the CPUs are running at their maximum frequency,
so when the computed frequency is below the actual running frequency,
it still doesn't go below the minimum frequency.
Meaning that if the CPUs were limited to their lowest frequency (with
the command above), maybe the get_rate() function would return values
in a range [[1]:[2]]:
- [1]: frequency below the lowest frequency
- [2]: frequency above the lowest frequency, but below the maximum
frequency of 2.8GHz

Would it be possible to do the following:

# Shut down all the CPUs except the last one just to have less logs
for i in /sys/devices/system/cpu/cpu[0-9]* ; do echo 0 > $i/online ; done

cd /sys/kernel/debug/tracing

# Add a kprobe to cppc_cpufreq_get_rate to get the computed freq
echo 'r:myretprobe cppc_cpufreq_get_rate $retval:u32' >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > events/kprobes/enable

# Setup ftrace to trace cppc_cpufreq_get_rate() calls
# (and maybe see if something undesired happens in the call)
echo function_graph > current_tracer
echo funcgraph-abstime > trace_options
echo cppc_cpufreq_get_rate > set_graph_function

# Do the tracing
echo 1 > tracing_on
# Wait a bit for a call to cppc_cpufreq_get_rate() to happen
echo 0 > tracing_on

Also adding the following in cppc_perf_from_fbctrs() should allow
tracking the counter values:
trace_printk("get_cntrs: ref0=%lx ref1=%lx del0=%lx del1=%lx",
fb_ctrs_t0->reference, fb_ctrs_t1->reference,
fb_ctrs_t0->delivered, fb_ctrs_t1->delivered);

Regards,
Pierre

2023-04-18 18:37:55

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error



On 4/11/23 1:51 AM, Pierre Gondois wrote:
> Hello Yang,
>
>>>>>
>>>>> -
>>>>> According to the same ACPI 6.5 s8.4.6.1.2.5 "Time Window Register"
>>>>> paragraph,
>>>>> and assuming we are in the 'Autonomous Selection is not enabled'
>>>>> case,
>>>>> the OS is supposed to write (not read) the delta between successive
>>>>> reads of the
>>>>> counter, so using this field as is would be bending the definition I
>>>>> think.
>>>>>
>>>>> -
>>>>> It is correct that the "Time Window Register" field specifies a value
>>>>> in ms,
>>>>> but it seems a long time to wait for with irqs off.
>>>>
>>>> AFAIK, our platforms don't support "Time Window Register".
>>>>
>>>>>
>>>>> -
>>>>> Theoretically, the perf/ref counters should accumulate to allow
>>>>> computing
>>>>> a correct frequency. Is it possible to know how these counters are
>>>>> accessed ?
>>>>> Is it through PCC channels and there is some undesired delay between
>>>>> the
>>>>> reads of the perf/ref counters ?
>>>>
>>>> The counters are implemented via mmio instead of PCC channels. So the
>>>> cpc_read() calls should go to ACPI_ADR_SPACE_SYSTEM_MEMORY IIRC.
>>>>
>>>>>
>>>>> -
>>>>> About making the delay:
>>>>>       max(cppc_cpufreq_get_transition_delay_us(), Time Winder
>>>>> Register)
>>>>> I think it would be good to know why the values of the counters
>>>>> don't accumulate correctly, ideally by getting a trace where a
>>>>> frequency
>>>>> above the maximum frequency is computed, and with the timestamps at
>>>>> which
>>>>> the counters are read.
>>>>> If the values are coming from PCC channels / the firmware, it
>>>>> might be
>>>>> difficult
>>>>> to get.
>>>>
>>>> I wrote a bpftrace script to trace the below data:
>>>>        - The CPU number
>>>>        - The frequency
>>>>        - The start and end timestamp of the first cppc_get_perf_ctrs()
>>>> call
>>>>        - The duration/latency of the first cppc_get_perf_ctrs() call
>>>>        - The start and end timestamp of the second
>>>> cppc_get_perf_ctrs() call
>>>>        - The duration/latency of the second cppc_get_perf_ctrs() call
>>>>
>>>> The typical logs look like below.
>>>> Good
>>>> CPU: 1
>>>> Freq: 2801485KHz
>>>> First:  2489382384  2489387084 4700ns
>>>> Second: 2489390824  2489394024  3200ns
>>>> --------------------------------------------------
>>>> CPU:    2
>>>> Freq:   2797956KHz
>>>> First:  2490406524  2490411204  4680ns
>>>> Second: 2490414764  2490417684  2920ns
>>>>
>>>> Bad:
>>>> CPU:    55
>>>> Freq:   3969372KHz
>>>> First:  875659868  875721568  61700ns
>>>> Second: 875725148  875727708  2560ns
>>>> --------------------------------------------------
>>>> CPU: 65
>>>> Freq: 3829744KHz
>>>> First:  3854951136  3854995896 44760ns
>>>> Second: 3854999416  3855002696 3280ns
>>>> --------------------------------------------------
>>>> CPU: 21
>>>> Freq: 4279242KHz
>>>> First:  240834204  240910484 76280ns
>>>> Second: 240914264  240916944  2680ns
>>>>
>>>>
>>>> The first line is cpu number, the second line is frequency returned by
>>>> cppc_cpufreq_get_rate(), the third line is the start and end
>>>> timestamps
>>>> and duration of the first cppc_get_perf_ctrs(), the fourth line is the
>>>> start and end timestamps and duration of the second
>>>> cppc_get_perf_ctrs().
>>>>
>>>> So per the log I think we can tell basically the longer the
>>>> duration the
>>>> higher the error. The 2us delay is not long enough to offset the
>>>> impact
>>>> from unexpected latency of reading the counters.
>>>>
>>>> In the worst case the frequency is 4279242KHz, comparing 2800000KHz
>>>> the
>>>> error is over 50%. So the delay should be 4ms ~ 5ms in order to offset
>>>> the impact from reading the counters if I do the math correctly.
>>>>
>>>> Hope the trace data is helpful to diagnose the problem.
>>>
>>>
>>> Thanks for the data. I was thinking the following was happening:
>>>
>>>   cppc_get_perf_ctrs()[0] cppc_get_perf_ctrs()[1]
>>> /                    \ /                         \
>>> ref[0]    delivered[0]                    ref[1] delivered[1]
>>>    |            | |                  |
>>>    v            v v                  v
>>> ---------------------------------------------------------------------->
>>> time
>>>     <-delta[0]-> <-------------2us------------> <----delta[1]---->
>>>
>>> If delta[0] is really different from delta[1] like above, then the
>>> reference and delivered counters would have accumulated during
>>> different
>>> intervals, resulting in a wrong frequency.
>>
>> Yeah, it looks like so.
>>
>>> If more/less than 2us elapse between the two cppc_get_perf_ctrs()
>>> calls,
>>> then it shouldn't have any impact. So waiting ~10ms should
>>> theoretically
>>> not solve the issue.
>>
>> I'm not sure whether the 10ms delay really resolved the issue, but it
>> did reduce the magnitude of the error.
>>
>> BTW, I don't see irq is disabled when reading cpuinfo_cur_freq, so it
>> looks like interrupts could easily result in the difference between
>> delta[0] and delta[1]. And it seems like the difference matters.
>
> Ok, maybe disabling irqs would have an impact ?

Yes, I ran some tracing to try to capture what happened when we saw high
error. And it seems like IRQs is one of the contributing factors.
Interconnect congestion is also a contributing factor. The below trace
data supported the conclusion.

4081929KHz
# tracer: function_graph
#
#     TIME        CPU  DURATION                  FUNCTION CALLS
#      |          |     |   |                     |   |   | |
13378.411863 |     0)               |  cppc_cpufreq_get_rate
[cppc_cpufreq]() {
13378.411864 |     0)               |    cpufreq_cpu_get() {
13378.411865 |     0)               |      _raw_read_lock_irqsave() {
13378.411865 |     0)   0.680 us    | preempt_count_add();
13378.411867 |     0)   2.020 us    | }
13378.411868 |     0)               | _raw_read_unlock_irqrestore() {
13378.411868 |     0)   0.520 us    | preempt_count_sub();
13378.411869 |     0)   1.340 us    | }
13378.411869 |     0)   5.160 us    | }
13378.411870 |     0)   0.440 us    | cpufreq_cpu_put();
13378.411871 |     0)               |    cppc_get_perf_ctrs() {
13378.411872 |     0)               |      cpc_read.isra.0() {
13378.411874 |     0)               |        irq_enter_rcu() {
13378.411874 |     0)   0.460 us    | preempt_count_add();
13378.411875 |     0)   0.660 us    | irqtime_account_irq();
13378.411876 |     0)   2.480 us    | }
13378.411877 |     0)               |        do_interrupt_handler() {
13378.411877 |     0)   ==========> |
13378.411877 |     0) + 49.260 us   | gic_handle_irq();
13378.411877 |     0)   <========== |
13378.411928 |     0) + 51.000 us   | }
13378.411928 |     0)               |        irq_exit_rcu() {
13378.411929 |     0)   0.540 us    | irqtime_account_irq();
13378.411929 |     0)   0.420 us    | preempt_count_sub();
13378.411930 |     0)   0.640 us    | ksoftirqd_running();
13378.411931 |     0) + 16.680 us   | do_softirq_own_stack();
13378.411948 |     0)   0.440 us    | idle_cpu();
13378.411949 |     0) + 21.200 us   |        }
13378.411950 |     0) + 77.480 us   |      }
13378.411950 |     0)   0.740 us    |      cpc_read.isra.0();
13378.411951 |     0)   0.440 us    |      cpc_read.isra.0();
13378.411952 |     0)   0.660 us    |      cpc_read.isra.0();
13378.411953 |     0) + 81.980 us   |    }
13378.411954 |     0)               |    __delay() {
13378.411954 |     0)   0.580 us    | arch_timer_evtstrm_available();
13378.411956 |     0)   2.460 us    |    }
13378.411956 |     0)               |    cppc_get_perf_ctrs() {
13378.411957 |     0)   0.540 us    | cpc_read.isra.0();
13378.411958 |     0)   0.520 us    | cpc_read.isra.0();
13378.411959 |     0)   0.420 us    | cpc_read.isra.0();
13378.411959 |     0)   0.440 us    |      cpc_read.isra.0();

The above trace data captured IRQ when cppc_cpufreq_get_rate() returns
4081929KHz.

3052367KHz
# tracer: function_graph
#
#     TIME        CPU  DURATION                  FUNCTION CALLS
#      |          |     |   |                     |   |   | |
19177.226497 |     0)               |  cppc_cpufreq_get_rate
[cppc_cpufreq]() {
19177.226498 |     0)               |    cpufreq_cpu_get() {
19177.226499 |     0)               |      _raw_read_lock_irqsave() {
19177.226499 |     0)   0.600 us    |        preempt_count_add();
19177.226500 |     0)   1.820 us    |      }
19177.226501 |     0)               | _raw_read_unlock_irqrestore() {
19177.226502 |     0)   0.480 us    |        preempt_count_sub();
19177.226502 |     0)   1.260 us    |      }
19177.226503 |     0)   4.780 us    |    }
19177.226503 |     0)   0.420 us    |    cpufreq_cpu_put();
19177.226504 |     0)               |    cppc_get_perf_ctrs() {
19177.226505 |     0)   0.720 us    |      cpc_read.isra.0();
19177.226506 |     0)   1.180 us    |      cpc_read.isra.0();
19177.226508 |     0)   0.400 us    |      cpc_read.isra.0();
19177.226508 |     0)   0.600 us    |      cpc_read.isra.0();
19177.226509 |     0)   5.240 us    |    }
19177.226510 |     0)               |    __delay() {
19177.226510 |     0)   0.520 us    | arch_timer_evtstrm_available();
19177.226512 |     0)   2.420 us    |    }
19177.226513 |     0)               |    cppc_get_perf_ctrs() {
19177.226513 |     0)   0.540 us    |      cpc_read.isra.0();
19177.226514 |     0)   0.540 us    |      cpc_read.isra.0();
19177.226515 |     0)   0.400 us    |      cpc_read.isra.0();
19177.226516 |     0)   0.440 us    |      cpc_read.isra.0();
19177.226516 |     0)   3.660 us    |    }
19177.226517 |     0)   0.780 us    |    cppc_cpufreq_perf_to_khz
[cppc_cpufreq]();
19177.226518 |     0) + 23.320 us   |  }

The above trace data shows some cpc reading took a little bit longer
than usual. I suspected it was caused by interconnect congestion.


So it looks like IRQ is the major contributing factor of high error
(4xxxxxx KHz), interconnect congestion is the major contributing factor
of low error (3xxxxxx KHz).

So I did the below test:
1. Disable IRQ: The high errors were gone (> 3700000KHz), but low errors
were still seen.
2.10us delay: The high errors were still seen.
3. Disable IRQ + 10us delay: all the errors were gone.

I think the test result also supports the tracing data.


I also got some confusion about calling cppc_cpufreq_get_rate() with irq
disabled. Rafael thought 10ms delay is too long because the function may
be called with irq disabled. But a deeper look at the function shows it
should *NOT* be called with irq disabled at all.

First, if pcc channel is used, cpc reading may take over 100ms, it is
way larger the proposed 10ms delay.
Second, reading from cpc channel needs to take a semaphore, so it may
sleep. But sleep with IRQ disabled is not allowed.
Third, if the counters are implemented by AMU, cpc_read_ffh() needs to
send IPI so it requires IRQ enabled.

>
>>
>> And the counters are accessed through an interconnect on our platform,
>> so the interconnect congestion may result in the difference as well.
>>
>>>
>>> freq = ref_freq * (delivered[1] - delivered[0]) / (ref[1] - ref[0])
>>>
>>> If the counters are accessed through mmio, I don't see anything that
>>> would
>>> make delta[x] vary when calling cppc_get_perf_ctrs(), cf. cpc_read().
>>> Do you know if the address represents real counters or a place in
>>> memory
>>> representing something else ?
>>
>> The address does represent real counters.
>
> Oh ok, is it possible to know what is there ?

AFAIK, it is memory mapped access to the core's AMU counters.

>
>>
>>>
>>> Would it be possible to try setting the CPU frequency to one unique
>>> value
>>> and get a serie of values like:
>>> [timestamp, ref_counter_value, deliverd_counter_value]
>>
>> Could you please elaborate regarding "setting the CPU frequency to one
>> unique value"? What value is unique?
>
> I meant having the CPUs using only on frequency. The following should
> work:
> cat /sys/devices/system/cpu/cpu[X]/cpufreq/scaling_min_freq >
> /sys/devices/system/cpu/cpu[X]/cpufreq/scaling_max_freq
>
>>
>>>
>>> This would allow to check that the counters are accumulating at a valid
>>> pace. Also you said there were frequencies above the maximum value, but
>>> are there also frequencies below the minimum value ?
>>
>> I've never seen the frequency below the minimum value.
>
> Maybe this is because the CPUs are running at their maximum frequency,
> so when the computed frequency is below the actual running frequency,
> it still doesn't go below the minimum frequency.
> Meaning that if the CPUs were limited to their lowest frequency (with
> the command above), maybe the get_rate() function would return values
> in a range [[1]:[2]]:
> - [1]: frequency below the lowest frequency
> - [2]: frequency above the lowest frequency, but below the maximum
>   frequency of 2.8GHz

Yes, with writing scaling_min_freq to scaling_max_freq, I can see the
frequency was lower than min freq (1000000KHz) occasionally.

>
> Would it be possible to do the following:
>
> # Shut down all the CPUs except the last one just to have less logs
> for i in /sys/devices/system/cpu/cpu[0-9]* ; do echo 0 > $i/online ; done
>
> cd /sys/kernel/debug/tracing
>
> # Add a kprobe to cppc_cpufreq_get_rate to get the computed freq
> echo 'r:myretprobe cppc_cpufreq_get_rate $retval:u32' >>
> /sys/kernel/debug/tracing/kprobe_events
> echo 1 > events/kprobes/enable
>
> # Setup ftrace to trace cppc_cpufreq_get_rate() calls
> # (and maybe see if something undesired happens in the call)
> echo function_graph > current_tracer
> echo funcgraph-abstime > trace_options
> echo cppc_cpufreq_get_rate > set_graph_function
>
> # Do the tracing
> echo 1 > tracing_on
> # Wait a bit for a call to cppc_cpufreq_get_rate() to happen
> echo 0 > tracing_on
>
> Also adding the following in cppc_perf_from_fbctrs() should allow
> tracking the counter values:
> trace_printk("get_cntrs: ref0=%lx ref1=%lx del0=%lx del1=%lx",
>     fb_ctrs_t0->reference, fb_ctrs_t1->reference,
>     fb_ctrs_t0->delivered, fb_ctrs_t1->delivered);

Sorry I forgot to ask whether I shall run the tracing for the lower than
min frequency case or not? Since the latency of reading cpc is quite
small (sub microsecond), so it is quite sensitive to extra
latency/delay. It seems like the overhead of tracing is not negligible.

I got some data with higher than max frequency.

Good:
 159.077799 |     0)               |  /* get_cntrs: ref0=1513476688
ref1=1513482268 del0=329890515 del1=329905431 */
 159.607913 |     0)               |  /* get_cntrs: ref0=1137490668
ref1=1137496188 del0=3198347141 del1=3198362993 */

Bad:
4965.786566 |     0)               |  /* get_cntrs: ref0=766025872
ref1=766034192 del0=2288598970 del1=2288624170 */
5251.933966 |     0)               |  /* get_cntrs: ref0=2472241696
ref1=2472250376 del0=1156772954 del1=1156799208 */

Hopefully this is helpful.

>
> Regards,
> Pierre

2023-04-19 09:48:16

by Pierre Gondois

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error



On 4/18/23 20:24, Yang Shi wrote:
>
>
> On 4/11/23 1:51 AM, Pierre Gondois wrote:
>> Hello Yang,
>>
>>>>>>
>>>>>> -
>>>>>> According to the same ACPI 6.5 s8.4.6.1.2.5 "Time Window Register"
>>>>>> paragraph,
>>>>>> and assuming we are in the 'Autonomous Selection is not enabled'
>>>>>> case,
>>>>>> the OS is supposed to write (not read) the delta between successive
>>>>>> reads of the
>>>>>> counter, so using this field as is would be bending the definition I
>>>>>> think.
>>>>>>
>>>>>> -
>>>>>> It is correct that the "Time Window Register" field specifies a value
>>>>>> in ms,
>>>>>> but it seems a long time to wait for with irqs off.
>>>>>
>>>>> AFAIK, our platforms don't support "Time Window Register".
>>>>>
>>>>>>
>>>>>> -
>>>>>> Theoretically, the perf/ref counters should accumulate to allow
>>>>>> computing
>>>>>> a correct frequency. Is it possible to know how these counters are
>>>>>> accessed ?
>>>>>> Is it through PCC channels and there is some undesired delay between
>>>>>> the
>>>>>> reads of the perf/ref counters ?
>>>>>
>>>>> The counters are implemented via mmio instead of PCC channels. So the
>>>>> cpc_read() calls should go to ACPI_ADR_SPACE_SYSTEM_MEMORY IIRC.
>>>>>
>>>>>>
>>>>>> -
>>>>>> About making the delay:
>>>>>>       max(cppc_cpufreq_get_transition_delay_us(), Time Winder
>>>>>> Register)
>>>>>> I think it would be good to know why the values of the counters
>>>>>> don't accumulate correctly, ideally by getting a trace where a
>>>>>> frequency
>>>>>> above the maximum frequency is computed, and with the timestamps at
>>>>>> which
>>>>>> the counters are read.
>>>>>> If the values are coming from PCC channels / the firmware, it
>>>>>> might be
>>>>>> difficult
>>>>>> to get.
>>>>>
>>>>> I wrote a bpftrace script to trace the below data:
>>>>>        - The CPU number
>>>>>        - The frequency
>>>>>        - The start and end timestamp of the first cppc_get_perf_ctrs()
>>>>> call
>>>>>        - The duration/latency of the first cppc_get_perf_ctrs() call
>>>>>        - The start and end timestamp of the second
>>>>> cppc_get_perf_ctrs() call
>>>>>        - The duration/latency of the second cppc_get_perf_ctrs() call
>>>>>
>>>>> The typical logs look like below.
>>>>> Good
>>>>> CPU: 1
>>>>> Freq: 2801485KHz
>>>>> First:  2489382384  2489387084 4700ns
>>>>> Second: 2489390824  2489394024  3200ns
>>>>> --------------------------------------------------
>>>>> CPU:    2
>>>>> Freq:   2797956KHz
>>>>> First:  2490406524  2490411204  4680ns
>>>>> Second: 2490414764  2490417684  2920ns
>>>>>
>>>>> Bad:
>>>>> CPU:    55
>>>>> Freq:   3969372KHz
>>>>> First:  875659868  875721568  61700ns
>>>>> Second: 875725148  875727708  2560ns
>>>>> --------------------------------------------------
>>>>> CPU: 65
>>>>> Freq: 3829744KHz
>>>>> First:  3854951136  3854995896 44760ns
>>>>> Second: 3854999416  3855002696 3280ns
>>>>> --------------------------------------------------
>>>>> CPU: 21
>>>>> Freq: 4279242KHz
>>>>> First:  240834204  240910484 76280ns
>>>>> Second: 240914264  240916944  2680ns
>>>>>
>>>>>
>>>>> The first line is cpu number, the second line is frequency returned by
>>>>> cppc_cpufreq_get_rate(), the third line is the start and end
>>>>> timestamps
>>>>> and duration of the first cppc_get_perf_ctrs(), the fourth line is the
>>>>> start and end timestamps and duration of the second
>>>>> cppc_get_perf_ctrs().
>>>>>
>>>>> So per the log I think we can tell basically the longer the
>>>>> duration the
>>>>> higher the error. The 2us delay is not long enough to offset the
>>>>> impact
>>>>> from unexpected latency of reading the counters.
>>>>>
>>>>> In the worst case the frequency is 4279242KHz, comparing 2800000KHz
>>>>> the
>>>>> error is over 50%. So the delay should be 4ms ~ 5ms in order to offset
>>>>> the impact from reading the counters if I do the math correctly.
>>>>>
>>>>> Hope the trace data is helpful to diagnose the problem.
>>>>
>>>>
>>>> Thanks for the data. I was thinking the following was happening:
>>>>
>>>>   cppc_get_perf_ctrs()[0] cppc_get_perf_ctrs()[1]
>>>> /                    \ /                         \
>>>> ref[0]    delivered[0]                    ref[1] delivered[1]
>>>>    |            | |                  |
>>>>    v            v v                  v
>>>> ---------------------------------------------------------------------->
>>>> time
>>>>     <-delta[0]-> <-------------2us------------> <----delta[1]---->
>>>>
>>>> If delta[0] is really different from delta[1] like above, then the
>>>> reference and delivered counters would have accumulated during
>>>> different
>>>> intervals, resulting in a wrong frequency.
>>>
>>> Yeah, it looks like so.
>>>
>>>> If more/less than 2us elapse between the two cppc_get_perf_ctrs()
>>>> calls,
>>>> then it shouldn't have any impact. So waiting ~10ms should
>>>> theoretically
>>>> not solve the issue.
>>>
>>> I'm not sure whether the 10ms delay really resolved the issue, but it
>>> did reduce the magnitude of the error.
>>>
>>> BTW, I don't see irq is disabled when reading cpuinfo_cur_freq, so it
>>> looks like interrupts could easily result in the difference between
>>> delta[0] and delta[1]. And it seems like the difference matters.
>>
>> Ok, maybe disabling irqs would have an impact ?
>
> Yes, I ran some tracing to try to capture what happened when we saw high
> error. And it seems like IRQs is one of the contributing factors.
> Interconnect congestion is also a contributing factor. The below trace
> data supported the conclusion.
>
> 4081929KHz
> # tracer: function_graph
> #
> #     TIME        CPU  DURATION                  FUNCTION CALLS
> #      |          |     |   |                     |   |   | |
> 13378.411863 |     0)               |  cppc_cpufreq_get_rate
> [cppc_cpufreq]() {
> 13378.411864 |     0)               |    cpufreq_cpu_get() {
> 13378.411865 |     0)               |      _raw_read_lock_irqsave() {
> 13378.411865 |     0)   0.680 us    | preempt_count_add();
> 13378.411867 |     0)   2.020 us    | }
> 13378.411868 |     0)               | _raw_read_unlock_irqrestore() {
> 13378.411868 |     0)   0.520 us    | preempt_count_sub();
> 13378.411869 |     0)   1.340 us    | }
> 13378.411869 |     0)   5.160 us    | }
> 13378.411870 |     0)   0.440 us    | cpufreq_cpu_put();
> 13378.411871 |     0)               |    cppc_get_perf_ctrs() {
> 13378.411872 |     0)               |      cpc_read.isra.0() {
> 13378.411874 |     0)               |        irq_enter_rcu() {
> 13378.411874 |     0)   0.460 us    | preempt_count_add();
> 13378.411875 |     0)   0.660 us    | irqtime_account_irq();
> 13378.411876 |     0)   2.480 us    | }
> 13378.411877 |     0)               |        do_interrupt_handler() {
> 13378.411877 |     0)   ==========> |
> 13378.411877 |     0) + 49.260 us   | gic_handle_irq();
> 13378.411877 |     0)   <========== |
> 13378.411928 |     0) + 51.000 us   | }
> 13378.411928 |     0)               |        irq_exit_rcu() {
> 13378.411929 |     0)   0.540 us    | irqtime_account_irq();
> 13378.411929 |     0)   0.420 us    | preempt_count_sub();
> 13378.411930 |     0)   0.640 us    | ksoftirqd_running();
> 13378.411931 |     0) + 16.680 us   | do_softirq_own_stack();
> 13378.411948 |     0)   0.440 us    | idle_cpu();
> 13378.411949 |     0) + 21.200 us   |        }
> 13378.411950 |     0) + 77.480 us   |      }
> 13378.411950 |     0)   0.740 us    |      cpc_read.isra.0();
> 13378.411951 |     0)   0.440 us    |      cpc_read.isra.0();
> 13378.411952 |     0)   0.660 us    |      cpc_read.isra.0();
> 13378.411953 |     0) + 81.980 us   |    }
> 13378.411954 |     0)               |    __delay() {
> 13378.411954 |     0)   0.580 us    | arch_timer_evtstrm_available();
> 13378.411956 |     0)   2.460 us    |    }
> 13378.411956 |     0)               |    cppc_get_perf_ctrs() {
> 13378.411957 |     0)   0.540 us    | cpc_read.isra.0();
> 13378.411958 |     0)   0.520 us    | cpc_read.isra.0();
> 13378.411959 |     0)   0.420 us    | cpc_read.isra.0();
> 13378.411959 |     0)   0.440 us    |      cpc_read.isra.0();
>
> The above trace data captured IRQ when cppc_cpufreq_get_rate() returns
> 4081929KHz.
>
> 3052367KHz
> # tracer: function_graph
> #
> #     TIME        CPU  DURATION                  FUNCTION CALLS
> #      |          |     |   |                     |   |   | |
> 19177.226497 |     0)               |  cppc_cpufreq_get_rate
> [cppc_cpufreq]() {
> 19177.226498 |     0)               |    cpufreq_cpu_get() {
> 19177.226499 |     0)               |      _raw_read_lock_irqsave() {
> 19177.226499 |     0)   0.600 us    |        preempt_count_add();
> 19177.226500 |     0)   1.820 us    |      }
> 19177.226501 |     0)               | _raw_read_unlock_irqrestore() {
> 19177.226502 |     0)   0.480 us    |        preempt_count_sub();
> 19177.226502 |     0)   1.260 us    |      }
> 19177.226503 |     0)   4.780 us    |    }
> 19177.226503 |     0)   0.420 us    |    cpufreq_cpu_put();
> 19177.226504 |     0)               |    cppc_get_perf_ctrs() {
> 19177.226505 |     0)   0.720 us    |      cpc_read.isra.0();
> 19177.226506 |     0)   1.180 us    |      cpc_read.isra.0();
> 19177.226508 |     0)   0.400 us    |      cpc_read.isra.0();
> 19177.226508 |     0)   0.600 us    |      cpc_read.isra.0();
> 19177.226509 |     0)   5.240 us    |    }
> 19177.226510 |     0)               |    __delay() {
> 19177.226510 |     0)   0.520 us    | arch_timer_evtstrm_available();
> 19177.226512 |     0)   2.420 us    |    }
> 19177.226513 |     0)               |    cppc_get_perf_ctrs() {
> 19177.226513 |     0)   0.540 us    |      cpc_read.isra.0();
> 19177.226514 |     0)   0.540 us    |      cpc_read.isra.0();
> 19177.226515 |     0)   0.400 us    |      cpc_read.isra.0();
> 19177.226516 |     0)   0.440 us    |      cpc_read.isra.0();
> 19177.226516 |     0)   3.660 us    |    }
> 19177.226517 |     0)   0.780 us    |    cppc_cpufreq_perf_to_khz
> [cppc_cpufreq]();
> 19177.226518 |     0) + 23.320 us   |  }


In cppc_get_perf_ctrs(), the cpc registers are read in this order:
1- delivered
2- reference
3- ref_perf
4- ctr_wrap

In the following, cpc_read[first][2] represents the second call to cpc_read()
in the first cppc_get_perf_ctrs() call.

The duration between the 2 cpc_read() calls to get the 'delivered' register
should be:
= cppc_get_perf_ctrs[first]() - cpc_read[first][1]() + __delay() +
cpc_read[second][1]()
= 5.24 - 0.72 + 2.24 + 0.54 us
= 7.3 us

It is assumed here that in both cppc_get_perf_ctrs() calls:
- the duration between the entry cppc_get_perf_ctrs() and the the first call
to cpc_read() is identical
- the duration between each cpc_read() call is identical

Similarly for the 'reference' register:
= cppc_get_perf_ctrs[first]() - cpc_read[first][1]() - cpc_read[first][2]() +
__delay() + cpc_read[second][1]() + cpc_read[second][2]()
= 5.24 - 0.72 - 1.18 + 2.24 + 0.54 +0.54 us
= 6.66 us

So assuming the reference counter is at 1GHz and the delivered counter is at
2.8GHz, the computed_freq is:
= ref_cnt_freq * (delta_cnt_delivered() / delta_cnt_ref())
= 1GHz * ((2.8GHz * 7.3us) / (1GHz * 6.66us))
= 2.8 * (7.3 * 6.66) GHz
= 3.07 GHz

This is approximately what was obtained in the trace above (3.05GHz).

----

You say that the cause of this is a congestion in the interconnect. I don't
see a way to check that right now.
However your trace is on the CPU0, so maybe all the other cores were shutdown
in your test. If this is the case, do you think a congestion could happen with
only one CPU ?

Just 2 other comments:
a-
It might be interesting to change the order in which cpc registers are read
just to see if it has an impact, but even if it has, I m not sure how this
could be exploitable.
Just in case, I mean doing that, but I think that b. might be better to try.

diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
index c51d3ccb4cca..479b55006020 100644
--- a/drivers/acpi/cppc_acpi.c
+++ b/drivers/acpi/cppc_acpi.c
@@ -1350,8 +1350,8 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs)
}
}

- cpc_read(cpunum, delivered_reg, &delivered);
cpc_read(cpunum, reference_reg, &reference);
+ cpc_read(cpunum, delivered_reg, &delivered);
cpc_read(cpunum, ref_perf_reg, &ref_perf);

/*

b-
In the trace that you shared, the cpc_read() calls in the fist
cppc_get_perf_ctrs() calls seem to always take a bit more time than in the
second cppc_get_perf_ctrs() call.
Would it be possible to collect traces similar as above with 3 or 4 calls to
cppc_get_perf_ctrs() instead of 2 ? It would allow to check whether in the first
call, accessing the cpc registers takes more time than in the following calls,
due to cache misses or other reasons.
Ideally statistics on the result would be the best, or if you have a trace.dat
to share containing a trace with multiple cppc_cpufreq_get_rate() calls.

Example of code where we do 4 calls to cppc_get_perf_ctrs():

diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
index 022e3555407c..6370f2f0bdad 100644
--- a/drivers/cpufreq/cppc_cpufreq.c
+++ b/drivers/cpufreq/cppc_cpufreq.c
@@ -853,6 +853,20 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpu)

udelay(2); /* 2usec delay between sampling */

+ ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
+ if (ret)
+ return ret;
+
+ udelay(2); /* 2usec delay between sampling */
+
+ /* Do a third call. */
+ ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
+ if (ret)
+ return ret;
+
+ udelay(2); /* 2usec delay between sampling */
+
+ /* Do a fourth call. */
ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
if (ret)
return ret;

>
> The above trace data shows some cpc reading took a little bit longer
> than usual. I suspected it was caused by interconnect congestion.
>
>
> So it looks like IRQ is the major contributing factor of high error
> (4xxxxxx KHz), interconnect congestion is the major contributing factor
> of low error (3xxxxxx KHz).
>
> So I did the below test:
> 1. Disable IRQ: The high errors were gone (> 3700000KHz), but low errors
> were still seen.
> 2.10us delay: The high errors were still seen.
> 3. Disable IRQ + 10us delay: all the errors were gone.
>
> I think the test result also supports the tracing data.
>
>
> I also got some confusion about calling cppc_cpufreq_get_rate() with irq
> disabled. Rafael thought 10ms delay is too long because the function may
> be called with irq disabled. But a deeper look at the function shows it
> should *NOT* be called with irq disabled at all.
>
> First, if pcc channel is used, cpc reading may take over 100ms, it is
> way larger the proposed 10ms delay.
> Second, reading from cpc channel needs to take a semaphore, so it may
> sleep. But sleep with IRQ disabled is not allowed.

Yes right, however the semaphore is not taken in between the sequence of
cpc_read() calls in cppc_get_perf_ctrs(). So maybe the change below should
be acceptable:

diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
index c51d3ccb4cca..105a7e2ffffa 100644
--- a/drivers/acpi/cppc_acpi.c
+++ b/drivers/acpi/cppc_acpi.c
@@ -1315,6 +1315,7 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs)
struct cppc_pcc_data *pcc_ss_data = NULL;
u64 delivered, reference, ref_perf, ctr_wrap_time;
int ret = 0, regs_in_pcc = 0;
+ unsigned long flags;

if (!cpc_desc) {
pr_debug("No CPC descriptor for CPU:%d\n", cpunum);
@@ -1350,10 +1351,14 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs)
}
}

+ local_irq_save(flags);
+
cpc_read(cpunum, delivered_reg, &delivered);
cpc_read(cpunum, reference_reg, &reference);
cpc_read(cpunum, ref_perf_reg, &ref_perf);

+ local_irq_restore(flags);
+
/*
* Per spec, if ctr_wrap_time optional register is unsupported, then the
* performance counters are assumed to never wrap during the lifetime of

> Third, if the counters are implemented by AMU, cpc_read_ffh() needs to
> send IPI so it requires IRQ enabled.

If I'm not mistaken, the CPU calling cpc_read_ffh() might have IRQs disabled,
it should not prevent it to send IPIs no ?

>
>>
>>>
>>> And the counters are accessed through an interconnect on our platform,
>>> so the interconnect congestion may result in the difference as well.
>>>
>>>>
>>>> freq = ref_freq * (delivered[1] - delivered[0]) / (ref[1] - ref[0])
>>>>
>>>> If the counters are accessed through mmio, I don't see anything that
>>>> would
>>>> make delta[x] vary when calling cppc_get_perf_ctrs(), cf. cpc_read().
>>>> Do you know if the address represents real counters or a place in
>>>> memory
>>>> representing something else ?
>>>
>>> The address does represent real counters.
>>
>> Oh ok, is it possible to know what is there ?
>
> AFAIK, it is memory mapped access to the core's AMU counters.
>
>>
>>>
>>>>
>>>> Would it be possible to try setting the CPU frequency to one unique
>>>> value
>>>> and get a serie of values like:
>>>> [timestamp, ref_counter_value, deliverd_counter_value]
>>>
>>> Could you please elaborate regarding "setting the CPU frequency to one
>>> unique value"? What value is unique?
>>
>> I meant having the CPUs using only on frequency. The following should
>> work:
>> cat /sys/devices/system/cpu/cpu[X]/cpufreq/scaling_min_freq >
>> /sys/devices/system/cpu/cpu[X]/cpufreq/scaling_max_freq
>>
>>>
>>>>
>>>> This would allow to check that the counters are accumulating at a valid
>>>> pace. Also you said there were frequencies above the maximum value, but
>>>> are there also frequencies below the minimum value ?
>>>
>>> I've never seen the frequency below the minimum value.
>>
>> Maybe this is because the CPUs are running at their maximum frequency,
>> so when the computed frequency is below the actual running frequency,
>> it still doesn't go below the minimum frequency.
>> Meaning that if the CPUs were limited to their lowest frequency (with
>> the command above), maybe the get_rate() function would return values
>> in a range [[1]:[2]]:
>> - [1]: frequency below the lowest frequency
>> - [2]: frequency above the lowest frequency, but below the maximum
>>   frequency of 2.8GHz
>
> Yes, with writing scaling_min_freq to scaling_max_freq, I can see the
> frequency was lower than min freq (1000000KHz) occasionally.
>
>>
>> Would it be possible to do the following:
>>
>> # Shut down all the CPUs except the last one just to have less logs
>> for i in /sys/devices/system/cpu/cpu[0-9]* ; do echo 0 > $i/online ; done
>>
>> cd /sys/kernel/debug/tracing
>>
>> # Add a kprobe to cppc_cpufreq_get_rate to get the computed freq
>> echo 'r:myretprobe cppc_cpufreq_get_rate $retval:u32' >>
>> /sys/kernel/debug/tracing/kprobe_events
>> echo 1 > events/kprobes/enable
>>
>> # Setup ftrace to trace cppc_cpufreq_get_rate() calls
>> # (and maybe see if something undesired happens in the call)
>> echo function_graph > current_tracer
>> echo funcgraph-abstime > trace_options
>> echo cppc_cpufreq_get_rate > set_graph_function
>>
>> # Do the tracing
>> echo 1 > tracing_on
>> # Wait a bit for a call to cppc_cpufreq_get_rate() to happen
>> echo 0 > tracing_on
>>
>> Also adding the following in cppc_perf_from_fbctrs() should allow
>> tracking the counter values:
>> trace_printk("get_cntrs: ref0=%lx ref1=%lx del0=%lx del1=%lx",
>>     fb_ctrs_t0->reference, fb_ctrs_t1->reference,
>>     fb_ctrs_t0->delivered, fb_ctrs_t1->delivered);
>
> Sorry I forgot to ask whether I shall run the tracing for the lower than
> min frequency case or not? Since the latency of reading cpc is quite
> small (sub microsecond), so it is quite sensitive to extra
> latency/delay. It seems like the overhead of tracing is not negligible.
>
> I got some data with higher than max frequency.
>
> Good:
>  159.077799 |     0)               |  /* get_cntrs: ref0=1513476688
> ref1=1513482268 del0=329890515 del1=329905431 */
>  159.607913 |     0)               |  /* get_cntrs: ref0=1137490668
> ref1=1137496188 del0=3198347141 del1=3198362993 */
>
> Bad:
> 4965.786566 |     0)               |  /* get_cntrs: ref0=766025872
> ref1=766034192 del0=2288598970 del1=2288624170 */
> 5251.933966 |     0)               |  /* get_cntrs: ref0=2472241696
> ref1=2472250376 del0=1156772954 del1=1156799208 */
>
> Hopefully this is helpful.

Yes, thanks a lot. Using the min or max frequency was not important IMO,
it was just to avoid having the possibility to have a CPU with a changing
frequency.
Just a comment, the first trace was really helpfull, but I think it is
possible to have a function graph and the counter values as above in
the same trace.


Thanks again for the experiments,
Regards,
Pierre

2023-04-19 17:00:27

by Pierre Gondois

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error

>
> Just 2 other comments:
> a-
> It might be interesting to change the order in which cpc registers are read
> just to see if it has an impact, but even if it has, I m not sure how this
> could be exploitable.
> Just in case, I mean doing that, but I think that b. might be better to try.
>
> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
> index c51d3ccb4cca..479b55006020 100644
> --- a/drivers/acpi/cppc_acpi.c
> +++ b/drivers/acpi/cppc_acpi.c
> @@ -1350,8 +1350,8 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs)
> }
> }
>
> - cpc_read(cpunum, delivered_reg, &delivered);
> cpc_read(cpunum, reference_reg, &reference);
> + cpc_read(cpunum, delivered_reg, &delivered);
> cpc_read(cpunum, ref_perf_reg, &ref_perf);
>
> /*
>
> b-
> In the trace that you shared, the cpc_read() calls in the fist
> cppc_get_perf_ctrs() calls seem to always take a bit more time than in the
> second cppc_get_perf_ctrs() call.
> Would it be possible to collect traces similar as above with 3 or 4 calls to
> cppc_get_perf_ctrs() instead of 2 ? It would allow to check whether in the first
> call, accessing the cpc registers takes more time than in the following calls,
> due to cache misses or other reasons.
> Ideally statistics on the result would be the best, or if you have a trace.dat
> to share containing a trace with multiple cppc_cpufreq_get_rate() calls.
>
> Example of code where we do 4 calls to cppc_get_perf_ctrs():
>
> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
> index 022e3555407c..6370f2f0bdad 100644
> --- a/drivers/cpufreq/cppc_cpufreq.c
> +++ b/drivers/cpufreq/cppc_cpufreq.c
> @@ -853,6 +853,20 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpu)
>
> udelay(2); /* 2usec delay between sampling */
>
> + ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
> + if (ret)
> + return ret;
> +
> + udelay(2); /* 2usec delay between sampling */
> +
> + /* Do a third call. */
> + ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
> + if (ret)
> + return ret;
> +
> + udelay(2); /* 2usec delay between sampling */
> +
> + /* Do a fourth call. */
> ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
> if (ret)
> return ret;

And also, if the cpc_read() calls in the third/fourth call are actually faster,
would it be possible to check whether the computed frequency is more accurate
(i.e. no over/undershoot) ?

2023-04-19 20:52:56

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error



On 4/19/23 2:45 AM, Pierre Gondois wrote:
>
>
> On 4/18/23 20:24, Yang Shi wrote:
>>
>>
>> On 4/11/23 1:51 AM, Pierre Gondois wrote:
>>> Hello Yang,
>>>
>>>>>>>
>>>>>>> -
>>>>>>> According to the same ACPI 6.5 s8.4.6.1.2.5 "Time Window Register"
>>>>>>> paragraph,
>>>>>>> and assuming we are in the 'Autonomous Selection is not enabled'
>>>>>>> case,
>>>>>>> the OS is supposed to write (not read) the delta between successive
>>>>>>> reads of the
>>>>>>> counter, so using this field as is would be bending the
>>>>>>> definition I
>>>>>>> think.
>>>>>>>
>>>>>>> -
>>>>>>> It is correct that the "Time Window Register" field specifies a
>>>>>>> value
>>>>>>> in ms,
>>>>>>> but it seems a long time to wait for with irqs off.
>>>>>>
>>>>>> AFAIK, our platforms don't support "Time Window Register".
>>>>>>
>>>>>>>
>>>>>>> -
>>>>>>> Theoretically, the perf/ref counters should accumulate to allow
>>>>>>> computing
>>>>>>> a correct frequency. Is it possible to know how these counters are
>>>>>>> accessed ?
>>>>>>> Is it through PCC channels and there is some undesired delay
>>>>>>> between
>>>>>>> the
>>>>>>> reads of the perf/ref counters ?
>>>>>>
>>>>>> The counters are implemented via mmio instead of PCC channels. So
>>>>>> the
>>>>>> cpc_read() calls should go to ACPI_ADR_SPACE_SYSTEM_MEMORY IIRC.
>>>>>>
>>>>>>>
>>>>>>> -
>>>>>>> About making the delay:
>>>>>>>        max(cppc_cpufreq_get_transition_delay_us(), Time Winder
>>>>>>> Register)
>>>>>>> I think it would be good to know why the values of the counters
>>>>>>> don't accumulate correctly, ideally by getting a trace where a
>>>>>>> frequency
>>>>>>> above the maximum frequency is computed, and with the timestamps at
>>>>>>> which
>>>>>>> the counters are read.
>>>>>>> If the values are coming from PCC channels / the firmware, it
>>>>>>> might be
>>>>>>> difficult
>>>>>>> to get.
>>>>>>
>>>>>> I wrote a bpftrace script to trace the below data:
>>>>>>         - The CPU number
>>>>>>         - The frequency
>>>>>>         - The start and end timestamp of the first
>>>>>> cppc_get_perf_ctrs()
>>>>>> call
>>>>>>         - The duration/latency of the first cppc_get_perf_ctrs()
>>>>>> call
>>>>>>         - The start and end timestamp of the second
>>>>>> cppc_get_perf_ctrs() call
>>>>>>         - The duration/latency of the second cppc_get_perf_ctrs()
>>>>>> call
>>>>>>
>>>>>> The typical logs look like below.
>>>>>> Good
>>>>>> CPU: 1
>>>>>> Freq: 2801485KHz
>>>>>> First:  2489382384  2489387084 4700ns
>>>>>> Second: 2489390824  2489394024  3200ns
>>>>>> --------------------------------------------------
>>>>>> CPU:    2
>>>>>> Freq:   2797956KHz
>>>>>> First:  2490406524  2490411204  4680ns
>>>>>> Second: 2490414764  2490417684  2920ns
>>>>>>
>>>>>> Bad:
>>>>>> CPU:    55
>>>>>> Freq:   3969372KHz
>>>>>> First:  875659868  875721568  61700ns
>>>>>> Second: 875725148  875727708  2560ns
>>>>>> --------------------------------------------------
>>>>>> CPU: 65
>>>>>> Freq: 3829744KHz
>>>>>> First:  3854951136  3854995896 44760ns
>>>>>> Second: 3854999416  3855002696 3280ns
>>>>>> --------------------------------------------------
>>>>>> CPU: 21
>>>>>> Freq: 4279242KHz
>>>>>> First:  240834204  240910484 76280ns
>>>>>> Second: 240914264  240916944  2680ns
>>>>>>
>>>>>>
>>>>>> The first line is cpu number, the second line is frequency
>>>>>> returned by
>>>>>> cppc_cpufreq_get_rate(), the third line is the start and end
>>>>>> timestamps
>>>>>> and duration of the first cppc_get_perf_ctrs(), the fourth line
>>>>>> is the
>>>>>> start and end timestamps and duration of the second
>>>>>> cppc_get_perf_ctrs().
>>>>>>
>>>>>> So per the log I think we can tell basically the longer the
>>>>>> duration the
>>>>>> higher the error. The 2us delay is not long enough to offset the
>>>>>> impact
>>>>>> from unexpected latency of reading the counters.
>>>>>>
>>>>>> In the worst case the frequency is 4279242KHz, comparing 2800000KHz
>>>>>> the
>>>>>> error is over 50%. So the delay should be 4ms ~ 5ms in order to
>>>>>> offset
>>>>>> the impact from reading the counters if I do the math correctly.
>>>>>>
>>>>>> Hope the trace data is helpful to diagnose the problem.
>>>>>
>>>>>
>>>>> Thanks for the data. I was thinking the following was happening:
>>>>>
>>>>>    cppc_get_perf_ctrs()[0] cppc_get_perf_ctrs()[1]
>>>>> /                    \ /                         \
>>>>> ref[0]    delivered[0]                    ref[1] delivered[1]
>>>>>     |            | |                  |
>>>>>     v            v v                  v
>>>>> ---------------------------------------------------------------------->
>>>>>
>>>>> time
>>>>>      <-delta[0]-> <-------------2us------------> <----delta[1]---->
>>>>>
>>>>> If delta[0] is really different from delta[1] like above, then the
>>>>> reference and delivered counters would have accumulated during
>>>>> different
>>>>> intervals, resulting in a wrong frequency.
>>>>
>>>> Yeah, it looks like so.
>>>>
>>>>> If more/less than 2us elapse between the two cppc_get_perf_ctrs()
>>>>> calls,
>>>>> then it shouldn't have any impact. So waiting ~10ms should
>>>>> theoretically
>>>>> not solve the issue.
>>>>
>>>> I'm not sure whether the 10ms delay really resolved the issue, but it
>>>> did reduce the magnitude of the error.
>>>>
>>>> BTW, I don't see irq is disabled when reading cpuinfo_cur_freq, so it
>>>> looks like interrupts could easily result in the difference between
>>>> delta[0] and delta[1]. And it seems like the difference matters.
>>>
>>> Ok, maybe disabling irqs would have an impact ?
>>
>> Yes, I ran some tracing to try to capture what happened when we saw high
>> error. And it seems like IRQs is one of the contributing factors.
>> Interconnect congestion is also a contributing factor. The below trace
>> data supported the conclusion.
>>
>> 4081929KHz
>> # tracer: function_graph
>> #
>> #     TIME        CPU  DURATION                  FUNCTION CALLS
>> #      |          |     |   |                     |   |   | |
>> 13378.411863 |     0)               |  cppc_cpufreq_get_rate
>> [cppc_cpufreq]() {
>> 13378.411864 |     0)               |    cpufreq_cpu_get() {
>> 13378.411865 |     0)               | _raw_read_lock_irqsave() {
>> 13378.411865 |     0)   0.680 us    | preempt_count_add();
>> 13378.411867 |     0)   2.020 us    | }
>> 13378.411868 |     0)               | _raw_read_unlock_irqrestore() {
>> 13378.411868 |     0)   0.520 us    | preempt_count_sub();
>> 13378.411869 |     0)   1.340 us    | }
>> 13378.411869 |     0)   5.160 us    | }
>> 13378.411870 |     0)   0.440 us    | cpufreq_cpu_put();
>> 13378.411871 |     0)               |    cppc_get_perf_ctrs() {
>> 13378.411872 |     0)               |      cpc_read.isra.0() {
>> 13378.411874 |     0)               |        irq_enter_rcu() {
>> 13378.411874 |     0)   0.460 us    | preempt_count_add();
>> 13378.411875 |     0)   0.660 us    | irqtime_account_irq();
>> 13378.411876 |     0)   2.480 us    | }
>> 13378.411877 |     0)               | do_interrupt_handler() {
>> 13378.411877 |     0)   ==========> |
>> 13378.411877 |     0) + 49.260 us   | gic_handle_irq();
>> 13378.411877 |     0)   <========== |
>> 13378.411928 |     0) + 51.000 us   | }
>> 13378.411928 |     0)               |        irq_exit_rcu() {
>> 13378.411929 |     0)   0.540 us    | irqtime_account_irq();
>> 13378.411929 |     0)   0.420 us    | preempt_count_sub();
>> 13378.411930 |     0)   0.640 us    | ksoftirqd_running();
>> 13378.411931 |     0) + 16.680 us   | do_softirq_own_stack();
>> 13378.411948 |     0)   0.440 us    | idle_cpu();
>> 13378.411949 |     0) + 21.200 us   |        }
>> 13378.411950 |     0) + 77.480 us   |      }
>> 13378.411950 |     0)   0.740 us    |      cpc_read.isra.0();
>> 13378.411951 |     0)   0.440 us    |      cpc_read.isra.0();
>> 13378.411952 |     0)   0.660 us    |      cpc_read.isra.0();
>> 13378.411953 |     0) + 81.980 us   |    }
>> 13378.411954 |     0)               |    __delay() {
>> 13378.411954 |     0)   0.580 us    | arch_timer_evtstrm_available();
>> 13378.411956 |     0)   2.460 us    |    }
>> 13378.411956 |     0)               |    cppc_get_perf_ctrs() {
>> 13378.411957 |     0)   0.540 us    | cpc_read.isra.0();
>> 13378.411958 |     0)   0.520 us    | cpc_read.isra.0();
>> 13378.411959 |     0)   0.420 us    | cpc_read.isra.0();
>> 13378.411959 |     0)   0.440 us    |      cpc_read.isra.0();
>>
>> The above trace data captured IRQ when cppc_cpufreq_get_rate() returns
>> 4081929KHz.
>>
>> 3052367KHz
>> # tracer: function_graph
>> #
>> #     TIME        CPU  DURATION                  FUNCTION CALLS
>> #      |          |     |   |                     |   |   | |
>> 19177.226497 |     0)               |  cppc_cpufreq_get_rate
>> [cppc_cpufreq]() {
>> 19177.226498 |     0)               |    cpufreq_cpu_get() {
>> 19177.226499 |     0)               | _raw_read_lock_irqsave() {
>> 19177.226499 |     0)   0.600 us    | preempt_count_add();
>> 19177.226500 |     0)   1.820 us    |      }
>> 19177.226501 |     0)               | _raw_read_unlock_irqrestore() {
>> 19177.226502 |     0)   0.480 us    | preempt_count_sub();
>> 19177.226502 |     0)   1.260 us    |      }
>> 19177.226503 |     0)   4.780 us    |    }
>> 19177.226503 |     0)   0.420 us    |    cpufreq_cpu_put();
>> 19177.226504 |     0)               |    cppc_get_perf_ctrs() {
>> 19177.226505 |     0)   0.720 us    |      cpc_read.isra.0();
>> 19177.226506 |     0)   1.180 us    |      cpc_read.isra.0();
>> 19177.226508 |     0)   0.400 us    |      cpc_read.isra.0();
>> 19177.226508 |     0)   0.600 us    |      cpc_read.isra.0();
>> 19177.226509 |     0)   5.240 us    |    }
>> 19177.226510 |     0)               |    __delay() {
>> 19177.226510 |     0)   0.520 us    | arch_timer_evtstrm_available();
>> 19177.226512 |     0)   2.420 us    |    }
>> 19177.226513 |     0)               |    cppc_get_perf_ctrs() {
>> 19177.226513 |     0)   0.540 us    |      cpc_read.isra.0();
>> 19177.226514 |     0)   0.540 us    |      cpc_read.isra.0();
>> 19177.226515 |     0)   0.400 us    |      cpc_read.isra.0();
>> 19177.226516 |     0)   0.440 us    |      cpc_read.isra.0();
>> 19177.226516 |     0)   3.660 us    |    }
>> 19177.226517 |     0)   0.780 us    | cppc_cpufreq_perf_to_khz
>> [cppc_cpufreq]();
>> 19177.226518 |     0) + 23.320 us   |  }
>
>
> In cppc_get_perf_ctrs(), the cpc registers are read in this order:
> 1- delivered
> 2- reference
> 3- ref_perf
> 4- ctr_wrap
>
> In the following, cpc_read[first][2] represents the second call to
> cpc_read()
> in the first cppc_get_perf_ctrs() call.
>
> The duration between the 2 cpc_read() calls to get the 'delivered'
> register
> should be:
> = cppc_get_perf_ctrs[first]() - cpc_read[first][1]() + __delay() +
>   cpc_read[second][1]()
> = 5.24 - 0.72 + 2.24 + 0.54 us
> = 7.3 us
>
> It is assumed here that in both cppc_get_perf_ctrs() calls:
> - the duration between the entry cppc_get_perf_ctrs() and the the
> first call
>   to cpc_read() is identical
> - the duration between each cpc_read() call is identical
>
> Similarly for the 'reference' register:
> = cppc_get_perf_ctrs[first]() - cpc_read[first][1]() -
> cpc_read[first][2]() +
>   __delay() + cpc_read[second][1]() + cpc_read[second][2]()
> = 5.24 - 0.72 - 1.18 + 2.24 + 0.54 +0.54 us
> = 6.66 us
>
> So assuming the reference counter is at 1GHz and the delivered counter
> is at
> 2.8GHz, the computed_freq is:
> = ref_cnt_freq * (delta_cnt_delivered() / delta_cnt_ref())
> = 1GHz * ((2.8GHz * 7.3us) / (1GHz * 6.66us))
> = 2.8 * (7.3 * 6.66) GHz
> = 3.07 GHz
>
> This is approximately what was obtained in the trace above (3.05GHz).
>
> ----
>
> You say that the cause of this is a congestion in the interconnect. I
> don't
> see a way to check that right now.
> However your trace is on the CPU0, so maybe all the other cores were
> shutdown
> in your test. If this is the case, do you think a congestion could
> happen with
> only one CPU ?

No, other CPUs were not shut down in my test. I just ran "yes" on all
cores except CPU 0, then ran the reading freq script. Since all other
cores are busy, so the script should be always running on CPU 0.

Since the counters, memory and other devices are on the interconnect, so
the congestion may be caused by plenty of factors IIUC.

>
> Just 2 other comments:
> a-
> It might be interesting to change the order in which cpc registers are
> read
> just to see if it has an impact, but even if it has, I m not sure how
> this
> could be exploitable.
> Just in case, I mean doing that, but I think that b. might be better
> to try.
>
> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
> index c51d3ccb4cca..479b55006020 100644
> --- a/drivers/acpi/cppc_acpi.c
> +++ b/drivers/acpi/cppc_acpi.c
> @@ -1350,8 +1350,8 @@ int cppc_get_perf_ctrs(int cpunum, struct
> cppc_perf_fb_ctrs *perf_fb_ctrs)
>                 }
>         }
>
> -       cpc_read(cpunum, delivered_reg, &delivered);
>         cpc_read(cpunum, reference_reg, &reference);
> +       cpc_read(cpunum, delivered_reg, &delivered);
>         cpc_read(cpunum, ref_perf_reg, &ref_perf);
>
>         /*
>
> b-
> In the trace that you shared, the cpc_read() calls in the fist
> cppc_get_perf_ctrs() calls seem to always take a bit more time than in
> the
> second cppc_get_perf_ctrs() call.
> Would it be possible to collect traces similar as above with 3 or 4
> calls to
> cppc_get_perf_ctrs() instead of 2 ? It would allow to check whether in
> the first
> call, accessing the cpc registers takes more time than in the
> following calls,
> due to cache misses or other reasons.

Cache miss? The counters should be not cached and reading the counters
should not hit cache IIUC.

> Ideally statistics on the result would be the best, or if you have a
> trace.dat
> to share containing a trace with multiple cppc_cpufreq_get_rate() calls.

Tried option b, I managed to get histogram:

@hist_first_ns[cat]:
[4K, 8K)          112321
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)            212
|                                                    |
[16K, 32K)            25
|                                                    |
[32K, 64K)            59
|                                                    |
[64K, 128K)            6
|                                                    |
[128K, 256K)           9
|                                                    |

@hist_second_ns[cat]:
[2K, 4K)          112590
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)               4
|                                                    |
[8K, 16K)              0
|                                                    |
[16K, 32K)            15
|                                                    |
[32K, 64K)            18
|                                                    |
[64K, 128K)            1
|                                                    |
[128K, 256K)           4
|                                                    |

The "first" means the first cppc_get_perf_ctrs() call. But the bpftrace
script can't tell the second, the third and the fourth, so all them are
shown as "second". Anyway it seems fine. We can tell the first read took
longer than the later ones for the most time.

And a typical func_graph trace shows:

# tracer: function_graph
#
#     TIME        CPU  DURATION                  FUNCTION CALLS
#      |          |     |   |                     |   |   | |
 4447.171333 |     0)               |  cppc_cpufreq_get_rate
[cppc_cpufreq]() {
 4447.171334 |     0)               |    cpufreq_cpu_get() {
 4447.171334 |     0)   1.060 us    | _raw_read_lock_irqsave();
 4447.171336 |     0)   0.560 us    | _raw_read_unlock_irqrestore();
 4447.171337 |     0)   3.480 us    | }
 4447.171338 |     0)   0.400 us    | cpufreq_cpu_put();
 4447.171338 |     0)               |    cppc_get_perf_ctrs() {
 4447.171339 |     0)   0.720 us    | cpc_read.isra.0();
 4447.171341 |     0)   0.700 us    | cpc_read.isra.0();
 4447.171342 |     0)   0.380 us    | cpc_read.isra.0();
 4447.171342 |     0)   0.600 us    | cpc_read.isra.0();
 4447.171343 |     0)   4.900 us    | }
 4447.171344 |     0)               |    __delay() {
 4447.171344 |     0)   0.540 us    | arch_timer_evtstrm_available();
 4447.171346 |     0)   2.420 us    | }
 4447.171347 |     0)               |    cppc_get_perf_ctrs() {
 4447.171347 |     0)   0.540 us    | cpc_read.isra.0();
 4447.171348 |     0)   0.520 us    | cpc_read.isra.0();
 4447.171349 |     0)   0.400 us    | cpc_read.isra.0();
 4447.171350 |     0)   0.440 us    | cpc_read.isra.0();
 4447.171350 |     0)   3.660 us    | }
 4447.171351 |     0)               |    __delay() {
 4447.171351 |     0)   0.400 us    | arch_timer_evtstrm_available();
 4447.171353 |     0)   2.400 us    | }
 4447.171353 |     0)               |    cppc_get_perf_ctrs() {
 4447.171354 |     0)   0.540 us    | cpc_read.isra.0();
 4447.171355 |     0)   0.540 us    | cpc_read.isra.0();
 4447.171356 |     0)   0.380 us    | cpc_read.isra.0();
 4447.171356 |     0)   0.420 us    | cpc_read.isra.0();
 4447.171357 |     0)   3.640 us    | }
 4447.171357 |     0)               |    __delay() {
 4447.171358 |     0)   0.380 us    | arch_timer_evtstrm_available();
 4447.171360 |     0)   2.380 us    |    }
 4447.171360 |     0)               |    cppc_get_perf_ctrs() {
 4447.171361 |     0)   0.520 us    |      cpc_read.isra.0();
 4447.171361 |     0)   0.520 us    |      cpc_read.isra.0();
 4447.171362 |     0)   0.400 us    |      cpc_read.isra.0();
 4447.171363 |     0)   0.440 us    |      cpc_read.isra.0();
 4447.171364 |     0)   3.640 us    |    }
 4447.171364 |     0)   0.520 us    |    cppc_cpufreq_perf_to_khz
[cppc_cpufreq]();
 4447.171365 |     0) + 34.240 us   |  }

It also shows the first reading typically took longer than the later
ones. The second, the third and the fourth actually took almost same time.

I also tried to read perf_fb_ctrs_t0 twice (so total 3 reads, 2 for t0,
1 for t1, 2us delay between each read), but I didn't see noticeable
improvement. 4 reads (2 for t0, 2 for t1) does show some noticeable
improvement.

>
> Example of code where we do 4 calls to cppc_get_perf_ctrs():
>
> diff --git a/drivers/cpufreq/cppc_cpufreq.c
> b/drivers/cpufreq/cppc_cpufreq.c
> index 022e3555407c..6370f2f0bdad 100644
> --- a/drivers/cpufreq/cppc_cpufreq.c
> +++ b/drivers/cpufreq/cppc_cpufreq.c
> @@ -853,6 +853,20 @@ static unsigned int
> cppc_cpufreq_get_rate(unsigned int cpu)
>
>         udelay(2); /* 2usec delay between sampling */
>
> +       ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
> +       if (ret)
> +               return ret;
> +
> +       udelay(2); /* 2usec delay between sampling */
> +
> +       /* Do a third call. */
> +       ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
> +       if (ret)
> +               return ret;
> +
> +       udelay(2); /* 2usec delay between sampling */
> +
> +       /* Do a fourth call. */
>         ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
>         if (ret)
>                 return ret;
>
>>
>> The above trace data shows some cpc reading took a little bit longer
>> than usual. I suspected it was caused by interconnect congestion.
>>
>>
>> So it looks like IRQ is the major contributing factor of high error
>> (4xxxxxx KHz), interconnect congestion is the major contributing factor
>> of low error (3xxxxxx KHz).
>>
>> So I did the below test:
>> 1. Disable IRQ: The high errors were gone (> 3700000KHz), but low errors
>> were still seen.
>> 2.10us delay: The high errors were still seen.
>> 3. Disable IRQ + 10us delay: all the errors were gone.
>>
>> I think the test result also supports the tracing data.
>>
>>
>> I also got some confusion about calling cppc_cpufreq_get_rate() with irq
>> disabled. Rafael thought 10ms delay is too long because the function may
>> be called with irq disabled. But a deeper look at the function shows it
>> should *NOT* be called with irq disabled at all.
>>
>> First, if pcc channel is used, cpc reading may take over 100ms, it is
>> way larger the proposed 10ms delay.
>> Second, reading from cpc channel needs to take a semaphore, so it may
>> sleep. But sleep with IRQ disabled is not allowed.
>
> Yes right, however the semaphore is not taken in between the sequence of
> cpc_read() calls in cppc_get_perf_ctrs(). So maybe the change below
> should
> be acceptable:

Yeah, we should be able to find a smaller irq disable section.

>
> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
> index c51d3ccb4cca..105a7e2ffffa 100644
> --- a/drivers/acpi/cppc_acpi.c
> +++ b/drivers/acpi/cppc_acpi.c
> @@ -1315,6 +1315,7 @@ int cppc_get_perf_ctrs(int cpunum, struct
> cppc_perf_fb_ctrs *perf_fb_ctrs)
>         struct cppc_pcc_data *pcc_ss_data = NULL;
>         u64 delivered, reference, ref_perf, ctr_wrap_time;
>         int ret = 0, regs_in_pcc = 0;
> +       unsigned long flags;
>
>         if (!cpc_desc) {
>                 pr_debug("No CPC descriptor for CPU:%d\n", cpunum);
> @@ -1350,10 +1351,14 @@ int cppc_get_perf_ctrs(int cpunum, struct
> cppc_perf_fb_ctrs *perf_fb_ctrs)
>                 }
>         }
>
> +       local_irq_save(flags);
> +
>         cpc_read(cpunum, delivered_reg, &delivered);
>         cpc_read(cpunum, reference_reg, &reference);
>         cpc_read(cpunum, ref_perf_reg, &ref_perf);
>
> +       local_irq_restore(flags);
> +

cpc_read_ffh() would return -EPERM if irq is disabled.

So, the irq disabling must happen for mmio only in cpc_read(), for example:

diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
index c51d3ccb4cca..f3c92d844074 100644
--- a/drivers/acpi/cppc_acpi.c
+++ b/drivers/acpi/cppc_acpi.c
@@ -982,6 +982,7 @@ static int cpc_read(int cpu, struct
cpc_register_resource *reg_res, u64 *val)
        void __iomem *vaddr = NULL;
        int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpu);
        struct cpc_reg *reg = &reg_res->cpc_entry.reg;
+       unsigned long flags;

        if (reg_res->type == ACPI_TYPE_INTEGER) {
                *val = reg_res->cpc_entry.int_value;
@@ -1015,6 +1016,7 @@ static int cpc_read(int cpu, struct
cpc_register_resource *reg_res, u64 *val)
                return
acpi_os_read_memory((acpi_physical_address)reg->address,
                                val, reg->bit_width);

+       local_irq_save(flags);
        switch (reg->bit_width) {
        case 8:
                *val = readb_relaxed(vaddr);
@@ -1029,10 +1031,12 @@ static int cpc_read(int cpu, struct
cpc_register_resource *reg_res, u64 *val)
                *val = readq_relaxed(vaddr);
                break;
        default:
+               local_irq_restore(flags);
                pr_debug("Error: Cannot read %u bit width from PCC for
ss: %d\n",
                         reg->bit_width, pcc_ss_id);
                return -EFAULT;
        }
+       local_irq_restore(flags);

        return 0;
 }

>         /*
>          * Per spec, if ctr_wrap_time optional register is
> unsupported, then the
>          * performance counters are assumed to never wrap during the
> lifetime of
>
>> Third, if the counters are implemented by AMU, cpc_read_ffh() needs to
>> send IPI so it requires IRQ enabled.
>
> If I'm not mistaken, the CPU calling cpc_read_ffh() might have IRQs
> disabled,
> it should not prevent it to send IPIs no ?

It can't work with irq disabled. The comment in counters_read_on_cpu()
says "Abort call on counterless CPU or when interrupts are disabled -
can lead to deadlock in smp sync call."


And it just returns -EPERM and raise a warning if irq is disabled.

>
>>
>>>
>>>>
>>>> And the counters are accessed through an interconnect on our platform,
>>>> so the interconnect congestion may result in the difference as well.
>>>>
>>>>>
>>>>> freq = ref_freq * (delivered[1] - delivered[0]) / (ref[1] - ref[0])
>>>>>
>>>>> If the counters are accessed through mmio, I don't see anything that
>>>>> would
>>>>> make delta[x] vary when calling cppc_get_perf_ctrs(), cf. cpc_read().
>>>>> Do you know if the address represents real counters or a place in
>>>>> memory
>>>>> representing something else ?
>>>>
>>>> The address does represent real counters.
>>>
>>> Oh ok, is it possible to know what is there ?
>>
>> AFAIK, it is memory mapped access to the core's AMU counters.
>>
>>>
>>>>
>>>>>
>>>>> Would it be possible to try setting the CPU frequency to one unique
>>>>> value
>>>>> and get a serie of values like:
>>>>> [timestamp, ref_counter_value, deliverd_counter_value]
>>>>
>>>> Could you please elaborate regarding "setting the CPU frequency to one
>>>> unique value"? What value is unique?
>>>
>>> I meant having the CPUs using only on frequency. The following should
>>> work:
>>> cat /sys/devices/system/cpu/cpu[X]/cpufreq/scaling_min_freq >
>>> /sys/devices/system/cpu/cpu[X]/cpufreq/scaling_max_freq
>>>
>>>>
>>>>>
>>>>> This would allow to check that the counters are accumulating at a
>>>>> valid
>>>>> pace. Also you said there were frequencies above the maximum
>>>>> value, but
>>>>> are there also frequencies below the minimum value ?
>>>>
>>>> I've never seen the frequency below the minimum value.
>>>
>>> Maybe this is because the CPUs are running at their maximum frequency,
>>> so when the computed frequency is below the actual running frequency,
>>> it still doesn't go below the minimum frequency.
>>> Meaning that if the CPUs were limited to their lowest frequency (with
>>> the command above), maybe the get_rate() function would return values
>>> in a range [[1]:[2]]:
>>> - [1]: frequency below the lowest frequency
>>> - [2]: frequency above the lowest frequency, but below the maximum
>>>    frequency of 2.8GHz
>>
>> Yes, with writing scaling_min_freq to scaling_max_freq, I can see the
>> frequency was lower than min freq (1000000KHz) occasionally.
>>
>>>
>>> Would it be possible to do the following:
>>>
>>> # Shut down all the CPUs except the last one just to have less logs
>>> for i in /sys/devices/system/cpu/cpu[0-9]* ; do echo 0 > $i/online ;
>>> done
>>>
>>> cd /sys/kernel/debug/tracing
>>>
>>> # Add a kprobe to cppc_cpufreq_get_rate to get the computed freq
>>> echo 'r:myretprobe cppc_cpufreq_get_rate $retval:u32' >>
>>> /sys/kernel/debug/tracing/kprobe_events
>>> echo 1 > events/kprobes/enable
>>>
>>> # Setup ftrace to trace cppc_cpufreq_get_rate() calls
>>> # (and maybe see if something undesired happens in the call)
>>> echo function_graph > current_tracer
>>> echo funcgraph-abstime > trace_options
>>> echo cppc_cpufreq_get_rate > set_graph_function
>>>
>>> # Do the tracing
>>> echo 1 > tracing_on
>>> # Wait a bit for a call to cppc_cpufreq_get_rate() to happen
>>> echo 0 > tracing_on
>>>
>>> Also adding the following in cppc_perf_from_fbctrs() should allow
>>> tracking the counter values:
>>> trace_printk("get_cntrs: ref0=%lx ref1=%lx del0=%lx del1=%lx",
>>>      fb_ctrs_t0->reference, fb_ctrs_t1->reference,
>>>      fb_ctrs_t0->delivered, fb_ctrs_t1->delivered);
>>
>> Sorry I forgot to ask whether I shall run the tracing for the lower than
>> min frequency case or not? Since the latency of reading cpc is quite
>> small (sub microsecond), so it is quite sensitive to extra
>> latency/delay. It seems like the overhead of tracing is not negligible.
>>
>> I got some data with higher than max frequency.
>>
>> Good:
>>    159.077799 |     0)               |  /* get_cntrs: ref0=1513476688
>> ref1=1513482268 del0=329890515 del1=329905431 */
>>    159.607913 |     0)               |  /* get_cntrs: ref0=1137490668
>> ref1=1137496188 del0=3198347141 del1=3198362993 */
>>
>> Bad:
>> 4965.786566 |     0)               |  /* get_cntrs: ref0=766025872
>> ref1=766034192 del0=2288598970 del1=2288624170 */
>> 5251.933966 |     0)               |  /* get_cntrs: ref0=2472241696
>> ref1=2472250376 del0=1156772954 del1=1156799208 */
>>
>> Hopefully this is helpful.
>
> Yes, thanks a lot. Using the min or max frequency was not important IMO,
> it was just to avoid having the possibility to have a CPU with a changing
> frequency.
> Just a comment, the first trace was really helpfull, but I think it is
> possible to have a function graph and the counter values as above in
> the same trace.

I wish I could trace more. But the tracing overhead is more noticeable
and may impact the result. With the trace_printk() I actually already
saw much fewer high error happened. The higher the error the more useful
the trace.

>
>
> Thanks again for the experiments,

Thanks for guiding me debug this problem.

> Regards,
> Pierre

2023-04-20 16:02:58

by Pierre Gondois

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error


>>
>> You say that the cause of this is a congestion in the interconnect. I
>> don't
>> see a way to check that right now.
>> However your trace is on the CPU0, so maybe all the other cores were
>> shutdown
>> in your test. If this is the case, do you think a congestion could
>> happen with
>> only one CPU ?
>
> No, other CPUs were not shut down in my test. I just ran "yes" on all
> cores except CPU 0, then ran the reading freq script. Since all other
> cores are busy, so the script should be always running on CPU 0.
>
> Since the counters, memory and other devices are on the interconnect, so
> the congestion may be caused by plenty of factors IIUC.

+Ionela

Ionela pointed me to the following patch-set, which seems realated:
https://lore.kernel.org/all/[email protected]/

One thing that we didn't check I believe is and that Ionela pointed out
is that we don't know whether we are accessing the present CPU or a remote
CPU'AMUs. In the latter case there would be IPIs and possible delays in
waking up/accessing the remote CPU).

>
>>
>> Just 2 other comments:
>> a-
>> It might be interesting to change the order in which cpc registers are
>> read
>> just to see if it has an impact, but even if it has, I m not sure how
>> this
>> could be exploitable.
>> Just in case, I mean doing that, but I think that b. might be better
>> to try.
>>
>> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
>> index c51d3ccb4cca..479b55006020 100644
>> --- a/drivers/acpi/cppc_acpi.c
>> +++ b/drivers/acpi/cppc_acpi.c
>> @@ -1350,8 +1350,8 @@ int cppc_get_perf_ctrs(int cpunum, struct
>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>                 }
>>         }
>>
>> -       cpc_read(cpunum, delivered_reg, &delivered);
>>         cpc_read(cpunum, reference_reg, &reference);
>> +       cpc_read(cpunum, delivered_reg, &delivered);
>>         cpc_read(cpunum, ref_perf_reg, &ref_perf);
>>
>>         /*
>>
>> b-
>> In the trace that you shared, the cpc_read() calls in the fist
>> cppc_get_perf_ctrs() calls seem to always take a bit more time than in
>> the
>> second cppc_get_perf_ctrs() call.
>> Would it be possible to collect traces similar as above with 3 or 4
>> calls to
>> cppc_get_perf_ctrs() instead of 2 ? It would allow to check whether in
>> the first
>> call, accessing the cpc registers takes more time than in the
>> following calls,
>> due to cache misses or other reasons.
>
> Cache miss? The counters should be not cached and reading the counters
> should not hit cache IIUC.

Yes you are correct, what I said is copmletely wrong.

>
>> Ideally statistics on the result would be the best, or if you have a
>> trace.dat
>> to share containing a trace with multiple cppc_cpufreq_get_rate() calls.
>
> Tried option b, I managed to get histogram:
>
> @hist_first_ns[cat]:
> [4K, 8K)          112321
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [8K, 16K)            212
> |                                                    |
> [16K, 32K)            25
> |                                                    |
> [32K, 64K)            59
> |                                                    |
> [64K, 128K)            6
> |                                                    |
> [128K, 256K)           9
> |                                                    |
>
> @hist_second_ns[cat]:
> [2K, 4K)          112590
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [4K, 8K)               4
> |                                                    |
> [8K, 16K)              0
> |                                                    |
> [16K, 32K)            15
> |                                                    |
> [32K, 64K)            18
> |                                                    |
> [64K, 128K)            1
> |                                                    |
> [128K, 256K)           4
> |                                                    |
>
> The "first" means the first cppc_get_perf_ctrs() call. But the bpftrace
> script can't tell the second, the third and the fourth, so all them are
> shown as "second". Anyway it seems fine. We can tell the first read took
> longer than the later ones for the most time.
>
> And a typical func_graph trace shows:
>
> # tracer: function_graph
> #
> #     TIME        CPU  DURATION                  FUNCTION CALLS
> #      |          |     |   |                     |   |   | |
>  4447.171333 |     0)               |  cppc_cpufreq_get_rate
> [cppc_cpufreq]() {
>  4447.171334 |     0)               |    cpufreq_cpu_get() {
>  4447.171334 |     0)   1.060 us    | _raw_read_lock_irqsave();
>  4447.171336 |     0)   0.560 us    | _raw_read_unlock_irqrestore();
>  4447.171337 |     0)   3.480 us    | }
>  4447.171338 |     0)   0.400 us    | cpufreq_cpu_put();
>  4447.171338 |     0)               |    cppc_get_perf_ctrs() {
>  4447.171339 |     0)   0.720 us    | cpc_read.isra.0();
>  4447.171341 |     0)   0.700 us    | cpc_read.isra.0();
>  4447.171342 |     0)   0.380 us    | cpc_read.isra.0();
>  4447.171342 |     0)   0.600 us    | cpc_read.isra.0();
>  4447.171343 |     0)   4.900 us    | }
>  4447.171344 |     0)               |    __delay() {
>  4447.171344 |     0)   0.540 us    | arch_timer_evtstrm_available();
>  4447.171346 |     0)   2.420 us    | }
>  4447.171347 |     0)               |    cppc_get_perf_ctrs() {
>  4447.171347 |     0)   0.540 us    | cpc_read.isra.0();
>  4447.171348 |     0)   0.520 us    | cpc_read.isra.0();
>  4447.171349 |     0)   0.400 us    | cpc_read.isra.0();
>  4447.171350 |     0)   0.440 us    | cpc_read.isra.0();
>  4447.171350 |     0)   3.660 us    | }
>  4447.171351 |     0)               |    __delay() {
>  4447.171351 |     0)   0.400 us    | arch_timer_evtstrm_available();
>  4447.171353 |     0)   2.400 us    | }
>  4447.171353 |     0)               |    cppc_get_perf_ctrs() {
>  4447.171354 |     0)   0.540 us    | cpc_read.isra.0();
>  4447.171355 |     0)   0.540 us    | cpc_read.isra.0();
>  4447.171356 |     0)   0.380 us    | cpc_read.isra.0();
>  4447.171356 |     0)   0.420 us    | cpc_read.isra.0();
>  4447.171357 |     0)   3.640 us    | }
>  4447.171357 |     0)               |    __delay() {
>  4447.171358 |     0)   0.380 us    | arch_timer_evtstrm_available();
>  4447.171360 |     0)   2.380 us    |    }
>  4447.171360 |     0)               |    cppc_get_perf_ctrs() {
>  4447.171361 |     0)   0.520 us    |      cpc_read.isra.0();
>  4447.171361 |     0)   0.520 us    |      cpc_read.isra.0();
>  4447.171362 |     0)   0.400 us    |      cpc_read.isra.0();
>  4447.171363 |     0)   0.440 us    |      cpc_read.isra.0();
>  4447.171364 |     0)   3.640 us    |    }
>  4447.171364 |     0)   0.520 us    |    cppc_cpufreq_perf_to_khz
> [cppc_cpufreq]();
>  4447.171365 |     0) + 34.240 us   |  }
>
> It also shows the first reading typically took longer than the later
> ones. The second, the third and the fourth actually took almost same time.
>
> I also tried to read perf_fb_ctrs_t0 twice (so total 3 reads, 2 for t0,
> 1 for t1, 2us delay between each read), but I didn't see noticeable
> improvement. 4 reads (2 for t0, 2 for t1) does show some noticeable
> improvement.
>

Thanks for the new data.

>>
>> Example of code where we do 4 calls to cppc_get_perf_ctrs():
>>
>> diff --git a/drivers/cpufreq/cppc_cpufreq.c
>> b/drivers/cpufreq/cppc_cpufreq.c
>> index 022e3555407c..6370f2f0bdad 100644
>> --- a/drivers/cpufreq/cppc_cpufreq.c
>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>> @@ -853,6 +853,20 @@ static unsigned int
>> cppc_cpufreq_get_rate(unsigned int cpu)
>>
>>         udelay(2); /* 2usec delay between sampling */
>>
>> +       ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
>> +       if (ret)
>> +               return ret;
>> +
>> +       udelay(2); /* 2usec delay between sampling */
>> +
>> +       /* Do a third call. */
>> +       ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
>> +       if (ret)
>> +               return ret;
>> +
>> +       udelay(2); /* 2usec delay between sampling */
>> +
>> +       /* Do a fourth call. */
>>         ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
>>         if (ret)
>>                 return ret;
>>
>>>
>>> The above trace data shows some cpc reading took a little bit longer
>>> than usual. I suspected it was caused by interconnect congestion.
>>>
>>>
>>> So it looks like IRQ is the major contributing factor of high error
>>> (4xxxxxx KHz), interconnect congestion is the major contributing factor
>>> of low error (3xxxxxx KHz).
>>>
>>> So I did the below test:
>>> 1. Disable IRQ: The high errors were gone (> 3700000KHz), but low errors
>>> were still seen.
>>> 2.10us delay: The high errors were still seen.
>>> 3. Disable IRQ + 10us delay: all the errors were gone.
>>>
>>> I think the test result also supports the tracing data.
>>>
>>>
>>> I also got some confusion about calling cppc_cpufreq_get_rate() with irq
>>> disabled. Rafael thought 10ms delay is too long because the function may
>>> be called with irq disabled. But a deeper look at the function shows it
>>> should *NOT* be called with irq disabled at all.
>>>
>>> First, if pcc channel is used, cpc reading may take over 100ms, it is
>>> way larger the proposed 10ms delay.
>>> Second, reading from cpc channel needs to take a semaphore, so it may
>>> sleep. But sleep with IRQ disabled is not allowed.
>>
>> Yes right, however the semaphore is not taken in between the sequence of
>> cpc_read() calls in cppc_get_perf_ctrs(). So maybe the change below
>> should
>> be acceptable:
>
> Yeah, we should be able to find a smaller irq disable section.
>
>>
>> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
>> index c51d3ccb4cca..105a7e2ffffa 100644
>> --- a/drivers/acpi/cppc_acpi.c
>> +++ b/drivers/acpi/cppc_acpi.c
>> @@ -1315,6 +1315,7 @@ int cppc_get_perf_ctrs(int cpunum, struct
>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>         struct cppc_pcc_data *pcc_ss_data = NULL;
>>         u64 delivered, reference, ref_perf, ctr_wrap_time;
>>         int ret = 0, regs_in_pcc = 0;
>> +       unsigned long flags;
>>
>>         if (!cpc_desc) {
>>                 pr_debug("No CPC descriptor for CPU:%d\n", cpunum);
>> @@ -1350,10 +1351,14 @@ int cppc_get_perf_ctrs(int cpunum, struct
>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>                 }
>>         }
>>
>> +       local_irq_save(flags);
>> +
>>         cpc_read(cpunum, delivered_reg, &delivered);
>>         cpc_read(cpunum, reference_reg, &reference);
>>         cpc_read(cpunum, ref_perf_reg, &ref_perf);
>>
>> +       local_irq_restore(flags);
>> +
>
> cpc_read_ffh() would return -EPERM if irq is disabled.
>
> So, the irq disabling must happen for mmio only in cpc_read(), for example:

I thought the issue was that irqs could happen in between cpc_read() functions,
the patch below would not cover it. If the frequency is more accurate
with this patch, I think I don't understand something.

(asking for more information)
Just to check, the core/perf counters are AMUs and the other CPPC registers
are mmio right ? Is it possible to know the CPPC registers that are
implemented on your platform ?
Also is it possible which platform you are using ?

>
> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
> index c51d3ccb4cca..f3c92d844074 100644
> --- a/drivers/acpi/cppc_acpi.c
> +++ b/drivers/acpi/cppc_acpi.c
> @@ -982,6 +982,7 @@ static int cpc_read(int cpu, struct
> cpc_register_resource *reg_res, u64 *val)
>         void __iomem *vaddr = NULL;
>         int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpu);
>         struct cpc_reg *reg = &reg_res->cpc_entry.reg;
> +       unsigned long flags;
>
>         if (reg_res->type == ACPI_TYPE_INTEGER) {
>                 *val = reg_res->cpc_entry.int_value;
> @@ -1015,6 +1016,7 @@ static int cpc_read(int cpu, struct
> cpc_register_resource *reg_res, u64 *val)
>                 return
> acpi_os_read_memory((acpi_physical_address)reg->address,
>                                 val, reg->bit_width);
>
> +       local_irq_save(flags);
>         switch (reg->bit_width) {
>         case 8:
>                 *val = readb_relaxed(vaddr);
> @@ -1029,10 +1031,12 @@ static int cpc_read(int cpu, struct
> cpc_register_resource *reg_res, u64 *val)
>                 *val = readq_relaxed(vaddr);
>                 break;
>         default:
> +               local_irq_restore(flags);
>                 pr_debug("Error: Cannot read %u bit width from PCC for
> ss: %d\n",
>                          reg->bit_width, pcc_ss_id);
>                 return -EFAULT;
>         }
> +       local_irq_restore(flags);
>
>         return 0;
>  }
>
>>         /*
>>          * Per spec, if ctr_wrap_time optional register is
>> unsupported, then the
>>          * performance counters are assumed to never wrap during the
>> lifetime of
>>
>>> Third, if the counters are implemented by AMU, cpc_read_ffh() needs to
>>> send IPI so it requires IRQ enabled.
>>
>> If I'm not mistaken, the CPU calling cpc_read_ffh() might have IRQs
>> disabled,
>> it should not prevent it to send IPIs no ?
>
> It can't work with irq disabled. The comment in counters_read_on_cpu()
> says "Abort call on counterless CPU or when interrupts are disabled -
> can lead to deadlock in smp sync call."
>
>
> And it just returns -EPERM and raise a warning if irq is disabled.

Ok right,

Regards,
Piere

2023-04-20 21:01:22

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error



On 4/20/23 9:01 AM, Pierre Gondois wrote:
>
>>>
>>> You say that the cause of this is a congestion in the interconnect. I
>>> don't
>>> see a way to check that right now.
>>> However your trace is on the CPU0, so maybe all the other cores were
>>> shutdown
>>> in your test. If this is the case, do you think a congestion could
>>> happen with
>>> only one CPU ?
>>
>> No, other CPUs were not shut down in my test. I just ran "yes" on all
>> cores except CPU 0, then ran the reading freq script. Since all other
>> cores are busy, so the script should be always running on CPU 0.
>>
>> Since the counters, memory and other devices are on the interconnect, so
>> the congestion may be caused by plenty of factors IIUC.
>
> +Ionela
>
> Ionela pointed me to the following patch-set, which seems realated:
> https://lore.kernel.org/all/[email protected]/

Thanks for the information. I think we do have the similar syndrome. But
I'm not sure how their counters are implemented, we may not have similar
root cause.

>
> One thing that we didn't check I believe is and that Ionela pointed out
> is that we don't know whether we are accessing the present CPU or a
> remote
> CPU'AMUs. In the latter case there would be IPIs and possible delays in
> waking up/accessing the remote CPU).
>
>>
>>>
>>> Just 2 other comments:
>>> a-
>>> It might be interesting to change the order in which cpc registers are
>>> read
>>> just to see if it has an impact, but even if it has, I m not sure how
>>> this
>>> could be exploitable.
>>> Just in case, I mean doing that, but I think that b. might be better
>>> to try.
>>>
>>> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
>>> index c51d3ccb4cca..479b55006020 100644
>>> --- a/drivers/acpi/cppc_acpi.c
>>> +++ b/drivers/acpi/cppc_acpi.c
>>> @@ -1350,8 +1350,8 @@ int cppc_get_perf_ctrs(int cpunum, struct
>>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>>                  }
>>>          }
>>>
>>> -       cpc_read(cpunum, delivered_reg, &delivered);
>>>          cpc_read(cpunum, reference_reg, &reference);
>>> +       cpc_read(cpunum, delivered_reg, &delivered);
>>>          cpc_read(cpunum, ref_perf_reg, &ref_perf);
>>>
>>>          /*
>>>
>>> b-
>>> In the trace that you shared, the cpc_read() calls in the fist
>>> cppc_get_perf_ctrs() calls seem to always take a bit more time than in
>>> the
>>> second cppc_get_perf_ctrs() call.
>>> Would it be possible to collect traces similar as above with 3 or 4
>>> calls to
>>> cppc_get_perf_ctrs() instead of 2 ? It would allow to check whether in
>>> the first
>>> call, accessing the cpc registers takes more time than in the
>>> following calls,
>>> due to cache misses or other reasons.
>>
>> Cache miss? The counters should be not cached and reading the counters
>> should not hit cache IIUC.
>
> Yes you are correct, what I said is copmletely wrong.
>
>>
>>> Ideally statistics on the result would be the best, or if you have a
>>> trace.dat
>>> to share containing a trace with multiple cppc_cpufreq_get_rate()
>>> calls.
>>
>> Tried option b, I managed to get histogram:
>>
>> @hist_first_ns[cat]:
>> [4K, 8K)          112321
>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [8K, 16K)            212
>> |                                                    |
>> [16K, 32K)            25
>> |                                                    |
>> [32K, 64K)            59
>> |                                                    |
>> [64K, 128K)            6
>> |                                                    |
>> [128K, 256K)           9
>> |                                                    |
>>
>> @hist_second_ns[cat]:
>> [2K, 4K)          112590
>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [4K, 8K)               4
>> |                                                    |
>> [8K, 16K)              0
>> |                                                    |
>> [16K, 32K)            15
>> |                                                    |
>> [32K, 64K)            18
>> |                                                    |
>> [64K, 128K)            1
>> |                                                    |
>> [128K, 256K)           4
>> |                                                    |
>>
>> The "first" means the first cppc_get_perf_ctrs() call. But the bpftrace
>> script can't tell the second, the third and the fourth, so all them are
>> shown as "second". Anyway it seems fine. We can tell the first read took
>> longer than the later ones for the most time.
>>
>> And a typical func_graph trace shows:
>>
>> # tracer: function_graph
>> #
>> #     TIME        CPU  DURATION                  FUNCTION CALLS
>> #      |          |     |   |                     |   |   | |
>>    4447.171333 |     0)               |  cppc_cpufreq_get_rate
>> [cppc_cpufreq]() {
>>    4447.171334 |     0)               |    cpufreq_cpu_get() {
>>    4447.171334 |     0)   1.060 us    | _raw_read_lock_irqsave();
>>    4447.171336 |     0)   0.560 us    | _raw_read_unlock_irqrestore();
>>    4447.171337 |     0)   3.480 us    | }
>>    4447.171338 |     0)   0.400 us    | cpufreq_cpu_put();
>>    4447.171338 |     0)               |    cppc_get_perf_ctrs() {
>>    4447.171339 |     0)   0.720 us    | cpc_read.isra.0();
>>    4447.171341 |     0)   0.700 us    | cpc_read.isra.0();
>>    4447.171342 |     0)   0.380 us    | cpc_read.isra.0();
>>    4447.171342 |     0)   0.600 us    | cpc_read.isra.0();
>>    4447.171343 |     0)   4.900 us    | }
>>    4447.171344 |     0)               |    __delay() {
>>    4447.171344 |     0)   0.540 us    | arch_timer_evtstrm_available();
>>    4447.171346 |     0)   2.420 us    | }
>>    4447.171347 |     0)               |    cppc_get_perf_ctrs() {
>>    4447.171347 |     0)   0.540 us    | cpc_read.isra.0();
>>    4447.171348 |     0)   0.520 us    | cpc_read.isra.0();
>>    4447.171349 |     0)   0.400 us    | cpc_read.isra.0();
>>    4447.171350 |     0)   0.440 us    | cpc_read.isra.0();
>>    4447.171350 |     0)   3.660 us    | }
>>    4447.171351 |     0)               |    __delay() {
>>    4447.171351 |     0)   0.400 us    | arch_timer_evtstrm_available();
>>    4447.171353 |     0)   2.400 us    | }
>>    4447.171353 |     0)               |    cppc_get_perf_ctrs() {
>>    4447.171354 |     0)   0.540 us    | cpc_read.isra.0();
>>    4447.171355 |     0)   0.540 us    | cpc_read.isra.0();
>>    4447.171356 |     0)   0.380 us    | cpc_read.isra.0();
>>    4447.171356 |     0)   0.420 us    | cpc_read.isra.0();
>>    4447.171357 |     0)   3.640 us    | }
>>    4447.171357 |     0)               |    __delay() {
>>    4447.171358 |     0)   0.380 us    | arch_timer_evtstrm_available();
>>    4447.171360 |     0)   2.380 us    |    }
>>    4447.171360 |     0)               |    cppc_get_perf_ctrs() {
>>    4447.171361 |     0)   0.520 us    |      cpc_read.isra.0();
>>    4447.171361 |     0)   0.520 us    |      cpc_read.isra.0();
>>    4447.171362 |     0)   0.400 us    |      cpc_read.isra.0();
>>    4447.171363 |     0)   0.440 us    |      cpc_read.isra.0();
>>    4447.171364 |     0)   3.640 us    |    }
>>    4447.171364 |     0)   0.520 us    | cppc_cpufreq_perf_to_khz
>> [cppc_cpufreq]();
>>    4447.171365 |     0) + 34.240 us   |  }
>>
>> It also shows the first reading typically took longer than the later
>> ones. The second, the third and the fourth actually took almost same
>> time.
>>
>> I also tried to read perf_fb_ctrs_t0 twice (so total 3 reads, 2 for t0,
>> 1 for t1, 2us delay between each read), but I didn't see noticeable
>> improvement. 4 reads (2 for t0, 2 for t1) does show some noticeable
>> improvement.
>>
>
> Thanks for the new data.
>
>>>
>>> Example of code where we do 4 calls to cppc_get_perf_ctrs():
>>>
>>> diff --git a/drivers/cpufreq/cppc_cpufreq.c
>>> b/drivers/cpufreq/cppc_cpufreq.c
>>> index 022e3555407c..6370f2f0bdad 100644
>>> --- a/drivers/cpufreq/cppc_cpufreq.c
>>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>>> @@ -853,6 +853,20 @@ static unsigned int
>>> cppc_cpufreq_get_rate(unsigned int cpu)
>>>
>>>          udelay(2); /* 2usec delay between sampling */
>>>
>>> +       ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
>>> +       if (ret)
>>> +               return ret;
>>> +
>>> +       udelay(2); /* 2usec delay between sampling */
>>> +
>>> +       /* Do a third call. */
>>> +       ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
>>> +       if (ret)
>>> +               return ret;
>>> +
>>> +       udelay(2); /* 2usec delay between sampling */
>>> +
>>> +       /* Do a fourth call. */
>>>          ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
>>>          if (ret)
>>>                  return ret;
>>>
>>>>
>>>> The above trace data shows some cpc reading took a little bit longer
>>>> than usual. I suspected it was caused by interconnect congestion.
>>>>
>>>>
>>>> So it looks like IRQ is the major contributing factor of high error
>>>> (4xxxxxx KHz), interconnect congestion is the major contributing
>>>> factor
>>>> of low error (3xxxxxx KHz).
>>>>
>>>> So I did the below test:
>>>> 1. Disable IRQ: The high errors were gone (> 3700000KHz), but low
>>>> errors
>>>> were still seen.
>>>> 2.10us delay: The high errors were still seen.
>>>> 3. Disable IRQ + 10us delay: all the errors were gone.
>>>>
>>>> I think the test result also supports the tracing data.
>>>>
>>>>
>>>> I also got some confusion about calling cppc_cpufreq_get_rate()
>>>> with irq
>>>> disabled. Rafael thought 10ms delay is too long because the
>>>> function may
>>>> be called with irq disabled. But a deeper look at the function
>>>> shows it
>>>> should *NOT* be called with irq disabled at all.
>>>>
>>>> First, if pcc channel is used, cpc reading may take over 100ms, it is
>>>> way larger the proposed 10ms delay.
>>>> Second, reading from cpc channel needs to take a semaphore, so it may
>>>> sleep. But sleep with IRQ disabled is not allowed.
>>>
>>> Yes right, however the semaphore is not taken in between the
>>> sequence of
>>> cpc_read() calls in cppc_get_perf_ctrs(). So maybe the change below
>>> should
>>> be acceptable:
>>
>> Yeah, we should be able to find a smaller irq disable section.
>>
>>>
>>> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
>>> index c51d3ccb4cca..105a7e2ffffa 100644
>>> --- a/drivers/acpi/cppc_acpi.c
>>> +++ b/drivers/acpi/cppc_acpi.c
>>> @@ -1315,6 +1315,7 @@ int cppc_get_perf_ctrs(int cpunum, struct
>>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>>          struct cppc_pcc_data *pcc_ss_data = NULL;
>>>          u64 delivered, reference, ref_perf, ctr_wrap_time;
>>>          int ret = 0, regs_in_pcc = 0;
>>> +       unsigned long flags;
>>>
>>>          if (!cpc_desc) {
>>>                  pr_debug("No CPC descriptor for CPU:%d\n", cpunum);
>>> @@ -1350,10 +1351,14 @@ int cppc_get_perf_ctrs(int cpunum, struct
>>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>>                  }
>>>          }
>>>
>>> +       local_irq_save(flags);
>>> +
>>>          cpc_read(cpunum, delivered_reg, &delivered);
>>>          cpc_read(cpunum, reference_reg, &reference);
>>>          cpc_read(cpunum, ref_perf_reg, &ref_perf);
>>>
>>> +       local_irq_restore(flags);
>>> +
>>
>> cpc_read_ffh() would return -EPERM if irq is disabled.
>>
>> So, the irq disabling must happen for mmio only in cpc_read(), for
>> example:
>
> I thought the issue was that irqs could happen in between cpc_read()
> functions,
> the patch below would not cover it. If the frequency is more accurate
> with this patch, I think I don't understand something.

Yeah, you are correct. The irq disabling window has to cover all the
cpc_read(). I didn't test with this patch. My test was done conceptually
with:

disable irq
cppc_get_perf_ctrs(t0)
udelay(2)
cppc_get_perf_ctrs(t1)
enable irq

But this will break cpc_read_ffh().

>
> (asking for more information)
> Just to check, the core/perf counters are AMUs and the other CPPC
> registers
> are mmio right ? Is it possible to know the CPPC registers that are
> implemented on your platform ?

AFAIK, the perf counters are implemented by AMU and accessed via mmio,
the other CPPC registers are implemented by PCC.

> Also is it possible which platform you are using ?

Ampere One

>
>>
>> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
>> index c51d3ccb4cca..f3c92d844074 100644
>> --- a/drivers/acpi/cppc_acpi.c
>> +++ b/drivers/acpi/cppc_acpi.c
>> @@ -982,6 +982,7 @@ static int cpc_read(int cpu, struct
>> cpc_register_resource *reg_res, u64 *val)
>>           void __iomem *vaddr = NULL;
>>           int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpu);
>>           struct cpc_reg *reg = &reg_res->cpc_entry.reg;
>> +       unsigned long flags;
>>
>>           if (reg_res->type == ACPI_TYPE_INTEGER) {
>>                   *val = reg_res->cpc_entry.int_value;
>> @@ -1015,6 +1016,7 @@ static int cpc_read(int cpu, struct
>> cpc_register_resource *reg_res, u64 *val)
>>                   return
>> acpi_os_read_memory((acpi_physical_address)reg->address,
>>                                   val, reg->bit_width);
>>
>> +       local_irq_save(flags);
>>           switch (reg->bit_width) {
>>           case 8:
>>                   *val = readb_relaxed(vaddr);
>> @@ -1029,10 +1031,12 @@ static int cpc_read(int cpu, struct
>> cpc_register_resource *reg_res, u64 *val)
>>                   *val = readq_relaxed(vaddr);
>>                   break;
>>           default:
>> +               local_irq_restore(flags);
>>                   pr_debug("Error: Cannot read %u bit width from PCC for
>> ss: %d\n",
>>                            reg->bit_width, pcc_ss_id);
>>                   return -EFAULT;
>>           }
>> +       local_irq_restore(flags);
>>
>>           return 0;
>>    }
>>
>>>          /*
>>>           * Per spec, if ctr_wrap_time optional register is
>>> unsupported, then the
>>>           * performance counters are assumed to never wrap during the
>>> lifetime of
>>>
>>>> Third, if the counters are implemented by AMU, cpc_read_ffh() needs to
>>>> send IPI so it requires IRQ enabled.
>>>
>>> If I'm not mistaken, the CPU calling cpc_read_ffh() might have IRQs
>>> disabled,
>>> it should not prevent it to send IPIs no ?
>>
>> It can't work with irq disabled. The comment in counters_read_on_cpu()
>> says "Abort call on counterless CPU or when interrupts are disabled -
>> can lead to deadlock in smp sync call."
>>
>>
>> And it just returns -EPERM and raise a warning if irq is disabled.
>
> Ok right,
>
> Regards,
> Piere

2023-04-24 11:50:55

by Ionela Voinescu

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error

Hey,

On Thursday 20 Apr 2023 at 13:49:24 (-0700), Yang Shi wrote:
>
>
> On 4/20/23 9:01 AM, Pierre Gondois wrote:
> >
> > > >
> > > > You say that the cause of this is a congestion in the interconnect. I
> > > > don't
> > > > see a way to check that right now.
> > > > However your trace is on the CPU0, so maybe all the other cores were
> > > > shutdown
> > > > in your test. If this is the case, do you think a congestion could
> > > > happen with
> > > > only one CPU ?
> > >
> > > No, other CPUs were not shut down in my test. I just ran "yes" on all
> > > cores except CPU 0, then ran the reading freq script. Since all other
> > > cores are busy, so the script should be always running on CPU 0.
> > >
> > > Since the counters, memory and other devices are on the interconnect, so
> > > the congestion may be caused by plenty of factors IIUC.
> >
> > +Ionela
> >
> > Ionela pointed me to the following patch-set, which seems realated:
> > https://lore.kernel.org/all/[email protected]/
>
> Thanks for the information. I think we do have the similar syndrome. But I'm
> not sure how their counters are implemented, we may not have similar root
> cause.

Yes, my bad, I did not get the chance to read this full thread before
talking with Pierre. In your case you have AMUs accessed via MMIO and in that
case they are accessed though FFH (IPIs and system registers). The root
cause is indeed different.

>
> >
> > One thing that we didn't check I believe is and that Ionela pointed out
> > is that we don't know whether we are accessing the present CPU or a
> > remote
> > CPU'AMUs. In the latter case there would be IPIs and possible delays in
> > waking up/accessing the remote CPU).
> >
> > >
> > > >
> > > > Just 2 other comments:
> > > > a-
> > > > It might be interesting to change the order in which cpc registers are
> > > > read
> > > > just to see if it has an impact, but even if it has, I m not sure how
> > > > this
> > > > could be exploitable.
> > > > Just in case, I mean doing that, but I think that b. might be better
> > > > to try.
> > > >
> > > > diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
> > > > index c51d3ccb4cca..479b55006020 100644
> > > > --- a/drivers/acpi/cppc_acpi.c
> > > > +++ b/drivers/acpi/cppc_acpi.c
> > > > @@ -1350,8 +1350,8 @@ int cppc_get_perf_ctrs(int cpunum, struct
> > > > cppc_perf_fb_ctrs *perf_fb_ctrs)
> > > >                  }
> > > >          }
> > > >
> > > > -       cpc_read(cpunum, delivered_reg, &delivered);
> > > >          cpc_read(cpunum, reference_reg, &reference);
> > > > +       cpc_read(cpunum, delivered_reg, &delivered);
> > > >          cpc_read(cpunum, ref_perf_reg, &ref_perf);
> > > >
> > > >          /*
> > > >
> > > > b-
> > > > In the trace that you shared, the cpc_read() calls in the fist
> > > > cppc_get_perf_ctrs() calls seem to always take a bit more time than in
> > > > the
> > > > second cppc_get_perf_ctrs() call.
> > > > Would it be possible to collect traces similar as above with 3 or 4
> > > > calls to
> > > > cppc_get_perf_ctrs() instead of 2 ? It would allow to check whether in
> > > > the first
> > > > call, accessing the cpc registers takes more time than in the
> > > > following calls,
> > > > due to cache misses or other reasons.
> > >
> > > Cache miss? The counters should be not cached and reading the counters
> > > should not hit cache IIUC.
> >
> > Yes you are correct, what I said is copmletely wrong.
> >
> > >
> > > > Ideally statistics on the result would be the best, or if you have a
> > > > trace.dat
> > > > to share containing a trace with multiple
> > > > cppc_cpufreq_get_rate() calls.
> > >
> > > Tried option b, I managed to get histogram:
> > >
> > > @hist_first_ns[cat]:
> > > [4K, 8K)          112321
> > > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > > [8K, 16K)            212
> > > |                                                    |
> > > [16K, 32K)            25
> > > |                                                    |
> > > [32K, 64K)            59
> > > |                                                    |
> > > [64K, 128K)            6
> > > |                                                    |
> > > [128K, 256K)           9
> > > |                                                    |
> > >
> > > @hist_second_ns[cat]:
> > > [2K, 4K)          112590
> > > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > > [4K, 8K)               4
> > > |                                                    |
> > > [8K, 16K)              0
> > > |                                                    |
> > > [16K, 32K)            15
> > > |                                                    |
> > > [32K, 64K)            18
> > > |                                                    |
> > > [64K, 128K)            1
> > > |                                                    |
> > > [128K, 256K)           4
> > > |                                                    |
> > >
> > > The "first" means the first cppc_get_perf_ctrs() call. But the bpftrace
> > > script can't tell the second, the third and the fourth, so all them are
> > > shown as "second". Anyway it seems fine. We can tell the first read took
> > > longer than the later ones for the most time.
> > >
> > > And a typical func_graph trace shows:
> > >
> > > # tracer: function_graph
> > > #
> > > #     TIME        CPU  DURATION                  FUNCTION CALLS
> > > #      |          |     |   |                     |   |   | |
> > >    4447.171333 |     0)               |  cppc_cpufreq_get_rate
> > > [cppc_cpufreq]() {
> > >    4447.171334 |     0)               |    cpufreq_cpu_get() {
> > >    4447.171334 |     0)   1.060 us    | _raw_read_lock_irqsave();
> > >    4447.171336 |     0)   0.560 us    | _raw_read_unlock_irqrestore();
> > >    4447.171337 |     0)   3.480 us    | }
> > >    4447.171338 |     0)   0.400 us    | cpufreq_cpu_put();
> > >    4447.171338 |     0)               |    cppc_get_perf_ctrs() {
> > >    4447.171339 |     0)   0.720 us    | cpc_read.isra.0();
> > >    4447.171341 |     0)   0.700 us    | cpc_read.isra.0();
> > >    4447.171342 |     0)   0.380 us    | cpc_read.isra.0();
> > >    4447.171342 |     0)   0.600 us    | cpc_read.isra.0();
> > >    4447.171343 |     0)   4.900 us    | }
> > >    4447.171344 |     0)               |    __delay() {
> > >    4447.171344 |     0)   0.540 us    | arch_timer_evtstrm_available();
> > >    4447.171346 |     0)   2.420 us    | }
> > >    4447.171347 |     0)               |    cppc_get_perf_ctrs() {
> > >    4447.171347 |     0)   0.540 us    | cpc_read.isra.0();
> > >    4447.171348 |     0)   0.520 us    | cpc_read.isra.0();
> > >    4447.171349 |     0)   0.400 us    | cpc_read.isra.0();
> > >    4447.171350 |     0)   0.440 us    | cpc_read.isra.0();
> > >    4447.171350 |     0)   3.660 us    | }
> > >    4447.171351 |     0)               |    __delay() {
> > >    4447.171351 |     0)   0.400 us    | arch_timer_evtstrm_available();
> > >    4447.171353 |     0)   2.400 us    | }
> > >    4447.171353 |     0)               |    cppc_get_perf_ctrs() {
> > >    4447.171354 |     0)   0.540 us    | cpc_read.isra.0();
> > >    4447.171355 |     0)   0.540 us    | cpc_read.isra.0();
> > >    4447.171356 |     0)   0.380 us    | cpc_read.isra.0();
> > >    4447.171356 |     0)   0.420 us    | cpc_read.isra.0();
> > >    4447.171357 |     0)   3.640 us    | }
> > >    4447.171357 |     0)               |    __delay() {
> > >    4447.171358 |     0)   0.380 us    | arch_timer_evtstrm_available();
> > >    4447.171360 |     0)   2.380 us    |    }
> > >    4447.171360 |     0)               |    cppc_get_perf_ctrs() {
> > >    4447.171361 |     0)   0.520 us    |      cpc_read.isra.0();
> > >    4447.171361 |     0)   0.520 us    |      cpc_read.isra.0();
> > >    4447.171362 |     0)   0.400 us    |      cpc_read.isra.0();
> > >    4447.171363 |     0)   0.440 us    |      cpc_read.isra.0();
> > >    4447.171364 |     0)   3.640 us    |    }
> > >    4447.171364 |     0)   0.520 us    | cppc_cpufreq_perf_to_khz
> > > [cppc_cpufreq]();
> > >    4447.171365 |     0) + 34.240 us   |  }
> > >
> > > It also shows the first reading typically took longer than the later
> > > ones. The second, the third and the fourth actually took almost same
> > > time.
> > >

Do you happen to know if this is still the case if the delay in between
the two cppc_get_perf_ctrs() calls is larger (a few ms)?

> > > I also tried to read perf_fb_ctrs_t0 twice (so total 3 reads, 2 for t0,
> > > 1 for t1, 2us delay between each read), but I didn't see noticeable
> > > improvement. 4 reads (2 for t0, 2 for t1) does show some noticeable
> > > improvement.
> > >

I'm not sure I understand this. t0, t1? Can you expand on the sequence
of reads here? Are you reading the same counter multiple times instead
of reading all counters multiple times?

> >
> > Thanks for the new data.
> >
> > > >
> > > > Example of code where we do 4 calls to cppc_get_perf_ctrs():
> > > >
> > > > diff --git a/drivers/cpufreq/cppc_cpufreq.c
> > > > b/drivers/cpufreq/cppc_cpufreq.c
> > > > index 022e3555407c..6370f2f0bdad 100644
> > > > --- a/drivers/cpufreq/cppc_cpufreq.c
> > > > +++ b/drivers/cpufreq/cppc_cpufreq.c
> > > > @@ -853,6 +853,20 @@ static unsigned int
> > > > cppc_cpufreq_get_rate(unsigned int cpu)
> > > >
> > > >          udelay(2); /* 2usec delay between sampling */
> > > >
> > > > +       ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
> > > > +       if (ret)
> > > > +               return ret;
> > > > +
> > > > +       udelay(2); /* 2usec delay between sampling */
> > > > +
> > > > +       /* Do a third call. */
> > > > +       ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
> > > > +       if (ret)
> > > > +               return ret;
> > > > +
> > > > +       udelay(2); /* 2usec delay between sampling */
> > > > +
> > > > +       /* Do a fourth call. */
> > > >          ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
> > > >          if (ret)
> > > >                  return ret;
> > > >
> > > > >
> > > > > The above trace data shows some cpc reading took a little bit longer
> > > > > than usual. I suspected it was caused by interconnect congestion.
> > > > >
> > > > >
> > > > > So it looks like IRQ is the major contributing factor of high error
> > > > > (4xxxxxx KHz), interconnect congestion is the major
> > > > > contributing factor
> > > > > of low error (3xxxxxx KHz).
> > > > >
> > > > > So I did the below test:
> > > > > 1. Disable IRQ: The high errors were gone (> 3700000KHz),
> > > > > but low errors
> > > > > were still seen.
> > > > > 2.10us delay: The high errors were still seen.
> > > > > 3. Disable IRQ + 10us delay: all the errors were gone.
> > > > >
> > > > > I think the test result also supports the tracing data.
> > > > >
> > > > >
> > > > > I also got some confusion about calling
> > > > > cppc_cpufreq_get_rate() with irq
> > > > > disabled. Rafael thought 10ms delay is too long because the
> > > > > function may
> > > > > be called with irq disabled. But a deeper look at the
> > > > > function shows it
> > > > > should *NOT* be called with irq disabled at all.
> > > > >
> > > > > First, if pcc channel is used, cpc reading may take over 100ms, it is
> > > > > way larger the proposed 10ms delay.
> > > > > Second, reading from cpc channel needs to take a semaphore, so it may
> > > > > sleep. But sleep with IRQ disabled is not allowed.
> > > >
> > > > Yes right, however the semaphore is not taken in between the
> > > > sequence of
> > > > cpc_read() calls in cppc_get_perf_ctrs(). So maybe the change below
> > > > should
> > > > be acceptable:
> > >
> > > Yeah, we should be able to find a smaller irq disable section.
> > >
> > > >
> > > > diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
> > > > index c51d3ccb4cca..105a7e2ffffa 100644
> > > > --- a/drivers/acpi/cppc_acpi.c
> > > > +++ b/drivers/acpi/cppc_acpi.c
> > > > @@ -1315,6 +1315,7 @@ int cppc_get_perf_ctrs(int cpunum, struct
> > > > cppc_perf_fb_ctrs *perf_fb_ctrs)
> > > >          struct cppc_pcc_data *pcc_ss_data = NULL;
> > > >          u64 delivered, reference, ref_perf, ctr_wrap_time;
> > > >          int ret = 0, regs_in_pcc = 0;
> > > > +       unsigned long flags;
> > > >
> > > >          if (!cpc_desc) {
> > > >                  pr_debug("No CPC descriptor for CPU:%d\n", cpunum);
> > > > @@ -1350,10 +1351,14 @@ int cppc_get_perf_ctrs(int cpunum, struct
> > > > cppc_perf_fb_ctrs *perf_fb_ctrs)
> > > >                  }
> > > >          }
> > > >
> > > > +       local_irq_save(flags);
> > > > +
> > > >          cpc_read(cpunum, delivered_reg, &delivered);
> > > >          cpc_read(cpunum, reference_reg, &reference);
> > > >          cpc_read(cpunum, ref_perf_reg, &ref_perf);
> > > >
> > > > +       local_irq_restore(flags);
> > > > +
> > >
> > > cpc_read_ffh() would return -EPERM if irq is disabled.
> > >
> > > So, the irq disabling must happen for mmio only in cpc_read(), for
> > > example:
> >
> > I thought the issue was that irqs could happen in between cpc_read()
> > functions,
> > the patch below would not cover it. If the frequency is more accurate
> > with this patch, I think I don't understand something.
>
> Yeah, you are correct. The irq disabling window has to cover all the
> cpc_read(). I didn't test with this patch. My test was done conceptually
> with:
>
> disable irq
> cppc_get_perf_ctrs(t0)
> udelay(2)
> cppc_get_perf_ctrs(t1)
> enable irq
>
> But this will break cpc_read_ffh().

Can you not disable IRQs in cppc_get_perf_ctrs() only if the registers
are CPC_IN_SYSTEM_MEMORY? Only spanning the reads of the delivered
register and the reference register, which should have minimal delay in
between?

As in:

if (CPC_IN_SYSTEM_MEMORY(delivered_reg) &&
CPC_IN_SYSTEM_MEMORY(reference_reg))
...

This will and should not affect FFH - the fix for that would have to be
different.

>
> >
> > (asking for more information)
> > Just to check, the core/perf counters are AMUs and the other CPPC
> > registers
> > are mmio right ? Is it possible to know the CPPC registers that are
> > implemented on your platform ?
>
> AFAIK, the perf counters are implemented by AMU and accessed via mmio, the
> other CPPC registers are implemented by PCC.
>

Usually there is an implementation defined rate of sync between the CPU
counters and the values available through the memory mapped interface.
Is it possible that some inaccuracies are introduced by that as well, in
case there is a sync in between the two counter (delivered and reference)
reads?

Hope it helps,
Ionela.

> > Also is it possible which platform you are using ?
>
> Ampere One
>
> >
> > >
> > > diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
> > > index c51d3ccb4cca..f3c92d844074 100644
> > > --- a/drivers/acpi/cppc_acpi.c
> > > +++ b/drivers/acpi/cppc_acpi.c
> > > @@ -982,6 +982,7 @@ static int cpc_read(int cpu, struct
> > > cpc_register_resource *reg_res, u64 *val)
> > >           void __iomem *vaddr = NULL;
> > >           int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpu);
> > >           struct cpc_reg *reg = &reg_res->cpc_entry.reg;
> > > +       unsigned long flags;
> > >
> > >           if (reg_res->type == ACPI_TYPE_INTEGER) {
> > >                   *val = reg_res->cpc_entry.int_value;
> > > @@ -1015,6 +1016,7 @@ static int cpc_read(int cpu, struct
> > > cpc_register_resource *reg_res, u64 *val)
> > >                   return
> > > acpi_os_read_memory((acpi_physical_address)reg->address,
> > >                                   val, reg->bit_width);
> > >
> > > +       local_irq_save(flags);
> > >           switch (reg->bit_width) {
> > >           case 8:
> > >                   *val = readb_relaxed(vaddr);
> > > @@ -1029,10 +1031,12 @@ static int cpc_read(int cpu, struct
> > > cpc_register_resource *reg_res, u64 *val)
> > >                   *val = readq_relaxed(vaddr);
> > >                   break;
> > >           default:
> > > +               local_irq_restore(flags);
> > >                   pr_debug("Error: Cannot read %u bit width from PCC for
> > > ss: %d\n",
> > >                            reg->bit_width, pcc_ss_id);
> > >                   return -EFAULT;
> > >           }
> > > +       local_irq_restore(flags);
> > >
> > >           return 0;
> > >    }
> > >
> > > >          /*
> > > >           * Per spec, if ctr_wrap_time optional register is
> > > > unsupported, then the
> > > >           * performance counters are assumed to never wrap during the
> > > > lifetime of
> > > >
> > > > > Third, if the counters are implemented by AMU, cpc_read_ffh() needs to
> > > > > send IPI so it requires IRQ enabled.
> > > >
> > > > If I'm not mistaken, the CPU calling cpc_read_ffh() might have IRQs
> > > > disabled,
> > > > it should not prevent it to send IPIs no ?
> > >
> > > It can't work with irq disabled. The comment in counters_read_on_cpu()
> > > says "Abort call on counterless CPU or when interrupts are disabled -
> > > can lead to deadlock in smp sync call."
> > >
> > >
> > > And it just returns -EPERM and raise a warning if irq is disabled.
> >
> > Ok right,
> >
> > Regards,
> > Piere
>

2023-04-26 01:46:16

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error



On 4/24/23 4:44 AM, Ionela Voinescu wrote:
> Hey,
>
> On Thursday 20 Apr 2023 at 13:49:24 (-0700), Yang Shi wrote:
>> On 4/20/23 9:01 AM, Pierre Gondois wrote:
>>>>> You say that the cause of this is a congestion in the interconnect. I
>>>>> don't
>>>>> see a way to check that right now.
>>>>> However your trace is on the CPU0, so maybe all the other cores were
>>>>> shutdown
>>>>> in your test. If this is the case, do you think a congestion could
>>>>> happen with
>>>>> only one CPU ?
>>>> No, other CPUs were not shut down in my test. I just ran "yes" on all
>>>> cores except CPU 0, then ran the reading freq script. Since all other
>>>> cores are busy, so the script should be always running on CPU 0.
>>>>
>>>> Since the counters, memory and other devices are on the interconnect, so
>>>> the congestion may be caused by plenty of factors IIUC.
>>> +Ionela
>>>
>>> Ionela pointed me to the following patch-set, which seems realated:
>>> https://lore.kernel.org/all/[email protected]/
>> Thanks for the information. I think we do have the similar syndrome. But I'm
>> not sure how their counters are implemented, we may not have similar root
>> cause.
> Yes, my bad, I did not get the chance to read this full thread before
> talking with Pierre. In your case you have AMUs accessed via MMIO and in that
> case they are accessed though FFH (IPIs and system registers). The root
> cause is indeed different.

Yeah, but it seems like using larger delay could mitigate both issues.

>>> One thing that we didn't check I believe is and that Ionela pointed out
>>> is that we don't know whether we are accessing the present CPU or a
>>> remote
>>> CPU'AMUs. In the latter case there would be IPIs and possible delays in
>>> waking up/accessing the remote CPU).
>>>
>>>>> Just 2 other comments:
>>>>> a-
>>>>> It might be interesting to change the order in which cpc registers are
>>>>> read
>>>>> just to see if it has an impact, but even if it has, I m not sure how
>>>>> this
>>>>> could be exploitable.
>>>>> Just in case, I mean doing that, but I think that b. might be better
>>>>> to try.
>>>>>
>>>>> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
>>>>> index c51d3ccb4cca..479b55006020 100644
>>>>> --- a/drivers/acpi/cppc_acpi.c
>>>>> +++ b/drivers/acpi/cppc_acpi.c
>>>>> @@ -1350,8 +1350,8 @@ int cppc_get_perf_ctrs(int cpunum, struct
>>>>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>>>>                  }
>>>>>          }
>>>>>
>>>>> -       cpc_read(cpunum, delivered_reg, &delivered);
>>>>>          cpc_read(cpunum, reference_reg, &reference);
>>>>> +       cpc_read(cpunum, delivered_reg, &delivered);
>>>>>          cpc_read(cpunum, ref_perf_reg, &ref_perf);
>>>>>
>>>>>          /*
>>>>>
>>>>> b-
>>>>> In the trace that you shared, the cpc_read() calls in the fist
>>>>> cppc_get_perf_ctrs() calls seem to always take a bit more time than in
>>>>> the
>>>>> second cppc_get_perf_ctrs() call.
>>>>> Would it be possible to collect traces similar as above with 3 or 4
>>>>> calls to
>>>>> cppc_get_perf_ctrs() instead of 2 ? It would allow to check whether in
>>>>> the first
>>>>> call, accessing the cpc registers takes more time than in the
>>>>> following calls,
>>>>> due to cache misses or other reasons.
>>>> Cache miss? The counters should be not cached and reading the counters
>>>> should not hit cache IIUC.
>>> Yes you are correct, what I said is copmletely wrong.
>>>
>>>>> Ideally statistics on the result would be the best, or if you have a
>>>>> trace.dat
>>>>> to share containing a trace with multiple
>>>>> cppc_cpufreq_get_rate() calls.
>>>> Tried option b, I managed to get histogram:
>>>>
>>>> @hist_first_ns[cat]:
>>>> [4K, 8K)          112321
>>>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>>> [8K, 16K)            212
>>>> |                                                    |
>>>> [16K, 32K)            25
>>>> |                                                    |
>>>> [32K, 64K)            59
>>>> |                                                    |
>>>> [64K, 128K)            6
>>>> |                                                    |
>>>> [128K, 256K)           9
>>>> |                                                    |
>>>>
>>>> @hist_second_ns[cat]:
>>>> [2K, 4K)          112590
>>>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>>> [4K, 8K)               4
>>>> |                                                    |
>>>> [8K, 16K)              0
>>>> |                                                    |
>>>> [16K, 32K)            15
>>>> |                                                    |
>>>> [32K, 64K)            18
>>>> |                                                    |
>>>> [64K, 128K)            1
>>>> |                                                    |
>>>> [128K, 256K)           4
>>>> |                                                    |
>>>>
>>>> The "first" means the first cppc_get_perf_ctrs() call. But the bpftrace
>>>> script can't tell the second, the third and the fourth, so all them are
>>>> shown as "second". Anyway it seems fine. We can tell the first read took
>>>> longer than the later ones for the most time.
>>>>
>>>> And a typical func_graph trace shows:
>>>>
>>>> # tracer: function_graph
>>>> #
>>>> #     TIME        CPU  DURATION                  FUNCTION CALLS
>>>> #      |          |     |   |                     |   |   | |
>>>>    4447.171333 |     0)               |  cppc_cpufreq_get_rate
>>>> [cppc_cpufreq]() {
>>>>    4447.171334 |     0)               |    cpufreq_cpu_get() {
>>>>    4447.171334 |     0)   1.060 us    | _raw_read_lock_irqsave();
>>>>    4447.171336 |     0)   0.560 us    | _raw_read_unlock_irqrestore();
>>>>    4447.171337 |     0)   3.480 us    | }
>>>>    4447.171338 |     0)   0.400 us    | cpufreq_cpu_put();
>>>>    4447.171338 |     0)               |    cppc_get_perf_ctrs() {
>>>>    4447.171339 |     0)   0.720 us    | cpc_read.isra.0();
>>>>    4447.171341 |     0)   0.700 us    | cpc_read.isra.0();
>>>>    4447.171342 |     0)   0.380 us    | cpc_read.isra.0();
>>>>    4447.171342 |     0)   0.600 us    | cpc_read.isra.0();
>>>>    4447.171343 |     0)   4.900 us    | }
>>>>    4447.171344 |     0)               |    __delay() {
>>>>    4447.171344 |     0)   0.540 us    | arch_timer_evtstrm_available();
>>>>    4447.171346 |     0)   2.420 us    | }
>>>>    4447.171347 |     0)               |    cppc_get_perf_ctrs() {
>>>>    4447.171347 |     0)   0.540 us    | cpc_read.isra.0();
>>>>    4447.171348 |     0)   0.520 us    | cpc_read.isra.0();
>>>>    4447.171349 |     0)   0.400 us    | cpc_read.isra.0();
>>>>    4447.171350 |     0)   0.440 us    | cpc_read.isra.0();
>>>>    4447.171350 |     0)   3.660 us    | }
>>>>    4447.171351 |     0)               |    __delay() {
>>>>    4447.171351 |     0)   0.400 us    | arch_timer_evtstrm_available();
>>>>    4447.171353 |     0)   2.400 us    | }
>>>>    4447.171353 |     0)               |    cppc_get_perf_ctrs() {
>>>>    4447.171354 |     0)   0.540 us    | cpc_read.isra.0();
>>>>    4447.171355 |     0)   0.540 us    | cpc_read.isra.0();
>>>>    4447.171356 |     0)   0.380 us    | cpc_read.isra.0();
>>>>    4447.171356 |     0)   0.420 us    | cpc_read.isra.0();
>>>>    4447.171357 |     0)   3.640 us    | }
>>>>    4447.171357 |     0)               |    __delay() {
>>>>    4447.171358 |     0)   0.380 us    | arch_timer_evtstrm_available();
>>>>    4447.171360 |     0)   2.380 us    |    }
>>>>    4447.171360 |     0)               |    cppc_get_perf_ctrs() {
>>>>    4447.171361 |     0)   0.520 us    |      cpc_read.isra.0();
>>>>    4447.171361 |     0)   0.520 us    |      cpc_read.isra.0();
>>>>    4447.171362 |     0)   0.400 us    |      cpc_read.isra.0();
>>>>    4447.171363 |     0)   0.440 us    |      cpc_read.isra.0();
>>>>    4447.171364 |     0)   3.640 us    |    }
>>>>    4447.171364 |     0)   0.520 us    | cppc_cpufreq_perf_to_khz
>>>> [cppc_cpufreq]();
>>>>    4447.171365 |     0) + 34.240 us   |  }
>>>>
>>>> It also shows the first reading typically took longer than the later
>>>> ones. The second, the third and the fourth actually took almost same
>>>> time.
>>>>
> Do you happen to know if this is still the case if the delay in between
> the two cppc_get_perf_ctrs() calls is larger (a few ms)?

Do you mean whether the first reading takes longer if larger delay is
used? If so, then yes. It should be not related to the delay.

>>>> I also tried to read perf_fb_ctrs_t0 twice (so total 3 reads, 2 for t0,
>>>> 1 for t1, 2us delay between each read), but I didn't see noticeable
>>>> improvement. 4 reads (2 for t0, 2 for t1) does show some noticeable
>>>> improvement.
>>>>
> I'm not sure I understand this. t0, t1? Can you expand on the sequence
> of reads here? Are you reading the same counter multiple times instead
> of reading all counters multiple times?

t0 means perf_fb_ctrs_t0, t1 means perf_fb_ctrs_t1. The above experiment
actually does 4 reads (2 for perf_fb_ctrs_t0, 2 for perf_fb_ctrs_t1),
but just use the 2nd (perf_fb_ctrs_t0) result and the 4th
(perf_fb_ctrs_t1) result to do the calculation.

>>> Thanks for the new data.
>>>
>>>>> Example of code where we do 4 calls to cppc_get_perf_ctrs():
>>>>>
>>>>> diff --git a/drivers/cpufreq/cppc_cpufreq.c
>>>>> b/drivers/cpufreq/cppc_cpufreq.c
>>>>> index 022e3555407c..6370f2f0bdad 100644
>>>>> --- a/drivers/cpufreq/cppc_cpufreq.c
>>>>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>>>>> @@ -853,6 +853,20 @@ static unsigned int
>>>>> cppc_cpufreq_get_rate(unsigned int cpu)
>>>>>
>>>>>          udelay(2); /* 2usec delay between sampling */
>>>>>
>>>>> +       ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
>>>>> +       if (ret)
>>>>> +               return ret;
>>>>> +
>>>>> +       udelay(2); /* 2usec delay between sampling */
>>>>> +
>>>>> +       /* Do a third call. */
>>>>> +       ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
>>>>> +       if (ret)
>>>>> +               return ret;
>>>>> +
>>>>> +       udelay(2); /* 2usec delay between sampling */
>>>>> +
>>>>> +       /* Do a fourth call. */
>>>>>          ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
>>>>>          if (ret)
>>>>>                  return ret;
>>>>>
>>>>>> The above trace data shows some cpc reading took a little bit longer
>>>>>> than usual. I suspected it was caused by interconnect congestion.
>>>>>>
>>>>>>
>>>>>> So it looks like IRQ is the major contributing factor of high error
>>>>>> (4xxxxxx KHz), interconnect congestion is the major
>>>>>> contributing factor
>>>>>> of low error (3xxxxxx KHz).
>>>>>>
>>>>>> So I did the below test:
>>>>>> 1. Disable IRQ: The high errors were gone (> 3700000KHz),
>>>>>> but low errors
>>>>>> were still seen.
>>>>>> 2.10us delay: The high errors were still seen.
>>>>>> 3. Disable IRQ + 10us delay: all the errors were gone.
>>>>>>
>>>>>> I think the test result also supports the tracing data.
>>>>>>
>>>>>>
>>>>>> I also got some confusion about calling
>>>>>> cppc_cpufreq_get_rate() with irq
>>>>>> disabled. Rafael thought 10ms delay is too long because the
>>>>>> function may
>>>>>> be called with irq disabled. But a deeper look at the
>>>>>> function shows it
>>>>>> should *NOT* be called with irq disabled at all.
>>>>>>
>>>>>> First, if pcc channel is used, cpc reading may take over 100ms, it is
>>>>>> way larger the proposed 10ms delay.
>>>>>> Second, reading from cpc channel needs to take a semaphore, so it may
>>>>>> sleep. But sleep with IRQ disabled is not allowed.
>>>>> Yes right, however the semaphore is not taken in between the
>>>>> sequence of
>>>>> cpc_read() calls in cppc_get_perf_ctrs(). So maybe the change below
>>>>> should
>>>>> be acceptable:
>>>> Yeah, we should be able to find a smaller irq disable section.
>>>>
>>>>> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
>>>>> index c51d3ccb4cca..105a7e2ffffa 100644
>>>>> --- a/drivers/acpi/cppc_acpi.c
>>>>> +++ b/drivers/acpi/cppc_acpi.c
>>>>> @@ -1315,6 +1315,7 @@ int cppc_get_perf_ctrs(int cpunum, struct
>>>>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>>>>          struct cppc_pcc_data *pcc_ss_data = NULL;
>>>>>          u64 delivered, reference, ref_perf, ctr_wrap_time;
>>>>>          int ret = 0, regs_in_pcc = 0;
>>>>> +       unsigned long flags;
>>>>>
>>>>>          if (!cpc_desc) {
>>>>>                  pr_debug("No CPC descriptor for CPU:%d\n", cpunum);
>>>>> @@ -1350,10 +1351,14 @@ int cppc_get_perf_ctrs(int cpunum, struct
>>>>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>>>>                  }
>>>>>          }
>>>>>
>>>>> +       local_irq_save(flags);
>>>>> +
>>>>>          cpc_read(cpunum, delivered_reg, &delivered);
>>>>>          cpc_read(cpunum, reference_reg, &reference);
>>>>>          cpc_read(cpunum, ref_perf_reg, &ref_perf);
>>>>>
>>>>> +       local_irq_restore(flags);
>>>>> +
>>>> cpc_read_ffh() would return -EPERM if irq is disabled.
>>>>
>>>> So, the irq disabling must happen for mmio only in cpc_read(), for
>>>> example:
>>> I thought the issue was that irqs could happen in between cpc_read()
>>> functions,
>>> the patch below would not cover it. If the frequency is more accurate
>>> with this patch, I think I don't understand something.
>> Yeah, you are correct. The irq disabling window has to cover all the
>> cpc_read(). I didn't test with this patch. My test was done conceptually
>> with:
>>
>> disable irq
>> cppc_get_perf_ctrs(t0)
>> udelay(2)
>> cppc_get_perf_ctrs(t1)
>> enable irq
>>
>> But this will break cpc_read_ffh().
> Can you not disable IRQs in cppc_get_perf_ctrs() only if the registers
> are CPC_IN_SYSTEM_MEMORY? Only spanning the reads of the delivered
> register and the reference register, which should have minimal delay in
> between?
>
> As in:
>
> if (CPC_IN_SYSTEM_MEMORY(delivered_reg) &&
> CPC_IN_SYSTEM_MEMORY(reference_reg))
> ...
>
> This will and should not affect FFH - the fix for that would have to be
> different.

It won't work, right? The problem is cppc_get_perf_ctrs() calls
cpc_read()s to read delivered and reference respectively, we just can
tell whether they are CPC_IN_SYSTEM_MEMORY in cpc_read() instead of in
cppc_get_perf_ctrs(). So the resulting code should conceptually look like:

disable irq
read delivered
enable irq

disable irq
read reference
enable irq

But there still may be interrupts between the two reads. We actually want:

disable irq
read delivered
read reference
enable irq

>>> (asking for more information)
>>> Just to check, the core/perf counters are AMUs and the other CPPC
>>> registers
>>> are mmio right ? Is it possible to know the CPPC registers that are
>>> implemented on your platform ?
>> AFAIK, the perf counters are implemented by AMU and accessed via mmio, the
>> other CPPC registers are implemented by PCC.
>>
> Usually there is an implementation defined rate of sync between the CPU
> counters and the values available through the memory mapped interface.
> Is it possible that some inaccuracies are introduced by that as well, in
> case there is a sync in between the two counter (delivered and reference)
> reads?

AFAIK, the counter values are available through the memory mapped
interface immediately, there's no coarser rate of synchronization.

>
> Hope it helps,
> Ionela.
>
>>> Also is it possible which platform you are using ?
>> Ampere One
>>
>>>> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
>>>> index c51d3ccb4cca..f3c92d844074 100644
>>>> --- a/drivers/acpi/cppc_acpi.c
>>>> +++ b/drivers/acpi/cppc_acpi.c
>>>> @@ -982,6 +982,7 @@ static int cpc_read(int cpu, struct
>>>> cpc_register_resource *reg_res, u64 *val)
>>>>           void __iomem *vaddr = NULL;
>>>>           int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpu);
>>>>           struct cpc_reg *reg = &reg_res->cpc_entry.reg;
>>>> +       unsigned long flags;
>>>>
>>>>           if (reg_res->type == ACPI_TYPE_INTEGER) {
>>>>                   *val = reg_res->cpc_entry.int_value;
>>>> @@ -1015,6 +1016,7 @@ static int cpc_read(int cpu, struct
>>>> cpc_register_resource *reg_res, u64 *val)
>>>>                   return
>>>> acpi_os_read_memory((acpi_physical_address)reg->address,
>>>>                                   val, reg->bit_width);
>>>>
>>>> +       local_irq_save(flags);
>>>>           switch (reg->bit_width) {
>>>>           case 8:
>>>>                   *val = readb_relaxed(vaddr);
>>>> @@ -1029,10 +1031,12 @@ static int cpc_read(int cpu, struct
>>>> cpc_register_resource *reg_res, u64 *val)
>>>>                   *val = readq_relaxed(vaddr);
>>>>                   break;
>>>>           default:
>>>> +               local_irq_restore(flags);
>>>>                   pr_debug("Error: Cannot read %u bit width from PCC for
>>>> ss: %d\n",
>>>>                            reg->bit_width, pcc_ss_id);
>>>>                   return -EFAULT;
>>>>           }
>>>> +       local_irq_restore(flags);
>>>>
>>>>           return 0;
>>>>    }
>>>>
>>>>>          /*
>>>>>           * Per spec, if ctr_wrap_time optional register is
>>>>> unsupported, then the
>>>>>           * performance counters are assumed to never wrap during the
>>>>> lifetime of
>>>>>
>>>>>> Third, if the counters are implemented by AMU, cpc_read_ffh() needs to
>>>>>> send IPI so it requires IRQ enabled.
>>>>> If I'm not mistaken, the CPU calling cpc_read_ffh() might have IRQs
>>>>> disabled,
>>>>> it should not prevent it to send IPIs no ?
>>>> It can't work with irq disabled. The comment in counters_read_on_cpu()
>>>> says "Abort call on counterless CPU or when interrupts are disabled -
>>>> can lead to deadlock in smp sync call."
>>>>
>>>>
>>>> And it just returns -EPERM and raise a warning if irq is disabled.
>>> Ok right,
>>>
>>> Regards,
>>> Piere

2023-04-26 19:08:20

by Ionela Voinescu

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error

Hi,

+ Sumit

On Tuesday 25 Apr 2023 at 18:32:55 (-0700), Yang Shi wrote:
>
>
> On 4/24/23 4:44 AM, Ionela Voinescu wrote:
> > Hey,
> >
> > On Thursday 20 Apr 2023 at 13:49:24 (-0700), Yang Shi wrote:
> > > On 4/20/23 9:01 AM, Pierre Gondois wrote:
> > > > > > You say that the cause of this is a congestion in the interconnect. I
> > > > > > don't
> > > > > > see a way to check that right now.
> > > > > > However your trace is on the CPU0, so maybe all the other cores were
> > > > > > shutdown
> > > > > > in your test. If this is the case, do you think a congestion could
> > > > > > happen with
> > > > > > only one CPU ?
> > > > > No, other CPUs were not shut down in my test. I just ran "yes" on all
> > > > > cores except CPU 0, then ran the reading freq script. Since all other
> > > > > cores are busy, so the script should be always running on CPU 0.
> > > > >
> > > > > Since the counters, memory and other devices are on the interconnect, so
> > > > > the congestion may be caused by plenty of factors IIUC.
> > > > +Ionela
> > > >
> > > > Ionela pointed me to the following patch-set, which seems realated:
> > > > https://lore.kernel.org/all/[email protected]/
> > > Thanks for the information. I think we do have the similar syndrome. But I'm
> > > not sure how their counters are implemented, we may not have similar root
> > > cause.
> > Yes, my bad, I did not get the chance to read this full thread before
> > talking with Pierre. In your case you have AMUs accessed via MMIO and in that
> > case they are accessed though FFH (IPIs and system registers). The root
> > cause is indeed different.
>
> Yeah, but it seems like using larger delay could mitigate both issues.

Yes, there is a minimum delay required there of 25us due to the way that
the counters accumulate, which is not too bad even with interrupts
disabled (to cater to cpufreq_quick_get()).

>
[..]
> > > > > Yeah, we should be able to find a smaller irq disable section.
> > > > >
> > > > > > diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
> > > > > > index c51d3ccb4cca..105a7e2ffffa 100644
> > > > > > --- a/drivers/acpi/cppc_acpi.c
> > > > > > +++ b/drivers/acpi/cppc_acpi.c
> > > > > > @@ -1315,6 +1315,7 @@ int cppc_get_perf_ctrs(int cpunum, struct
> > > > > > cppc_perf_fb_ctrs *perf_fb_ctrs)
> > > > > >          struct cppc_pcc_data *pcc_ss_data = NULL;
> > > > > >          u64 delivered, reference, ref_perf, ctr_wrap_time;
> > > > > >          int ret = 0, regs_in_pcc = 0;
> > > > > > +       unsigned long flags;
> > > > > >
> > > > > >          if (!cpc_desc) {
> > > > > >                  pr_debug("No CPC descriptor for CPU:%d\n", cpunum);
> > > > > > @@ -1350,10 +1351,14 @@ int cppc_get_perf_ctrs(int cpunum, struct
> > > > > > cppc_perf_fb_ctrs *perf_fb_ctrs)
> > > > > >                  }
> > > > > >          }
> > > > > >
> > > > > > +       local_irq_save(flags);
> > > > > > +
> > > > > >          cpc_read(cpunum, delivered_reg, &delivered);
> > > > > >          cpc_read(cpunum, reference_reg, &reference);
> > > > > >          cpc_read(cpunum, ref_perf_reg, &ref_perf);
> > > > > >
> > > > > > +       local_irq_restore(flags);
> > > > > > +
> > > > > cpc_read_ffh() would return -EPERM if irq is disabled.
> > > > >
> > > > > So, the irq disabling must happen for mmio only in cpc_read(), for
> > > > > example:
> > > > I thought the issue was that irqs could happen in between cpc_read()
> > > > functions,
> > > > the patch below would not cover it. If the frequency is more accurate
> > > > with this patch, I think I don't understand something.
> > > Yeah, you are correct. The irq disabling window has to cover all the
> > > cpc_read(). I didn't test with this patch. My test was done conceptually
> > > with:
> > >
> > > disable irq
> > > cppc_get_perf_ctrs(t0)
> > > udelay(2)
> > > cppc_get_perf_ctrs(t1)
> > > enable irq
> > >
> > > But this will break cpc_read_ffh().
> > Can you not disable IRQs in cppc_get_perf_ctrs() only if the registers
> > are CPC_IN_SYSTEM_MEMORY? Only spanning the reads of the delivered
> > register and the reference register, which should have minimal delay in
> > between?
> >
> > As in:
> >
> > if (CPC_IN_SYSTEM_MEMORY(delivered_reg) &&
> > CPC_IN_SYSTEM_MEMORY(reference_reg))
> > ...
> >
> > This will and should not affect FFH - the fix for that would have to be
> > different.
>
> It won't work, right? The problem is cppc_get_perf_ctrs() calls cpc_read()s
> to read delivered and reference respectively, we just can tell whether they
> are CPC_IN_SYSTEM_MEMORY in cpc_read() instead of in cppc_get_perf_ctrs().
> So the resulting code should conceptually look like:
>
> disable irq
> read delivered
> enable irq
>
> disable irq
> read reference
> enable irq
>
> But there still may be interrupts between the two reads. We actually want:
>
> disable irq
> read delivered
> read reference
> enable irq

Yes, this is what I was suggesting above.

I've hacked up the following code. It covers the FFH case as well, with a
modified solution that Sumit proposed on the other thread:

diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
index 0f17b1c32718..7e828aed3693 100644
--- a/drivers/acpi/cppc_acpi.c
+++ b/drivers/acpi/cppc_acpi.c
@@ -110,6 +110,11 @@ static DEFINE_PER_CPU(struct cpc_desc *, cpc_desc_ptr);
(cpc)->cpc_entry.reg.space_id == \
ACPI_ADR_SPACE_SYSTEM_IO)

+/* Check if a CPC register is in FFH */
+#define CPC_IN_FFH(cpc) ((cpc)->type == ACPI_TYPE_BUFFER && \
+ (cpc)->cpc_entry.reg.space_id == \
+ ACPI_ADR_SPACE_FIXED_HARDWARE)
+
/* Evaluates to True if reg is a NULL register descriptor */
#define IS_NULL_REG(reg) ((reg)->space_id == ACPI_ADR_SPACE_SYSTEM_MEMORY && \
(reg)->address == 0 && \
@@ -1292,6 +1297,24 @@ EXPORT_SYMBOL_GPL(cppc_perf_ctrs_in_pcc);
*
* Return: 0 for success with perf_fb_ctrs populated else -ERRNO.
*/
+
+struct cycle_counters {
+ int cpunum;
+ struct cpc_register_resource *delivered_reg;
+ struct cpc_register_resource *reference_reg;
+ u64 *delivered;
+ u64 *reference;
+};
+
+static int cppc_get_cycle_ctrs(void *cycle_ctrs) {
+ struct cycle_counters *ctrs = cycle_ctrs;
+
+ cpc_read(ctrs->cpunum, ctrs->delivered_reg, ctrs->delivered);
+ cpc_read(ctrs->cpunum, ctrs->reference_reg, ctrs->reference);
+
+ return 0;
+}
+
int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs)
{
struct cpc_desc *cpc_desc = per_cpu(cpc_desc_ptr, cpunum);
@@ -1300,7 +1323,9 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs)
int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpunum);
struct cppc_pcc_data *pcc_ss_data = NULL;
u64 delivered, reference, ref_perf, ctr_wrap_time;
+ struct cycle_counters ctrs = {0};
int ret = 0, regs_in_pcc = 0;
+ unsigned long flags;

if (!cpc_desc) {
pr_debug("No CPC descriptor for CPU:%d\n", cpunum);
@@ -1336,8 +1361,25 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs)
}
}

- cpc_read(cpunum, delivered_reg, &delivered);
- cpc_read(cpunum, reference_reg, &reference);
+ ctrs.cpunum = cpunum;
+ ctrs.delivered_reg = delivered_reg;
+ ctrs.reference_reg = reference_reg;
+ ctrs.delivered = &delivered;
+ ctrs.reference = &reference;
+
+ if (CPC_IN_FFH(delivered_reg) && CPC_IN_FFH(reference_reg)) {
+ ret = smp_call_on_cpu(cpunum, cppc_get_cycle_ctrs, &ctrs, false);
+ } else {
+ if (CPC_IN_SYSTEM_MEMORY(delivered_reg) &&
+ CPC_IN_SYSTEM_MEMORY(reference_reg)) {
+ local_irq_save(flags);
+ cppc_get_cycle_ctrs(&ctrs);
+ local_irq_restore(flags);
+ } else {
+ cppc_get_cycle_ctrs(&ctrs);
+ }
+ }
+
cpc_read(cpunum, ref_perf_reg, &ref_perf);

/*

I've only tested this on a model using FFH, with 10us delay, and it
worked well for me. Yang, Sumit, could you give it a try?

Even with a solution like the above (more refined, of course) there is an
additional improvement possible: we can implement arch_freq_get_on_cpu()
for arm64 systems that will use cached (on the tick) AMU cycle counter
samples and have this function called from show_cpuinfo_cur_freq()
before/instead of calling the .get() function. But this will only help
arm64 systems with AMUs accessible though system registers. We'll try to
submit patches on this soon. But as I mentioned to Sumit on the other
thread, the returned frequency value from this will be an average over 4ms
(with CONFIG_HZ=250) and could be up to 4ms old (more than that only if the
CPU was idle/isolated).

Thanks,
Ionela.

2023-04-27 20:42:34

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error



On 4/26/23 12:01 PM, Ionela Voinescu wrote:
> Hi,
>
> + Sumit
>
> On Tuesday 25 Apr 2023 at 18:32:55 (-0700), Yang Shi wrote:
>>
>> On 4/24/23 4:44 AM, Ionela Voinescu wrote:
>>> Hey,
>>>
>>> On Thursday 20 Apr 2023 at 13:49:24 (-0700), Yang Shi wrote:
>>>> On 4/20/23 9:01 AM, Pierre Gondois wrote:
>>>>>>> You say that the cause of this is a congestion in the interconnect. I
>>>>>>> don't
>>>>>>> see a way to check that right now.
>>>>>>> However your trace is on the CPU0, so maybe all the other cores were
>>>>>>> shutdown
>>>>>>> in your test. If this is the case, do you think a congestion could
>>>>>>> happen with
>>>>>>> only one CPU ?
>>>>>> No, other CPUs were not shut down in my test. I just ran "yes" on all
>>>>>> cores except CPU 0, then ran the reading freq script. Since all other
>>>>>> cores are busy, so the script should be always running on CPU 0.
>>>>>>
>>>>>> Since the counters, memory and other devices are on the interconnect, so
>>>>>> the congestion may be caused by plenty of factors IIUC.
>>>>> +Ionela
>>>>>
>>>>> Ionela pointed me to the following patch-set, which seems realated:
>>>>> https://lore.kernel.org/all/[email protected]/
>>>> Thanks for the information. I think we do have the similar syndrome. But I'm
>>>> not sure how their counters are implemented, we may not have similar root
>>>> cause.
>>> Yes, my bad, I did not get the chance to read this full thread before
>>> talking with Pierre. In your case you have AMUs accessed via MMIO and in that
>>> case they are accessed though FFH (IPIs and system registers). The root
>>> cause is indeed different.
>> Yeah, but it seems like using larger delay could mitigate both issues.
> Yes, there is a minimum delay required there of 25us due to the way that
> the counters accumulate, which is not too bad even with interrupts
> disabled (to cater to cpufreq_quick_get()).
>
> [..]
>>>>>> Yeah, we should be able to find a smaller irq disable section.
>>>>>>
>>>>>>> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
>>>>>>> index c51d3ccb4cca..105a7e2ffffa 100644
>>>>>>> --- a/drivers/acpi/cppc_acpi.c
>>>>>>> +++ b/drivers/acpi/cppc_acpi.c
>>>>>>> @@ -1315,6 +1315,7 @@ int cppc_get_perf_ctrs(int cpunum, struct
>>>>>>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>>>>>>          struct cppc_pcc_data *pcc_ss_data = NULL;
>>>>>>>          u64 delivered, reference, ref_perf, ctr_wrap_time;
>>>>>>>          int ret = 0, regs_in_pcc = 0;
>>>>>>> +       unsigned long flags;
>>>>>>>
>>>>>>>          if (!cpc_desc) {
>>>>>>>                  pr_debug("No CPC descriptor for CPU:%d\n", cpunum);
>>>>>>> @@ -1350,10 +1351,14 @@ int cppc_get_perf_ctrs(int cpunum, struct
>>>>>>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>>>>>>                  }
>>>>>>>          }
>>>>>>>
>>>>>>> +       local_irq_save(flags);
>>>>>>> +
>>>>>>>          cpc_read(cpunum, delivered_reg, &delivered);
>>>>>>>          cpc_read(cpunum, reference_reg, &reference);
>>>>>>>          cpc_read(cpunum, ref_perf_reg, &ref_perf);
>>>>>>>
>>>>>>> +       local_irq_restore(flags);
>>>>>>> +
>>>>>> cpc_read_ffh() would return -EPERM if irq is disabled.
>>>>>>
>>>>>> So, the irq disabling must happen for mmio only in cpc_read(), for
>>>>>> example:
>>>>> I thought the issue was that irqs could happen in between cpc_read()
>>>>> functions,
>>>>> the patch below would not cover it. If the frequency is more accurate
>>>>> with this patch, I think I don't understand something.
>>>> Yeah, you are correct. The irq disabling window has to cover all the
>>>> cpc_read(). I didn't test with this patch. My test was done conceptually
>>>> with:
>>>>
>>>> disable irq
>>>> cppc_get_perf_ctrs(t0)
>>>> udelay(2)
>>>> cppc_get_perf_ctrs(t1)
>>>> enable irq
>>>>
>>>> But this will break cpc_read_ffh().
>>> Can you not disable IRQs in cppc_get_perf_ctrs() only if the registers
>>> are CPC_IN_SYSTEM_MEMORY? Only spanning the reads of the delivered
>>> register and the reference register, which should have minimal delay in
>>> between?
>>>
>>> As in:
>>>
>>> if (CPC_IN_SYSTEM_MEMORY(delivered_reg) &&
>>> CPC_IN_SYSTEM_MEMORY(reference_reg))
>>> ...
>>>
>>> This will and should not affect FFH - the fix for that would have to be
>>> different.
>> It won't work, right? The problem is cppc_get_perf_ctrs() calls cpc_read()s
>> to read delivered and reference respectively, we just can tell whether they
>> are CPC_IN_SYSTEM_MEMORY in cpc_read() instead of in cppc_get_perf_ctrs().
>> So the resulting code should conceptually look like:
>>
>> disable irq
>> read delivered
>> enable irq
>>
>> disable irq
>> read reference
>> enable irq
>>
>> But there still may be interrupts between the two reads. We actually want:
>>
>> disable irq
>> read delivered
>> read reference
>> enable irq
> Yes, this is what I was suggesting above.
>
> I've hacked up the following code. It covers the FFH case as well, with a
> modified solution that Sumit proposed on the other thread:
>
> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
> index 0f17b1c32718..7e828aed3693 100644
> --- a/drivers/acpi/cppc_acpi.c
> +++ b/drivers/acpi/cppc_acpi.c
> @@ -110,6 +110,11 @@ static DEFINE_PER_CPU(struct cpc_desc *, cpc_desc_ptr);
> (cpc)->cpc_entry.reg.space_id == \
> ACPI_ADR_SPACE_SYSTEM_IO)
>
> +/* Check if a CPC register is in FFH */
> +#define CPC_IN_FFH(cpc) ((cpc)->type == ACPI_TYPE_BUFFER && \
> + (cpc)->cpc_entry.reg.space_id == \
> + ACPI_ADR_SPACE_FIXED_HARDWARE)
> +
> /* Evaluates to True if reg is a NULL register descriptor */
> #define IS_NULL_REG(reg) ((reg)->space_id == ACPI_ADR_SPACE_SYSTEM_MEMORY && \
> (reg)->address == 0 && \
> @@ -1292,6 +1297,24 @@ EXPORT_SYMBOL_GPL(cppc_perf_ctrs_in_pcc);
> *
> * Return: 0 for success with perf_fb_ctrs populated else -ERRNO.
> */
> +
> +struct cycle_counters {
> + int cpunum;
> + struct cpc_register_resource *delivered_reg;
> + struct cpc_register_resource *reference_reg;
> + u64 *delivered;
> + u64 *reference;
> +};
> +
> +static int cppc_get_cycle_ctrs(void *cycle_ctrs) {
> + struct cycle_counters *ctrs = cycle_ctrs;
> +
> + cpc_read(ctrs->cpunum, ctrs->delivered_reg, ctrs->delivered);
> + cpc_read(ctrs->cpunum, ctrs->reference_reg, ctrs->reference);
> +
> + return 0;
> +}
> +
> int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs)
> {
> struct cpc_desc *cpc_desc = per_cpu(cpc_desc_ptr, cpunum);
> @@ -1300,7 +1323,9 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs)
> int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpunum);
> struct cppc_pcc_data *pcc_ss_data = NULL;
> u64 delivered, reference, ref_perf, ctr_wrap_time;
> + struct cycle_counters ctrs = {0};
> int ret = 0, regs_in_pcc = 0;
> + unsigned long flags;
>
> if (!cpc_desc) {
> pr_debug("No CPC descriptor for CPU:%d\n", cpunum);
> @@ -1336,8 +1361,25 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs)
> }
> }
>
> - cpc_read(cpunum, delivered_reg, &delivered);
> - cpc_read(cpunum, reference_reg, &reference);
> + ctrs.cpunum = cpunum;
> + ctrs.delivered_reg = delivered_reg;
> + ctrs.reference_reg = reference_reg;
> + ctrs.delivered = &delivered;
> + ctrs.reference = &reference;
> +
> + if (CPC_IN_FFH(delivered_reg) && CPC_IN_FFH(reference_reg)) {
> + ret = smp_call_on_cpu(cpunum, cppc_get_cycle_ctrs, &ctrs, false);
> + } else {
> + if (CPC_IN_SYSTEM_MEMORY(delivered_reg) &&
> + CPC_IN_SYSTEM_MEMORY(reference_reg)) {
> + local_irq_save(flags);
> + cppc_get_cycle_ctrs(&ctrs);
> + local_irq_restore(flags);
> + } else {
> + cppc_get_cycle_ctrs(&ctrs);
> + }
> + }
> +
> cpc_read(cpunum, ref_perf_reg, &ref_perf);
>
> /*
>
> I've only tested this on a model using FFH, with 10us delay, and it
> worked well for me. Yang, Sumit, could you give it a try?

Thanks for the patch. I tested it with 10us delay, it works well for me.
There was 0 high error in my 3 hours test.

>
> Even with a solution like the above (more refined, of course) there is an
> additional improvement possible: we can implement arch_freq_get_on_cpu()
> for arm64 systems that will use cached (on the tick) AMU cycle counter
> samples and have this function called from show_cpuinfo_cur_freq()
> before/instead of calling the .get() function. But this will only help
> arm64 systems with AMUs accessible though system registers. We'll try to
> submit patches on this soon. But as I mentioned to Sumit on the other
> thread, the returned frequency value from this will be an average over 4ms
> (with CONFIG_HZ=250) and could be up to 4ms old (more than that only if the
> CPU was idle/isolated).
>
> Thanks,
> Ionela.

2023-04-28 11:05:01

by Sumit Gupta

[permalink] [raw]
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error



On 28/04/23 02:10, Yang Shi wrote:
> External email: Use caution opening links or attachments
>
>
> On 4/26/23 12:01 PM, Ionela Voinescu wrote:
>> Hi,
>>
>> + Sumit
>>
>> On Tuesday 25 Apr 2023 at 18:32:55 (-0700), Yang Shi wrote:
>>>
>>> On 4/24/23 4:44 AM, Ionela Voinescu wrote:
>>>> Hey,
>>>>
>>>> On Thursday 20 Apr 2023 at 13:49:24 (-0700), Yang Shi wrote:
>>>>> On 4/20/23 9:01 AM, Pierre Gondois wrote:
>>>>>>>> You say that the cause of this is a congestion in the
>>>>>>>> interconnect. I
>>>>>>>> don't
>>>>>>>> see a way to check that right now.
>>>>>>>> However your trace is on the CPU0, so maybe all the other cores
>>>>>>>> were
>>>>>>>> shutdown
>>>>>>>> in your test. If this is the case, do you think a congestion could
>>>>>>>> happen with
>>>>>>>> only one CPU ?
>>>>>>> No, other CPUs were not shut down in my test. I just ran "yes" on
>>>>>>> all
>>>>>>> cores except CPU 0, then ran the reading freq script. Since all
>>>>>>> other
>>>>>>> cores are busy, so the script should be always running on CPU 0.
>>>>>>>
>>>>>>> Since the counters, memory and other devices are on the
>>>>>>> interconnect, so
>>>>>>> the congestion may be caused by plenty of factors IIUC.
>>>>>> +Ionela
>>>>>>
>>>>>> Ionela pointed me to the following patch-set, which seems realated:
>>>>>> https://lore.kernel.org/all/[email protected]/
>>>>> Thanks for the information. I think we do have the similar
>>>>> syndrome. But I'm
>>>>> not sure how their counters are implemented, we may not have
>>>>> similar root
>>>>> cause.
>>>> Yes, my bad, I did not get the chance to read this full thread before
>>>> talking with Pierre. In your case you have AMUs accessed via MMIO
>>>> and in that
>>>> case they are accessed though FFH (IPIs and system registers). The root
>>>> cause is indeed different.
>>> Yeah, but it seems like using larger delay could mitigate both issues.
>> Yes, there is a minimum delay required there of 25us due to the way that
>> the counters accumulate, which is not too bad even with interrupts
>> disabled (to cater to cpufreq_quick_get()).
>>
>> [..]
>>>>>>> Yeah, we should be able to find a smaller irq disable section.
>>>>>>>
>>>>>>>> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
>>>>>>>> index c51d3ccb4cca..105a7e2ffffa 100644
>>>>>>>> --- a/drivers/acpi/cppc_acpi.c
>>>>>>>> +++ b/drivers/acpi/cppc_acpi.c
>>>>>>>> @@ -1315,6 +1315,7 @@ int cppc_get_perf_ctrs(int cpunum, struct
>>>>>>>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>>>>>>>            struct cppc_pcc_data *pcc_ss_data = NULL;
>>>>>>>>            u64 delivered, reference, ref_perf, ctr_wrap_time;
>>>>>>>>            int ret = 0, regs_in_pcc = 0;
>>>>>>>> +       unsigned long flags;
>>>>>>>>
>>>>>>>>            if (!cpc_desc) {
>>>>>>>>                    pr_debug("No CPC descriptor for CPU:%d\n",
>>>>>>>> cpunum);
>>>>>>>> @@ -1350,10 +1351,14 @@ int cppc_get_perf_ctrs(int cpunum, struct
>>>>>>>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>>>>>>>                    }
>>>>>>>>            }
>>>>>>>>
>>>>>>>> +       local_irq_save(flags);
>>>>>>>> +
>>>>>>>>            cpc_read(cpunum, delivered_reg, &delivered);
>>>>>>>>            cpc_read(cpunum, reference_reg, &reference);
>>>>>>>>            cpc_read(cpunum, ref_perf_reg, &ref_perf);
>>>>>>>>
>>>>>>>> +       local_irq_restore(flags);
>>>>>>>> +
>>>>>>> cpc_read_ffh() would return -EPERM if irq is disabled.
>>>>>>>
>>>>>>> So, the irq disabling must happen for mmio only in cpc_read(), for
>>>>>>> example:
>>>>>> I thought the issue was that irqs could happen in between cpc_read()
>>>>>> functions,
>>>>>> the patch below would not cover it. If the frequency is more accurate
>>>>>> with this patch, I think I don't understand something.
>>>>> Yeah, you are correct. The irq disabling window has to cover all the
>>>>> cpc_read(). I didn't test with this patch. My test was done
>>>>> conceptually
>>>>> with:
>>>>>
>>>>> disable irq
>>>>> cppc_get_perf_ctrs(t0)
>>>>> udelay(2)
>>>>> cppc_get_perf_ctrs(t1)
>>>>> enable irq
>>>>>
>>>>> But this will break cpc_read_ffh().
>>>> Can you not disable IRQs in cppc_get_perf_ctrs() only if the registers
>>>> are CPC_IN_SYSTEM_MEMORY? Only spanning the reads of the delivered
>>>> register and the reference register, which should have minimal delay in
>>>> between?
>>>>
>>>> As in:
>>>>
>>>> if (CPC_IN_SYSTEM_MEMORY(delivered_reg) &&
>>>>       CPC_IN_SYSTEM_MEMORY(reference_reg))
>>>>     ...
>>>>
>>>> This will and should not affect FFH - the fix for that would have to be
>>>> different.
>>> It won't work, right? The problem is cppc_get_perf_ctrs() calls
>>> cpc_read()s
>>> to read delivered and reference respectively, we just can tell
>>> whether they
>>> are CPC_IN_SYSTEM_MEMORY in cpc_read() instead of in
>>> cppc_get_perf_ctrs().
>>> So the resulting code should conceptually look like:
>>>
>>> disable irq
>>> read delivered
>>> enable irq
>>>
>>> disable irq
>>> read reference
>>> enable irq
>>>
>>> But there still may be interrupts between the two reads. We actually
>>> want:
>>>
>>> disable irq
>>> read delivered
>>> read reference
>>> enable irq
>> Yes, this is what I was suggesting above.
>>
>> I've hacked up the following code. It covers the FFH case as well, with a
>> modified solution that Sumit proposed on the other thread:
>>
>> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
>> index 0f17b1c32718..7e828aed3693 100644
>> --- a/drivers/acpi/cppc_acpi.c
>> +++ b/drivers/acpi/cppc_acpi.c
>> @@ -110,6 +110,11 @@ static DEFINE_PER_CPU(struct cpc_desc *,
>> cpc_desc_ptr);
>>                               (cpc)->cpc_entry.reg.space_id ==        \
>>                               ACPI_ADR_SPACE_SYSTEM_IO)
>>
>> +/* Check if a CPC register is in FFH */
>> +#define CPC_IN_FFH(cpc) ((cpc)->type == ACPI_TYPE_BUFFER &&  \
>> +                             (cpc)->cpc_entry.reg.space_id ==        \
>> +                             ACPI_ADR_SPACE_FIXED_HARDWARE)
>> +
>>   /* Evaluates to True if reg is a NULL register descriptor */
>>   #define IS_NULL_REG(reg) ((reg)->space_id ==
>> ACPI_ADR_SPACE_SYSTEM_MEMORY && \
>>                               (reg)->address == 0 &&                  \
>> @@ -1292,6 +1297,24 @@ EXPORT_SYMBOL_GPL(cppc_perf_ctrs_in_pcc);
>>    *
>>    * Return: 0 for success with perf_fb_ctrs populated else -ERRNO.
>>    */
>> +
>> +struct cycle_counters {
>> +     int cpunum;
>> +     struct cpc_register_resource *delivered_reg;
>> +     struct cpc_register_resource *reference_reg;
>> +     u64 *delivered;
>> +     u64 *reference;
>> +};
>> +
>> +static int cppc_get_cycle_ctrs(void *cycle_ctrs) {
>> +     struct cycle_counters *ctrs = cycle_ctrs;
>> +
>> +     cpc_read(ctrs->cpunum, ctrs->delivered_reg, ctrs->delivered);
>> +     cpc_read(ctrs->cpunum, ctrs->reference_reg, ctrs->reference);
>> +
>> +     return 0;
>> +}
>> +
>>   int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs
>> *perf_fb_ctrs)
>>   {
>>       struct cpc_desc *cpc_desc = per_cpu(cpc_desc_ptr, cpunum);
>> @@ -1300,7 +1323,9 @@ int cppc_get_perf_ctrs(int cpunum, struct
>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>       int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpunum);
>>       struct cppc_pcc_data *pcc_ss_data = NULL;
>>       u64 delivered, reference, ref_perf, ctr_wrap_time;
>> +     struct cycle_counters ctrs = {0};
>>       int ret = 0, regs_in_pcc = 0;
>> +     unsigned long flags;
>>
>>       if (!cpc_desc) {
>>               pr_debug("No CPC descriptor for CPU:%d\n", cpunum);
>> @@ -1336,8 +1361,25 @@ int cppc_get_perf_ctrs(int cpunum, struct
>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>               }
>>       }
>>
>> -     cpc_read(cpunum, delivered_reg, &delivered);
>> -     cpc_read(cpunum, reference_reg, &reference);
>> +     ctrs.cpunum = cpunum;
>> +     ctrs.delivered_reg = delivered_reg;
>> +     ctrs.reference_reg = reference_reg;
>> +     ctrs.delivered = &delivered;
>> +     ctrs.reference = &reference;
>> +
>> +     if (CPC_IN_FFH(delivered_reg) && CPC_IN_FFH(reference_reg)) {
>> +             ret = smp_call_on_cpu(cpunum, cppc_get_cycle_ctrs,
>> &ctrs, false);
>> +     } else {
>> +             if (CPC_IN_SYSTEM_MEMORY(delivered_reg) &&
>> +                 CPC_IN_SYSTEM_MEMORY(reference_reg)) {
>> +                     local_irq_save(flags);
>> +                     cppc_get_cycle_ctrs(&ctrs);krish
>> +                     local_irq_restore(flags);
>> +             } else {
>> +                     cppc_get_cycle_ctrs(&ctrs);
>> +             }
>> +     }
>> +
>>       cpc_read(cpunum, ref_perf_reg, &ref_perf);
>>
>>       /*
>>
>> I've only tested this on a model using FFH, with 10us delay, and it
>> worked well for me. Yang, Sumit, could you give it a try?
>
> Thanks for the patch. I tested it with 10us delay, it works well for me.
> There was 0 high error in my 3 hours test.
>
With this change, the delta between the set and get freq has reduced
from ~25% to <10% and occasionally ~15%. Tried with the delay of
10, 25, 50, 75 us.

With the change in [1] & '25us' delay in [2], the delta was <5%.
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/

Thank you,
Sumit Gupta

>>
>> Even with a solution like the above (more refined, of course) there is an
>> additional improvement possible: we can implement arch_freq_get_on_cpu()
>> for arm64 systems that will use cached (on the tick) AMU cycle counter
>> samples and have this function called from show_cpuinfo_cur_freq()
>> before/instead of calling the .get() function. But this will only help
>> arm64 systems with AMUs accessible though system registers. We'll try to
>> submit patches on this soon. But as I mentioned to Sumit on the other
>> thread, the returned frequency value from this will be an average over
>> 4ms
>> (with CONFIG_HZ=250) and could be up to 4ms old (more than that only
>> if the
>> CPU was idle/isolated).
>>
>> Thanks,
>> Ionela.
>