In order for the scheduler to be frequency invariant we measure the
ratio between the maximum cpu frequency and the actual cpu frequency.
During long tickless periods of time the calculations that keep track
of that might overflow, in the function scale_freq_tick():
if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
» goto error;
eventually forcing the kernel to disable the feature with the
message "Scheduler frequency invariance went wobbly, disabling!".
Let's avoid that by detecting long tickless periods and bypassing
the calculation for that tick.
This calculation updates the value of arch_freq_scale, used by the
capacity-aware scheduler to correct cpu duty cycles:
task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) /
max_frequency(cpu))
However Consider a long tickless period, It takes should take 60 minutes
for a tickless CPU running at 5GHz to trigger the acnt overflow,
pick 10 minutes as a staleness threshold to be on the safe side,
In our testing it took over 30 minutes for the overflow to happen,
but since it's frequency/platform dependent we choose a smaller value
to be on the safe side.
Fixes: e2b0d619b400 ("x86, sched: check for counters overflow in frequency invariant accounting")
Signed-off-by: Yair Podemsky <[email protected]>
---
arch/x86/kernel/cpu/aperfmperf.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
index 1f60a2b27936..dfe356034a60 100644
--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -23,6 +23,13 @@
#include "cpu.h"
+/*
+ * Samples older then 10 minutes should not be proccessed,
+ * This time is long enough to prevent unneeded drops of data
+ * But short enough to prevent overflows
+ */
+#define MAX_SAMPLE_AGE_NOHZ ((unsigned long)HZ * 600)
+
struct aperfmperf {
seqcount_t seq;
unsigned long last_update;
@@ -373,6 +380,7 @@ static inline void scale_freq_tick(u64 acnt, u64 mcnt) { }
void arch_scale_freq_tick(void)
{
struct aperfmperf *s = this_cpu_ptr(&cpu_samples);
+ unsigned long last = s->last_update;
u64 acnt, mcnt, aperf, mperf;
if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
@@ -392,7 +400,12 @@ void arch_scale_freq_tick(void)
s->mcnt = mcnt;
raw_write_seqcount_end(&s->seq);
- scale_freq_tick(acnt, mcnt);
+ /*
+ * Avoid calling scale_freq_tick() when the last update was too long ago,
+ * as it might overflow during calulation.
+ */
+ if ((jiffies - last) <= MAX_SAMPLE_AGE_NOHZ)
+ scale_freq_tick(acnt, mcnt);
}
/*
--
2.31.1
Friendly ping?
On Thu, 2022-08-04 at 16:17 +0300, Yair Podemsky wrote:
> In order for the scheduler to be frequency invariant we measure the
> ratio between the maximum cpu frequency and the actual cpu frequency.
> During long tickless periods of time the calculations that keep track
> of that might overflow, in the function scale_freq_tick():
>
> if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
> » goto error;
>
> eventually forcing the kernel to disable the feature with the
> message "Scheduler frequency invariance went wobbly, disabling!".
> Let's avoid that by detecting long tickless periods and bypassing
> the calculation for that tick.
>
> This calculation updates the value of arch_freq_scale, used by the
> capacity-aware scheduler to correct cpu duty cycles:
> task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) /
> max_frequency(cpu))
>
> However Consider a long tickless period, It takes should take 60
> minutes
> for a tickless CPU running at 5GHz to trigger the acnt overflow,
> pick 10 minutes as a staleness threshold to be on the safe side,
> In our testing it took over 30 minutes for the overflow to happen,
> but since it's frequency/platform dependent we choose a smaller value
> to be on the safe side.
>
> Fixes: e2b0d619b400 ("x86, sched: check for counters overflow in
> frequency invariant accounting")
> Signed-off-by: Yair Podemsky <[email protected]>
> ---
> arch/x86/kernel/cpu/aperfmperf.c | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/aperfmperf.c
> b/arch/x86/kernel/cpu/aperfmperf.c
> index 1f60a2b27936..dfe356034a60 100644
> --- a/arch/x86/kernel/cpu/aperfmperf.c
> +++ b/arch/x86/kernel/cpu/aperfmperf.c
> @@ -23,6 +23,13 @@
>
> #include "cpu.h"
>
> +/*
> + * Samples older then 10 minutes should not be proccessed,
> + * This time is long enough to prevent unneeded drops of data
> + * But short enough to prevent overflows
> + */
> +#define MAX_SAMPLE_AGE_NOHZ ((unsigned long)HZ * 600)
> +
> struct aperfmperf {
> seqcount_t seq;
> unsigned long last_update;
> @@ -373,6 +380,7 @@ static inline void scale_freq_tick(u64 acnt, u64
> mcnt) { }
> void arch_scale_freq_tick(void)
> {
> struct aperfmperf *s = this_cpu_ptr(&cpu_samples);
> + unsigned long last = s->last_update;
> u64 acnt, mcnt, aperf, mperf;
>
> if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
> @@ -392,7 +400,12 @@ void arch_scale_freq_tick(void)
> s->mcnt = mcnt;
> raw_write_seqcount_end(&s->seq);
>
> - scale_freq_tick(acnt, mcnt);
> + /*
> + * Avoid calling scale_freq_tick() when the last update was too
> long ago,
> + * as it might overflow during calulation.
> + */
> + if ((jiffies - last) <= MAX_SAMPLE_AGE_NOHZ)
> + scale_freq_tick(acnt, mcnt);
> }
>
> /*
On Thu, Aug 04, 2022 at 04:17:28PM +0300, Yair Podemsky wrote:
> In order for the scheduler to be frequency invariant we measure the
> ratio between the maximum cpu frequency and the actual cpu frequency.
> During long tickless periods of time the calculations that keep track
> of that might overflow, in the function scale_freq_tick():
>
> if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
> ? goto error;
>
> eventually forcing the kernel to disable the feature with the
> message "Scheduler frequency invariance went wobbly, disabling!".
> Let's avoid that by detecting long tickless periods and bypassing
> the calculation for that tick.
>
> This calculation updates the value of arch_freq_scale, used by the
> capacity-aware scheduler to correct cpu duty cycles:
> task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) /
> max_frequency(cpu))
>
> However Consider a long tickless period, It takes should take 60 minutes
> for a tickless CPU running at 5GHz to trigger the acnt overflow,
> pick 10 minutes as a staleness threshold to be on the safe side,
> In our testing it took over 30 minutes for the overflow to happen,
> but since it's frequency/platform dependent we choose a smaller value
> to be on the safe side.
>
> Fixes: e2b0d619b400 ("x86, sched: check for counters overflow in frequency invariant accounting")
> Signed-off-by: Yair Podemsky <[email protected]>
> ---
> arch/x86/kernel/cpu/aperfmperf.c | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
> index 1f60a2b27936..dfe356034a60 100644
> --- a/arch/x86/kernel/cpu/aperfmperf.c
> +++ b/arch/x86/kernel/cpu/aperfmperf.c
> @@ -23,6 +23,13 @@
>
> #include "cpu.h"
>
> +/*
> + * Samples older then 10 minutes should not be proccessed,
> + * This time is long enough to prevent unneeded drops of data
> + * But short enough to prevent overflows
> + */
> +#define MAX_SAMPLE_AGE_NOHZ ((unsigned long)HZ * 600)
> +
> struct aperfmperf {
> seqcount_t seq;
> unsigned long last_update;
> @@ -373,6 +380,7 @@ static inline void scale_freq_tick(u64 acnt, u64 mcnt) { }
> void arch_scale_freq_tick(void)
> {
> struct aperfmperf *s = this_cpu_ptr(&cpu_samples);
> + unsigned long last = s->last_update;
> u64 acnt, mcnt, aperf, mperf;
>
> if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
> @@ -392,7 +400,12 @@ void arch_scale_freq_tick(void)
> s->mcnt = mcnt;
> raw_write_seqcount_end(&s->seq);
>
> - scale_freq_tick(acnt, mcnt);
> + /*
> + * Avoid calling scale_freq_tick() when the last update was too long ago,
> + * as it might overflow during calulation.
> + */
> + if ((jiffies - last) <= MAX_SAMPLE_AGE_NOHZ)
> + scale_freq_tick(acnt, mcnt);
> }
All this patch does is avoid the warning; but afaict it doesn't make it
behave in a sane way.
I'm thinking that on nohz_full cpus you don't have load balancing, I'm
also thinking that on nohz_full cpus you don't have DVFS.
So *why* the heck are we setting this stuff to random values ? Should
you not simply kill th entire thing for nohz_full cpus?
On 06/09/22 16:54, Peter Zijlstra wrote:
> On Thu, Aug 04, 2022 at 04:17:28PM +0300, Yair Podemsky wrote:
>> @@ -392,7 +400,12 @@ void arch_scale_freq_tick(void)
>> s->mcnt = mcnt;
>> raw_write_seqcount_end(&s->seq);
>>
>> - scale_freq_tick(acnt, mcnt);
>> + /*
>> + * Avoid calling scale_freq_tick() when the last update was too long ago,
>> + * as it might overflow during calulation.
>> + */
>> + if ((jiffies - last) <= MAX_SAMPLE_AGE_NOHZ)
>> + scale_freq_tick(acnt, mcnt);
>> }
>
> All this patch does is avoid the warning; but afaict it doesn't make it
> behave in a sane way.
>
> I'm thinking that on nohz_full cpus you don't have load balancing, I'm
> also thinking that on nohz_full cpus you don't have DVFS.
>
> So *why* the heck are we setting this stuff to random values ? Should
> you not simply kill th entire thing for nohz_full cpus?
IIRC this stems from systems where nohz_full CPUs are not running tickless
at all times (you get transitions to/from latency-sensitive work). Also
from what I've seen isolation is (intentionally) done with just
isolcpus=managed_irq,<nohz_cpus>; there isn't the 'domain' flag so load
balancing isn't permanently disabled.
DVFS is another point, I don't remember seeing cpufreq governor changes in
the transitions, but I wouldn't be suprised if there were - so we'd move
from tickless, no-DVFS to ticking with DVFS (and would like that to behave
"sanely").
FWIW arm64 does something similar in that it just saves the counters but
doesn't update the scale when the delta overflows/wrapsaround, so that the
next tick can work with a sane delta, cf
arch/arm64/kernel/topology.c::amu_scale_freq_tick()
On Tue, 2022-09-06 at 17:17 +0100, Valentin Schneider wrote:
> On 06/09/22 16:54, Peter Zijlstra wrote:
> > On Thu, Aug 04, 2022 at 04:17:28PM +0300, Yair Podemsky wrote:
> > > @@ -392,7 +400,12 @@ void arch_scale_freq_tick(void)
> > > s->mcnt = mcnt;
> > > raw_write_seqcount_end(&s->seq);
> > >
> > > - scale_freq_tick(acnt, mcnt);
> > > + /*
> > > + * Avoid calling scale_freq_tick() when the last update was too
> > > long ago,
> > > + * as it might overflow during calulation.
> > > + */
> > > + if ((jiffies - last) <= MAX_SAMPLE_AGE_NOHZ)
> > > + scale_freq_tick(acnt, mcnt);
> > > }
> >
> > All this patch does is avoid the warning; but afaict it doesn't
> > make it
> > behave in a sane way.
It also avoids the disabling of the frequency invariance accounting for
all cpus, that occurs immediately after the warning.
That is the bug that is being solved, Since it affects also non-
tickless cpus.
> >
> > I'm thinking that on nohz_full cpus you don't have load balancing,
> > I'm
> > also thinking that on nohz_full cpus you don't have DVFS.
> >
> > So *why* the heck are we setting this stuff to random values ?
> > Should
> > you not simply kill th entire thing for nohz_full cpus?
>
> IIRC this stems from systems where nohz_full CPUs are not running
> tickless
> at all times (you get transitions to/from latency-sensitive work).
> Also
> from what I've seen isolation is (intentionally) done with just
> isolcpus=managed_irq,<nohz_cpus>; there isn't the 'domain' flag so
> load
> balancing isn't permanently disabled.
>
> DVFS is another point, I don't remember seeing cpufreq governor
> changes in
> the transitions, but I wouldn't be suprised if there were - so we'd
> move
> from tickless, no-DVFS to ticking with DVFS (and would like that to
> behave
> "sanely").
>
> FWIW arm64 does something similar in that it just saves the counters
> but
> doesn't update the scale when the delta overflows/wrapsaround, so
> that the
> next tick can work with a sane delta, cf
>
> arch/arm64/kernel/topology.c::amu_scale_freq_tick()
>
Friendly ping?
On Wed, 2022-10-19 at 14:31 +0300, [email protected] wrote:
> On Tue, 2022-09-06 at 17:17 +0100, Valentin Schneider wrote:
> > On 06/09/22 16:54, Peter Zijlstra wrote:
> > > On Thu, Aug 04, 2022 at 04:17:28PM +0300, Yair Podemsky wrote:
> > > > @@ -392,7 +400,12 @@ void arch_scale_freq_tick(void)
> > > > s->mcnt = mcnt;
> > > > raw_write_seqcount_end(&s->seq);
> > > >
> > > > - scale_freq_tick(acnt, mcnt);
> > > > + /*
> > > > + * Avoid calling scale_freq_tick() when the last update
> > > > was too
> > > > long ago,
> > > > + * as it might overflow during calulation.
> > > > + */
> > > > + if ((jiffies - last) <= MAX_SAMPLE_AGE_NOHZ)
> > > > + scale_freq_tick(acnt, mcnt);
> > > > }
> > >
> > > All this patch does is avoid the warning; but afaict it doesn't
> > > make it
> > > behave in a sane way.
>
> It also avoids the disabling of the frequency invariance accounting
> for
> all cpus, that occurs immediately after the warning.
> That is the bug that is being solved, Since it affects also non-
> tickless cpus.
>
> > > I'm thinking that on nohz_full cpus you don't have load
> > > balancing,
> > > I'm
> > > also thinking that on nohz_full cpus you don't have DVFS.
> > >
> > > So *why* the heck are we setting this stuff to random values ?
> > > Should
> > > you not simply kill th entire thing for nohz_full cpus?
> >
> > IIRC this stems from systems where nohz_full CPUs are not running
> > tickless
> > at all times (you get transitions to/from latency-sensitive work).
> > Also
> > from what I've seen isolation is (intentionally) done with just
> > isolcpus=managed_irq,<nohz_cpus>; there isn't the 'domain' flag so
> > load
> > balancing isn't permanently disabled.
> >
> > DVFS is another point, I don't remember seeing cpufreq governor
> > changes in
> > the transitions, but I wouldn't be suprised if there were - so we'd
> > move
> > from tickless, no-DVFS to ticking with DVFS (and would like that to
> > behave
> > "sanely").
> >
> > FWIW arm64 does something similar in that it just saves the
> > counters
> > but
> > doesn't update the scale when the delta overflows/wrapsaround, so
> > that the
> > next tick can work with a sane delta, cf
> >
> > arch/arm64/kernel/topology.c::amu_scale_freq_tick()
> >