LinuxLists.cc - [RFC] cputime: Introduce option to force full dynticks accounting on NOHZ & NOHZ

2024-02-19 17:58:28

Subject: [RFC] cputime: Introduce option to force full dynticks accounting on NOHZ & NOHZ_IDLE CPUs

Under certain extreme conditions, the tick-based cputime accounting may
produce inaccurate data. For instance, guest CPU usage is sensitive to
interrupts firing right before the tick's expiration. This forces the
guest into kernel context, and has that time slice wrongly accounted as
system time. This issue is exacerbated if the interrupt source is in
sync with the tick, significantly skewing usage metrics towards system
time.

On CPUs with full dynticks enabled, cputime accounting leverages the
context tracking subsystem to measure usage, and isn't susceptible to
this sort of race conditions. However, this imposes a bigger overhead,
including additional accounting and the extra dyntick tracking during
user<->kernel<->guest transitions (RmW + mb).

So, in order to get the best of both worlds, introduce a cputime
configuration option that allows using the full dynticks accounting
scheme on NOHZ & NOHZ_IDLE CPUs, while avoiding the expensive
user<->kernel<->guest dyntick transitions.

Signed-off-by: Nicolas Saenz Julienne <[email protected]>
Signed-off-by: Jack Allister <[email protected]>
---

NOTE: This wasn't tested in depth, and it's mostly intended to highlight
the issue we're trying to solve. Also ccing KVM folks, since it's
relevant to guest CPU usage accounting.

include/linux/context_tracking.h | 4 ++--
include/linux/vtime.h | 6 ++++--
init/Kconfig | 24 +++++++++++++++++++++++-
kernel/context_tracking.c | 25 ++++++++++++++++++++++---
4 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 6e76b9dba00e7..dd9b500359aa6 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -102,11 +102,11 @@ static __always_inline void context_tracking_guest_exit(void) { }
#define CT_WARN_ON(cond) do { } while (0)
#endif /* !CONFIG_CONTEXT_TRACKING_USER */

-#ifdef CONFIG_CONTEXT_TRACKING_USER_FORCE
+#if defined(CONFIG_CONTEXT_TRACKING_USER_FORCE) || defined(CONFIG_VIRT_CPU_ACCOUNTING_GEN_FORCE)
extern void context_tracking_init(void);
#else
static inline void context_tracking_init(void) { }
-#endif /* CONFIG_CONTEXT_TRACKING_USER_FORCE */
+#endif /* CONFIG_CONTEXT_TRACKING_USER_FORCE || CONFIG_VIRT_CPU_ACCOUNTING_GEN_FORCE */

#ifdef CONFIG_CONTEXT_TRACKING_IDLE
extern void ct_idle_enter(void);
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 3684487d01e1c..d78d01eead6e9 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -79,12 +79,14 @@ static inline bool vtime_accounting_enabled(void)

static inline bool vtime_accounting_enabled_cpu(int cpu)
{
- return context_tracking_enabled_cpu(cpu);
+ return IS_ENABLED(CONFIG_VIRT_CPU_ACCOUNTING_GEN_FORCE) ||
+ context_tracking_enabled_cpu(cpu);
}

static inline bool vtime_accounting_enabled_this_cpu(void)
{
- return context_tracking_enabled_this_cpu();
+ return IS_ENABLED(CONFIG_VIRT_CPU_ACCOUNTING_GEN_FORCE) ||
+ context_tracking_enabled_this_cpu();
}

extern void vtime_task_switch_generic(struct task_struct *prev);
diff --git a/init/Kconfig b/init/Kconfig
index 9ffb103fc927b..86877e1f416fc 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -473,6 +473,9 @@ menu "CPU/Task time and stats accounting"
config VIRT_CPU_ACCOUNTING
bool

+config VIRT_CPU_ACCOUNTING_GEN
+ bool
+
choice
prompt "Cputime accounting"
default TICK_CPU_ACCOUNTING
@@ -501,12 +504,13 @@ config VIRT_CPU_ACCOUNTING_NATIVE
this also enables accounting of stolen time on logically-partitioned
systems.

-config VIRT_CPU_ACCOUNTING_GEN
+config VIRT_CPU_ACCOUNTING_DYNTICKS
bool "Full dynticks CPU time accounting"
depends on HAVE_CONTEXT_TRACKING_USER
depends on HAVE_VIRT_CPU_ACCOUNTING_GEN
depends on GENERIC_CLOCKEVENTS
select VIRT_CPU_ACCOUNTING
+ select VIRT_CPU_ACCOUNTING_GEN
select CONTEXT_TRACKING_USER
help
Select this option to enable task and CPU time accounting on full
@@ -520,8 +524,26 @@ config VIRT_CPU_ACCOUNTING_GEN

If unsure, say N.

+config VIRT_CPU_ACCOUNTING_GEN_FORCE
+ bool "Force full dynticks CPU time accounting"
+ depends on HAVE_CONTEXT_TRACKING_USER
+ depends on HAVE_VIRT_CPU_ACCOUNTING_GEN
+ depends on GENERIC_CLOCKEVENTS
+ select VIRT_CPU_ACCOUNTING
+ select VIRT_CPU_ACCOUNTING_GEN
+ select CONTEXT_TRACKING_USER
+ help
+ Select this option to forcibly enable the full dynticks CPU time
+ accounting. This accounting is implemented by watching every
+ kernel-user boundaries using the context tracking subsystem. The
+ accounting is thus performed at the expense of some overhead, but is
+ more precise than tick based CPU accounting.
+
+ If unsure, say N.
+
endchoice

+
config IRQ_TIME_ACCOUNTING
bool "Fine granularity task level IRQ time accounting"
depends on HAVE_IRQ_TIME_ACCOUNTING && !VIRT_CPU_ACCOUNTING_NATIVE
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 6ef0b35fc28c5..f70949430cf11 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -537,6 +537,13 @@ void noinstr __ct_user_enter(enum ctx_state state)
*/
raw_atomic_add(state, &ct->state);
}
+
+ if (IS_ENABLED(CONFIG_VIRT_CPU_ACCOUNTING_GEN_FORCE) &&
+ state == CONTEXT_USER) {
+ instrumentation_begin();
+ vtime_user_enter(current);
+ instrumentation_end();
+ }
}
}
context_tracking_recursion_exit();
@@ -645,6 +652,13 @@ void noinstr __ct_user_exit(enum ctx_state state)
*/
raw_atomic_sub(state, &ct->state);
}
+
+ if (IS_ENABLED(CONFIG_VIRT_CPU_ACCOUNTING_GEN_FORCE) &&
+ state == CONTEXT_USER) {
+ instrumentation_begin();
+ vtime_user_exit(current);
+ instrumentation_end();
+ }
}
}
context_tracking_recursion_exit();
@@ -715,13 +729,18 @@ void __init ct_cpu_track_user(int cpu)
initialized = true;
}

-#ifdef CONFIG_CONTEXT_TRACKING_USER_FORCE
+#if defined(CONFIG_CONTEXT_TRACKING_USER_FORCE) || defined(CONFIG_VIRT_CPU_ACCOUNTING_GEN_FORCE)
void __init context_tracking_init(void)
{
int cpu;

- for_each_possible_cpu(cpu)
- ct_cpu_track_user(cpu);
+ if (IS_ENABLED(CONFIG_CONTEXT_TRACKING_USER_FORCE)) {
+ for_each_possible_cpu(cpu)
+ ct_cpu_track_user(cpu);
+ }
+
+ if (IS_ENABLED(CONFIG_VIRT_CPU_ACCOUNTING_GEN_FORCE))
+ static_branch_inc(&context_tracking_key);
}
#endif

--
2.40.1

2024-02-20 18:20:16

by Nicolas Saenz Julienne

[permalink] [raw]

Subject: Re: [RFC] cputime: Introduce option to force full dynticks accounting on NOHZ & NOHZ_IDLE CPUs

Hi Sean,

On Tue Feb 20, 2024 at 4:18 PM UTC, Sean Christopherson wrote:
> On Mon, Feb 19, 2024, Nicolas Saenz Julienne wrote:
> > Under certain extreme conditions, the tick-based cputime accounting may
> > produce inaccurate data. For instance, guest CPU usage is sensitive to
> > interrupts firing right before the tick's expiration. This forces the
> > guest into kernel context, and has that time slice wrongly accounted as
> > system time. This issue is exacerbated if the interrupt source is in
> > sync with the tick, significantly skewing usage metrics towards system
> > time.
>
> ...
>
> > NOTE: This wasn't tested in depth, and it's mostly intended to highlight
> > the issue we're trying to solve. Also ccing KVM folks, since it's
> > relevant to guest CPU usage accounting.
>
> How bad is the synchronization issue on upstream kernels? We tried to address
> that in commit 160457140187 ("KVM: x86: Defer vtime accounting 'til after IRQ handling").
>
> I don't expect it to be foolproof, but it'd be good to know if there's a blatant
> flaw and/or easily closed hole.

The issue is not really about the interrupts themselves, but their side
effects.

For instance, let's say the guest sets up an Hyper-V stimer that
consistently fires 1 us before the preemption tick. The preemption tick
will expire while the vCPU thread is running with !PF_VCPU (maybe inside
kvm_hv_process_stimers() for ex.). As long as they both keep in sync,
you'll get a 100% system usage. I was able to reproduce this one through
kvm-unit-tests, but the race window is too small to keep the interrupts
in sync for long periods of time, yet still capable of producing random
system usage bursts (which unacceptable for some use-cases).

Other use-cases have bigger race windows and managed to maintain high
system CPU usage over long periods of time. For example, with user-space
HPET emulation, or KVM+Xen (don't know the fine details on these, but
VIRT_CPU_ACCOUNTING_GEN fixes the mis-accounting). It all comes down to
the same situation. Something triggers an exit, and the vCPU thread goes
past 'vtime_account_guest_exit()' just in time for the tick interrupt to
show up.

Note that we're running with 160457140187 ("KVM: x86: Defer vtime
accounting 'til after IRQ handling"), on the kernel that reproduced
these issues. The RFC fix was tested against an upstream kernel by
tracing cputime accounting and making sure the right code-paths were
exercised.

Nicolas

2024-02-21 16:24:42

by Sean Christopherson

[permalink] [raw]

Subject: Re: [RFC] cputime: Introduce option to force full dynticks accounting on NOHZ & NOHZ_IDLE CPUs

On Tue, Feb 20, 2024, Nicolas Saenz Julienne wrote:
> Hi Sean,
>
> On Tue Feb 20, 2024 at 4:18 PM UTC, Sean Christopherson wrote:
> > On Mon, Feb 19, 2024, Nicolas Saenz Julienne wrote:
> > > Under certain extreme conditions, the tick-based cputime accounting may
> > > produce inaccurate data. For instance, guest CPU usage is sensitive to
> > > interrupts firing right before the tick's expiration.

Ah, this confused me. The "right before" is a bit misleading. It's more like
"shortly before", because if the interrupt that occurs due to the guest's tick
arrives _right_ before the host tick expires, then commit 160457140187 should
avoid horrific accounting.

> > > This forces the guest into kernel context, and has that time slice
> > > wrongly accounted as system time. This issue is exacerbated if the
> > > interrupt source is in sync with the tick,

It's worth calling out why this can happen, to make it clear that getting into
such syncopation can happen quite naturally. E.g. something like:

interrupt source is in sync with the tick, e.g. if the guest's tick
is configured to run at the same frequency as the host tick, and the
guest tick is every so slightly ahead of the host tick.

> > > significantly skewing usage metrics towards system time.
> >
> > ...
> >
> > > NOTE: This wasn't tested in depth, and it's mostly intended to highlight
> > > the issue we're trying to solve. Also ccing KVM folks, since it's
> > > relevant to guest CPU usage accounting.
> >
> > How bad is the synchronization issue on upstream kernels? We tried to address
> > that in commit 160457140187 ("KVM: x86: Defer vtime accounting 'til after IRQ handling").
> >
> > I don't expect it to be foolproof, but it'd be good to know if there's a blatant
> > flaw and/or easily closed hole.
>
> The issue is not really about the interrupts themselves, but their side
> effects.
>
> For instance, let's say the guest sets up an Hyper-V stimer that
> consistently fires 1 us before the preemption tick. The preemption tick
> will expire while the vCPU thread is running with !PF_VCPU (maybe inside
> kvm_hv_process_stimers() for ex.). As long as they both keep in sync,
> you'll get a 100% system usage. I was able to reproduce this one through
> kvm-unit-tests, but the race window is too small to keep the interrupts
> in sync for long periods of time, yet still capable of producing random
> system usage bursts (which unacceptable for some use-cases).
>
> Other use-cases have bigger race windows and managed to maintain high
> system CPU usage over long periods of time. For example, with user-space
> HPET emulation, or KVM+Xen (don't know the fine details on these, but
> VIRT_CPU_ACCOUNTING_GEN fixes the mis-accounting). It all comes down to
> the same situation. Something triggers an exit, and the vCPU thread goes
> past 'vtime_account_guest_exit()' just in time for the tick interrupt to
> show up.

I suspect the common "problem" with those flows is that emulating the guest timer
interrupt is (a) slow, relatively speaking and (b) done with interrupts enabled.

E.g. on VMX, the TSC deadline timer is emulated via VMX preemption timer, and both
the programming of the guest's TSC deadline timer and the handling of the expiration
interrupt is done in the VM-Exit fastpath with IRQs disabled. As a result, even
if the host tick interrupt is a hair behind the guest tick, it doesn't affect
accounting because the host tick interrupt will never be delivered while KVM is
emulating the guest's periodic tick.

I'm guessing that if you tested on SVM (or a guest that doesn't use the APIC timer
in deadline mode), which doesn't utilize the fastpath since KVM needs to bounce
through hrtimers, then you'd see similar accounting problems even without using
any of the problematic "slow" timer sources.

2024-02-21 17:11:36

by Nicolas Saenz Julienne

[permalink] [raw]

Subject: Re: [RFC] cputime: Introduce option to force full dynticks accounting on NOHZ & NOHZ_IDLE CPUs

On Wed Feb 21, 2024 at 4:24 PM UTC, Sean Christopherson wrote:
> On Tue, Feb 20, 2024, Nicolas Saenz Julienne wrote:
> > Hi Sean,
> >
> > On Tue Feb 20, 2024 at 4:18 PM UTC, Sean Christopherson wrote:
> > > On Mon, Feb 19, 2024, Nicolas Saenz Julienne wrote:
> > > > Under certain extreme conditions, the tick-based cputime accounting may
> > > > produce inaccurate data. For instance, guest CPU usage is sensitive to
> > > > interrupts firing right before the tick's expiration.
>
> Ah, this confused me. The "right before" is a bit misleading. It's more like
> "shortly before", because if the interrupt that occurs due to the guest's tick
> arrives _right_ before the host tick expires, then commit 160457140187 should
> avoid horrific accounting.
>
> > > > This forces the guest into kernel context, and has that time slice
> > > > wrongly accounted as system time. This issue is exacerbated if the
> > > > interrupt source is in sync with the tick,
>
> It's worth calling out why this can happen, to make it clear that getting into
> such syncopation can happen quite naturally. E.g. something like:
>
> interrupt source is in sync with the tick, e.g. if the guest's tick
> is configured to run at the same frequency as the host tick, and the
> guest tick is every so slightly ahead of the host tick.

I'll incorporate both comments into the description. :)

> > > > significantly skewing usage metrics towards system time.
> > >
> > > ...
> > >
> > > > NOTE: This wasn't tested in depth, and it's mostly intended to highlight
> > > > the issue we're trying to solve. Also ccing KVM folks, since it's
> > > > relevant to guest CPU usage accounting.
> > >
> > > How bad is the synchronization issue on upstream kernels? We tried to address
> > > that in commit 160457140187 ("KVM: x86: Defer vtime accounting 'til after IRQ handling").
> > >
> > > I don't expect it to be foolproof, but it'd be good to know if there's a blatant
> > > flaw and/or easily closed hole.
> >
> > The issue is not really about the interrupts themselves, but their side
> > effects.
> >
> > For instance, let's say the guest sets up an Hyper-V stimer that
> > consistently fires 1 us before the preemption tick. The preemption tick
> > will expire while the vCPU thread is running with !PF_VCPU (maybe inside
> > kvm_hv_process_stimers() for ex.). As long as they both keep in sync,
> > you'll get a 100% system usage. I was able to reproduce this one through
> > kvm-unit-tests, but the race window is too small to keep the interrupts
> > in sync for long periods of time, yet still capable of producing random
> > system usage bursts (which unacceptable for some use-cases).
> >
> > Other use-cases have bigger race windows and managed to maintain high
> > system CPU usage over long periods of time. For example, with user-space
> > HPET emulation, or KVM+Xen (don't know the fine details on these, but
> > VIRT_CPU_ACCOUNTING_GEN fixes the mis-accounting). It all comes down to
> > the same situation. Something triggers an exit, and the vCPU thread goes
> > past 'vtime_account_guest_exit()' just in time for the tick interrupt to
> > show up.
>
> I suspect the common "problem" with those flows is that emulating the guest timer
> interrupt is (a) slow, relatively speaking and (b) done with interrupts enabled.
>
> E.g. on VMX, the TSC deadline timer is emulated via VMX preemption timer, and both
> the programming of the guest's TSC deadline timer and the handling of the expiration
> interrupt is done in the VM-Exit fastpath with IRQs disabled. As a result, even
> if the host tick interrupt is a hair behind the guest tick, it doesn't affect
> accounting because the host tick interrupt will never be delivered while KVM is
> emulating the guest's periodic tick.
>
> I'm guessing that if you tested on SVM (or a guest that doesn't use the APIC timer
> in deadline mode), which doesn't utilize the fastpath since KVM needs to bounce
> through hrtimers, then you'd see similar accounting problems even without using
> any of the problematic "slow" timer sources.

That's right, the "problem" will show up when periodically emulating
something with interrupts enabled. The slower the emulation the bigger
the race window. It's just a limitation of tick based accounting, I have
the feeling there isn't much KVM can do.

Nicolas

2024-03-11 17:15:53

by Nicolas Saenz Julienne

[permalink] [raw]

Subject: Re: [RFC] cputime: Introduce option to force full dynticks accounting on NOHZ & NOHZ_IDLE CPUs

Hi Frederic,

On Mon Feb 19, 2024 at 5:57 PM UTC, Nicolas Saenz Julienne wrote:
> Under certain extreme conditions, the tick-based cputime accounting may
> produce inaccurate data. For instance, guest CPU usage is sensitive to
> interrupts firing right before the tick's expiration. This forces the
> guest into kernel context, and has that time slice wrongly accounted as
> system time. This issue is exacerbated if the interrupt source is in
> sync with the tick, significantly skewing usage metrics towards system
> time.
>
> On CPUs with full dynticks enabled, cputime accounting leverages the
> context tracking subsystem to measure usage, and isn't susceptible to
> this sort of race conditions. However, this imposes a bigger overhead,
> including additional accounting and the extra dyntick tracking during
> user<->kernel<->guest transitions (RmW + mb).
>
> So, in order to get the best of both worlds, introduce a cputime
> configuration option that allows using the full dynticks accounting
> scheme on NOHZ & NOHZ_IDLE CPUs, while avoiding the expensive
> user<->kernel<->guest dyntick transitions.
>
> Signed-off-by: Nicolas Saenz Julienne <[email protected]>
> Signed-off-by: Jack Allister <[email protected]>
> ---

Would you be opposed to introducing a config option like this? Any
alternatives you might have in mind?

Nicolas

2024-03-12 23:13:30

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [RFC] cputime: Introduce option to force full dynticks accounting on NOHZ & NOHZ_IDLE CPUs

Le Mon, Mar 11, 2024 at 05:15:26PM +0000, Nicolas Saenz Julienne a ?crit :
> Hi Frederic,
>
> On Mon Feb 19, 2024 at 5:57 PM UTC, Nicolas Saenz Julienne wrote:
> > Under certain extreme conditions, the tick-based cputime accounting may
> > produce inaccurate data. For instance, guest CPU usage is sensitive to
> > interrupts firing right before the tick's expiration. This forces the
> > guest into kernel context, and has that time slice wrongly accounted as
> > system time. This issue is exacerbated if the interrupt source is in
> > sync with the tick, significantly skewing usage metrics towards system
> > time.
> >
> > On CPUs with full dynticks enabled, cputime accounting leverages the
> > context tracking subsystem to measure usage, and isn't susceptible to
> > this sort of race conditions. However, this imposes a bigger overhead,
> > including additional accounting and the extra dyntick tracking during
> > user<->kernel<->guest transitions (RmW + mb).
> >
> > So, in order to get the best of both worlds, introduce a cputime
> > configuration option that allows using the full dynticks accounting
> > scheme on NOHZ & NOHZ_IDLE CPUs, while avoiding the expensive
> > user<->kernel<->guest dyntick transitions.
> >
> > Signed-off-by: Nicolas Saenz Julienne <[email protected]>
> > Signed-off-by: Jack Allister <[email protected]>
> > ---
>
> Would you be opposed to introducing a config option like this? Any
> alternatives you might have in mind?

I'm not opposed to the idea no. It is not the first time I hear about people
using generic virt Cputime accounting for precise stime/utime measurements on
benchmarks. But let me sit down and have a look at your patch. Once I find
my way through performance regression reports and rcutorture splats anyway...

Thanks!