2024-05-11 02:06:42

by Leonardo Bras

[permalink] [raw]
Subject: [RFC PATCH 1/1] kvm: Note an RCU quiescent state on guest exit

As of today, KVM notes a quiescent state only in guest entry, which is good
as it avoids the guest being interrupted for current RCU operations.

While the guest vcpu runs, it can be interrupted by a timer IRQ that will
check for any RCU operations waiting for this CPU. In case there are any of
such, it invokes rcu_core() in order to sched-out the current thread and
note a quiescent state.

This occasional schedule work will introduce tens of microsseconds of
latency, which is really bad for vcpus running latency-sensitive
applications, such as real-time workloads.

So, note a quiescent state in guest exit, so the interrupted guests is able
to deal with any pending RCU operations before being required to invoke
rcu_core(), and thus avoid the overhead of related scheduler work.

Signed-off-by: Leonardo Bras <[email protected]>
---

ps: A patch fixing this same issue was discussed in this thread:
https://lore.kernel.org/all/[email protected]/

Also, this can be paired with a new RCU option (rcutree.nocb_patience_delay)
to avoid having invoke_rcu() being called on grace-periods starting between
guest exit and the timer IRQ. This RCU option is being discussed in a
sub-thread of this message:
https://lore.kernel.org/all/5fd66909-1250-4a91-aa71-93cb36ed4ad5@paulmck-laptop/


include/linux/context_tracking.h | 6 ++++--
include/linux/kvm_host.h | 10 +++++++++-
2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 6e76b9dba00e..8a78fabeafc3 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -73,39 +73,41 @@ static inline void exception_exit(enum ctx_state prev_ctx)
}

static __always_inline bool context_tracking_guest_enter(void)
{
if (context_tracking_enabled())
__ct_user_enter(CONTEXT_GUEST);

return context_tracking_enabled_this_cpu();
}

-static __always_inline void context_tracking_guest_exit(void)
+static __always_inline bool context_tracking_guest_exit(void)
{
if (context_tracking_enabled())
__ct_user_exit(CONTEXT_GUEST);
+
+ return context_tracking_enabled_this_cpu();
}

#define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))

#else
static inline void user_enter(void) { }
static inline void user_exit(void) { }
static inline void user_enter_irqoff(void) { }
static inline void user_exit_irqoff(void) { }
static inline int exception_enter(void) { return 0; }
static inline void exception_exit(enum ctx_state prev_ctx) { }
static inline int ct_state(void) { return -1; }
static inline int __ct_state(void) { return -1; }
static __always_inline bool context_tracking_guest_enter(void) { return false; }
-static __always_inline void context_tracking_guest_exit(void) { }
+static __always_inline bool context_tracking_guest_exit(void) { return false; }
#define CT_WARN_ON(cond) do { } while (0)
#endif /* !CONFIG_CONTEXT_TRACKING_USER */

#ifdef CONFIG_CONTEXT_TRACKING_USER_FORCE
extern void context_tracking_init(void);
#else
static inline void context_tracking_init(void) { }
#endif /* CONFIG_CONTEXT_TRACKING_USER_FORCE */

#ifdef CONFIG_CONTEXT_TRACKING_IDLE
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 48f31dcd318a..e37724c44843 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -480,21 +480,29 @@ static __always_inline void guest_state_enter_irqoff(void)
/*
* Exit guest context and exit an RCU extended quiescent state.
*
* Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
* unsafe to use any code which may directly or indirectly use RCU, tracing
* (including IRQ flag tracing), or lockdep. All code in this period must be
* non-instrumentable.
*/
static __always_inline void guest_context_exit_irqoff(void)
{
- context_tracking_guest_exit();
+ /*
+ * Guest mode is treated as a quiescent state, see
+ * guest_context_enter_irqoff() for more details.
+ */
+ if (!context_tracking_guest_exit()) {
+ instrumentation_begin();
+ rcu_virt_note_context_switch();
+ instrumentation_end();
+ }
}

/*
* Stop accounting time towards a guest.
* Must be called after exiting guest context.
*/
static __always_inline void guest_timing_exit_irqoff(void)
{
instrumentation_begin();
/* Flush the guest cputime we spent on the guest */
--
2.45.0



2024-05-11 02:11:54

by Leonardo Bras

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] kvm: Note an RCU quiescent state on guest exit

On Fri, May 10, 2024 at 11:05:56PM -0300, Leonardo Bras wrote:
> As of today, KVM notes a quiescent state only in guest entry, which is good
> as it avoids the guest being interrupted for current RCU operations.
>
> While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> check for any RCU operations waiting for this CPU. In case there are any of
> such, it invokes rcu_core() in order to sched-out the current thread and
> note a quiescent state.
>
> This occasional schedule work will introduce tens of microsseconds of
> latency, which is really bad for vcpus running latency-sensitive
> applications, such as real-time workloads.
>
> So, note a quiescent state in guest exit, so the interrupted guests is able

s/guests/guest

> to deal with any pending RCU operations before being required to invoke
> rcu_core(), and thus avoid the overhead of related scheduler work.
>
> Signed-off-by: Leonardo Bras <[email protected]>
> ---
>
> ps: A patch fixing this same issue was discussed in this thread:
> https://lore.kernel.org/all/[email protected]/
>
> Also, this can be paired with a new RCU option (rcutree.nocb_patience_delay)
> to avoid having invoke_rcu() being called on grace-periods starting between
> guest exit and the timer IRQ. This RCU option is being discussed in a
> sub-thread of this message:
> https://lore.kernel.org/all/5fd66909-1250-4a91-aa71-93cb36ed4ad5@paulmck-laptop/
>
>
> include/linux/context_tracking.h | 6 ++++--
> include/linux/kvm_host.h | 10 +++++++++-
> 2 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> index 6e76b9dba00e..8a78fabeafc3 100644
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -73,39 +73,41 @@ static inline void exception_exit(enum ctx_state prev_ctx)
> }
>
> static __always_inline bool context_tracking_guest_enter(void)
> {
> if (context_tracking_enabled())
> __ct_user_enter(CONTEXT_GUEST);
>
> return context_tracking_enabled_this_cpu();
> }
>
> -static __always_inline void context_tracking_guest_exit(void)
> +static __always_inline bool context_tracking_guest_exit(void)
> {
> if (context_tracking_enabled())
> __ct_user_exit(CONTEXT_GUEST);
> +
> + return context_tracking_enabled_this_cpu();
> }
>
> #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
>
> #else
> static inline void user_enter(void) { }
> static inline void user_exit(void) { }
> static inline void user_enter_irqoff(void) { }
> static inline void user_exit_irqoff(void) { }
> static inline int exception_enter(void) { return 0; }
> static inline void exception_exit(enum ctx_state prev_ctx) { }
> static inline int ct_state(void) { return -1; }
> static inline int __ct_state(void) { return -1; }
> static __always_inline bool context_tracking_guest_enter(void) { return false; }
> -static __always_inline void context_tracking_guest_exit(void) { }
> +static __always_inline bool context_tracking_guest_exit(void) { return false; }
> #define CT_WARN_ON(cond) do { } while (0)
> #endif /* !CONFIG_CONTEXT_TRACKING_USER */
>
> #ifdef CONFIG_CONTEXT_TRACKING_USER_FORCE
> extern void context_tracking_init(void);
> #else
> static inline void context_tracking_init(void) { }
> #endif /* CONFIG_CONTEXT_TRACKING_USER_FORCE */
>
> #ifdef CONFIG_CONTEXT_TRACKING_IDLE
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 48f31dcd318a..e37724c44843 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -480,21 +480,29 @@ static __always_inline void guest_state_enter_irqoff(void)
> /*
> * Exit guest context and exit an RCU extended quiescent state.
> *
> * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
> * unsafe to use any code which may directly or indirectly use RCU, tracing
> * (including IRQ flag tracing), or lockdep. All code in this period must be
> * non-instrumentable.
> */
> static __always_inline void guest_context_exit_irqoff(void)
> {
> - context_tracking_guest_exit();
> + /*
> + * Guest mode is treated as a quiescent state, see
> + * guest_context_enter_irqoff() for more details.
> + */
> + if (!context_tracking_guest_exit()) {
> + instrumentation_begin();
> + rcu_virt_note_context_switch();
> + instrumentation_end();
> + }
> }
>
> /*
> * Stop accounting time towards a guest.
> * Must be called after exiting guest context.
> */
> static __always_inline void guest_timing_exit_irqoff(void)
> {
> instrumentation_begin();
> /* Flush the guest cputime we spent on the guest */
> --
> 2.45.0
>


2024-05-11 14:56:07

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] kvm: Note an RCU quiescent state on guest exit

On Fri, May 10, 2024 at 11:05:56PM -0300, Leonardo Bras wrote:
> As of today, KVM notes a quiescent state only in guest entry, which is good
> as it avoids the guest being interrupted for current RCU operations.
>
> While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> check for any RCU operations waiting for this CPU. In case there are any of
> such, it invokes rcu_core() in order to sched-out the current thread and
> note a quiescent state.
>
> This occasional schedule work will introduce tens of microsseconds of
> latency, which is really bad for vcpus running latency-sensitive
> applications, such as real-time workloads.
>
> So, note a quiescent state in guest exit, so the interrupted guests is able
> to deal with any pending RCU operations before being required to invoke
> rcu_core(), and thus avoid the overhead of related scheduler work.
>
> Signed-off-by: Leonardo Bras <[email protected]>

Acked-by: Paul E. McKenney <[email protected]>

> ---
>
> ps: A patch fixing this same issue was discussed in this thread:
> https://lore.kernel.org/all/[email protected]/
>
> Also, this can be paired with a new RCU option (rcutree.nocb_patience_delay)
> to avoid having invoke_rcu() being called on grace-periods starting between
> guest exit and the timer IRQ. This RCU option is being discussed in a
> sub-thread of this message:
> https://lore.kernel.org/all/5fd66909-1250-4a91-aa71-93cb36ed4ad5@paulmck-laptop/
>
>
> include/linux/context_tracking.h | 6 ++++--
> include/linux/kvm_host.h | 10 +++++++++-
> 2 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> index 6e76b9dba00e..8a78fabeafc3 100644
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -73,39 +73,41 @@ static inline void exception_exit(enum ctx_state prev_ctx)
> }
>
> static __always_inline bool context_tracking_guest_enter(void)
> {
> if (context_tracking_enabled())
> __ct_user_enter(CONTEXT_GUEST);
>
> return context_tracking_enabled_this_cpu();
> }
>
> -static __always_inline void context_tracking_guest_exit(void)
> +static __always_inline bool context_tracking_guest_exit(void)
> {
> if (context_tracking_enabled())
> __ct_user_exit(CONTEXT_GUEST);
> +
> + return context_tracking_enabled_this_cpu();
> }
>
> #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
>
> #else
> static inline void user_enter(void) { }
> static inline void user_exit(void) { }
> static inline void user_enter_irqoff(void) { }
> static inline void user_exit_irqoff(void) { }
> static inline int exception_enter(void) { return 0; }
> static inline void exception_exit(enum ctx_state prev_ctx) { }
> static inline int ct_state(void) { return -1; }
> static inline int __ct_state(void) { return -1; }
> static __always_inline bool context_tracking_guest_enter(void) { return false; }
> -static __always_inline void context_tracking_guest_exit(void) { }
> +static __always_inline bool context_tracking_guest_exit(void) { return false; }
> #define CT_WARN_ON(cond) do { } while (0)
> #endif /* !CONFIG_CONTEXT_TRACKING_USER */
>
> #ifdef CONFIG_CONTEXT_TRACKING_USER_FORCE
> extern void context_tracking_init(void);
> #else
> static inline void context_tracking_init(void) { }
> #endif /* CONFIG_CONTEXT_TRACKING_USER_FORCE */
>
> #ifdef CONFIG_CONTEXT_TRACKING_IDLE
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 48f31dcd318a..e37724c44843 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -480,21 +480,29 @@ static __always_inline void guest_state_enter_irqoff(void)
> /*
> * Exit guest context and exit an RCU extended quiescent state.
> *
> * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
> * unsafe to use any code which may directly or indirectly use RCU, tracing
> * (including IRQ flag tracing), or lockdep. All code in this period must be
> * non-instrumentable.
> */
> static __always_inline void guest_context_exit_irqoff(void)
> {
> - context_tracking_guest_exit();
> + /*
> + * Guest mode is treated as a quiescent state, see
> + * guest_context_enter_irqoff() for more details.
> + */
> + if (!context_tracking_guest_exit()) {
> + instrumentation_begin();
> + rcu_virt_note_context_switch();
> + instrumentation_end();
> + }
> }
>
> /*
> * Stop accounting time towards a guest.
> * Must be called after exiting guest context.
> */
> static __always_inline void guest_timing_exit_irqoff(void)
> {
> instrumentation_begin();
> /* Flush the guest cputime we spent on the guest */
> --
> 2.45.0
>

2024-05-11 20:37:03

by Leonardo Bras

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] kvm: Note an RCU quiescent state on guest exit

On Sat, May 11, 2024 at 07:55:55AM -0700, Paul E. McKenney wrote:
> On Fri, May 10, 2024 at 11:05:56PM -0300, Leonardo Bras wrote:
> > As of today, KVM notes a quiescent state only in guest entry, which is good
> > as it avoids the guest being interrupted for current RCU operations.
> >
> > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > check for any RCU operations waiting for this CPU. In case there are any of
> > such, it invokes rcu_core() in order to sched-out the current thread and
> > note a quiescent state.
> >
> > This occasional schedule work will introduce tens of microsseconds of
> > latency, which is really bad for vcpus running latency-sensitive
> > applications, such as real-time workloads.
> >
> > So, note a quiescent state in guest exit, so the interrupted guests is able
> > to deal with any pending RCU operations before being required to invoke
> > rcu_core(), and thus avoid the overhead of related scheduler work.
> >
> > Signed-off-by: Leonardo Bras <[email protected]>
>
> Acked-by: Paul E. McKenney <[email protected]>

Thanks!
Leo

>
> > ---
> >
> > ps: A patch fixing this same issue was discussed in this thread:
> > https://lore.kernel.org/all/[email protected]/
> >
> > Also, this can be paired with a new RCU option (rcutree.nocb_patience_delay)
> > to avoid having invoke_rcu() being called on grace-periods starting between
> > guest exit and the timer IRQ. This RCU option is being discussed in a
> > sub-thread of this message:
> > https://lore.kernel.org/all/5fd66909-1250-4a91-aa71-93cb36ed4ad5@paulmck-laptop/
> >
> >
> > include/linux/context_tracking.h | 6 ++++--
> > include/linux/kvm_host.h | 10 +++++++++-
> > 2 files changed, 13 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> > index 6e76b9dba00e..8a78fabeafc3 100644
> > --- a/include/linux/context_tracking.h
> > +++ b/include/linux/context_tracking.h
> > @@ -73,39 +73,41 @@ static inline void exception_exit(enum ctx_state prev_ctx)
> > }
> >
> > static __always_inline bool context_tracking_guest_enter(void)
> > {
> > if (context_tracking_enabled())
> > __ct_user_enter(CONTEXT_GUEST);
> >
> > return context_tracking_enabled_this_cpu();
> > }
> >
> > -static __always_inline void context_tracking_guest_exit(void)
> > +static __always_inline bool context_tracking_guest_exit(void)
> > {
> > if (context_tracking_enabled())
> > __ct_user_exit(CONTEXT_GUEST);
> > +
> > + return context_tracking_enabled_this_cpu();
> > }
> >
> > #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
> >
> > #else
> > static inline void user_enter(void) { }
> > static inline void user_exit(void) { }
> > static inline void user_enter_irqoff(void) { }
> > static inline void user_exit_irqoff(void) { }
> > static inline int exception_enter(void) { return 0; }
> > static inline void exception_exit(enum ctx_state prev_ctx) { }
> > static inline int ct_state(void) { return -1; }
> > static inline int __ct_state(void) { return -1; }
> > static __always_inline bool context_tracking_guest_enter(void) { return false; }
> > -static __always_inline void context_tracking_guest_exit(void) { }
> > +static __always_inline bool context_tracking_guest_exit(void) { return false; }
> > #define CT_WARN_ON(cond) do { } while (0)
> > #endif /* !CONFIG_CONTEXT_TRACKING_USER */
> >
> > #ifdef CONFIG_CONTEXT_TRACKING_USER_FORCE
> > extern void context_tracking_init(void);
> > #else
> > static inline void context_tracking_init(void) { }
> > #endif /* CONFIG_CONTEXT_TRACKING_USER_FORCE */
> >
> > #ifdef CONFIG_CONTEXT_TRACKING_IDLE
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 48f31dcd318a..e37724c44843 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -480,21 +480,29 @@ static __always_inline void guest_state_enter_irqoff(void)
> > /*
> > * Exit guest context and exit an RCU extended quiescent state.
> > *
> > * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
> > * unsafe to use any code which may directly or indirectly use RCU, tracing
> > * (including IRQ flag tracing), or lockdep. All code in this period must be
> > * non-instrumentable.
> > */
> > static __always_inline void guest_context_exit_irqoff(void)
> > {
> > - context_tracking_guest_exit();
> > + /*
> > + * Guest mode is treated as a quiescent state, see
> > + * guest_context_enter_irqoff() for more details.
> > + */
> > + if (!context_tracking_guest_exit()) {
> > + instrumentation_begin();
> > + rcu_virt_note_context_switch();
> > + instrumentation_end();
> > + }
> > }
> >
> > /*
> > * Stop accounting time towards a guest.
> > * Must be called after exiting guest context.
> > */
> > static __always_inline void guest_timing_exit_irqoff(void)
> > {
> > instrumentation_begin();
> > /* Flush the guest cputime we spent on the guest */
> > --
> > 2.45.0
> >
>


2024-05-13 01:06:49

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] kvm: Note an RCU quiescent state on guest exit

On Sun, May 12, 2024 at 06:44:23PM -0300, Marcelo Tosatti wrote:
> On Fri, May 10, 2024 at 11:05:56PM -0300, Leonardo Bras wrote:
> > As of today, KVM notes a quiescent state only in guest entry, which is good
> > as it avoids the guest being interrupted for current RCU operations.
> >
> > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > check for any RCU operations waiting for this CPU. In case there are any of
> > such, it invokes rcu_core() in order to sched-out the current thread and
> > note a quiescent state.
> >
> > This occasional schedule work will introduce tens of microsseconds of
> > latency, which is really bad for vcpus running latency-sensitive
> > applications, such as real-time workloads.
> >
> > So, note a quiescent state in guest exit, so the interrupted guests is able
> > to deal with any pending RCU operations before being required to invoke
> > rcu_core(), and thus avoid the overhead of related scheduler work.
>
> This does not properly fix the current problem, as RCU work might be
> scheduled after the VM exit, followed by a timer interrupt.
>
> Correct?

Not that i am against the patch...

But, regarding the problem at hand, it does not fix it reliably.


2024-05-13 01:08:00

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] kvm: Note an RCU quiescent state on guest exit

On Fri, May 10, 2024 at 11:05:56PM -0300, Leonardo Bras wrote:
> As of today, KVM notes a quiescent state only in guest entry, which is good
> as it avoids the guest being interrupted for current RCU operations.
>
> While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> check for any RCU operations waiting for this CPU. In case there are any of
> such, it invokes rcu_core() in order to sched-out the current thread and
> note a quiescent state.
>
> This occasional schedule work will introduce tens of microsseconds of
> latency, which is really bad for vcpus running latency-sensitive
> applications, such as real-time workloads.
>
> So, note a quiescent state in guest exit, so the interrupted guests is able
> to deal with any pending RCU operations before being required to invoke
> rcu_core(), and thus avoid the overhead of related scheduler work.

This does not properly fix the current problem, as RCU work might be
scheduled after the VM exit, followed by a timer interrupt.

Correct?

>
> Signed-off-by: Leonardo Bras <[email protected]>
> ---
>
> ps: A patch fixing this same issue was discussed in this thread:
> https://lore.kernel.org/all/[email protected]/
>
> Also, this can be paired with a new RCU option (rcutree.nocb_patience_delay)
> to avoid having invoke_rcu() being called on grace-periods starting between
> guest exit and the timer IRQ. This RCU option is being discussed in a
> sub-thread of this message:
> https://lore.kernel.org/all/5fd66909-1250-4a91-aa71-93cb36ed4ad5@paulmck-laptop/
>
>
> include/linux/context_tracking.h | 6 ++++--
> include/linux/kvm_host.h | 10 +++++++++-
> 2 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> index 6e76b9dba00e..8a78fabeafc3 100644
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -73,39 +73,41 @@ static inline void exception_exit(enum ctx_state prev_ctx)
> }
>
> static __always_inline bool context_tracking_guest_enter(void)
> {
> if (context_tracking_enabled())
> __ct_user_enter(CONTEXT_GUEST);
>
> return context_tracking_enabled_this_cpu();
> }
>
> -static __always_inline void context_tracking_guest_exit(void)
> +static __always_inline bool context_tracking_guest_exit(void)
> {
> if (context_tracking_enabled())
> __ct_user_exit(CONTEXT_GUEST);
> +
> + return context_tracking_enabled_this_cpu();
> }
>
> #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
>
> #else
> static inline void user_enter(void) { }
> static inline void user_exit(void) { }
> static inline void user_enter_irqoff(void) { }
> static inline void user_exit_irqoff(void) { }
> static inline int exception_enter(void) { return 0; }
> static inline void exception_exit(enum ctx_state prev_ctx) { }
> static inline int ct_state(void) { return -1; }
> static inline int __ct_state(void) { return -1; }
> static __always_inline bool context_tracking_guest_enter(void) { return false; }
> -static __always_inline void context_tracking_guest_exit(void) { }
> +static __always_inline bool context_tracking_guest_exit(void) { return false; }
> #define CT_WARN_ON(cond) do { } while (0)
> #endif /* !CONFIG_CONTEXT_TRACKING_USER */
>
> #ifdef CONFIG_CONTEXT_TRACKING_USER_FORCE
> extern void context_tracking_init(void);
> #else
> static inline void context_tracking_init(void) { }
> #endif /* CONFIG_CONTEXT_TRACKING_USER_FORCE */
>
> #ifdef CONFIG_CONTEXT_TRACKING_IDLE
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 48f31dcd318a..e37724c44843 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -480,21 +480,29 @@ static __always_inline void guest_state_enter_irqoff(void)
> /*
> * Exit guest context and exit an RCU extended quiescent state.
> *
> * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
> * unsafe to use any code which may directly or indirectly use RCU, tracing
> * (including IRQ flag tracing), or lockdep. All code in this period must be
> * non-instrumentable.
> */
> static __always_inline void guest_context_exit_irqoff(void)
> {
> - context_tracking_guest_exit();
> + /*
> + * Guest mode is treated as a quiescent state, see
> + * guest_context_enter_irqoff() for more details.
> + */
> + if (!context_tracking_guest_exit()) {
> + instrumentation_begin();
> + rcu_virt_note_context_switch();
> + instrumentation_end();
> + }
> }
>
> /*
> * Stop accounting time towards a guest.
> * Must be called after exiting guest context.
> */
> static __always_inline void guest_timing_exit_irqoff(void)
> {
> instrumentation_begin();
> /* Flush the guest cputime we spent on the guest */
> --
> 2.45.0
>
>
>


2024-05-13 03:15:07

by Leonardo Bras

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] kvm: Note an RCU quiescent state on guest exit

On Sun, May 12, 2024 at 06:44:23PM -0300, Marcelo Tosatti wrote:
> On Fri, May 10, 2024 at 11:05:56PM -0300, Leonardo Bras wrote:
> > As of today, KVM notes a quiescent state only in guest entry, which is good
> > as it avoids the guest being interrupted for current RCU operations.
> >
> > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > check for any RCU operations waiting for this CPU. In case there are any of
> > such, it invokes rcu_core() in order to sched-out the current thread and
> > note a quiescent state.
> >
> > This occasional schedule work will introduce tens of microsseconds of
> > latency, which is really bad for vcpus running latency-sensitive
> > applications, such as real-time workloads.
> >
> > So, note a quiescent state in guest exit, so the interrupted guests is able
> > to deal with any pending RCU operations before being required to invoke
> > rcu_core(), and thus avoid the overhead of related scheduler work.
>
> This does not properly fix the current problem, as RCU work might be
> scheduled after the VM exit, followed by a timer interrupt.
>
> Correct?

Correct, for this case, check the note below:

>
> >
> > Signed-off-by: Leonardo Bras <[email protected]>
> > ---
> >
> > ps: A patch fixing this same issue was discussed in this thread:
> > https://lore.kernel.org/all/[email protected]/
> >
> > Also, this can be paired with a new RCU option (rcutree.nocb_patience_delay)
> > to avoid having invoke_rcu() being called on grace-periods starting between
> > guest exit and the timer IRQ. This RCU option is being discussed in a
> > sub-thread of this message:
> > https://lore.kernel.org/all/5fd66909-1250-4a91-aa71-93cb36ed4ad5@paulmck-laptop/

^ This one above.
The idea is to use this rcutree.nocb_patience_delay=N :
a new option we added on RCU that allow us to avoid invoking rcu_core() if
the grace_period < N miliseconds. This only works on nohz_full cpus.

So with both the current patch and the one in above link, we have the same
effect as we previously had with last_guest_exit, with a cherry on top: we
can avoid rcu_core() getting called in situations where a grace period just
started after going into kernel code, and a timer interrupt happened before
it can report quiescent state again.

For our nohz_full vcpu thread scenario, we have:

- guest_exit note a quiescent state
- let's say we start a grace period in the next cycle
- If timer interrupts, it requires the grace period to be older than N
miliseconds
- If we configure a proper value for patience, it will never reach the
end of patience before going guest_entry, and thus noting a quiescent
state

What do you think?

Thanks!
Leo

> >
> >
> > include/linux/context_tracking.h | 6 ++++--
> > include/linux/kvm_host.h | 10 +++++++++-
> > 2 files changed, 13 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> > index 6e76b9dba00e..8a78fabeafc3 100644
> > --- a/include/linux/context_tracking.h
> > +++ b/include/linux/context_tracking.h
> > @@ -73,39 +73,41 @@ static inline void exception_exit(enum ctx_state prev_ctx)
> > }
> >
> > static __always_inline bool context_tracking_guest_enter(void)
> > {
> > if (context_tracking_enabled())
> > __ct_user_enter(CONTEXT_GUEST);
> >
> > return context_tracking_enabled_this_cpu();
> > }
> >
> > -static __always_inline void context_tracking_guest_exit(void)
> > +static __always_inline bool context_tracking_guest_exit(void)
> > {
> > if (context_tracking_enabled())
> > __ct_user_exit(CONTEXT_GUEST);
> > +
> > + return context_tracking_enabled_this_cpu();
> > }
> >
> > #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
> >
> > #else
> > static inline void user_enter(void) { }
> > static inline void user_exit(void) { }
> > static inline void user_enter_irqoff(void) { }
> > static inline void user_exit_irqoff(void) { }
> > static inline int exception_enter(void) { return 0; }
> > static inline void exception_exit(enum ctx_state prev_ctx) { }
> > static inline int ct_state(void) { return -1; }
> > static inline int __ct_state(void) { return -1; }
> > static __always_inline bool context_tracking_guest_enter(void) { return false; }
> > -static __always_inline void context_tracking_guest_exit(void) { }
> > +static __always_inline bool context_tracking_guest_exit(void) { return false; }
> > #define CT_WARN_ON(cond) do { } while (0)
> > #endif /* !CONFIG_CONTEXT_TRACKING_USER */
> >
> > #ifdef CONFIG_CONTEXT_TRACKING_USER_FORCE
> > extern void context_tracking_init(void);
> > #else
> > static inline void context_tracking_init(void) { }
> > #endif /* CONFIG_CONTEXT_TRACKING_USER_FORCE */
> >
> > #ifdef CONFIG_CONTEXT_TRACKING_IDLE
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 48f31dcd318a..e37724c44843 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -480,21 +480,29 @@ static __always_inline void guest_state_enter_irqoff(void)
> > /*
> > * Exit guest context and exit an RCU extended quiescent state.
> > *
> > * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
> > * unsafe to use any code which may directly or indirectly use RCU, tracing
> > * (including IRQ flag tracing), or lockdep. All code in this period must be
> > * non-instrumentable.
> > */
> > static __always_inline void guest_context_exit_irqoff(void)
> > {
> > - context_tracking_guest_exit();
> > + /*
> > + * Guest mode is treated as a quiescent state, see
> > + * guest_context_enter_irqoff() for more details.
> > + */
> > + if (!context_tracking_guest_exit()) {
> > + instrumentation_begin();
> > + rcu_virt_note_context_switch();
> > + instrumentation_end();
> > + }
> > }
> >
> > /*
> > * Stop accounting time towards a guest.
> > * Must be called after exiting guest context.
> > */
> > static __always_inline void guest_timing_exit_irqoff(void)
> > {
> > instrumentation_begin();
> > /* Flush the guest cputime we spent on the guest */
> > --
> > 2.45.0
> >
> >
> >
>


2024-05-13 19:14:31

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] kvm: Note an RCU quiescent state on guest exit

On Mon, May 13, 2024 at 12:14:32AM -0300, Leonardo Bras wrote:
> On Sun, May 12, 2024 at 06:44:23PM -0300, Marcelo Tosatti wrote:
> > On Fri, May 10, 2024 at 11:05:56PM -0300, Leonardo Bras wrote:
> > > As of today, KVM notes a quiescent state only in guest entry, which is good
> > > as it avoids the guest being interrupted for current RCU operations.
> > >
> > > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > > check for any RCU operations waiting for this CPU. In case there are any of
> > > such, it invokes rcu_core() in order to sched-out the current thread and
> > > note a quiescent state.
> > >
> > > This occasional schedule work will introduce tens of microsseconds of
> > > latency, which is really bad for vcpus running latency-sensitive
> > > applications, such as real-time workloads.
> > >
> > > So, note a quiescent state in guest exit, so the interrupted guests is able
> > > to deal with any pending RCU operations before being required to invoke
> > > rcu_core(), and thus avoid the overhead of related scheduler work.
> >
> > This does not properly fix the current problem, as RCU work might be
> > scheduled after the VM exit, followed by a timer interrupt.
> >
> > Correct?
>
> Correct, for this case, check the note below:
>
> >
> > >
> > > Signed-off-by: Leonardo Bras <[email protected]>
> > > ---
> > >
> > > ps: A patch fixing this same issue was discussed in this thread:
> > > https://lore.kernel.org/all/[email protected]/
> > >
> > > Also, this can be paired with a new RCU option (rcutree.nocb_patience_delay)
> > > to avoid having invoke_rcu() being called on grace-periods starting between
> > > guest exit and the timer IRQ. This RCU option is being discussed in a
> > > sub-thread of this message:
> > > https://lore.kernel.org/all/5fd66909-1250-4a91-aa71-93cb36ed4ad5@paulmck-laptop/
>
> ^ This one above.
> The idea is to use this rcutree.nocb_patience_delay=N :
> a new option we added on RCU that allow us to avoid invoking rcu_core() if
> the grace_period < N miliseconds. This only works on nohz_full cpus.
>
> So with both the current patch and the one in above link, we have the same
> effect as we previously had with last_guest_exit, with a cherry on top: we
> can avoid rcu_core() getting called in situations where a grace period just
> started after going into kernel code, and a timer interrupt happened before
> it can report quiescent state again.
>
> For our nohz_full vcpu thread scenario, we have:
>
> - guest_exit note a quiescent state
> - let's say we start a grace period in the next cycle
> - If timer interrupts, it requires the grace period to be older than N
> miliseconds
> - If we configure a proper value for patience, it will never reach the
> end of patience before going guest_entry, and thus noting a quiescent
> state
>
> What do you think?

I don't fully understand all of the RCU details, but since RCU quiescent
state marking happens in IRQ disabled section, there is no chance for a
timer interrupt to conflict with the marking of quiescent state.

So seem to make sense to me.


2024-05-13 19:40:59

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] kvm: Note an RCU quiescent state on guest exit

On Fri, May 10, 2024, Leonardo Bras wrote:
> As of today, KVM notes a quiescent state only in guest entry, which is good
> as it avoids the guest being interrupted for current RCU operations.
>
> While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> check for any RCU operations waiting for this CPU. In case there are any of
> such, it invokes rcu_core() in order to sched-out the current thread and
> note a quiescent state.
>
> This occasional schedule work will introduce tens of microsseconds of
> latency, which is really bad for vcpus running latency-sensitive
> applications, such as real-time workloads.
>
> So, note a quiescent state in guest exit, so the interrupted guests is able
> to deal with any pending RCU operations before being required to invoke
> rcu_core(), and thus avoid the overhead of related scheduler work.

Are there any downsides to this? E.g. extra latency or anything? KVM will note
a context switch on the next VM-Enter, so even if there is extra latency or
something, KVM will eventually take the hit in the common case no matter what.
But I know some setups are sensitive to handling select VM-Exits as soon as possible.

I ask mainly because it seems like a no brainer to me to have both VM-Entry and
VM-Exit note the context switch, which begs the question of why KVM isn't already
doing that. I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz:
use RCU extended quiescent state when running KVM guest") handled the VM-Entry
case?

2024-05-13 21:47:42

by Leonardo Bras

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] kvm: Note an RCU quiescent state on guest exit

On Mon, May 13, 2024 at 4:40 PM Sean Christopherson <[email protected]> wrote:
>
> On Fri, May 10, 2024, Leonardo Bras wrote:
> > As of today, KVM notes a quiescent state only in guest entry, which is good
> > as it avoids the guest being interrupted for current RCU operations.
> >
> > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > check for any RCU operations waiting for this CPU. In case there are any of
> > such, it invokes rcu_core() in order to sched-out the current thread and
> > note a quiescent state.
> >
> > This occasional schedule work will introduce tens of microsseconds of
> > latency, which is really bad for vcpus running latency-sensitive
> > applications, such as real-time workloads.
> >
> > So, note a quiescent state in guest exit, so the interrupted guests is able
> > to deal with any pending RCU operations before being required to invoke
> > rcu_core(), and thus avoid the overhead of related scheduler work.
>
> Are there any downsides to this? E.g. extra latency or anything? KVM will note
> a context switch on the next VM-Enter, so even if there is extra latency or
> something, KVM will eventually take the hit in the common case no matter what.
> But I know some setups are sensitive to handling select VM-Exits as soon as possible.
>
> I ask mainly because it seems like a no brainer to me to have both VM-Entry and
> VM-Exit note the context switch, which begs the question of why KVM isn't already
> doing that. I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz:
> use RCU extended quiescent state when running KVM guest") handled the VM-Entry
> case?

I don't know, by the lore I see it happening in guest entry since the
first time it was introduced at
https://lore.kernel.org/all/[email protected]/

Noting a quiescent state is cheap, but it may cost a few accesses to
possibly non-local cachelines. (Not an expert in this, Paul please let
me know if I got it wrong).

I don't have a historic context on why it was just implemented on
guest_entry, but it would make sense when we don't worry about latency
to take the entry-only approach:
- It saves the overhead of calling rcu_virt_note_context_switch()
twice per guest entry in the loop
- KVM will probably run guest entry soon after guest exit (in loop),
so there is no need to run it twice
- Eventually running rcu_core() may be cheaper than noting quiescent
state every guest entry/exit cycle

Upsides of the new strategy:
- Noting a quiescent state in guest exit avoids calling rcu_core() if
there was a grace period request while guest was running, and timer
interrupt hits the cpu.
- If the loop re-enter quickly there is a high chance that guest
entry's rcu_virt_note_context_switch() will be fast (local cacheline)
as there is low probability of a grace period request happening
between exit & re-entry.
- It allows us to use the rcu patience strategy to avoid rcu_core()
running if any grace period request happens between guest exit and
guest re-entry, which is very important for low latency workloads
running on guests as it reduces maximum latency in long runs.

What do you think?

Thanks!
Leo


2024-05-14 22:54:27

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] kvm: Note an RCU quiescent state on guest exit

On Mon, May 13, 2024 at 06:47:13PM -0300, Leonardo Bras Soares Passos wrote:
> On Mon, May 13, 2024 at 4:40 PM Sean Christopherson <[email protected]> wrote:
> >
> > On Fri, May 10, 2024, Leonardo Bras wrote:
> > > As of today, KVM notes a quiescent state only in guest entry, which is good
> > > as it avoids the guest being interrupted for current RCU operations.
> > >
> > > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > > check for any RCU operations waiting for this CPU. In case there are any of
> > > such, it invokes rcu_core() in order to sched-out the current thread and
> > > note a quiescent state.
> > >
> > > This occasional schedule work will introduce tens of microsseconds of
> > > latency, which is really bad for vcpus running latency-sensitive
> > > applications, such as real-time workloads.
> > >
> > > So, note a quiescent state in guest exit, so the interrupted guests is able
> > > to deal with any pending RCU operations before being required to invoke
> > > rcu_core(), and thus avoid the overhead of related scheduler work.
> >
> > Are there any downsides to this? E.g. extra latency or anything? KVM will note
> > a context switch on the next VM-Enter, so even if there is extra latency or
> > something, KVM will eventually take the hit in the common case no matter what.
> > But I know some setups are sensitive to handling select VM-Exits as soon as possible.
> >
> > I ask mainly because it seems like a no brainer to me to have both VM-Entry and
> > VM-Exit note the context switch, which begs the question of why KVM isn't already
> > doing that. I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz:
> > use RCU extended quiescent state when running KVM guest") handled the VM-Entry
> > case?
>
> I don't know, by the lore I see it happening in guest entry since the
> first time it was introduced at
> https://lore.kernel.org/all/[email protected]/
>
> Noting a quiescent state is cheap, but it may cost a few accesses to
> possibly non-local cachelines. (Not an expert in this, Paul please let
> me know if I got it wrong).

Yes, it is cheap, especially if interrupts are already disabled.
(As in the scheduler asks RCU to do the same amount of work on its
context-switch fastpath.)

> I don't have a historic context on why it was just implemented on
> guest_entry, but it would make sense when we don't worry about latency
> to take the entry-only approach:
> - It saves the overhead of calling rcu_virt_note_context_switch()
> twice per guest entry in the loop
> - KVM will probably run guest entry soon after guest exit (in loop),
> so there is no need to run it twice
> - Eventually running rcu_core() may be cheaper than noting quiescent
> state every guest entry/exit cycle
>
> Upsides of the new strategy:
> - Noting a quiescent state in guest exit avoids calling rcu_core() if
> there was a grace period request while guest was running, and timer
> interrupt hits the cpu.
> - If the loop re-enter quickly there is a high chance that guest
> entry's rcu_virt_note_context_switch() will be fast (local cacheline)
> as there is low probability of a grace period request happening
> between exit & re-entry.
> - It allows us to use the rcu patience strategy to avoid rcu_core()
> running if any grace period request happens between guest exit and
> guest re-entry, which is very important for low latency workloads
> running on guests as it reduces maximum latency in long runs.
>
> What do you think?

Try both on the workload of interest with appropriate tracing and
see what happens? The hardware's opinion overrides mine. ;-)

Thanx, Paul

2024-05-15 04:46:07

by Leonardo Bras

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] kvm: Note an RCU quiescent state on guest exit

On Tue, May 14, 2024 at 03:54:16PM -0700, Paul E. McKenney wrote:
> On Mon, May 13, 2024 at 06:47:13PM -0300, Leonardo Bras Soares Passos wrote:
> > On Mon, May 13, 2024 at 4:40 PM Sean Christopherson <[email protected]> wrote:
> > >
> > > On Fri, May 10, 2024, Leonardo Bras wrote:
> > > > As of today, KVM notes a quiescent state only in guest entry, which is good
> > > > as it avoids the guest being interrupted for current RCU operations.
> > > >
> > > > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > > > check for any RCU operations waiting for this CPU. In case there are any of
> > > > such, it invokes rcu_core() in order to sched-out the current thread and
> > > > note a quiescent state.
> > > >
> > > > This occasional schedule work will introduce tens of microsseconds of
> > > > latency, which is really bad for vcpus running latency-sensitive
> > > > applications, such as real-time workloads.
> > > >
> > > > So, note a quiescent state in guest exit, so the interrupted guests is able
> > > > to deal with any pending RCU operations before being required to invoke
> > > > rcu_core(), and thus avoid the overhead of related scheduler work.
> > >
> > > Are there any downsides to this? E.g. extra latency or anything? KVM will note
> > > a context switch on the next VM-Enter, so even if there is extra latency or
> > > something, KVM will eventually take the hit in the common case no matter what.
> > > But I know some setups are sensitive to handling select VM-Exits as soon as possible.
> > >
> > > I ask mainly because it seems like a no brainer to me to have both VM-Entry and
> > > VM-Exit note the context switch, which begs the question of why KVM isn't already
> > > doing that. I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz:
> > > use RCU extended quiescent state when running KVM guest") handled the VM-Entry
> > > case?
> >
> > I don't know, by the lore I see it happening in guest entry since the
> > first time it was introduced at
> > https://lore.kernel.org/all/[email protected]/
> >
> > Noting a quiescent state is cheap, but it may cost a few accesses to
> > possibly non-local cachelines. (Not an expert in this, Paul please let
> > me know if I got it wrong).
>
> Yes, it is cheap, especially if interrupts are already disabled.
> (As in the scheduler asks RCU to do the same amount of work on its
> context-switch fastpath.)

Thanks!

>
> > I don't have a historic context on why it was just implemented on
> > guest_entry, but it would make sense when we don't worry about latency
> > to take the entry-only approach:
> > - It saves the overhead of calling rcu_virt_note_context_switch()
> > twice per guest entry in the loop
> > - KVM will probably run guest entry soon after guest exit (in loop),
> > so there is no need to run it twice
> > - Eventually running rcu_core() may be cheaper than noting quiescent
> > state every guest entry/exit cycle
> >
> > Upsides of the new strategy:
> > - Noting a quiescent state in guest exit avoids calling rcu_core() if
> > there was a grace period request while guest was running, and timer
> > interrupt hits the cpu.
> > - If the loop re-enter quickly there is a high chance that guest
> > entry's rcu_virt_note_context_switch() will be fast (local cacheline)
> > as there is low probability of a grace period request happening
> > between exit & re-entry.
> > - It allows us to use the rcu patience strategy to avoid rcu_core()
> > running if any grace period request happens between guest exit and
> > guest re-entry, which is very important for low latency workloads
> > running on guests as it reduces maximum latency in long runs.
> >
> > What do you think?
>
> Try both on the workload of interest with appropriate tracing and
> see what happens? The hardware's opinion overrides mine. ;-)

That's a great approach!

But in this case I think noting a quiescent state in guest exit is
necessary to avoid a scenario in which a VM takes longer than RCU
patience, and it ends up running rcuc in a nohz_full cpu, even if guest
exit was quite brief.

IIUC Sean's question is more on the tone of "Why KVM does not note a
quiescent state in guest exit already, if it does in guest entry", and I
just came with a few arguments to try finding a possible rationale, since
I could find no discussion on that topic in the lore for the original
commit.

Since noting a quiescent state in guest exit is cheap enough, avoids rcuc
schedules when grace period starts during guest execution, and enables a
much more rational usage of RCU patience, it's a safe to assume it's a
better way of dealing with RCU compared to current implementation.

Sean, what do you think?

Thanks!
Leo

>
> Thanx, Paul
>


2024-05-15 14:57:54

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] kvm: Note an RCU quiescent state on guest exit

On Wed, May 15, 2024 at 01:45:33AM -0300, Leonardo Bras wrote:
> On Tue, May 14, 2024 at 03:54:16PM -0700, Paul E. McKenney wrote:
> > On Mon, May 13, 2024 at 06:47:13PM -0300, Leonardo Bras Soares Passos wrote:
> > > On Mon, May 13, 2024 at 4:40 PM Sean Christopherson <[email protected]> wrote:
> > > >
> > > > On Fri, May 10, 2024, Leonardo Bras wrote:
> > > > > As of today, KVM notes a quiescent state only in guest entry, which is good
> > > > > as it avoids the guest being interrupted for current RCU operations.
> > > > >
> > > > > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > > > > check for any RCU operations waiting for this CPU. In case there are any of
> > > > > such, it invokes rcu_core() in order to sched-out the current thread and
> > > > > note a quiescent state.
> > > > >
> > > > > This occasional schedule work will introduce tens of microsseconds of
> > > > > latency, which is really bad for vcpus running latency-sensitive
> > > > > applications, such as real-time workloads.
> > > > >
> > > > > So, note a quiescent state in guest exit, so the interrupted guests is able
> > > > > to deal with any pending RCU operations before being required to invoke
> > > > > rcu_core(), and thus avoid the overhead of related scheduler work.
> > > >
> > > > Are there any downsides to this? E.g. extra latency or anything? KVM will note
> > > > a context switch on the next VM-Enter, so even if there is extra latency or
> > > > something, KVM will eventually take the hit in the common case no matter what.
> > > > But I know some setups are sensitive to handling select VM-Exits as soon as possible.
> > > >
> > > > I ask mainly because it seems like a no brainer to me to have both VM-Entry and
> > > > VM-Exit note the context switch, which begs the question of why KVM isn't already
> > > > doing that. I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz:
> > > > use RCU extended quiescent state when running KVM guest") handled the VM-Entry
> > > > case?
> > >
> > > I don't know, by the lore I see it happening in guest entry since the
> > > first time it was introduced at
> > > https://lore.kernel.org/all/[email protected]/
> > >
> > > Noting a quiescent state is cheap, but it may cost a few accesses to
> > > possibly non-local cachelines. (Not an expert in this, Paul please let
> > > me know if I got it wrong).
> >
> > Yes, it is cheap, especially if interrupts are already disabled.
> > (As in the scheduler asks RCU to do the same amount of work on its
> > context-switch fastpath.)
>
> Thanks!
>
> >
> > > I don't have a historic context on why it was just implemented on
> > > guest_entry, but it would make sense when we don't worry about latency
> > > to take the entry-only approach:
> > > - It saves the overhead of calling rcu_virt_note_context_switch()
> > > twice per guest entry in the loop
> > > - KVM will probably run guest entry soon after guest exit (in loop),
> > > so there is no need to run it twice
> > > - Eventually running rcu_core() may be cheaper than noting quiescent
> > > state every guest entry/exit cycle
> > >
> > > Upsides of the new strategy:
> > > - Noting a quiescent state in guest exit avoids calling rcu_core() if
> > > there was a grace period request while guest was running, and timer
> > > interrupt hits the cpu.
> > > - If the loop re-enter quickly there is a high chance that guest
> > > entry's rcu_virt_note_context_switch() will be fast (local cacheline)
> > > as there is low probability of a grace period request happening
> > > between exit & re-entry.
> > > - It allows us to use the rcu patience strategy to avoid rcu_core()
> > > running if any grace period request happens between guest exit and
> > > guest re-entry, which is very important for low latency workloads
> > > running on guests as it reduces maximum latency in long runs.
> > >
> > > What do you think?
> >
> > Try both on the workload of interest with appropriate tracing and
> > see what happens? The hardware's opinion overrides mine. ;-)
>
> That's a great approach!
>
> But in this case I think noting a quiescent state in guest exit is
> necessary to avoid a scenario in which a VM takes longer than RCU
> patience, and it ends up running rcuc in a nohz_full cpu, even if guest
> exit was quite brief.
>
> IIUC Sean's question is more on the tone of "Why KVM does not note a
> quiescent state in guest exit already, if it does in guest entry", and I
> just came with a few arguments to try finding a possible rationale, since
> I could find no discussion on that topic in the lore for the original
> commit.

Understood, and maybe trying it would answer that question quickly.
Don't get me wrong, just because it appears to work in a few tests doesn't
mean that it really works, but if it visibly blows up, that answers the
question quite quickly and easily. ;-)

But yes, if it appears to work, there must be a full investigation into
whether or not the change really is safe.

Thanx, Paul

> Since noting a quiescent state in guest exit is cheap enough, avoids rcuc
> schedules when grace period starts during guest execution, and enables a
> much more rational usage of RCU patience, it's a safe to assume it's a
> better way of dealing with RCU compared to current implementation.
>
> Sean, what do you think?
>
> Thanks!
> Leo
>
> >
> > Thanx, Paul
> >
>