2024-03-28 17:21:25

by Leonardo Bras

[permalink] [raw]
Subject: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

I am dealing with a latency issue inside a KVM guest, which is caused by
a sched_switch to rcuc[1].

During guest entry, kernel code will signal to RCU that current CPU was on
a quiescent state, making sure no other CPU is waiting for this one.

If a vcpu just stopped running (guest_exit), and a syncronize_rcu() was
issued somewhere since guest entry, there is a chance a timer interrupt
will happen in that CPU, which will cause rcu_sched_clock_irq() to run.

rcu_sched_clock_irq() will check rcu_pending() which will return true,
and cause invoke_rcu_core() to be called, which will (in current config)
cause rcuc/N to be scheduled into the current cpu.

On rcu_pending(), I noticed we can avoid returning true (and thus invoking
rcu_core()) if the current cpu is nohz_full, and the cpu came from either
idle or userspace, since both are considered quiescent states.

Since this is also true to guest context, my idea to solve this latency
issue by avoiding rcu_core() invocation if it was running a guest vcpu.

On the other hand, I could not find a way of reliably saying the current
cpu was running a guest vcpu, so patch #1 implements a per-cpu variable
for keeping the time (jiffies) of the last guest exit.

In patch #2 I compare current time to that time, and if less than a second
has past, we just skip rcu_core() invocation, since there is a high chance
it will just go back to the guest in a moment.

What I know it's weird with this patch:
1 - Not sure if this is the best way of finding out if the cpu was
running a guest recently.

2 - This per-cpu variable needs to get set at each guest_exit(), so it's
overhead, even though it's supposed to be in local cache. If that's
an issue, I would suggest having this part compiled out on
!CONFIG_NO_HZ_FULL, but further checking each cpu for being nohz_full
enabled seems more expensive than just setting this out.

3 - It checks if the guest exit happened over than 1 second ago. This 1
second value was copied from rcu_nohz_full_cpu() which checks if the
grace period started over than a second ago. If this value is bad,
I have no issue changing it.

4 - Even though I could detect no issue, I included linux/kvm_host.h into
rcu/tree_plugin.h, which is the first time it's getting included
outside of kvm or arch code, and can be weird. An alternative would
be to create a new header for providing data for non-kvm code.

Please provide feedback.

Thanks!
Leo
...
[1]: It uses a PREEMPT_RT kernel, with the guest cpus running on isolated,
rcu_nocbs, nohz_full cpus.

Leonardo Bras (2):
kvm: Implement guest_exit_last_time()
rcu: Ignore RCU in nohz_full cpus if it was running a guest recently

include/linux/kvm_host.h | 13 +++++++++++++
kernel/rcu/tree_plugin.h | 14 ++++++++++++++
kernel/rcu/tree.c | 4 +++-
virt/kvm/kvm_main.c | 3 +++
4 files changed, 33 insertions(+), 1 deletion(-)


base-commit: 8d025e2092e29bfd13e56c78e22af25fac83c8ec
--
2.44.0



2024-04-01 20:21:42

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

On Thu, Mar 28, 2024, Leonardo Bras wrote:
> I am dealing with a latency issue inside a KVM guest, which is caused by
> a sched_switch to rcuc[1].
>
> During guest entry, kernel code will signal to RCU that current CPU was on
> a quiescent state, making sure no other CPU is waiting for this one.
>
> If a vcpu just stopped running (guest_exit), and a syncronize_rcu() was
> issued somewhere since guest entry, there is a chance a timer interrupt
> will happen in that CPU, which will cause rcu_sched_clock_irq() to run.
>
> rcu_sched_clock_irq() will check rcu_pending() which will return true,
> and cause invoke_rcu_core() to be called, which will (in current config)
> cause rcuc/N to be scheduled into the current cpu.
>
> On rcu_pending(), I noticed we can avoid returning true (and thus invoking
> rcu_core()) if the current cpu is nohz_full, and the cpu came from either
> idle or userspace, since both are considered quiescent states.
>
> Since this is also true to guest context, my idea to solve this latency
> issue by avoiding rcu_core() invocation if it was running a guest vcpu.
>
> On the other hand, I could not find a way of reliably saying the current
> cpu was running a guest vcpu, so patch #1 implements a per-cpu variable
> for keeping the time (jiffies) of the last guest exit.
>
> In patch #2 I compare current time to that time, and if less than a second
> has past, we just skip rcu_core() invocation, since there is a high chance
> it will just go back to the guest in a moment.

What's the downside if there's a false positive?

> What I know it's weird with this patch:
> 1 - Not sure if this is the best way of finding out if the cpu was
> running a guest recently.
>
> 2 - This per-cpu variable needs to get set at each guest_exit(), so it's
> overhead, even though it's supposed to be in local cache. If that's
> an issue, I would suggest having this part compiled out on
> !CONFIG_NO_HZ_FULL, but further checking each cpu for being nohz_full
> enabled seems more expensive than just setting this out.

A per-CPU write isn't problematic, but I suspect reading jiffies will be quite
imprecise, e.g. it'll be a full tick "behind" on many exits.

> 3 - It checks if the guest exit happened over than 1 second ago. This 1
> second value was copied from rcu_nohz_full_cpu() which checks if the
> grace period started over than a second ago. If this value is bad,
> I have no issue changing it.

IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal heuristic regardless
of what magic time threshold is used. IIUC, what you want is a way to detect if
a CPU is likely to _run_ a KVM vCPU in the near future. KVM can provide that
information with much better precision, e.g. KVM knows when when it's in the core
vCPU run loop.

> 4 - Even though I could detect no issue, I included linux/kvm_host.h into
> rcu/tree_plugin.h, which is the first time it's getting included
> outside of kvm or arch code, and can be weird.

Heh, kvm_host.h isn't included outside of KVM because several architectures can
build KVM as a module, which means referencing global KVM varibles from the kernel
proper won't work.

> An alternative would be to create a new header for providing data for
> non-kvm code.

I doubt a new .h or .c file is needed just for this, there's gotta be a decent
landing spot for a one-off variable. E.g. I wouldn't be at all surprised if there
is additional usefulness in knowing if a CPU is in KVM's core run loop and thus
likely to do a VM-Enter in the near future, at which point you could probably make
a good argument for adding a flag in "struct context_tracking". Even without a
separate use case, there's a good argument for adding that info to context_tracking.

2024-04-05 13:46:18

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

On Mon, Apr 01, 2024 at 01:21:25PM -0700, Sean Christopherson wrote:
> On Thu, Mar 28, 2024, Leonardo Bras wrote:
> > I am dealing with a latency issue inside a KVM guest, which is caused by
> > a sched_switch to rcuc[1].
> >
> > During guest entry, kernel code will signal to RCU that current CPU was on
> > a quiescent state, making sure no other CPU is waiting for this one.
> >
> > If a vcpu just stopped running (guest_exit), and a syncronize_rcu() was
> > issued somewhere since guest entry, there is a chance a timer interrupt
> > will happen in that CPU, which will cause rcu_sched_clock_irq() to run.
> >
> > rcu_sched_clock_irq() will check rcu_pending() which will return true,
> > and cause invoke_rcu_core() to be called, which will (in current config)
> > cause rcuc/N to be scheduled into the current cpu.
> >
> > On rcu_pending(), I noticed we can avoid returning true (and thus invoking
> > rcu_core()) if the current cpu is nohz_full, and the cpu came from either
> > idle or userspace, since both are considered quiescent states.
> >
> > Since this is also true to guest context, my idea to solve this latency
> > issue by avoiding rcu_core() invocation if it was running a guest vcpu.
> >
> > On the other hand, I could not find a way of reliably saying the current
> > cpu was running a guest vcpu, so patch #1 implements a per-cpu variable
> > for keeping the time (jiffies) of the last guest exit.
> >
> > In patch #2 I compare current time to that time, and if less than a second
> > has past, we just skip rcu_core() invocation, since there is a high chance
> > it will just go back to the guest in a moment.
>
> What's the downside if there's a false positive?

rcuc wakes up (which might exceed the allowed latency threshold
for certain realtime apps).

> > What I know it's weird with this patch:
> > 1 - Not sure if this is the best way of finding out if the cpu was
> > running a guest recently.
> >
> > 2 - This per-cpu variable needs to get set at each guest_exit(), so it's
> > overhead, even though it's supposed to be in local cache. If that's
> > an issue, I would suggest having this part compiled out on
> > !CONFIG_NO_HZ_FULL, but further checking each cpu for being nohz_full
> > enabled seems more expensive than just setting this out.
>
> A per-CPU write isn't problematic, but I suspect reading jiffies will be quite
> imprecise, e.g. it'll be a full tick "behind" on many exits.
>
> > 3 - It checks if the guest exit happened over than 1 second ago. This 1
> > second value was copied from rcu_nohz_full_cpu() which checks if the
> > grace period started over than a second ago. If this value is bad,
> > I have no issue changing it.
>
> IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal heuristic regardless
> of what magic time threshold is used.

Why? It works for this particular purpose.

> IIUC, what you want is a way to detect if
> a CPU is likely to _run_ a KVM vCPU in the near future. KVM can provide that
> information with much better precision, e.g. KVM knows when when it's in the core
> vCPU run loop.

ktime_t ktime_get(void)
{
struct timekeeper *tk = &tk_core.timekeeper;
unsigned int seq;
ktime_t base;
u64 nsecs;

WARN_ON(timekeeping_suspended);

do {
seq = read_seqcount_begin(&tk_core.seq);
base = tk->tkr_mono.base;
nsecs = timekeeping_get_ns(&tk->tkr_mono);

} while (read_seqcount_retry(&tk_core.seq, seq));

return ktime_add_ns(base, nsecs);
}
EXPORT_SYMBOL_GPL(ktime_get);

ktime_get() is more expensive than unsigned long assignment.

What is done is: If vcpu has entered guest mode in the past, then RCU
extended quiescent state has been transitioned into the CPU, therefore
it is not necessary to wake up rcu core.

The logic is copied from:

/*
* Is this CPU a NO_HZ_FULL CPU that should ignore RCU so that the
* grace-period kthread will do force_quiescent_state() processing?
* The idea is to avoid waking up RCU core processing on such a
* CPU unless the grace period has extended for too long.
*
* This code relies on the fact that all NO_HZ_FULL CPUs are also
* RCU_NOCB_CPU CPUs.
*/
static bool rcu_nohz_full_cpu(void)
{
#ifdef CONFIG_NO_HZ_FULL
if (tick_nohz_full_cpu(smp_processor_id()) &&
(!rcu_gp_in_progress() ||
time_before(jiffies, READ_ONCE(rcu_state.gp_start) + HZ)))
return true;
#endif /* #ifdef CONFIG_NO_HZ_FULL */
return false;
}

Note:

avoid waking up RCU core processing on such a
CPU unless the grace period has extended for too long.

> > 4 - Even though I could detect no issue, I included linux/kvm_host.h into
> > rcu/tree_plugin.h, which is the first time it's getting included
> > outside of kvm or arch code, and can be weird.
>
> Heh, kvm_host.h isn't included outside of KVM because several architectures can
> build KVM as a module, which means referencing global KVM varibles from the kernel
> proper won't work.
>
> > An alternative would be to create a new header for providing data for
> > non-kvm code.
>
> I doubt a new .h or .c file is needed just for this, there's gotta be a decent
> landing spot for a one-off variable. E.g. I wouldn't be at all surprised if there
> is additional usefulness in knowing if a CPU is in KVM's core run loop and thus
> likely to do a VM-Enter in the near future, at which point you could probably make
> a good argument for adding a flag in "struct context_tracking". Even without a
> separate use case, there's a good argument for adding that info to context_tracking.

Well, jiffies is cheap and just works.

Perhaps can add higher resolution later if required?


2024-04-05 14:42:48

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

On Fri, Apr 05, 2024, Marcelo Tosatti wrote:
> On Mon, Apr 01, 2024 at 01:21:25PM -0700, Sean Christopherson wrote:
> > On Thu, Mar 28, 2024, Leonardo Bras wrote:
> > > I am dealing with a latency issue inside a KVM guest, which is caused by
> > > a sched_switch to rcuc[1].
> > >
> > > During guest entry, kernel code will signal to RCU that current CPU was on
> > > a quiescent state, making sure no other CPU is waiting for this one.
> > >
> > > If a vcpu just stopped running (guest_exit), and a syncronize_rcu() was
> > > issued somewhere since guest entry, there is a chance a timer interrupt
> > > will happen in that CPU, which will cause rcu_sched_clock_irq() to run.
> > >
> > > rcu_sched_clock_irq() will check rcu_pending() which will return true,
> > > and cause invoke_rcu_core() to be called, which will (in current config)
> > > cause rcuc/N to be scheduled into the current cpu.
> > >
> > > On rcu_pending(), I noticed we can avoid returning true (and thus invoking
> > > rcu_core()) if the current cpu is nohz_full, and the cpu came from either
> > > idle or userspace, since both are considered quiescent states.
> > >
> > > Since this is also true to guest context, my idea to solve this latency
> > > issue by avoiding rcu_core() invocation if it was running a guest vcpu.
> > >
> > > On the other hand, I could not find a way of reliably saying the current
> > > cpu was running a guest vcpu, so patch #1 implements a per-cpu variable
> > > for keeping the time (jiffies) of the last guest exit.
> > >
> > > In patch #2 I compare current time to that time, and if less than a second
> > > has past, we just skip rcu_core() invocation, since there is a high chance
> > > it will just go back to the guest in a moment.
> >
> > What's the downside if there's a false positive?
>
> rcuc wakes up (which might exceed the allowed latency threshold
> for certain realtime apps).

Isn't that a false negative? (RCU doesn't detect that a CPU is about to (re)enter
a guest) I was trying to ask about the case where RCU thinks a CPU is about to
enter a guest, but the CPU never does (at least, not in the immediate future).

Or am I just not understanding how RCU's kthreads work?

> > > What I know it's weird with this patch:
> > > 1 - Not sure if this is the best way of finding out if the cpu was
> > > running a guest recently.
> > >
> > > 2 - This per-cpu variable needs to get set at each guest_exit(), so it's
> > > overhead, even though it's supposed to be in local cache. If that's
> > > an issue, I would suggest having this part compiled out on
> > > !CONFIG_NO_HZ_FULL, but further checking each cpu for being nohz_full
> > > enabled seems more expensive than just setting this out.
> >
> > A per-CPU write isn't problematic, but I suspect reading jiffies will be quite
> > imprecise, e.g. it'll be a full tick "behind" on many exits.
> >
> > > 3 - It checks if the guest exit happened over than 1 second ago. This 1
> > > second value was copied from rcu_nohz_full_cpu() which checks if the
> > > grace period started over than a second ago. If this value is bad,
> > > I have no issue changing it.
> >
> > IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal heuristic regardless
> > of what magic time threshold is used.
>
> Why? It works for this particular purpose.

Because maintaining magic numbers is no fun, AFAICT the heurisitic doesn't guard
against edge cases, and I'm pretty sure we can do better with about the same amount
of effort/churn.

> > IIUC, what you want is a way to detect if a CPU is likely to _run_ a KVM
> > vCPU in the near future. KVM can provide that information with much better
> > precision, e.g. KVM knows when when it's in the core vCPU run loop.
>
> ktime_t ktime_get(void)
> {
> struct timekeeper *tk = &tk_core.timekeeper;
> unsigned int seq;
> ktime_t base;
> u64 nsecs;
>
> WARN_ON(timekeeping_suspended);
>
> do {
> seq = read_seqcount_begin(&tk_core.seq);
> base = tk->tkr_mono.base;
> nsecs = timekeeping_get_ns(&tk->tkr_mono);
>
> } while (read_seqcount_retry(&tk_core.seq, seq));
>
> return ktime_add_ns(base, nsecs);
> }
> EXPORT_SYMBOL_GPL(ktime_get);
>
> ktime_get() is more expensive than unsigned long assignment.

Huh? What does ktime_get() have to do with anything? I'm suggesting something
like the below (wants_to_run is from an in-flight patch,
https://lore.kernel.org/all/[email protected]).

---
include/linux/context_tracking.h | 12 ++++++++++++
include/linux/context_tracking_state.h | 3 +++
kernel/rcu/tree.c | 9 +++++++--
virt/kvm/kvm_main.c | 7 +++++++
4 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 6e76b9dba00e..59bc855701c5 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -86,6 +86,16 @@ static __always_inline void context_tracking_guest_exit(void)
__ct_user_exit(CONTEXT_GUEST);
}

+static inline void context_tracking_guest_start_run_loop(void)
+{
+ __this_cpu_write(context_tracking.in_guest_run_loop, true);
+}
+
+static inline void context_tracking_guest_stop_run_loop(void)
+{
+ __this_cpu_write(context_tracking.in_guest_run_loop, false);
+}
+
#define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))

#else
@@ -99,6 +109,8 @@ static inline int ct_state(void) { return -1; }
static inline int __ct_state(void) { return -1; }
static __always_inline bool context_tracking_guest_enter(void) { return false; }
static __always_inline void context_tracking_guest_exit(void) { }
+static inline void context_tracking_guest_start_run_loop(void) { }
+static inline void context_tracking_guest_stop_run_loop(void) { }
#define CT_WARN_ON(cond) do { } while (0)
#endif /* !CONFIG_CONTEXT_TRACKING_USER */

diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index bbff5f7f8803..629ada1a4d81 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -25,6 +25,9 @@ enum ctx_state {
#define CT_DYNTICKS_MASK (~CT_STATE_MASK)

struct context_tracking {
+#if IS_ENABLED(CONFIG_KVM)
+ bool in_guest_run_loop;
+#endif
#ifdef CONFIG_CONTEXT_TRACKING_USER
/*
* When active is false, probes are unset in order
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d9642dd06c25..303ae9ae1c53 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3937,8 +3937,13 @@ static int rcu_pending(int user)
if (rcu_nocb_need_deferred_wakeup(rdp, RCU_NOCB_WAKE))
return 1;

- /* Is this a nohz_full CPU in userspace or idle? (Ignore RCU if so.) */
- if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
+ /*
+ * Is this a nohz_full CPU in userspace, idle, or likely to enter a
+ * guest in the near future? (Ignore RCU if so.)
+ */
+ if ((user || rcu_is_cpu_rrupt_from_idle() ||
+ __this_cpu_read(context_tracking.in_guest_run_loop)) &&
+ rcu_nohz_full_cpu())
return 0;

/* Is the RCU core waiting for a quiescent state from this CPU? */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index bfb2b52a1416..5a7efc669a0f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -209,6 +209,9 @@ void vcpu_load(struct kvm_vcpu *vcpu)
{
int cpu = get_cpu();

+ if (vcpu->wants_to_run)
+ context_tracking_guest_start_run_loop();
+
__this_cpu_write(kvm_running_vcpu, vcpu);
preempt_notifier_register(&vcpu->preempt_notifier);
kvm_arch_vcpu_load(vcpu, cpu);
@@ -222,6 +225,10 @@ void vcpu_put(struct kvm_vcpu *vcpu)
kvm_arch_vcpu_put(vcpu);
preempt_notifier_unregister(&vcpu->preempt_notifier);
__this_cpu_write(kvm_running_vcpu, NULL);
+
+ if (vcpu->wants_to_run)
+ context_tracking_guest_stop_run_loop();
+
preempt_enable();
}
EXPORT_SYMBOL_GPL(vcpu_put);

base-commit: 619e56a3810c88b8d16d7b9553932ad05f0d4968
--



2024-04-06 00:03:50

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

On Fri, Apr 05, 2024 at 07:42:35AM -0700, Sean Christopherson wrote:
> On Fri, Apr 05, 2024, Marcelo Tosatti wrote:
> > On Mon, Apr 01, 2024 at 01:21:25PM -0700, Sean Christopherson wrote:
> > > On Thu, Mar 28, 2024, Leonardo Bras wrote:
> > > > I am dealing with a latency issue inside a KVM guest, which is caused by
> > > > a sched_switch to rcuc[1].
> > > >
> > > > During guest entry, kernel code will signal to RCU that current CPU was on
> > > > a quiescent state, making sure no other CPU is waiting for this one.
> > > >
> > > > If a vcpu just stopped running (guest_exit), and a syncronize_rcu() was
> > > > issued somewhere since guest entry, there is a chance a timer interrupt
> > > > will happen in that CPU, which will cause rcu_sched_clock_irq() to run.
> > > >
> > > > rcu_sched_clock_irq() will check rcu_pending() which will return true,
> > > > and cause invoke_rcu_core() to be called, which will (in current config)
> > > > cause rcuc/N to be scheduled into the current cpu.
> > > >
> > > > On rcu_pending(), I noticed we can avoid returning true (and thus invoking
> > > > rcu_core()) if the current cpu is nohz_full, and the cpu came from either
> > > > idle or userspace, since both are considered quiescent states.
> > > >
> > > > Since this is also true to guest context, my idea to solve this latency
> > > > issue by avoiding rcu_core() invocation if it was running a guest vcpu.
> > > >
> > > > On the other hand, I could not find a way of reliably saying the current
> > > > cpu was running a guest vcpu, so patch #1 implements a per-cpu variable
> > > > for keeping the time (jiffies) of the last guest exit.
> > > >
> > > > In patch #2 I compare current time to that time, and if less than a second
> > > > has past, we just skip rcu_core() invocation, since there is a high chance
> > > > it will just go back to the guest in a moment.
> > >
> > > What's the downside if there's a false positive?
> >
> > rcuc wakes up (which might exceed the allowed latency threshold
> > for certain realtime apps).
>
> Isn't that a false negative? (RCU doesn't detect that a CPU is about to (re)enter
> a guest) I was trying to ask about the case where RCU thinks a CPU is about to
> enter a guest, but the CPU never does (at least, not in the immediate future).
>
> Or am I just not understanding how RCU's kthreads work?

It is quite possible that the current rcu_pending() code needs help,
given the possibility of vCPU preemption. I have heard of people doing
nested KVM virtualization -- or is that no longer a thing?

But the help might well involve RCU telling the hypervisor that a given
vCPU needs to run. Not sure how that would go over, though it has been
prototyped a couple times in the context of RCU priority boosting.

> > > > What I know it's weird with this patch:
> > > > 1 - Not sure if this is the best way of finding out if the cpu was
> > > > running a guest recently.
> > > >
> > > > 2 - This per-cpu variable needs to get set at each guest_exit(), so it's
> > > > overhead, even though it's supposed to be in local cache. If that's
> > > > an issue, I would suggest having this part compiled out on
> > > > !CONFIG_NO_HZ_FULL, but further checking each cpu for being nohz_full
> > > > enabled seems more expensive than just setting this out.
> > >
> > > A per-CPU write isn't problematic, but I suspect reading jiffies will be quite
> > > imprecise, e.g. it'll be a full tick "behind" on many exits.
> > >
> > > > 3 - It checks if the guest exit happened over than 1 second ago. This 1
> > > > second value was copied from rcu_nohz_full_cpu() which checks if the
> > > > grace period started over than a second ago. If this value is bad,
> > > > I have no issue changing it.
> > >
> > > IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal heuristic regardless
> > > of what magic time threshold is used.
> >
> > Why? It works for this particular purpose.
>
> Because maintaining magic numbers is no fun, AFAICT the heurisitic doesn't guard
> against edge cases, and I'm pretty sure we can do better with about the same amount
> of effort/churn.

Beyond a certain point, we have no choice. How long should RCU let
a CPU run with preemption disabled before complaining? We choose 21
seconds in mainline and some distros choose 60 seconds. Android chooses
20 milliseconds for synchronize_rcu_expedited() grace periods.

> > > IIUC, what you want is a way to detect if a CPU is likely to _run_ a KVM
> > > vCPU in the near future. KVM can provide that information with much better
> > > precision, e.g. KVM knows when when it's in the core vCPU run loop.
> >
> > ktime_t ktime_get(void)
> > {
> > struct timekeeper *tk = &tk_core.timekeeper;
> > unsigned int seq;
> > ktime_t base;
> > u64 nsecs;
> >
> > WARN_ON(timekeeping_suspended);
> >
> > do {
> > seq = read_seqcount_begin(&tk_core.seq);
> > base = tk->tkr_mono.base;
> > nsecs = timekeeping_get_ns(&tk->tkr_mono);
> >
> > } while (read_seqcount_retry(&tk_core.seq, seq));
> >
> > return ktime_add_ns(base, nsecs);
> > }
> > EXPORT_SYMBOL_GPL(ktime_get);
> >
> > ktime_get() is more expensive than unsigned long assignment.
>
> Huh? What does ktime_get() have to do with anything? I'm suggesting something
> like the below (wants_to_run is from an in-flight patch,
> https://lore.kernel.org/all/[email protected]).

Interesting. Some questions below, especially if we are doing nested
virtualization.

> ---
> include/linux/context_tracking.h | 12 ++++++++++++
> include/linux/context_tracking_state.h | 3 +++
> kernel/rcu/tree.c | 9 +++++++--
> virt/kvm/kvm_main.c | 7 +++++++
> 4 files changed, 29 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> index 6e76b9dba00e..59bc855701c5 100644
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -86,6 +86,16 @@ static __always_inline void context_tracking_guest_exit(void)
> __ct_user_exit(CONTEXT_GUEST);
> }
>
> +static inline void context_tracking_guest_start_run_loop(void)
> +{
> + __this_cpu_write(context_tracking.in_guest_run_loop, true);
> +}
> +
> +static inline void context_tracking_guest_stop_run_loop(void)
> +{
> + __this_cpu_write(context_tracking.in_guest_run_loop, false);
> +}
> +
> #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
>
> #else
> @@ -99,6 +109,8 @@ static inline int ct_state(void) { return -1; }
> static inline int __ct_state(void) { return -1; }
> static __always_inline bool context_tracking_guest_enter(void) { return false; }
> static __always_inline void context_tracking_guest_exit(void) { }
> +static inline void context_tracking_guest_start_run_loop(void) { }
> +static inline void context_tracking_guest_stop_run_loop(void) { }
> #define CT_WARN_ON(cond) do { } while (0)
> #endif /* !CONFIG_CONTEXT_TRACKING_USER */
>
> diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
> index bbff5f7f8803..629ada1a4d81 100644
> --- a/include/linux/context_tracking_state.h
> +++ b/include/linux/context_tracking_state.h
> @@ -25,6 +25,9 @@ enum ctx_state {
> #define CT_DYNTICKS_MASK (~CT_STATE_MASK)
>
> struct context_tracking {
> +#if IS_ENABLED(CONFIG_KVM)
> + bool in_guest_run_loop;
> +#endif
> #ifdef CONFIG_CONTEXT_TRACKING_USER
> /*
> * When active is false, probes are unset in order
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index d9642dd06c25..303ae9ae1c53 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3937,8 +3937,13 @@ static int rcu_pending(int user)
> if (rcu_nocb_need_deferred_wakeup(rdp, RCU_NOCB_WAKE))
> return 1;
>
> - /* Is this a nohz_full CPU in userspace or idle? (Ignore RCU if so.) */
> - if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> + /*
> + * Is this a nohz_full CPU in userspace, idle, or likely to enter a
> + * guest in the near future? (Ignore RCU if so.)
> + */
> + if ((user || rcu_is_cpu_rrupt_from_idle() ||
> + __this_cpu_read(context_tracking.in_guest_run_loop)) &&

In the case of (user || rcu_is_cpu_rrupt_from_idle()), this CPU was in
a quiescent just before the current scheduling-clock interrupt and will
again be in a quiescent state right after return from this interrupt.
This means that the grace-period kthread will be able to remotely sense
this quiescent state, so that the current CPU need do nothing.

In constrast, it looks like context_tracking.in_guest_run_loop instead
means that when we return from this interrupt, this CPU will still be
in a non-quiescent state.

Now, in the nested-virtualization case, your point might be that the
lower-level hypervisor could preempt the vCPU in the interrupt handler
just as easily as in the .in_guest_run_loop code. Which is a good point.
But I don't know of a way to handle this other than heuristics and maybe
hinting to the hypervisor (which has been prototyped for RCU priority
boosting).

Maybe the time for such hinting has come?

> + rcu_nohz_full_cpu())

And rcu_nohz_full_cpu() has a one-second timeout, and has for quite
some time.

> return 0;
>
> /* Is the RCU core waiting for a quiescent state from this CPU? */
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index bfb2b52a1416..5a7efc669a0f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -209,6 +209,9 @@ void vcpu_load(struct kvm_vcpu *vcpu)
> {
> int cpu = get_cpu();
>
> + if (vcpu->wants_to_run)
> + context_tracking_guest_start_run_loop();

At this point, if this is a nohz_full CPU, it will no longer report
quiescent states until the grace period is at least one second old.

> +
> __this_cpu_write(kvm_running_vcpu, vcpu);
> preempt_notifier_register(&vcpu->preempt_notifier);
> kvm_arch_vcpu_load(vcpu, cpu);
> @@ -222,6 +225,10 @@ void vcpu_put(struct kvm_vcpu *vcpu)
> kvm_arch_vcpu_put(vcpu);
> preempt_notifier_unregister(&vcpu->preempt_notifier);
> __this_cpu_write(kvm_running_vcpu, NULL);
> +

And also at this point, if this is a nohz_full CPU, it will no longer
report quiescent states until the grace period is at least one second old.

> + if (vcpu->wants_to_run)
> + context_tracking_guest_stop_run_loop();
> +
> preempt_enable();
> }
> EXPORT_SYMBOL_GPL(vcpu_put);
>
> base-commit: 619e56a3810c88b8d16d7b9553932ad05f0d4968

All of which might be OK. Just checking as to whether all of that was
in fact the intent.

Thanx, Paul

2024-04-08 17:16:45

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

On Fri, Apr 05, 2024, Paul E. McKenney wrote:
> On Fri, Apr 05, 2024 at 07:42:35AM -0700, Sean Christopherson wrote:
> > On Fri, Apr 05, 2024, Marcelo Tosatti wrote:
> > > rcuc wakes up (which might exceed the allowed latency threshold
> > > for certain realtime apps).
> >
> > Isn't that a false negative? (RCU doesn't detect that a CPU is about to (re)enter
> > a guest) I was trying to ask about the case where RCU thinks a CPU is about to
> > enter a guest, but the CPU never does (at least, not in the immediate future).
> >
> > Or am I just not understanding how RCU's kthreads work?
>
> It is quite possible that the current rcu_pending() code needs help,
> given the possibility of vCPU preemption. I have heard of people doing
> nested KVM virtualization -- or is that no longer a thing?

Nested virtualization is still very much a thing, but I don't see how it is at
all unique with respect to RCU grace periods and quiescent states. More below.

> But the help might well involve RCU telling the hypervisor that a given
> vCPU needs to run. Not sure how that would go over, though it has been
> prototyped a couple times in the context of RCU priority boosting.
>
> > > > > 3 - It checks if the guest exit happened over than 1 second ago. This 1
> > > > > second value was copied from rcu_nohz_full_cpu() which checks if the
> > > > > grace period started over than a second ago. If this value is bad,
> > > > > I have no issue changing it.
> > > >
> > > > IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal heuristic regardless
> > > > of what magic time threshold is used.
> > >
> > > Why? It works for this particular purpose.
> >
> > Because maintaining magic numbers is no fun, AFAICT the heurisitic doesn't guard
> > against edge cases, and I'm pretty sure we can do better with about the same amount
> > of effort/churn.
>
> Beyond a certain point, we have no choice. How long should RCU let
> a CPU run with preemption disabled before complaining? We choose 21
> seconds in mainline and some distros choose 60 seconds. Android chooses
> 20 milliseconds for synchronize_rcu_expedited() grace periods.

Issuing a warning based on an arbitrary time limit is wildly different than using
an arbitrary time window to make functional decisions. My objection to the "assume
the CPU will enter a quiescent state if it exited a KVM guest in the last second"
is that there are plenty of scenarios where that assumption falls apart, i.e. where
_that_ physical CPU will not re-enter the guest.

Off the top of my head:

- If the vCPU is migrated to a different physical CPU (pCPU), the *old* pCPU
will get false positives, and the *new* pCPU will get false negatives (though
the false negatives aren't all that problematic since the pCPU will enter a
quiescent state on the next VM-Enter.

- If the vCPU halts, in which case KVM will schedule out the vCPU/task, i.e.
won't re-enter the guest. And so the pCPU will get false positives until the
vCPU gets a wake event or the 1 second window expires.

- If the VM terminates, the pCPU will get false positives until the 1 second
window expires.

The false positives are solvable problems, by hooking vcpu_put() to reset
kvm_last_guest_exit. And to help with the false negatives when a vCPU task is
scheduled in on a different pCPU, KVM would hook vcpu_load().

> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index d9642dd06c25..303ae9ae1c53 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3937,8 +3937,13 @@ static int rcu_pending(int user)
> > if (rcu_nocb_need_deferred_wakeup(rdp, RCU_NOCB_WAKE))
> > return 1;
> >
> > - /* Is this a nohz_full CPU in userspace or idle? (Ignore RCU if so.) */
> > - if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > + /*
> > + * Is this a nohz_full CPU in userspace, idle, or likely to enter a
> > + * guest in the near future? (Ignore RCU if so.)
> > + */
> > + if ((user || rcu_is_cpu_rrupt_from_idle() ||
> > + __this_cpu_read(context_tracking.in_guest_run_loop)) &&
>
> In the case of (user || rcu_is_cpu_rrupt_from_idle()), this CPU was in
> a quiescent just before the current scheduling-clock interrupt and will
> again be in a quiescent state right after return from this interrupt.
> This means that the grace-period kthread will be able to remotely sense
> this quiescent state, so that the current CPU need do nothing.
>
> In constrast, it looks like context_tracking.in_guest_run_loop instead
> means that when we return from this interrupt, this CPU will still be
> in a non-quiescent state.
>
> Now, in the nested-virtualization case, your point might be that the
> lower-level hypervisor could preempt the vCPU in the interrupt handler
> just as easily as in the .in_guest_run_loop code. Which is a good point.
> But I don't know of a way to handle this other than heuristics and maybe
> hinting to the hypervisor (which has been prototyped for RCU priority
> boosting).

Regarding nested virtualization, what exactly is your concern? IIUC, you are
worried about this code running at L1, i.e. as a nested hypervisor, and L0, i.e.
the bare metal hypervisor, scheduling out the L1 CPU. And because the L1 CPU
doesn't get run "soon", it won't enter a quiescent state as expected by RCU.

But that's 100% the case with RCU in a VM in general. If an L1 CPU gets scheduled
out by L0, that L1 CPU won't participate in any RCU stuff until it gets scheduled
back in by L0.

E.g. throw away all of the special case checks for rcu_nohz_full_cpu() in
rcu_pending(), and the exact same problem exists. The L1 CPU could get scheduled
out while trying to run the RCU core kthread just as easily as it could get
scheduled out while trying to run the vCPU task. Or the L1 CPU could get scheduled
out while it's still in the IRQ handler, before it even completes it rcu_pending().

And FWIW, it's not just L0 scheduling that is problematic. If something in L0
prevents an L1 CPU (vCPU from L0's perspective) from making forward progress, e.g.
due to a bug in L0, or severe resource contention, from the L1 kernel's perspective,
the L1 CPU will appear stuck and trigger various warnings, e.g. soft-lockup,
need_resched, RCU stalls, etc.

> Maybe the time for such hinting has come?

That's a largely orthogonal discussion. As above, boosting the scheduling priority
of a vCPU because that vCPU is in critical section of some form is not at all
unique to nested virtualization (or RCU).

For basic functional correctness, the L0 hypervisor already has the "hint" it
needs. L0 knows that the L1 CPU wants to run by virtue of the L1 CPU being
runnable, i.e. not halted, not in WFS, etc.

> > + rcu_nohz_full_cpu())
>
> And rcu_nohz_full_cpu() has a one-second timeout, and has for quite
> some time.

That's not a good reason to use a suboptimal heuristic for determining whether
or not a CPU is likely to enter a KVM guest, it simply mitigates the worst case
scenario of a false positive.

> > return 0;
> >
> > /* Is the RCU core waiting for a quiescent state from this CPU? */
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index bfb2b52a1416..5a7efc669a0f 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -209,6 +209,9 @@ void vcpu_load(struct kvm_vcpu *vcpu)
> > {
> > int cpu = get_cpu();
> >
> > + if (vcpu->wants_to_run)
> > + context_tracking_guest_start_run_loop();
>
> At this point, if this is a nohz_full CPU, it will no longer report
> quiescent states until the grace period is at least one second old.

I don't think I follow the "will no longer report quiescent states" issue. Are
you saying that this would prevent guest_context_enter_irqoff() from reporting
that the CPU is entering a quiescent state? If so, that's an issue that would
need to be resolved regardless of what heuristic we use to determine whether or
not a CPU is likely to enter a KVM guest.

> > __this_cpu_write(kvm_running_vcpu, vcpu);
> > preempt_notifier_register(&vcpu->preempt_notifier);
> > kvm_arch_vcpu_load(vcpu, cpu);
> > @@ -222,6 +225,10 @@ void vcpu_put(struct kvm_vcpu *vcpu)
> > kvm_arch_vcpu_put(vcpu);
> > preempt_notifier_unregister(&vcpu->preempt_notifier);
> > __this_cpu_write(kvm_running_vcpu, NULL);
> > +
>
> And also at this point, if this is a nohz_full CPU, it will no longer
> report quiescent states until the grace period is at least one second old.
>
> > + if (vcpu->wants_to_run)
> > + context_tracking_guest_stop_run_loop();
> > +
> > preempt_enable();
> > }
> > EXPORT_SYMBOL_GPL(vcpu_put);
> >
> > base-commit: 619e56a3810c88b8d16d7b9553932ad05f0d4968
>
> All of which might be OK. Just checking as to whether all of that was
> in fact the intent.
>
> Thanx, Paul

2024-04-08 19:10:08

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

On Mon, Apr 08, 2024 at 10:16:24AM -0700, Sean Christopherson wrote:
> On Fri, Apr 05, 2024, Paul E. McKenney wrote:
> > On Fri, Apr 05, 2024 at 07:42:35AM -0700, Sean Christopherson wrote:
> > > On Fri, Apr 05, 2024, Marcelo Tosatti wrote:
> > > > rcuc wakes up (which might exceed the allowed latency threshold
> > > > for certain realtime apps).
> > >
> > > Isn't that a false negative? (RCU doesn't detect that a CPU is about to (re)enter
> > > a guest) I was trying to ask about the case where RCU thinks a CPU is about to
> > > enter a guest, but the CPU never does (at least, not in the immediate future).
> > >
> > > Or am I just not understanding how RCU's kthreads work?
> >
> > It is quite possible that the current rcu_pending() code needs help,
> > given the possibility of vCPU preemption. I have heard of people doing
> > nested KVM virtualization -- or is that no longer a thing?
>
> Nested virtualization is still very much a thing, but I don't see how it is at
> all unique with respect to RCU grace periods and quiescent states. More below.

When the hypervisor runs on bare metal, the existing checks have
interrupts disables. Yes, you can still get added delays from NMIs and
SMIs, but excessively long NMI/SMI handlers are either considered to be
bugs or happen when the system is already in trouble (backtrace NMIs,
for an example of the latter).

But if the hypervisor is running on top of another hypervisor, then
the scheduling-clock interrupt handler is subject to vCPU preemption,
which can unduly delay reporting of RCU quiescent states.

And no, this is not exactly new, but your patch reminded me of it.

> > But the help might well involve RCU telling the hypervisor that a given
> > vCPU needs to run. Not sure how that would go over, though it has been
> > prototyped a couple times in the context of RCU priority boosting.
> >
> > > > > > 3 - It checks if the guest exit happened over than 1 second ago. This 1
> > > > > > second value was copied from rcu_nohz_full_cpu() which checks if the
> > > > > > grace period started over than a second ago. If this value is bad,
> > > > > > I have no issue changing it.
> > > > >
> > > > > IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal heuristic regardless
> > > > > of what magic time threshold is used.
> > > >
> > > > Why? It works for this particular purpose.
> > >
> > > Because maintaining magic numbers is no fun, AFAICT the heurisitic doesn't guard
> > > against edge cases, and I'm pretty sure we can do better with about the same amount
> > > of effort/churn.
> >
> > Beyond a certain point, we have no choice. How long should RCU let
> > a CPU run with preemption disabled before complaining? We choose 21
> > seconds in mainline and some distros choose 60 seconds. Android chooses
> > 20 milliseconds for synchronize_rcu_expedited() grace periods.
>
> Issuing a warning based on an arbitrary time limit is wildly different than using
> an arbitrary time window to make functional decisions. My objection to the "assume
> the CPU will enter a quiescent state if it exited a KVM guest in the last second"
> is that there are plenty of scenarios where that assumption falls apart, i.e. where
> _that_ physical CPU will not re-enter the guest.
>
> Off the top of my head:
>
> - If the vCPU is migrated to a different physical CPU (pCPU), the *old* pCPU
> will get false positives, and the *new* pCPU will get false negatives (though
> the false negatives aren't all that problematic since the pCPU will enter a
> quiescent state on the next VM-Enter.
>
> - If the vCPU halts, in which case KVM will schedule out the vCPU/task, i.e.
> won't re-enter the guest. And so the pCPU will get false positives until the
> vCPU gets a wake event or the 1 second window expires.
>
> - If the VM terminates, the pCPU will get false positives until the 1 second
> window expires.
>
> The false positives are solvable problems, by hooking vcpu_put() to reset
> kvm_last_guest_exit. And to help with the false negatives when a vCPU task is
> scheduled in on a different pCPU, KVM would hook vcpu_load().

Here you are arguing against the heuristic in the original patch, correct?
As opposed to the current RCU heuristic that ignores certain quiescent
states for nohz_full CPUs until the grace period reaches an age of
one second?

If so, no argument here. In fact, please consider my ack cancelled.

> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index d9642dd06c25..303ae9ae1c53 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -3937,8 +3937,13 @@ static int rcu_pending(int user)
> > > if (rcu_nocb_need_deferred_wakeup(rdp, RCU_NOCB_WAKE))
> > > return 1;
> > >
> > > - /* Is this a nohz_full CPU in userspace or idle? (Ignore RCU if so.) */
> > > - if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > > + /*
> > > + * Is this a nohz_full CPU in userspace, idle, or likely to enter a
> > > + * guest in the near future? (Ignore RCU if so.)
> > > + */
> > > + if ((user || rcu_is_cpu_rrupt_from_idle() ||
> > > + __this_cpu_read(context_tracking.in_guest_run_loop)) &&
> >
> > In the case of (user || rcu_is_cpu_rrupt_from_idle()), this CPU was in
> > a quiescent just before the current scheduling-clock interrupt and will
> > again be in a quiescent state right after return from this interrupt.
> > This means that the grace-period kthread will be able to remotely sense
> > this quiescent state, so that the current CPU need do nothing.
> >
> > In constrast, it looks like context_tracking.in_guest_run_loop instead
> > means that when we return from this interrupt, this CPU will still be
> > in a non-quiescent state.
> >
> > Now, in the nested-virtualization case, your point might be that the
> > lower-level hypervisor could preempt the vCPU in the interrupt handler
> > just as easily as in the .in_guest_run_loop code. Which is a good point.
> > But I don't know of a way to handle this other than heuristics and maybe
> > hinting to the hypervisor (which has been prototyped for RCU priority
> > boosting).
>
> Regarding nested virtualization, what exactly is your concern? IIUC, you are
> worried about this code running at L1, i.e. as a nested hypervisor, and L0, i.e.
> the bare metal hypervisor, scheduling out the L1 CPU. And because the L1 CPU
> doesn't get run "soon", it won't enter a quiescent state as expected by RCU.

I don't believe that I have any additional concerns over and above
those for the current situation for nested virtualization. But see my
additional question on your patch below for non-nested virtualization.

> But that's 100% the case with RCU in a VM in general. If an L1 CPU gets scheduled
> out by L0, that L1 CPU won't participate in any RCU stuff until it gets scheduled
> back in by L0.
>
> E.g. throw away all of the special case checks for rcu_nohz_full_cpu() in
> rcu_pending(), and the exact same problem exists. The L1 CPU could get scheduled
> out while trying to run the RCU core kthread just as easily as it could get
> scheduled out while trying to run the vCPU task. Or the L1 CPU could get scheduled
> out while it's still in the IRQ handler, before it even completes it rcu_pending().
>
> And FWIW, it's not just L0 scheduling that is problematic. If something in L0
> prevents an L1 CPU (vCPU from L0's perspective) from making forward progress, e.g.
> due to a bug in L0, or severe resource contention, from the L1 kernel's perspective,
> the L1 CPU will appear stuck and trigger various warnings, e.g. soft-lockup,
> need_resched, RCU stalls, etc.

Indeed, there was a USENIX paper on some aspects of this topic some years back.
https://www.usenix.org/conference/atc17/technical-sessions/presentation/prasad

> > Maybe the time for such hinting has come?
>
> That's a largely orthogonal discussion. As above, boosting the scheduling priority
> of a vCPU because that vCPU is in critical section of some form is not at all
> unique to nested virtualization (or RCU).
>
> For basic functional correctness, the L0 hypervisor already has the "hint" it
> needs. L0 knows that the L1 CPU wants to run by virtue of the L1 CPU being
> runnable, i.e. not halted, not in WFS, etc.

And if the system is sufficiently lightly loaded, all will be well, as is
the case with my rcutorture usage. However, if the system is saturated,
that basic functional correctness might not be enough. I haven't heard
many complaints, other than research work, so I have been assuming that
we do not yet need hinting. But you guys tell me. ;-)

> > > + rcu_nohz_full_cpu())
> >
> > And rcu_nohz_full_cpu() has a one-second timeout, and has for quite
> > some time.
>
> That's not a good reason to use a suboptimal heuristic for determining whether
> or not a CPU is likely to enter a KVM guest, it simply mitigates the worst case
> scenario of a false positive.

Again, are you referring to the current RCU code, or the original patch
that started this email thread?

> > > return 0;
> > >
> > > /* Is the RCU core waiting for a quiescent state from this CPU? */
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index bfb2b52a1416..5a7efc669a0f 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -209,6 +209,9 @@ void vcpu_load(struct kvm_vcpu *vcpu)
> > > {
> > > int cpu = get_cpu();
> > >
> > > + if (vcpu->wants_to_run)
> > > + context_tracking_guest_start_run_loop();
> >
> > At this point, if this is a nohz_full CPU, it will no longer report
> > quiescent states until the grace period is at least one second old.
>
> I don't think I follow the "will no longer report quiescent states" issue. Are
> you saying that this would prevent guest_context_enter_irqoff() from reporting
> that the CPU is entering a quiescent state? If so, that's an issue that would
> need to be resolved regardless of what heuristic we use to determine whether or
> not a CPU is likely to enter a KVM guest.

Please allow me to start over. Are interrupts disabled at this point,
and, if so, will they remain disabled until the transfer of control to
the guest has become visible to RCU via the context-tracking code?

Or has the context-tracking code already made the transfer of control
to the guest visible to RCU?

> > > __this_cpu_write(kvm_running_vcpu, vcpu);
> > > preempt_notifier_register(&vcpu->preempt_notifier);
> > > kvm_arch_vcpu_load(vcpu, cpu);
> > > @@ -222,6 +225,10 @@ void vcpu_put(struct kvm_vcpu *vcpu)
> > > kvm_arch_vcpu_put(vcpu);
> > > preempt_notifier_unregister(&vcpu->preempt_notifier);
> > > __this_cpu_write(kvm_running_vcpu, NULL);
> > > +
> >
> > And also at this point, if this is a nohz_full CPU, it will no longer
> > report quiescent states until the grace period is at least one second old.

And here, are interrupts disabled at this point, and if so, have they
been disabled since the time that the exit from the guest become
visible to RCU via the context-tracking code?

Or will the context-tracking code make the transfer of control to the
guest visible to RCU at some later time?

Thanx, Paul

> > > + if (vcpu->wants_to_run)
> > > + context_tracking_guest_stop_run_loop();
> > > +
> > > preempt_enable();
> > > }
> > > EXPORT_SYMBOL_GPL(vcpu_put);
> > >
> > > base-commit: 619e56a3810c88b8d16d7b9553932ad05f0d4968
> >
> > All of which might be OK. Just checking as to whether all of that was
> > in fact the intent.
> >
> > Thanx, Paul

2024-04-08 21:02:49

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

On Mon, Apr 08, 2024 at 01:06:00PM -0700, Sean Christopherson wrote:
> On Mon, Apr 08, 2024, Paul E. McKenney wrote:
> > On Mon, Apr 08, 2024 at 10:16:24AM -0700, Sean Christopherson wrote:
> > > On Fri, Apr 05, 2024, Paul E. McKenney wrote:
> > > Issuing a warning based on an arbitrary time limit is wildly different than using
> > > an arbitrary time window to make functional decisions. My objection to the "assume
> > > the CPU will enter a quiescent state if it exited a KVM guest in the last second"
> > > is that there are plenty of scenarios where that assumption falls apart, i.e. where
> > > _that_ physical CPU will not re-enter the guest.
> > >
> > > Off the top of my head:
> > >
> > > - If the vCPU is migrated to a different physical CPU (pCPU), the *old* pCPU
> > > will get false positives, and the *new* pCPU will get false negatives (though
> > > the false negatives aren't all that problematic since the pCPU will enter a
> > > quiescent state on the next VM-Enter.
> > >
> > > - If the vCPU halts, in which case KVM will schedule out the vCPU/task, i.e.
> > > won't re-enter the guest. And so the pCPU will get false positives until the
> > > vCPU gets a wake event or the 1 second window expires.
> > >
> > > - If the VM terminates, the pCPU will get false positives until the 1 second
> > > window expires.
> > >
> > > The false positives are solvable problems, by hooking vcpu_put() to reset
> > > kvm_last_guest_exit. And to help with the false negatives when a vCPU task is
> > > scheduled in on a different pCPU, KVM would hook vcpu_load().
> >
> > Here you are arguing against the heuristic in the original patch, correct?
>
> Yep, correct.

Whew!!! ;-)

> > As opposed to the current RCU heuristic that ignores certain quiescent
> > states for nohz_full CPUs until the grace period reaches an age of
> > one second?
> >
> > If so, no argument here. In fact, please consider my ack cancelled.
>
> ...
>
> > > That's a largely orthogonal discussion. As above, boosting the scheduling priority
> > > of a vCPU because that vCPU is in critical section of some form is not at all
> > > unique to nested virtualization (or RCU).
> > >
> > > For basic functional correctness, the L0 hypervisor already has the "hint" it
> > > needs. L0 knows that the L1 CPU wants to run by virtue of the L1 CPU being
> > > runnable, i.e. not halted, not in WFS, etc.
> >
> > And if the system is sufficiently lightly loaded, all will be well, as is
> > the case with my rcutorture usage. However, if the system is saturated,
> > that basic functional correctness might not be enough. I haven't heard
> > many complaints, other than research work, so I have been assuming that
> > we do not yet need hinting. But you guys tell me. ;-)
>
> We should never use hinting for basic, *default* functionality. If the host is
> so overloaded that it can induce RCU stalls with the default threshold of 21
> seconds, then something in the host's domain is broken/misconfigured. E.g. it
> doesn't necessary have to be a host kernel/userspace bug, it could be an issue
> with VM scheduling at the control plane. But it's still a host issue, and under
> no circumstance should the host need a hint in order for the guest to not complain
> after 20+ seconds.
>
> And _if_ we were to push the default lower, e.g. all the way down to Android's
> aggressive 20 milliseconds, a boosting hint would still be the wrong way to go
> about it, because no sane hypervisor would ever back such a hint with strong
> guarantees for all scenarios.
>
> It's very much possible to achieve a 20ms deadline when running as a VM, but it
> would require strong guarantees about the VM's configuration and environment,
> e.g. that memory isn't overcommited, that each vCPU has a fully dedicated pCPU,
> etc.

Agreed, and again, you guys need to tell me what is necessary here.

> > > > > + rcu_nohz_full_cpu())
> > > >
> > > > And rcu_nohz_full_cpu() has a one-second timeout, and has for quite
> > > > some time.
> > >
> > > That's not a good reason to use a suboptimal heuristic for determining whether
> > > or not a CPU is likely to enter a KVM guest, it simply mitigates the worst case
> > > scenario of a false positive.
> >
> > Again, are you referring to the current RCU code, or the original patch
> > that started this email thread?
>
> Original patch.
>
> > > > > /* Is the RCU core waiting for a quiescent state from this CPU? */
> > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > > index bfb2b52a1416..5a7efc669a0f 100644
> > > > > --- a/virt/kvm/kvm_main.c
> > > > > +++ b/virt/kvm/kvm_main.c
> > > > > @@ -209,6 +209,9 @@ void vcpu_load(struct kvm_vcpu *vcpu)
> > > > > {
> > > > > int cpu = get_cpu();
> > > > >
> > > > > + if (vcpu->wants_to_run)
> > > > > + context_tracking_guest_start_run_loop();
> > > >
> > > > At this point, if this is a nohz_full CPU, it will no longer report
> > > > quiescent states until the grace period is at least one second old.
> > >
> > > I don't think I follow the "will no longer report quiescent states" issue. Are
> > > you saying that this would prevent guest_context_enter_irqoff() from reporting
> > > that the CPU is entering a quiescent state? If so, that's an issue that would
> > > need to be resolved regardless of what heuristic we use to determine whether or
> > > not a CPU is likely to enter a KVM guest.
> >
> > Please allow me to start over. Are interrupts disabled at this point,
>
> Nope, IRQs are enabled.
>
> Oof, I'm glad you asked, because I was going to say that there's one exception,
> kvm_sched_in(), which is KVM's notifier for when a preempted task/vCPU is scheduled
> back in. But I forgot that kvm_sched_{in,out}() don't use vcpu_{load,put}(),
> i.e. would need explicit calls to context_tracking_guest_{stop,start}_run_loop().
>
> > and, if so, will they remain disabled until the transfer of control to
> > the guest has become visible to RCU via the context-tracking code?
> >
> > Or has the context-tracking code already made the transfer of control
> > to the guest visible to RCU?
>
> Nope. The call to __ct_user_enter(CONTEXT_GUEST) or rcu_virt_note_context_switch()
> happens later, just before the actual VM-Enter. And that call does happen with
> IRQs disabled (and IRQs stay disabled until the CPU enters the guest).

OK, then we can have difficulties with long-running interrupts hitting
this range of code. It is unfortunately not unheard-of for interrupts
plus trailing softirqs to run for tens of seconds, even minutes.

One counter-argument is that that softirq would take scheduling-clock
interrupts, and would eventually make rcu_core() run.

But does a rcu_sched_clock_irq() from a guest OS have its "user"
argument set?

> > > > > __this_cpu_write(kvm_running_vcpu, vcpu);
> > > > > preempt_notifier_register(&vcpu->preempt_notifier);
> > > > > kvm_arch_vcpu_load(vcpu, cpu);
> > > > > @@ -222,6 +225,10 @@ void vcpu_put(struct kvm_vcpu *vcpu)
> > > > > kvm_arch_vcpu_put(vcpu);
> > > > > preempt_notifier_unregister(&vcpu->preempt_notifier);
> > > > > __this_cpu_write(kvm_running_vcpu, NULL);
> > > > > +
> > > >
> > > > And also at this point, if this is a nohz_full CPU, it will no longer
> > > > report quiescent states until the grace period is at least one second old.
> >
> > And here, are interrupts disabled at this point, and if so, have they
> > been disabled since the time that the exit from the guest become
> > visible to RCU via the context-tracking code?
>
> IRQs are enabled.
>
> The gist of my suggestion is:
>
> ioctl(KVM_RUN) {
>
> context_tracking_guest_start_run_loop();
>
> for (;;) {
>
> vcpu_run();
>
> if (<need to return to userspace>)
> break;
> }
>
> context_tracking_guest_stop_run_loop();
> }
>
> where vcpu_run() encompasses a fairly huge amount of code and functionality,
> including the logic to do world switches between host and guest.
>
> E.g. if a vCPU triggers a VM-Exit because it tried to access memory that has been
> swapped out by the host, KVM could end up way down in mm/ doing I/O to bring a
> page back into memory for the guest. Immediately after VM-Exit, before enabling
> IRQs, KVM will notify RCU that the CPU has exited the extended quiescent state
> (this is what happens today). But the "in KVM run loop" flag would stay set, and
> RCU would rely on rcu_nohz_full_cpu() for protection, e.g. in case faulting in
> memory somehow takes more than a second.
>
> But, barring something that triggers a return to userspace, KVM _will_ re-enter
> the guest as quickly as possible. So it's still a heuristic in the sense that
> the CPU isn't guaranteed to enter the guest, nor are there any enforceable SLOs
> on how quickly the CPU will enter the guest, but I think it's the best tradeoff
> between simplicity and functionality, especially since rcu_nohz_full_cpu() has
> a one second timeout to safeguard against some unforeseen hiccup that prevents
> KVM from re-entering the guest in a timely manner.
>
> Note, as above, my intent is that there would also be hooks in kvm_sched_{in,out}()
> to note that the guest run loop is starting/stopping if the vCPU task yields or
> is preempted.

Very good, same responses as for the context_tracking_guest_start_run_loop()
case.

Thanx, Paul

2024-04-08 21:56:55

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

On Mon, Apr 08, 2024, Paul E. McKenney wrote:
> On Mon, Apr 08, 2024 at 01:06:00PM -0700, Sean Christopherson wrote:
> > On Mon, Apr 08, 2024, Paul E. McKenney wrote:
> > > > > > + if (vcpu->wants_to_run)
> > > > > > + context_tracking_guest_start_run_loop();
> > > > >
> > > > > At this point, if this is a nohz_full CPU, it will no longer report
> > > > > quiescent states until the grace period is at least one second old.
> > > >
> > > > I don't think I follow the "will no longer report quiescent states" issue. Are
> > > > you saying that this would prevent guest_context_enter_irqoff() from reporting
> > > > that the CPU is entering a quiescent state? If so, that's an issue that would
> > > > need to be resolved regardless of what heuristic we use to determine whether or
> > > > not a CPU is likely to enter a KVM guest.
> > >
> > > Please allow me to start over. Are interrupts disabled at this point,
> >
> > Nope, IRQs are enabled.
> >
> > Oof, I'm glad you asked, because I was going to say that there's one exception,
> > kvm_sched_in(), which is KVM's notifier for when a preempted task/vCPU is scheduled
> > back in. But I forgot that kvm_sched_{in,out}() don't use vcpu_{load,put}(),
> > i.e. would need explicit calls to context_tracking_guest_{stop,start}_run_loop().
> >
> > > and, if so, will they remain disabled until the transfer of control to
> > > the guest has become visible to RCU via the context-tracking code?
> > >
> > > Or has the context-tracking code already made the transfer of control
> > > to the guest visible to RCU?
> >
> > Nope. The call to __ct_user_enter(CONTEXT_GUEST) or rcu_virt_note_context_switch()
> > happens later, just before the actual VM-Enter. And that call does happen with
> > IRQs disabled (and IRQs stay disabled until the CPU enters the guest).
>
> OK, then we can have difficulties with long-running interrupts hitting
> this range of code. It is unfortunately not unheard-of for interrupts
> plus trailing softirqs to run for tens of seconds, even minutes.

Ah, and if that occurs, *and* KVM is slow to re-enter the guest, then there will
be a massive lag before the CPU gets back into a quiescent state.

> One counter-argument is that that softirq would take scheduling-clock
> interrupts, and would eventually make rcu_core() run.

Considering that this behavior would be unique to nohz_full CPUs, how much
responsibility does RCU have to ensure a sane setup? E.g. if a softirq runs for
multiple seconds on a nohz_full CPU whose primary role is to run a KVM vCPU, then
whatever real-time workaround the vCPU is running is already doomed.

> But does a rcu_sched_clock_irq() from a guest OS have its "user"
> argument set?

No, and it shouldn't, at least not on x86 (I assume other architectures are
similar, but I don't actually no for sure).

On x86, the IRQ that the kernel sees comes looks like it comes from host kernel
code. And on AMD (SVM), the IRQ doesn't just "look" like it came from host kernel,
the IRQ really does get vectored/handled in the host kernel. Intel CPUs have a
performance optimization where the IRQ gets "eaten" as part of the VM-Exit, and
so KVM synthesizes a stack frame and does a manual CALL to invoke the IRQ handler.

And that's just for IRQs that actually arrive while the guest is running. IRQs
arrive while KVM is active, e.g. running its large vcpu_run(), are "pure" host
IRQs.

2024-04-08 22:35:34

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

On Mon, Apr 08, 2024 at 02:56:29PM -0700, Sean Christopherson wrote:
> On Mon, Apr 08, 2024, Paul E. McKenney wrote:
> > On Mon, Apr 08, 2024 at 01:06:00PM -0700, Sean Christopherson wrote:
> > > On Mon, Apr 08, 2024, Paul E. McKenney wrote:
> > > > > > > + if (vcpu->wants_to_run)
> > > > > > > + context_tracking_guest_start_run_loop();
> > > > > >
> > > > > > At this point, if this is a nohz_full CPU, it will no longer report
> > > > > > quiescent states until the grace period is at least one second old.
> > > > >
> > > > > I don't think I follow the "will no longer report quiescent states" issue. Are
> > > > > you saying that this would prevent guest_context_enter_irqoff() from reporting
> > > > > that the CPU is entering a quiescent state? If so, that's an issue that would
> > > > > need to be resolved regardless of what heuristic we use to determine whether or
> > > > > not a CPU is likely to enter a KVM guest.
> > > >
> > > > Please allow me to start over. Are interrupts disabled at this point,
> > >
> > > Nope, IRQs are enabled.
> > >
> > > Oof, I'm glad you asked, because I was going to say that there's one exception,
> > > kvm_sched_in(), which is KVM's notifier for when a preempted task/vCPU is scheduled
> > > back in. But I forgot that kvm_sched_{in,out}() don't use vcpu_{load,put}(),
> > > i.e. would need explicit calls to context_tracking_guest_{stop,start}_run_loop().
> > >
> > > > and, if so, will they remain disabled until the transfer of control to
> > > > the guest has become visible to RCU via the context-tracking code?
> > > >
> > > > Or has the context-tracking code already made the transfer of control
> > > > to the guest visible to RCU?
> > >
> > > Nope. The call to __ct_user_enter(CONTEXT_GUEST) or rcu_virt_note_context_switch()
> > > happens later, just before the actual VM-Enter. And that call does happen with
> > > IRQs disabled (and IRQs stay disabled until the CPU enters the guest).
> >
> > OK, then we can have difficulties with long-running interrupts hitting
> > this range of code. It is unfortunately not unheard-of for interrupts
> > plus trailing softirqs to run for tens of seconds, even minutes.
>
> Ah, and if that occurs, *and* KVM is slow to re-enter the guest, then there will
> be a massive lag before the CPU gets back into a quiescent state.

Exactly!

> > One counter-argument is that that softirq would take scheduling-clock
> > interrupts, and would eventually make rcu_core() run.
>
> Considering that this behavior would be unique to nohz_full CPUs, how much
> responsibility does RCU have to ensure a sane setup? E.g. if a softirq runs for
> multiple seconds on a nohz_full CPU whose primary role is to run a KVM vCPU, then
> whatever real-time workaround the vCPU is running is already doomed.

True, but it is always good to be doing one's part.

> > But does a rcu_sched_clock_irq() from a guest OS have its "user"
> > argument set?
>
> No, and it shouldn't, at least not on x86 (I assume other architectures are
> similar, but I don't actually no for sure).
>
> On x86, the IRQ that the kernel sees comes looks like it comes from host kernel
> code. And on AMD (SVM), the IRQ doesn't just "look" like it came from host kernel,
> the IRQ really does get vectored/handled in the host kernel. Intel CPUs have a
> performance optimization where the IRQ gets "eaten" as part of the VM-Exit, and
> so KVM synthesizes a stack frame and does a manual CALL to invoke the IRQ handler.
>
> And that's just for IRQs that actually arrive while the guest is running. IRQs
> arrive while KVM is active, e.g. running its large vcpu_run(), are "pure" host
> IRQs.

OK, then is it possible to get some other indication to the
rcu_sched_clock_irq() function that it has interrupted a guest OS?

Not an emergency, and maybe not even necessary, but it might well be
one hole that would be good to stop up.

Thanx, Paul

2024-04-08 23:20:24

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

On Mon, Apr 08, 2024 at 04:06:22PM -0700, Sean Christopherson wrote:
> On Mon, Apr 08, 2024, Paul E. McKenney wrote:
> > On Mon, Apr 08, 2024 at 02:56:29PM -0700, Sean Christopherson wrote:
> > > > OK, then we can have difficulties with long-running interrupts hitting
> > > > this range of code. It is unfortunately not unheard-of for interrupts
> > > > plus trailing softirqs to run for tens of seconds, even minutes.
> > >
> > > Ah, and if that occurs, *and* KVM is slow to re-enter the guest, then there will
> > > be a massive lag before the CPU gets back into a quiescent state.
> >
> > Exactly!
>
> ...
>
> > OK, then is it possible to get some other indication to the
> > rcu_sched_clock_irq() function that it has interrupted a guest OS?
>
> It's certainly possible, but I don't think we want to go down that road.
>
> Any functionality built on that would be strictly limited to Intel CPUs, because
> AFAIK, only Intel VMX has the mode where an IRQ can be handled without enabling
> IRQs (which sounds stupid when I write it like that).
>
> E.g. on AMD SVM, if an IRQ interrupts the guest, KVM literally handles it by
> doing:
>
> local_irq_enable();
> ++vcpu->stat.exits;
> local_irq_disable();
>
> which means there's no way for KVM to guarantee that the IRQ that leads to
> rcu_sched_clock_irq() is the _only_ IRQ that is taken (or that what RCU sees was
> even the IRQ that interrupted the guest, though that probably doesn't matter much).
>
> Orthogonal to RCU, I do think it makes sense to have KVM VMX handle IRQs in its
> fastpath for VM-Exit, i.e. handle the IRQ VM-Exit and re-enter the guest without
> ever enabling IRQs. But that's purely a KVM optimization, e.g. to avoid useless
> work when the host has already done what it needed to do.
>
> But even then, to make it so RCU could safely skip invoke_rcu_core(), KVM would
> need to _guarantee_ re-entry to the guest, and I don't think we want to do that.
> E.g. if there is some work that needs to be done on the CPU, re-entering the guest
> is a huge waste of cycles, as KVM would need to do some shenanigans to immediately
> force a VM-Exit. It'd also require a moderate amount of complexity that I wouldn't
> want to maintain, particularly since it'd be Intel-only.

Thank you for the analysis!

It sounds like the current state, imperfect though it might be, is the
best of the known possible worlds at the moment.

But should anyone come up with something better, please do not keep it
a secret!

Thanx, Paul

> > Not an emergency, and maybe not even necessary, but it might well be
> > one hole that would be good to stop up.
> >
> > Thanx, Paul

2024-04-08 23:08:01

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

On Mon, Apr 08, 2024, Paul E. McKenney wrote:
> On Mon, Apr 08, 2024 at 02:56:29PM -0700, Sean Christopherson wrote:
> > > OK, then we can have difficulties with long-running interrupts hitting
> > > this range of code. It is unfortunately not unheard-of for interrupts
> > > plus trailing softirqs to run for tens of seconds, even minutes.
> >
> > Ah, and if that occurs, *and* KVM is slow to re-enter the guest, then there will
> > be a massive lag before the CPU gets back into a quiescent state.
>
> Exactly!

..

> OK, then is it possible to get some other indication to the
> rcu_sched_clock_irq() function that it has interrupted a guest OS?

It's certainly possible, but I don't think we want to go down that road.

Any functionality built on that would be strictly limited to Intel CPUs, because
AFAIK, only Intel VMX has the mode where an IRQ can be handled without enabling
IRQs (which sounds stupid when I write it like that).

E.g. on AMD SVM, if an IRQ interrupts the guest, KVM literally handles it by
doing:

local_irq_enable();
++vcpu->stat.exits;
local_irq_disable();

which means there's no way for KVM to guarantee that the IRQ that leads to
rcu_sched_clock_irq() is the _only_ IRQ that is taken (or that what RCU sees was
even the IRQ that interrupted the guest, though that probably doesn't matter much).

Orthogonal to RCU, I do think it makes sense to have KVM VMX handle IRQs in its
fastpath for VM-Exit, i.e. handle the IRQ VM-Exit and re-enter the guest without
ever enabling IRQs. But that's purely a KVM optimization, e.g. to avoid useless
work when the host has already done what it needed to do.

But even then, to make it so RCU could safely skip invoke_rcu_core(), KVM would
need to _guarantee_ re-entry to the guest, and I don't think we want to do that.
E.g. if there is some work that needs to be done on the CPU, re-entering the guest
is a huge waste of cycles, as KVM would need to do some shenanigans to immediately
force a VM-Exit. It'd also require a moderate amount of complexity that I wouldn't
want to maintain, particularly since it'd be Intel-only.

> Not an emergency, and maybe not even necessary, but it might well be
> one hole that would be good to stop up.
>
> Thanx, Paul

2024-04-08 20:11:32

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

On Mon, Apr 08, 2024, Paul E. McKenney wrote:
> On Mon, Apr 08, 2024 at 10:16:24AM -0700, Sean Christopherson wrote:
> > On Fri, Apr 05, 2024, Paul E. McKenney wrote:
> > Issuing a warning based on an arbitrary time limit is wildly different than using
> > an arbitrary time window to make functional decisions. My objection to the "assume
> > the CPU will enter a quiescent state if it exited a KVM guest in the last second"
> > is that there are plenty of scenarios where that assumption falls apart, i.e. where
> > _that_ physical CPU will not re-enter the guest.
> >
> > Off the top of my head:
> >
> > - If the vCPU is migrated to a different physical CPU (pCPU), the *old* pCPU
> > will get false positives, and the *new* pCPU will get false negatives (though
> > the false negatives aren't all that problematic since the pCPU will enter a
> > quiescent state on the next VM-Enter.
> >
> > - If the vCPU halts, in which case KVM will schedule out the vCPU/task, i.e.
> > won't re-enter the guest. And so the pCPU will get false positives until the
> > vCPU gets a wake event or the 1 second window expires.
> >
> > - If the VM terminates, the pCPU will get false positives until the 1 second
> > window expires.
> >
> > The false positives are solvable problems, by hooking vcpu_put() to reset
> > kvm_last_guest_exit. And to help with the false negatives when a vCPU task is
> > scheduled in on a different pCPU, KVM would hook vcpu_load().
>
> Here you are arguing against the heuristic in the original patch, correct?

Yep, correct.

> As opposed to the current RCU heuristic that ignores certain quiescent
> states for nohz_full CPUs until the grace period reaches an age of
> one second?
>
> If so, no argument here. In fact, please consider my ack cancelled.

..

> > That's a largely orthogonal discussion. As above, boosting the scheduling priority
> > of a vCPU because that vCPU is in critical section of some form is not at all
> > unique to nested virtualization (or RCU).
> >
> > For basic functional correctness, the L0 hypervisor already has the "hint" it
> > needs. L0 knows that the L1 CPU wants to run by virtue of the L1 CPU being
> > runnable, i.e. not halted, not in WFS, etc.
>
> And if the system is sufficiently lightly loaded, all will be well, as is
> the case with my rcutorture usage. However, if the system is saturated,
> that basic functional correctness might not be enough. I haven't heard
> many complaints, other than research work, so I have been assuming that
> we do not yet need hinting. But you guys tell me. ;-)

We should never use hinting for basic, *default* functionality. If the host is
so overloaded that it can induce RCU stalls with the default threshold of 21
seconds, then something in the host's domain is broken/misconfigured. E.g. it
doesn't necessary have to be a host kernel/userspace bug, it could be an issue
with VM scheduling at the control plane. But it's still a host issue, and under
no circumstance should the host need a hint in order for the guest to not complain
after 20+ seconds.

And _if_ we were to push the default lower, e.g. all the way down to Android's
aggressive 20 milliseconds, a boosting hint would still be the wrong way to go
about it, because no sane hypervisor would ever back such a hint with strong
guarantees for all scenarios.

It's very much possible to achieve a 20ms deadline when running as a VM, but it
would require strong guarantees about the VM's configuration and environment,
e.g. that memory isn't overcommited, that each vCPU has a fully dedicated pCPU,
etc.

> > > > + rcu_nohz_full_cpu())
> > >
> > > And rcu_nohz_full_cpu() has a one-second timeout, and has for quite
> > > some time.
> >
> > That's not a good reason to use a suboptimal heuristic for determining whether
> > or not a CPU is likely to enter a KVM guest, it simply mitigates the worst case
> > scenario of a false positive.
>
> Again, are you referring to the current RCU code, or the original patch
> that started this email thread?

Original patch.

> > > > /* Is the RCU core waiting for a quiescent state from this CPU? */
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index bfb2b52a1416..5a7efc669a0f 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -209,6 +209,9 @@ void vcpu_load(struct kvm_vcpu *vcpu)
> > > > {
> > > > int cpu = get_cpu();
> > > >
> > > > + if (vcpu->wants_to_run)
> > > > + context_tracking_guest_start_run_loop();
> > >
> > > At this point, if this is a nohz_full CPU, it will no longer report
> > > quiescent states until the grace period is at least one second old.
> >
> > I don't think I follow the "will no longer report quiescent states" issue. Are
> > you saying that this would prevent guest_context_enter_irqoff() from reporting
> > that the CPU is entering a quiescent state? If so, that's an issue that would
> > need to be resolved regardless of what heuristic we use to determine whether or
> > not a CPU is likely to enter a KVM guest.
>
> Please allow me to start over. Are interrupts disabled at this point,

Nope, IRQs are enabled.

Oof, I'm glad you asked, because I was going to say that there's one exception,
kvm_sched_in(), which is KVM's notifier for when a preempted task/vCPU is scheduled
back in. But I forgot that kvm_sched_{in,out}() don't use vcpu_{load,put}(),
i.e. would need explicit calls to context_tracking_guest_{stop,start}_run_loop().

> and, if so, will they remain disabled until the transfer of control to
> the guest has become visible to RCU via the context-tracking code?
>
> Or has the context-tracking code already made the transfer of control
> to the guest visible to RCU?

Nope. The call to __ct_user_enter(CONTEXT_GUEST) or rcu_virt_note_context_switch()
happens later, just before the actual VM-Enter. And that call does happen with
IRQs disabled (and IRQs stay disabled until the CPU enters the guest).

> > > > __this_cpu_write(kvm_running_vcpu, vcpu);
> > > > preempt_notifier_register(&vcpu->preempt_notifier);
> > > > kvm_arch_vcpu_load(vcpu, cpu);
> > > > @@ -222,6 +225,10 @@ void vcpu_put(struct kvm_vcpu *vcpu)
> > > > kvm_arch_vcpu_put(vcpu);
> > > > preempt_notifier_unregister(&vcpu->preempt_notifier);
> > > > __this_cpu_write(kvm_running_vcpu, NULL);
> > > > +
> > >
> > > And also at this point, if this is a nohz_full CPU, it will no longer
> > > report quiescent states until the grace period is at least one second old.
>
> And here, are interrupts disabled at this point, and if so, have they
> been disabled since the time that the exit from the guest become
> visible to RCU via the context-tracking code?

IRQs are enabled.

The gist of my suggestion is:

ioctl(KVM_RUN) {

context_tracking_guest_start_run_loop();

for (;;) {

vcpu_run();

if (<need to return to userspace>)
break;
}

context_tracking_guest_stop_run_loop();
}

where vcpu_run() encompasses a fairly huge amount of code and functionality,
including the logic to do world switches between host and guest.

E.g. if a vCPU triggers a VM-Exit because it tried to access memory that has been
swapped out by the host, KVM could end up way down in mm/ doing I/O to bring a
page back into memory for the guest. Immediately after VM-Exit, before enabling
IRQs, KVM will notify RCU that the CPU has exited the extended quiescent state
(this is what happens today). But the "in KVM run loop" flag would stay set, and
RCU would rely on rcu_nohz_full_cpu() for protection, e.g. in case faulting in
memory somehow takes more than a second.

But, barring something that triggers a return to userspace, KVM _will_ re-enter
the guest as quickly as possible. So it's still a heuristic in the sense that
the CPU isn't guaranteed to enter the guest, nor are there any enforceable SLOs
on how quickly the CPU will enter the guest, but I think it's the best tradeoff
between simplicity and functionality, especially since rcu_nohz_full_cpu() has
a one second timeout to safeguard against some unforeseen hiccup that prevents
KVM from re-entering the guest in a timely manner.

Note, as above, my intent is that there would also be hooks in kvm_sched_{in,out}()
to note that the guest run loop is starting/stopping if the vCPU task yields or
is preempted.

2024-04-11 17:41:59

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

On Mon, Apr 08, 2024 at 10:16:24AM -0700, Sean Christopherson wrote:
> On Fri, Apr 05, 2024, Paul E. McKenney wrote:
> > On Fri, Apr 05, 2024 at 07:42:35AM -0700, Sean Christopherson wrote:
> > > On Fri, Apr 05, 2024, Marcelo Tosatti wrote:
> > > > rcuc wakes up (which might exceed the allowed latency threshold
> > > > for certain realtime apps).
> > >
> > > Isn't that a false negative? (RCU doesn't detect that a CPU is about to (re)enter
> > > a guest) I was trying to ask about the case where RCU thinks a CPU is about to
> > > enter a guest, but the CPU never does (at least, not in the immediate future).
> > >
> > > Or am I just not understanding how RCU's kthreads work?
> >
> > It is quite possible that the current rcu_pending() code needs help,
> > given the possibility of vCPU preemption. I have heard of people doing
> > nested KVM virtualization -- or is that no longer a thing?
>
> Nested virtualization is still very much a thing, but I don't see how it is at
> all unique with respect to RCU grace periods and quiescent states. More below.
>
> > But the help might well involve RCU telling the hypervisor that a given
> > vCPU needs to run. Not sure how that would go over, though it has been
> > prototyped a couple times in the context of RCU priority boosting.
> >
> > > > > > 3 - It checks if the guest exit happened over than 1 second ago. This 1
> > > > > > second value was copied from rcu_nohz_full_cpu() which checks if the
> > > > > > grace period started over than a second ago. If this value is bad,
> > > > > > I have no issue changing it.
> > > > >
> > > > > IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal heuristic regardless
> > > > > of what magic time threshold is used.
> > > >
> > > > Why? It works for this particular purpose.
> > >
> > > Because maintaining magic numbers is no fun, AFAICT the heurisitic doesn't guard
> > > against edge cases, and I'm pretty sure we can do better with about the same amount
> > > of effort/churn.
> >
> > Beyond a certain point, we have no choice. How long should RCU let
> > a CPU run with preemption disabled before complaining? We choose 21
> > seconds in mainline and some distros choose 60 seconds. Android chooses
> > 20 milliseconds for synchronize_rcu_expedited() grace periods.
>
> Issuing a warning based on an arbitrary time limit is wildly different than using
> an arbitrary time window to make functional decisions. My objection to the "assume
> the CPU will enter a quiescent state if it exited a KVM guest in the last second"
> is that there are plenty of scenarios where that assumption falls apart, i.e. where
> _that_ physical CPU will not re-enter the guest.
>
> Off the top of my head:
>
> - If the vCPU is migrated to a different physical CPU (pCPU), the *old* pCPU
> will get false positives, and the *new* pCPU will get false negatives (though
> the false negatives aren't all that problematic since the pCPU will enter a
> quiescent state on the next VM-Enter.
>
> - If the vCPU halts, in which case KVM will schedule out the vCPU/task, i.e.
> won't re-enter the guest. And so the pCPU will get false positives until the
> vCPU gets a wake event or the 1 second window expires.
>
> - If the VM terminates, the pCPU will get false positives until the 1 second
> window expires.
>
> The false positives are solvable problems, by hooking vcpu_put() to reset
> kvm_last_guest_exit. And to help with the false negatives when a vCPU task is
> scheduled in on a different pCPU, KVM would hook vcpu_load().

Sean,

It seems that fixing the problems you pointed out above is a way to go.


2024-04-15 20:04:27

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

On Mon, Apr 08, 2024 at 10:16:24AM -0700, Sean Christopherson wrote:
> On Fri, Apr 05, 2024, Paul E. McKenney wrote:
> > On Fri, Apr 05, 2024 at 07:42:35AM -0700, Sean Christopherson wrote:
> > > On Fri, Apr 05, 2024, Marcelo Tosatti wrote:
> > > > rcuc wakes up (which might exceed the allowed latency threshold
> > > > for certain realtime apps).
> > >
> > > Isn't that a false negative? (RCU doesn't detect that a CPU is about to (re)enter
> > > a guest) I was trying to ask about the case where RCU thinks a CPU is about to
> > > enter a guest, but the CPU never does (at least, not in the immediate future).
> > >
> > > Or am I just not understanding how RCU's kthreads work?
> >
> > It is quite possible that the current rcu_pending() code needs help,
> > given the possibility of vCPU preemption. I have heard of people doing
> > nested KVM virtualization -- or is that no longer a thing?
>
> Nested virtualization is still very much a thing, but I don't see how it is at
> all unique with respect to RCU grace periods and quiescent states. More below.
>
> > But the help might well involve RCU telling the hypervisor that a given
> > vCPU needs to run. Not sure how that would go over, though it has been
> > prototyped a couple times in the context of RCU priority boosting.
> >
> > > > > > 3 - It checks if the guest exit happened over than 1 second ago. This 1
> > > > > > second value was copied from rcu_nohz_full_cpu() which checks if the
> > > > > > grace period started over than a second ago. If this value is bad,
> > > > > > I have no issue changing it.
> > > > >
> > > > > IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal heuristic regardless
> > > > > of what magic time threshold is used.
> > > >
> > > > Why? It works for this particular purpose.
> > >
> > > Because maintaining magic numbers is no fun, AFAICT the heurisitic doesn't guard
> > > against edge cases, and I'm pretty sure we can do better with about the same amount
> > > of effort/churn.
> >
> > Beyond a certain point, we have no choice. How long should RCU let
> > a CPU run with preemption disabled before complaining? We choose 21
> > seconds in mainline and some distros choose 60 seconds. Android chooses
> > 20 milliseconds for synchronize_rcu_expedited() grace periods.
>
> Issuing a warning based on an arbitrary time limit is wildly different than using
> an arbitrary time window to make functional decisions. My objection to the "assume
> the CPU will enter a quiescent state if it exited a KVM guest in the last second"
> is that there are plenty of scenarios where that assumption falls apart, i.e. where
> _that_ physical CPU will not re-enter the guest.
>
> Off the top of my head:
>
> - If the vCPU is migrated to a different physical CPU (pCPU), the *old* pCPU
> will get false positives, and the *new* pCPU will get false negatives (though
> the false negatives aren't all that problematic since the pCPU will enter a
> quiescent state on the next VM-Enter.
>
> - If the vCPU halts, in which case KVM will schedule out the vCPU/task, i.e.
> won't re-enter the guest. And so the pCPU will get false positives until the
> vCPU gets a wake event or the 1 second window expires.
>
> - If the VM terminates, the pCPU will get false positives until the 1 second
> window expires.
>
> The false positives are solvable problems, by hooking vcpu_put() to reset
> kvm_last_guest_exit. And to help with the false negatives when a vCPU task is
> scheduled in on a different pCPU, KVM would hook vcpu_load().

Hi Sean,

So this should deal with it? (untested, don't apply...).

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 48f31dcd318a..be90d83d631a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -477,6 +477,16 @@ static __always_inline void guest_state_enter_irqoff(void)
lockdep_hardirqs_on(CALLER_ADDR0);
}

+DECLARE_PER_CPU(unsigned long, kvm_last_guest_exit);
+
+/*
+ * Returns time (jiffies) for the last guest exit in current cpu
+ */
+static inline unsigned long guest_exit_last_time(void)
+{
+ return this_cpu_read(kvm_last_guest_exit);
+}
+
/*
* Exit guest context and exit an RCU extended quiescent state.
*
@@ -488,6 +498,9 @@ static __always_inline void guest_state_enter_irqoff(void)
static __always_inline void guest_context_exit_irqoff(void)
{
context_tracking_guest_exit();
+
+ /* Keeps track of last guest exit */
+ this_cpu_write(kvm_last_guest_exit, jiffies);
}

/*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fb49c2a60200..231d0e4d2cf1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -110,6 +110,9 @@ static struct kmem_cache *kvm_vcpu_cache;
static __read_mostly struct preempt_ops kvm_preempt_ops;
static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_running_vcpu);

+DEFINE_PER_CPU(unsigned long, kvm_last_guest_exit);
+EXPORT_SYMBOL_GPL(kvm_last_guest_exit);
+
struct dentry *kvm_debugfs_dir;
EXPORT_SYMBOL_GPL(kvm_debugfs_dir);

@@ -210,6 +213,7 @@ void vcpu_load(struct kvm_vcpu *vcpu)
int cpu = get_cpu();

__this_cpu_write(kvm_running_vcpu, vcpu);
+ __this_cpu_write(kvm_last_guest_exit, 0);
preempt_notifier_register(&vcpu->preempt_notifier);
kvm_arch_vcpu_load(vcpu, cpu);
put_cpu();
@@ -222,6 +226,7 @@ void vcpu_put(struct kvm_vcpu *vcpu)
kvm_arch_vcpu_put(vcpu);
preempt_notifier_unregister(&vcpu->preempt_notifier);
__this_cpu_write(kvm_running_vcpu, NULL);
+ __this_cpu_write(kvm_last_guest_exit, 0);
preempt_enable();
}
EXPORT_SYMBOL_GPL(vcpu_put);


2024-04-15 21:29:50

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

On Mon, Apr 15, 2024, Marcelo Tosatti wrote:
> On Mon, Apr 08, 2024 at 10:16:24AM -0700, Sean Christopherson wrote:
> > On Fri, Apr 05, 2024, Paul E. McKenney wrote:
> > > Beyond a certain point, we have no choice. How long should RCU let
> > > a CPU run with preemption disabled before complaining? We choose 21
> > > seconds in mainline and some distros choose 60 seconds. Android chooses
> > > 20 milliseconds for synchronize_rcu_expedited() grace periods.
> >
> > Issuing a warning based on an arbitrary time limit is wildly different than using
> > an arbitrary time window to make functional decisions. My objection to the "assume
> > the CPU will enter a quiescent state if it exited a KVM guest in the last second"
> > is that there are plenty of scenarios where that assumption falls apart, i.e. where
> > _that_ physical CPU will not re-enter the guest.
> >
> > Off the top of my head:
> >
> > - If the vCPU is migrated to a different physical CPU (pCPU), the *old* pCPU
> > will get false positives, and the *new* pCPU will get false negatives (though
> > the false negatives aren't all that problematic since the pCPU will enter a
> > quiescent state on the next VM-Enter.
> >
> > - If the vCPU halts, in which case KVM will schedule out the vCPU/task, i.e.
> > won't re-enter the guest. And so the pCPU will get false positives until the
> > vCPU gets a wake event or the 1 second window expires.
> >
> > - If the VM terminates, the pCPU will get false positives until the 1 second
> > window expires.
> >
> > The false positives are solvable problems, by hooking vcpu_put() to reset
> > kvm_last_guest_exit. And to help with the false negatives when a vCPU task is
> > scheduled in on a different pCPU, KVM would hook vcpu_load().
>
> Hi Sean,
>
> So this should deal with it? (untested, don't apply...).

Not entirely. As I belatedly noted, hooking vcpu_put() doesn't handle the case
where the vCPU is preempted, i.e. kvm_sched_out() would also need to zero out
kvm_last_guest_exit to avoid a false positive. Going through the scheduler will
note the CPU is quiescent for the current grace period, but after that RCU will
still see a non-zero kvm_last_guest_exit even though the vCPU task isn't actively
running.

And snapshotting the VM-Exit time will get false negatives when the vCPU is about
to run, but for whatever reason has kvm_last_guest_exit=0, e.g. if a vCPU was
preempted and/or migrated to a different pCPU.

I don't understand the motivation for keeping the kvm_last_guest_exit logic. My
understanding is that RCU already has a timeout to avoid stalling RCU. I don't
see what is gained by effectively duplicating that timeout for KVM. Why not have
KVM provide a "this task is in KVM_RUN" flag, and then let the existing timeout
handle the (hopefully rare) case where KVM doesn't "immediately" re-enter the guest?

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 48f31dcd318a..be90d83d631a 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -477,6 +477,16 @@ static __always_inline void guest_state_enter_irqoff(void)
> lockdep_hardirqs_on(CALLER_ADDR0);
> }
>
> +DECLARE_PER_CPU(unsigned long, kvm_last_guest_exit);
> +
> +/*
> + * Returns time (jiffies) for the last guest exit in current cpu
> + */
> +static inline unsigned long guest_exit_last_time(void)
> +{
> + return this_cpu_read(kvm_last_guest_exit);
> +}
> +
> /*
> * Exit guest context and exit an RCU extended quiescent state.
> *
> @@ -488,6 +498,9 @@ static __always_inline void guest_state_enter_irqoff(void)
> static __always_inline void guest_context_exit_irqoff(void)
> {
> context_tracking_guest_exit();
> +
> + /* Keeps track of last guest exit */
> + this_cpu_write(kvm_last_guest_exit, jiffies);
> }
>
> /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index fb49c2a60200..231d0e4d2cf1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -110,6 +110,9 @@ static struct kmem_cache *kvm_vcpu_cache;
> static __read_mostly struct preempt_ops kvm_preempt_ops;
> static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_running_vcpu);
>
> +DEFINE_PER_CPU(unsigned long, kvm_last_guest_exit);
> +EXPORT_SYMBOL_GPL(kvm_last_guest_exit);
> +
> struct dentry *kvm_debugfs_dir;
> EXPORT_SYMBOL_GPL(kvm_debugfs_dir);
>
> @@ -210,6 +213,7 @@ void vcpu_load(struct kvm_vcpu *vcpu)
> int cpu = get_cpu();
>
> __this_cpu_write(kvm_running_vcpu, vcpu);
> + __this_cpu_write(kvm_last_guest_exit, 0);
> preempt_notifier_register(&vcpu->preempt_notifier);
> kvm_arch_vcpu_load(vcpu, cpu);
> put_cpu();
> @@ -222,6 +226,7 @@ void vcpu_put(struct kvm_vcpu *vcpu)
> kvm_arch_vcpu_put(vcpu);
> preempt_notifier_unregister(&vcpu->preempt_notifier);
> __this_cpu_write(kvm_running_vcpu, NULL);
> + __this_cpu_write(kvm_last_guest_exit, 0);
> preempt_enable();
> }
> EXPORT_SYMBOL_GPL(vcpu_put);
>

2024-04-16 12:37:50

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

On Mon, Apr 15, 2024 at 02:29:32PM -0700, Sean Christopherson wrote:
> On Mon, Apr 15, 2024, Marcelo Tosatti wrote:
> > On Mon, Apr 08, 2024 at 10:16:24AM -0700, Sean Christopherson wrote:
> > > On Fri, Apr 05, 2024, Paul E. McKenney wrote:
> > > > Beyond a certain point, we have no choice. How long should RCU let
> > > > a CPU run with preemption disabled before complaining? We choose 21
> > > > seconds in mainline and some distros choose 60 seconds. Android chooses
> > > > 20 milliseconds for synchronize_rcu_expedited() grace periods.
> > >
> > > Issuing a warning based on an arbitrary time limit is wildly different than using
> > > an arbitrary time window to make functional decisions. My objection to the "assume
> > > the CPU will enter a quiescent state if it exited a KVM guest in the last second"
> > > is that there are plenty of scenarios where that assumption falls apart, i.e. where
> > > _that_ physical CPU will not re-enter the guest.
> > >
> > > Off the top of my head:
> > >
> > > - If the vCPU is migrated to a different physical CPU (pCPU), the *old* pCPU
> > > will get false positives, and the *new* pCPU will get false negatives (though
> > > the false negatives aren't all that problematic since the pCPU will enter a
> > > quiescent state on the next VM-Enter.
> > >
> > > - If the vCPU halts, in which case KVM will schedule out the vCPU/task, i.e.
> > > won't re-enter the guest. And so the pCPU will get false positives until the
> > > vCPU gets a wake event or the 1 second window expires.
> > >
> > > - If the VM terminates, the pCPU will get false positives until the 1 second
> > > window expires.
> > >
> > > The false positives are solvable problems, by hooking vcpu_put() to reset
> > > kvm_last_guest_exit. And to help with the false negatives when a vCPU task is
> > > scheduled in on a different pCPU, KVM would hook vcpu_load().
> >
> > Hi Sean,
> >
> > So this should deal with it? (untested, don't apply...).
>
> Not entirely. As I belatedly noted, hooking vcpu_put() doesn't handle the case
> where the vCPU is preempted, i.e. kvm_sched_out() would also need to zero out
> kvm_last_guest_exit to avoid a false positive.

True. Can fix that.

> Going through the scheduler will
> note the CPU is quiescent for the current grace period, but after that RCU will
> still see a non-zero kvm_last_guest_exit even though the vCPU task isn't actively
> running.

Right, can fix kvm_sched_out().

> And snapshotting the VM-Exit time will get false negatives when the vCPU is about
> to run, but for whatever reason has kvm_last_guest_exit=0, e.g. if a vCPU was
> preempted and/or migrated to a different pCPU.

Right, for the use-case where waking up rcuc is a problem, the pCPU is
isolated (there are no userspace processes and hopefully no kernel threads
executing there), vCPU pinned to that pCPU.

So there should be no preemptions or migrations.

> I don't understand the motivation for keeping the kvm_last_guest_exit logic.

The motivation is to _avoid_ waking up rcuc to perform RCU core
processing, in case the vCPU runs on a nohz full CPU, since
entering the VM is an extended quiescent state.

The logic for userspace/idle extended quiescent states is:

This is called from the sched clock interrupt.

/*
* This function is invoked from each scheduling-clock interrupt,
* and checks to see if this CPU is in a non-context-switch quiescent
* state, for example, user mode or idle loop. It also schedules RCU
* core processing. If the current grace period has gone on too long,
* it will ask the scheduler to manufacture a context switch for the sole
* purpose of providing the needed quiescent state.
*/
void rcu_sched_clock_irq(int user)
{
..
if (rcu_pending(user))
invoke_rcu_core();
..
}

And, from rcu_pending:

/* Is this a nohz_full CPU in userspace or idle? (Ignore RCU if so.) */
if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
return 0;

/*
* Is this CPU a NO_HZ_FULL CPU that should ignore RCU so that the
* grace-period kthread will do force_quiescent_state() processing?
* The idea is to avoid waking up RCU core processing on such a
* CPU unless the grace period has extended for too long.
*
* This code relies on the fact that all NO_HZ_FULL CPUs are also
* RCU_NOCB_CPU CPUs.
*/
static bool rcu_nohz_full_cpu(void)
{
#ifdef CONFIG_NO_HZ_FULL
if (tick_nohz_full_cpu(smp_processor_id()) &&
(!rcu_gp_in_progress() ||
time_before(jiffies, READ_ONCE(rcu_state.gp_start) + HZ)))
return true;
#endif /* #ifdef CONFIG_NO_HZ_FULL */
return false;
}

Does that make sense?

> My understanding is that RCU already has a timeout to avoid stalling RCU. I don't
> see what is gained by effectively duplicating that timeout for KVM.

The point is not to avoid stalling RCU. The point is to not perform RCU
core processing through rcuc thread (because that interrupts execution
of the vCPU thread), if it is known that an extended quiescent state
will occur "soon" anyway (via VM-entry).

If the extended quiescent state does not occur in 1 second, then rcuc
will be woken up (the time_before call in rcu_nohz_full_cpu function
above).

> Why not have
> KVM provide a "this task is in KVM_RUN" flag, and then let the existing timeout
> handle the (hopefully rare) case where KVM doesn't "immediately" re-enter the guest?

Do you mean something like:

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d9642dd06c25..0ca5a6a45025 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3938,7 +3938,7 @@ static int rcu_pending(int user)
return 1;

/* Is this a nohz_full CPU in userspace or idle? (Ignore RCU if so.) */
- if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
+ if ((user || rcu_is_cpu_rrupt_from_idle() || this_cpu->in_kvm_run) && rcu_nohz_full_cpu())
return 0;

/* Is the RCU core waiting for a quiescent state from this CPU? */

The problem is:

1) You should only set that flag, in the VM-entry path, after the point
where no use of RCU is made: close to guest_state_enter_irqoff call.

2) While handling a VM-exit, a host timer interrupt can occur before that,
or after the point where "this_cpu->in_kvm_run" is set to false.

And a host timer interrupt calls rcu_sched_clock_irq which is going to
wake up rcuc.

Or am i missing something?

Thanks.

> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 48f31dcd318a..be90d83d631a 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -477,6 +477,16 @@ static __always_inline void guest_state_enter_irqoff(void)
> > lockdep_hardirqs_on(CALLER_ADDR0);
> > }
> >
> > +DECLARE_PER_CPU(unsigned long, kvm_last_guest_exit);
> > +
> > +/*
> > + * Returns time (jiffies) for the last guest exit in current cpu
> > + */
> > +static inline unsigned long guest_exit_last_time(void)
> > +{
> > + return this_cpu_read(kvm_last_guest_exit);
> > +}
> > +
> > /*
> > * Exit guest context and exit an RCU extended quiescent state.
> > *
> > @@ -488,6 +498,9 @@ static __always_inline void guest_state_enter_irqoff(void)
> > static __always_inline void guest_context_exit_irqoff(void)
> > {
> > context_tracking_guest_exit();
> > +
> > + /* Keeps track of last guest exit */
> > + this_cpu_write(kvm_last_guest_exit, jiffies);
> > }
> >
> > /*
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index fb49c2a60200..231d0e4d2cf1 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -110,6 +110,9 @@ static struct kmem_cache *kvm_vcpu_cache;
> > static __read_mostly struct preempt_ops kvm_preempt_ops;
> > static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_running_vcpu);
> >
> > +DEFINE_PER_CPU(unsigned long, kvm_last_guest_exit);
> > +EXPORT_SYMBOL_GPL(kvm_last_guest_exit);
> > +
> > struct dentry *kvm_debugfs_dir;
> > EXPORT_SYMBOL_GPL(kvm_debugfs_dir);
> >
> > @@ -210,6 +213,7 @@ void vcpu_load(struct kvm_vcpu *vcpu)
> > int cpu = get_cpu();
> >
> > __this_cpu_write(kvm_running_vcpu, vcpu);
> > + __this_cpu_write(kvm_last_guest_exit, 0);
> > preempt_notifier_register(&vcpu->preempt_notifier);
> > kvm_arch_vcpu_load(vcpu, cpu);
> > put_cpu();
> > @@ -222,6 +226,7 @@ void vcpu_put(struct kvm_vcpu *vcpu)
> > kvm_arch_vcpu_put(vcpu);
> > preempt_notifier_unregister(&vcpu->preempt_notifier);
> > __this_cpu_write(kvm_running_vcpu, NULL);
> > + __this_cpu_write(kvm_last_guest_exit, 0);
> > preempt_enable();
> > }
> > EXPORT_SYMBOL_GPL(vcpu_put);
> >
>
>


2024-04-16 14:07:50

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

On Tue, Apr 16, 2024, Marcelo Tosatti wrote:
> On Mon, Apr 15, 2024 at 02:29:32PM -0700, Sean Christopherson wrote:
> > And snapshotting the VM-Exit time will get false negatives when the vCPU is about
> > to run, but for whatever reason has kvm_last_guest_exit=0, e.g. if a vCPU was
> > preempted and/or migrated to a different pCPU.
>
> Right, for the use-case where waking up rcuc is a problem, the pCPU is
> isolated (there are no userspace processes and hopefully no kernel threads
> executing there), vCPU pinned to that pCPU.
>
> So there should be no preemptions or migrations.

I understand that preemption/migration will not be problematic if the system is
configured "correctly", but we still need to play nice with other scenarios and/or
suboptimal setups. While false positives aren't fatal, KVM still should do its
best to avoid them, especially when it's relatively easy to do so.

> > My understanding is that RCU already has a timeout to avoid stalling RCU. I don't
> > see what is gained by effectively duplicating that timeout for KVM.
>
> The point is not to avoid stalling RCU. The point is to not perform RCU
> core processing through rcuc thread (because that interrupts execution
> of the vCPU thread), if it is known that an extended quiescent state
> will occur "soon" anyway (via VM-entry).

I know. My point is that, as you note below, RCU will wake-up rcuc after 1 second
even if KVM is still reporting a VM-Enter is imminent, i.e. there's a 1 second
timeout to avoid an RCU stall to due to KVM never completing entry to the guest.

> If the extended quiescent state does not occur in 1 second, then rcuc
> will be woken up (the time_before call in rcu_nohz_full_cpu function
> above).
>
> > Why not have
> > KVM provide a "this task is in KVM_RUN" flag, and then let the existing timeout
> > handle the (hopefully rare) case where KVM doesn't "immediately" re-enter the guest?
>
> Do you mean something like:
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index d9642dd06c25..0ca5a6a45025 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3938,7 +3938,7 @@ static int rcu_pending(int user)
> return 1;
>
> /* Is this a nohz_full CPU in userspace or idle? (Ignore RCU if so.) */
> - if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> + if ((user || rcu_is_cpu_rrupt_from_idle() || this_cpu->in_kvm_run) && rcu_nohz_full_cpu())
> return 0;

Yes. This, https://lore.kernel.org/all/[email protected], plus logic
in kvm_sched_{in,out}().

> /* Is the RCU core waiting for a quiescent state from this CPU? */
>
> The problem is:
>
> 1) You should only set that flag, in the VM-entry path, after the point
> where no use of RCU is made: close to guest_state_enter_irqoff call.

Why? As established above, KVM essentially has 1 second to enter the guest after
setting in_guest_run_loop (or whatever we call it). In the vast majority of cases,
the time before KVM enters the guest can probably be measured in microseconds.

Snapshotting the exit time has the exact same problem of depending on KVM to
re-enter the guest soon-ish, so I don't understand why this would be considered
a problem with a flag to note the CPU is in KVM's run loop, but not with a
snapshot to say the CPU recently exited a KVM guest.

> 2) While handling a VM-exit, a host timer interrupt can occur before that,
> or after the point where "this_cpu->in_kvm_run" is set to false.
>
> And a host timer interrupt calls rcu_sched_clock_irq which is going to
> wake up rcuc.

If in_kvm_run is false when the IRQ is handled, then either KVM exited to userspace
or the vCPU was scheduled out. In the former case, rcuc won't be woken up if the
CPU is in userspace. And in the latter case, waking up rcuc is absolutely the
correct thing to do as VM-Enter is not imminent.

For exits to userspace, there would be a small window where an IRQ could arrive
between KVM putting the vCPU and the CPU actually returning to userspace, but
unless that's problematic in practice, I think it's a reasonable tradeoff.

2024-04-17 16:23:00

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

On Tue, Apr 16, 2024 at 07:07:32AM -0700, Sean Christopherson wrote:
> On Tue, Apr 16, 2024, Marcelo Tosatti wrote:
> > On Mon, Apr 15, 2024 at 02:29:32PM -0700, Sean Christopherson wrote:
> > > And snapshotting the VM-Exit time will get false negatives when the vCPU is about
> > > to run, but for whatever reason has kvm_last_guest_exit=0, e.g. if a vCPU was
> > > preempted and/or migrated to a different pCPU.
> >
> > Right, for the use-case where waking up rcuc is a problem, the pCPU is
> > isolated (there are no userspace processes and hopefully no kernel threads
> > executing there), vCPU pinned to that pCPU.
> >
> > So there should be no preemptions or migrations.
>
> I understand that preemption/migration will not be problematic if the system is
> configured "correctly", but we still need to play nice with other scenarios and/or
> suboptimal setups. While false positives aren't fatal, KVM still should do its
> best to avoid them, especially when it's relatively easy to do so.

Sure.

> > > My understanding is that RCU already has a timeout to avoid stalling RCU. I don't
> > > see what is gained by effectively duplicating that timeout for KVM.
> >
> > The point is not to avoid stalling RCU. The point is to not perform RCU
> > core processing through rcuc thread (because that interrupts execution
> > of the vCPU thread), if it is known that an extended quiescent state
> > will occur "soon" anyway (via VM-entry).
>
> I know. My point is that, as you note below, RCU will wake-up rcuc after 1 second
> even if KVM is still reporting a VM-Enter is imminent, i.e. there's a 1 second
> timeout to avoid an RCU stall to due to KVM never completing entry to the guest.

Right.

So a reply to the sentence:

"My understanding is that RCU already has a timeout to avoid stalling RCU. I don't
see what is gained by effectively duplicating that timeout for KVM."

Is that the current RCU timeout is not functional for KVM VM entries,
therefore it needs modification.

> > If the extended quiescent state does not occur in 1 second, then rcuc
> > will be woken up (the time_before call in rcu_nohz_full_cpu function
> > above).
> >
> > > Why not have
> > > KVM provide a "this task is in KVM_RUN" flag, and then let the existing timeout
> > > handle the (hopefully rare) case where KVM doesn't "immediately" re-enter the guest?
> >
> > Do you mean something like:
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index d9642dd06c25..0ca5a6a45025 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3938,7 +3938,7 @@ static int rcu_pending(int user)
> > return 1;
> >
> > /* Is this a nohz_full CPU in userspace or idle? (Ignore RCU if so.) */
> > - if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > + if ((user || rcu_is_cpu_rrupt_from_idle() || this_cpu->in_kvm_run) && rcu_nohz_full_cpu())
> > return 0;
>
> Yes. This, https://lore.kernel.org/all/[email protected], plus logic
> in kvm_sched_{in,out}().

Question: where is vcpu->wants_to_run set? (or, where is the full series
again?).

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index bfb2b52a1416..5a7efc669a0f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -209,6 +209,9 @@ void vcpu_load(struct kvm_vcpu *vcpu)
{
int cpu = get_cpu();

+ if (vcpu->wants_to_run)
+ context_tracking_guest_start_run_loop();
+
__this_cpu_write(kvm_running_vcpu, vcpu);
preempt_notifier_register(&vcpu->preempt_notifier);
kvm_arch_vcpu_load(vcpu, cpu);
@@ -222,6 +225,10 @@ void vcpu_put(struct kvm_vcpu *vcpu)
kvm_arch_vcpu_put(vcpu);
preempt_notifier_unregister(&vcpu->preempt_notifier);
__this_cpu_write(kvm_running_vcpu, NULL);
+
+ if (vcpu->wants_to_run)
+ context_tracking_guest_stop_run_loop();
+
preempt_enable();
}
EXPORT_SYMBOL_GPL(vcpu_put);

A little worried about guest HLT:

/**
* rcu_is_cpu_rrupt_from_idle - see if 'interrupted' from idle
*
* If the current CPU is idle and running at a first-level (not nested)
* interrupt, or directly, from idle, return true.
*
* The caller must have at least disabled IRQs.
*/
static int rcu_is_cpu_rrupt_from_idle(void)
{
long nesting;

/*
* Usually called from the tick; but also used from smp_function_call()
* for expedited grace periods. This latter can result in running from
* the idle task, instead of an actual IPI.
*/
...

/* Does CPU appear to be idle from an RCU standpoint? */
return ct_dynticks_nesting() == 0;
}

static __always_inline void ct_cpuidle_enter(void)
{
lockdep_assert_irqs_disabled();
/*
* Idle is allowed to (temporary) enable IRQs. It
* will return with IRQs disabled.
*
* Trace IRQs enable here, then switch off RCU, and have
* arch_cpu_idle() use raw_local_irq_enable(). Note that
* ct_idle_enter() relies on lockdep IRQ state, so switch that
* last -- this is very similar to the entry code.
*/
trace_hardirqs_on_prepare();
lockdep_hardirqs_on_prepare();
instrumentation_end();
ct_idle_enter();
lockdep_hardirqs_on(_RET_IP_);
}

So for guest HLT emulation, there is a window between

kvm_vcpu_block -> fire_sched_out_preempt_notifiers -> vcpu_put
and
the idle's task call to ct_cpuidle_enter, where

ct_dynticks_nesting() != 0 and vcpu_put has already executed.

Even for idle=poll, the race exists.

> > /* Is the RCU core waiting for a quiescent state from this CPU? */
> >
> > The problem is:
> >
> > 1) You should only set that flag, in the VM-entry path, after the point
> > where no use of RCU is made: close to guest_state_enter_irqoff call.
>
> Why? As established above, KVM essentially has 1 second to enter the guest after
> setting in_guest_run_loop (or whatever we call it). In the vast majority of cases,
> the time before KVM enters the guest can probably be measured in microseconds.

OK.

> Snapshotting the exit time has the exact same problem of depending on KVM to
> re-enter the guest soon-ish, so I don't understand why this would be considered
> a problem with a flag to note the CPU is in KVM's run loop, but not with a
> snapshot to say the CPU recently exited a KVM guest.

See the race above.

> > 2) While handling a VM-exit, a host timer interrupt can occur before that,
> > or after the point where "this_cpu->in_kvm_run" is set to false.
> >
> > And a host timer interrupt calls rcu_sched_clock_irq which is going to
> > wake up rcuc.
>
> If in_kvm_run is false when the IRQ is handled, then either KVM exited to userspace
> or the vCPU was scheduled out. In the former case, rcuc won't be woken up if the
> CPU is in userspace. And in the latter case, waking up rcuc is absolutely the
> correct thing to do as VM-Enter is not imminent.
>
> For exits to userspace, there would be a small window where an IRQ could arrive
> between KVM putting the vCPU and the CPU actually returning to userspace, but
> unless that's problematic in practice, I think it's a reasonable tradeoff.

OK, your proposal looks alright except these races.

We don't want those races to occur in production (and they likely will).

Is there any way to fix the races? Perhaps cmpxchg?



2024-04-17 17:43:33

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

On Wed, Apr 17, 2024, Marcelo Tosatti wrote:
> On Tue, Apr 16, 2024 at 07:07:32AM -0700, Sean Christopherson wrote:
> > On Tue, Apr 16, 2024, Marcelo Tosatti wrote:
> > > > Why not have
> > > > KVM provide a "this task is in KVM_RUN" flag, and then let the existing timeout
> > > > handle the (hopefully rare) case where KVM doesn't "immediately" re-enter the guest?
> > >
> > > Do you mean something like:
> > >
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index d9642dd06c25..0ca5a6a45025 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -3938,7 +3938,7 @@ static int rcu_pending(int user)
> > > return 1;
> > >
> > > /* Is this a nohz_full CPU in userspace or idle? (Ignore RCU if so.) */
> > > - if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > > + if ((user || rcu_is_cpu_rrupt_from_idle() || this_cpu->in_kvm_run) && rcu_nohz_full_cpu())
> > > return 0;
> >
> > Yes. This, https://lore.kernel.org/all/[email protected], plus logic
> > in kvm_sched_{in,out}().
>
> Question: where is vcpu->wants_to_run set? (or, where is the full series
> again?).

Precisely around the call to kvm_arch_vcpu_ioctl_run(). I am planning on applying
the patch that introduces the code for 6.10[*], I just haven't yet for a variety
of reasons.

[*] https://lore.kernel.org/all/[email protected]

> So for guest HLT emulation, there is a window between
>
> kvm_vcpu_block -> fire_sched_out_preempt_notifiers -> vcpu_put
> and the idle's task call to ct_cpuidle_enter, where
>
> ct_dynticks_nesting() != 0 and vcpu_put has already executed.
>
> Even for idle=poll, the race exists.

Is waking rcuc actually problematic? I agree it's not ideal, but it's a smallish
window, i.e. is unlikely to happen frequently, and if rcuc is awakened, it will
effectively steal cycles from the idle thread, not the vCPU thread. If the vCPU
gets a wake event before rcuc completes, then the vCPU could experience jitter,
but that could also happen if the CPU ends up in a deep C-state.

And that race exists in general, i.e. any IRQ that arrives just as the idle task
is being scheduled in will unnecessarily wakeup rcuc.

> > > /* Is the RCU core waiting for a quiescent state from this CPU? */
> > >
> > > The problem is:
> > >
> > > 1) You should only set that flag, in the VM-entry path, after the point
> > > where no use of RCU is made: close to guest_state_enter_irqoff call.
> >
> > Why? As established above, KVM essentially has 1 second to enter the guest after
> > setting in_guest_run_loop (or whatever we call it). In the vast majority of cases,
> > the time before KVM enters the guest can probably be measured in microseconds.
>
> OK.
>
> > Snapshotting the exit time has the exact same problem of depending on KVM to
> > re-enter the guest soon-ish, so I don't understand why this would be considered
> > a problem with a flag to note the CPU is in KVM's run loop, but not with a
> > snapshot to say the CPU recently exited a KVM guest.
>
> See the race above.

Ya, but if kvm_last_guest_exit is zeroed in kvm_sched_out(), then the snapshot
approach ends up with the same race. And not zeroing kvm_last_guest_exit is
arguably much more problematic as encountering a false positive doesn't require
hitting a small window.

> > > 2) While handling a VM-exit, a host timer interrupt can occur before that,
> > > or after the point where "this_cpu->in_kvm_run" is set to false.
> > >
> > > And a host timer interrupt calls rcu_sched_clock_irq which is going to
> > > wake up rcuc.
> >
> > If in_kvm_run is false when the IRQ is handled, then either KVM exited to userspace
> > or the vCPU was scheduled out. In the former case, rcuc won't be woken up if the
> > CPU is in userspace. And in the latter case, waking up rcuc is absolutely the
> > correct thing to do as VM-Enter is not imminent.
> >
> > For exits to userspace, there would be a small window where an IRQ could arrive
> > between KVM putting the vCPU and the CPU actually returning to userspace, but
> > unless that's problematic in practice, I think it's a reasonable tradeoff.
>
> OK, your proposal looks alright except these races.
>
> We don't want those races to occur in production (and they likely will).
>
> Is there any way to fix the races? Perhaps cmpxchg?

I don't think an atomic switch from the vCPU task to the idle task is feasible,
e.g. KVM would somehow have to know that the idle task is going to run next.
This seems like something that needs a generic solution, e.g. to prevent waking
rcuc if the idle task is in the process of being scheduled in.