2022-01-11 15:35:52

by Mark Rutland

[permalink] [raw]
Subject: [PATCH 0/5] kvm: fix latent guest entry/exit bugs

Several architectures have latent bugs around guest entry/exit, most
notably:

1) Several architectures enable interrupts between guest_enter() and
guest_exit(). As this period is an RCU extended quiescent state (EQS) this
is unsound unless the irq entry code explicitly wakes RCU, which most
architectures only do for entry from usersapce or idle.

I believe this affects: arm64, riscv, s390

I am not sure about powerpc.

2) Several architectures permit instrumentation of code between
guest_enter() and guest_exit(), e.g. KASAN, KCOV, KCSAN, etc. As
instrumentation may directly o indirectly use RCU, this has the same
problems as with interrupts.

I believe this affects: arm64, mips, powerpc, riscv, s390

3) Several architectures do not inform lockdep and tracing that
interrupts are enabled during the execution of the guest, or do so in
an incorrect order. Generally
this means that logs will report IRQs being masked for much longer
than is actually the case, which is not ideal for debugging. I don't
know whether this affects the correctness of lockdep.

I believe this affects: arm64, mips, powerpc, riscv, s390

This was previously fixed for x86 specifically in a series of commits:

87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")

But other architectures were left broken, and the infrastructure for
handling this correctly is x86-specific.

This series introduces generic helper functions which can be used to
handle the problems above, and migrates architectures over to these,
fixing the latent issues.

I wasn't able to figure my way around powerpc and s390, so I have not
altered these. I'd appreciate if anyone could take a look at those
cases, and either have a go at patches or provide some feedback as to
any alternative approaches which work work better there.

I have build-tested the arm64, mips, riscv, and x86 cases, but I don't
have a suitable HW setup to test these, so any review and/or testing
would be much appreciated.

I've pushed the series (based on v5.16) to my kvm/entry-rework branch:

https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kvm/entry-rework
git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git kvm/entry-rework

... also tagged as kvm-entry-rework-20210111

Thanks,
Mark.

Mark Rutland (5):
kvm: add exit_to_guest_mode() and enter_from_guest_mode()
kvm/arm64: rework guest entry logic
kvm/mips: rework guest entry logic
kvm/riscv: rework guest entry logic
kvm/x86: rework guest entry logic

arch/arm64/kvm/arm.c | 51 +++++++++++-------
arch/mips/kvm/mips.c | 37 ++++++++++++--
arch/riscv/kvm/vcpu.c | 44 ++++++++++------
arch/x86/kvm/svm/svm.c | 4 +-
arch/x86/kvm/vmx/vmx.c | 4 +-
arch/x86/kvm/x86.c | 4 +-
arch/x86/kvm/x86.h | 45 ----------------
include/linux/kvm_host.h | 108 +++++++++++++++++++++++++++++++++++++--
8 files changed, 206 insertions(+), 91 deletions(-)

--
2.30.2



2022-01-11 15:36:20

by Mark Rutland

[permalink] [raw]
Subject: [PATCH 1/5] kvm: add exit_to_guest_mode() and enter_from_guest_mode()

When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.

Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().

On x86, this was dealt with across commits:

87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")

... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.

This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.

Atop this, new exit_to_guest_mode() and enter_from_guest_mode() helpers
are added to handle the ordering of lockdep, tracing, and RCU manageent.
These are named to align with exit_to_user_mode() and
enter_from_user_mode().

Subsequent patches will migrate architectures over to the new helpers,
following a sequence:

guest_timing_enter_irqoff();

exit_to_guest_mode();
< run the vcpu >
enter_from_guest_mode();

< take any pending IRQs >

guest_timing_exit_irqoff();

This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.

The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <[email protected]>
---
include/linux/kvm_host.h | 108 +++++++++++++++++++++++++++++++++++++--
1 file changed, 105 insertions(+), 3 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c310648cc8f1..13fcf7979880 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -29,6 +29,8 @@
#include <linux/refcount.h>
#include <linux/nospec.h>
#include <linux/notifier.h>
+#include <linux/ftrace.h>
+#include <linux/instrumentation.h>
#include <asm/signal.h>

#include <linux/kvm.h>
@@ -362,8 +364,11 @@ struct kvm_vcpu {
int last_used_slot;
};

-/* must be called with irqs disabled */
-static __always_inline void guest_enter_irqoff(void)
+/*
+ * Start accounting time towards a guest.
+ * Must be called before entering guest context.
+ */
+static __always_inline void guest_timing_enter_irqoff(void)
{
/*
* This is running in ioctl context so its safe to assume that it's the
@@ -372,7 +377,17 @@ static __always_inline void guest_enter_irqoff(void)
instrumentation_begin();
vtime_account_guest_enter();
instrumentation_end();
+}

+/*
+ * Enter guest context and enter an RCU extended quiescent state.
+ *
+ * This should be the last thing called before entering the guest, and must be
+ * called after any potential use of RCU (including any potentially
+ * instrumented code).
+ */
+static __always_inline void guest_context_enter_irqoff(void)
+{
/*
* KVM does not hold any references to rcu protected data when it
* switches CPU into a guest mode. In fact switching to a guest mode
@@ -388,16 +403,77 @@ static __always_inline void guest_enter_irqoff(void)
}
}

-static __always_inline void guest_exit_irqoff(void)
+/*
+ * Deprecated. Architectures should move to guest_timing_enter_irqoff() and
+ * exit_to_guest_mode().
+ */
+static __always_inline void guest_enter_irqoff(void)
+{
+ guest_timing_enter_irqoff();
+ guest_context_enter_irqoff();
+}
+
+/**
+ * exit_to_guest_mode - Fixup state when exiting to guest mode
+ *
+ * This is analagous to exit_to_user_mode(), and ensures we perform the
+ * following in order:
+ *
+ * 1) Trace interrupts on state
+ * 2) Invoke context tracking if enabled to adjust RCU state
+ * 3) Tell lockdep that interrupts are enabled
+ *
+ * Invoked from architecture specific code as the last step before entering a
+ * guest. Must be invoked with interrupts disabled and the caller must be
+ * non-instrumentable.
+ *
+ * This must be called after guest_timing_enter_irqoff().
+ */
+static __always_inline void exit_to_guest_mode(void)
+{
+ instrumentation_begin();
+ trace_hardirqs_on_prepare();
+ lockdep_hardirqs_on_prepare(CALLER_ADDR0);
+ instrumentation_end();
+
+ guest_context_enter_irqoff();
+ lockdep_hardirqs_on(CALLER_ADDR0);
+}
+
+/*
+ * Exit guest context and exit an RCU extended quiescent state.
+ *
+ * This should be the first thing called after exiting the guest, and must be
+ * called before any potential use of RCU (including any potentially
+ * instrumented code).
+ */
+static __always_inline void guest_context_exit_irqoff(void)
{
context_tracking_guest_exit();
+}

+/*
+ * Stop accounting time towards a guest.
+ * Must be called after exiting guest context.
+ */
+static __always_inline void guest_timing_exit_irqoff(void)
+{
instrumentation_begin();
/* Flush the guest cputime we spent on the guest */
vtime_account_guest_exit();
instrumentation_end();
}

+/*
+ * Deprecated. Architectures should move to enter_from_guest_mode() and
+ * guest_timing_enter_irqoff().
+ */
+static __always_inline void guest_exit_irqoff(void)
+{
+ guest_context_exit_irqoff();
+ guest_timing_exit_irqoff();
+}
+
static inline void guest_exit(void)
{
unsigned long flags;
@@ -407,6 +483,32 @@ static inline void guest_exit(void)
local_irq_restore(flags);
}

+/**
+ * enter_from_guest_mode - Establish state when returning from guest mode
+ *
+ * This is analagous to enter_from_user_mode(), and ensures we perform the
+ * following in order:
+ *
+ * 1) Tell lockdep that interrupts are disabled
+ * 2) Invoke context tracking if enabled to reactivate RCU
+ * 3) Trace interrupts off state
+ *
+ * Invoked from architecture specific code as the first step after exiting a
+ * guest. Must be invoked with interrupts disabled and the caller must be
+ * non-instrumentable.
+ *
+ * This must be called before guest_timing_exit_irqoff().
+ */
+static __always_inline void enter_from_guest_mode(void)
+{
+ lockdep_hardirqs_off(CALLER_ADDR0);
+ guest_context_exit_irqoff();
+
+ instrumentation_begin();
+ trace_hardirqs_off_finish();
+ instrumentation_end();
+}
+
static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
{
/*
--
2.30.2


2022-01-11 15:36:30

by Mark Rutland

[permalink] [raw]
Subject: [PATCH 2/5] kvm/arm64: rework guest entry logic

In kvm_arch_vcpu_ioctl_run() we enter an RCU extended quiescent state
(EQS) by calling guest_enter_irqoff(), and unmasked IRQs prior to
exiting the EQS by calling guest_exit(). As the IRQ entry code will not
wake RCU in this case, we may run the core IRQ code and IRQ handler
without RCU watching, leading to various potential problems.

Additionally, we do not inform lockdep or tracing that interrupts will
be enabled during guest execution, which caan lead to misleading traces
and warnings that interrupts have been enabled for overly-long periods.

This patch fixes these issues by using the new timing and context
entry/exit helpers to ensure that interrupts are handled during guest
vtime but with RCU watching, with a sequence:

guest_timing_enter_irqoff();

exit_to_guest_mode();
< run the vcpu >
enter_from_guest_mode();

< take any pending IRQs >

guest_timing_exit_irqoff();

Since instrumentation may make use of RCU, we must also ensure that no
instrumented code is run during the EQS. I've split out the critical
section into a new kvm_arm_enter_exit_vcpu() helper which is marked
noinstr.

Fixes: 1b3d546daf85ed2b ("arm/arm64: KVM: Properly account for guest CPU time")
Reported-by: Nicolas Saenz Julienne <[email protected]>
Signed-off-by: Mark Rutland <[email protected]>
Cc: Alexandru Elisei <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: James Morse <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Suzuki K Poulose <[email protected]>
Cc: Will Deacon <[email protected]>
---
arch/arm64/kvm/arm.c | 51 ++++++++++++++++++++++++++++----------------
1 file changed, 33 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e4727dc771bf..1721df2522c8 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -764,6 +764,24 @@ static bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu, int *ret)
xfer_to_guest_mode_work_pending();
}

+/*
+ * Actually run the vCPU, entering an RCU extended quiescent state (EQS) while
+ * the vCPU is running.
+ *
+ * This must be noinstr as instrumentation may make use of RCU, and this is not
+ * safe during the EQS.
+ */
+static int noinstr kvm_arm_vcpu_enter_exit(struct kvm_vcpu *vcpu)
+{
+ int ret;
+
+ exit_to_guest_mode();
+ ret = kvm_call_hyp_ret(__kvm_vcpu_run, vcpu);
+ enter_from_guest_mode();
+
+ return ret;
+}
+
/**
* kvm_arch_vcpu_ioctl_run - the main VCPU run function to execute guest code
* @vcpu: The VCPU pointer
@@ -854,9 +872,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
* Enter the guest
*/
trace_kvm_entry(*vcpu_pc(vcpu));
- guest_enter_irqoff();
+ guest_timing_enter_irqoff();

- ret = kvm_call_hyp_ret(__kvm_vcpu_run, vcpu);
+ ret = kvm_arm_vcpu_enter_exit(vcpu);

vcpu->mode = OUTSIDE_GUEST_MODE;
vcpu->stat.exits++;
@@ -891,26 +909,23 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
kvm_arch_vcpu_ctxsync_fp(vcpu);

/*
- * We may have taken a host interrupt in HYP mode (ie
- * while executing the guest). This interrupt is still
- * pending, as we haven't serviced it yet!
+ * We must ensure that any pending interrupts are taken before
+ * we exit guest timing so that timer ticks are accounted as
+ * guest time. Transiently unmask interrupts so that any
+ * pending interrupts are taken.
*
- * We're now back in SVC mode, with interrupts
- * disabled. Enabling the interrupts now will have
- * the effect of taking the interrupt again, in SVC
- * mode this time.
+ * Per ARM DDI 0487G.b section D1.13.4, an ISB (or other
+ * context synchronization event) is necessary to ensure that
+ * pending interrupts are taken.
*/
local_irq_enable();
+ isb();
+ local_irq_disable();
+
+ guest_timing_exit_irqoff();
+
+ local_irq_enable();

- /*
- * We do local_irq_enable() before calling guest_exit() so
- * that if a timer interrupt hits while running the guest we
- * account that tick as being spent in the guest. We enable
- * preemption after calling guest_exit() so that if we get
- * preempted we make sure ticks after that is not counted as
- * guest time.
- */
- guest_exit();
trace_kvm_exit(ret, kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));

/* Exit types that need handling before we can be preempted */
--
2.30.2


2022-01-11 15:36:34

by Mark Rutland

[permalink] [raw]
Subject: [PATCH 3/5] kvm/mips: rework guest entry logic

In kvm_arch_vcpu_ioctl_run() we use guest_enter_irqoff() and
guest_exit_irqoff() directly, with interrupts masked between these. As
we don't handle any timer ticks during this window, we will not account
time spent within the guest as guest time, which is unfortunate.

Additionally, we do not inform lockdep or tracing that interrupts will
be enabled during guest execution, which caan lead to misleading traces
and warnings that interrupts have been enabled for overly-long periods.

This patch fixes these issues by using the new timing and context
entry/exit helpers to ensure that interrupts are handled during guest
vtime but with RCU watching, with a sequence:

guest_timing_enter_irqoff();

exit_to_guest_mode();
< run the vcpu >
enter_from_guest_mode();

< take any pending IRQs >

guest_timing_exit_irqoff();

Since instrumentation may make use of RCU, we must also ensure that no
instrumented code is run during the EQS. I've split out the critical
section into a new kvm_mips_enter_exit_vcpu() helper which is marked
noinstr.

Signed-off-by: Mark Rutland <[email protected]>
Cc: Aleksandar Markovic <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
---
arch/mips/kvm/mips.c | 37 ++++++++++++++++++++++++++++++++++---
1 file changed, 34 insertions(+), 3 deletions(-)

diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index aa20d074d388..f18a3f39163f 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -438,6 +438,24 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
return -ENOIOCTLCMD;
}

+/*
+ * Actually run the vCPU, entering an RCU extended quiescent state (EQS) while
+ * the vCPU is running.
+ *
+ * This must be noinstr as instrumentation may make use of RCU, and this is not
+ * safe during the EQS.
+ */
+static int noinstr kvm_mips_vcpu_enter_exit(struct kvm_vcpu *vcpu)
+{
+ int ret;
+
+ exit_to_guest_mode();
+ ret = kvm_mips_callbacks->vcpu_run(vcpu);
+ enter_from_guest_mode();
+
+ return ret;
+}
+
int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
{
int r = -EINTR;
@@ -458,7 +476,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
lose_fpu(1);

local_irq_disable();
- guest_enter_irqoff();
+ guest_timing_enter_irqoff();
trace_kvm_enter(vcpu);

/*
@@ -469,10 +487,23 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
*/
smp_store_mb(vcpu->mode, IN_GUEST_MODE);

- r = kvm_mips_callbacks->vcpu_run(vcpu);
+ r = kvm_mips_vcpu_enter_exit(vcpu);
+
+ /*
+ * We must ensure that any pending interrupts are taken before
+ * we exit guest timing so that timer ticks are accounted as
+ * guest time. Transiently unmask interrupts so that any
+ * pending interrupts are taken.
+ *
+ * TODO: is there a barrier which ensures that pending interrupts are
+ * recognised? Currently this just hopes that the CPU takes any pending
+ * interrupts between the enable and disable.
+ */
+ local_irq_enable();
+ local_irq_disable();

trace_kvm_out(vcpu);
- guest_exit_irqoff();
+ guest_timing_exit_irqoff();
local_irq_enable();

out:
--
2.30.2


2022-01-11 15:36:41

by Mark Rutland

[permalink] [raw]
Subject: [PATCH 4/5] kvm/riscv: rework guest entry logic

In kvm_arch_vcpu_ioctl_run() we enter an RCU extended quiescent state
(EQS) by calling guest_enter_irqoff(), and unmask IRQs prior to exiting
the EQS by calling guest_exit(). As the IRQ entry code will not wake RCU
in this case, we may run the core IRQ code and IRQ handler without RCU
watching, leading to various potential problems.

Additionally, we do not inform lockdep or tracing that interrupts will
be enabled during guest execution, which caan lead to misleading traces
and warnings that interrupts have been enabled for overly-long periods.

This patch fixes these issues by using the new timing and context
entry/exit helpers to ensure that interrupts are handled during guest
vtime but with RCU watching, with a sequence:

guest_timing_enter_irqoff();

exit_to_guest_mode();
< run the vcpu >
enter_from_guest_mode();

< take any pending IRQs >

guest_timing_exit_irqoff();

Since instrumentation may make use of RCU, we must also ensure that no
instrumented code is run during the EQS. I've split out the critical
section into a new kvm_riscv_enter_exit_vcpu() helper which is marked
noinstr.

Fixes: 99cdc6c18c2d815e ("RISC-V: Add initial skeletal KVM support")
Signed-off-by: Mark Rutland <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Anup Patel <[email protected]>
Cc: Atish Patra <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Paul Walmsley <[email protected]>
---
arch/riscv/kvm/vcpu.c | 44 ++++++++++++++++++++++++++-----------------
1 file changed, 27 insertions(+), 17 deletions(-)

diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index fb84619df012..0b524b26ee54 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -675,6 +675,20 @@ static void kvm_riscv_update_hvip(struct kvm_vcpu *vcpu)
csr_write(CSR_HVIP, csr->hvip);
}

+/*
+ * Actually run the vCPU, entering an RCU extended quiescent state (EQS) while
+ * the vCPU is running.
+ *
+ * This must be noinstr as instrumentation may make use of RCU, and this is not
+ * safe during the EQS.
+ */
+static void noinstr kvm_riscv_vcpu_enter_exit(struct kvm_vcpu *vcpu)
+{
+ exit_to_guest_mode();
+ __kvm_riscv_switch_to(&vcpu->arch);
+ enter_from_guest_mode();
+}
+
int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
{
int ret;
@@ -766,9 +780,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
continue;
}

- guest_enter_irqoff();
+ guest_timing_enter_irqoff();

- __kvm_riscv_switch_to(&vcpu->arch);
+ kvm_riscv_vcpu_enter_exit(vcpu);

vcpu->mode = OUTSIDE_GUEST_MODE;
vcpu->stat.exits++;
@@ -788,25 +802,21 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
kvm_riscv_vcpu_sync_interrupts(vcpu);

/*
- * We may have taken a host interrupt in VS/VU-mode (i.e.
- * while executing the guest). This interrupt is still
- * pending, as we haven't serviced it yet!
+ * We must ensure that any pending interrupts are taken before
+ * we exit guest timing so that timer ticks are accounted as
+ * guest time. Transiently unmask interrupts so that any
+ * pending interrupts are taken.
*
- * We're now back in HS-mode with interrupts disabled
- * so enabling the interrupts now will have the effect
- * of taking the interrupt again, in HS-mode this time.
+ * There's no barrier which ensures that pending interrupts are
+ * recognised, so we just hope that the CPU takes any pending
+ * interrupts between the enable and disable.
*/
local_irq_enable();
+ local_irq_disable();

- /*
- * We do local_irq_enable() before calling guest_exit() so
- * that if a timer interrupt hits while running the guest
- * we account that tick as being spent in the guest. We
- * enable preemption after calling guest_exit() so that if
- * we get preempted we make sure ticks after that is not
- * counted as guest time.
- */
- guest_exit();
+ guest_timing_exit_irqoff();
+
+ local_irq_enable();

preempt_enable();

--
2.30.2


2022-01-11 15:36:44

by Mark Rutland

[permalink] [raw]
Subject: [PATCH 5/5] kvm/x86: rework guest entry logic

For consistency and clarity, migrate x86 over to the generic helpers for
guest timing and lockdep/RCU/tracing management, and remove the
x86-specific helpers.

Prior to this patch, the guest timing was entered in
kvm_guest_enter_irqoff() (called by svm_vcpu_enter_exit() and
svm_vcpu_enter_exit()), and was exited by the call to
vtime_account_guest_exit() within vcpu_enter_guest().

To minimize duplication and to more clearly balance entry and exit, both
entry and exit of guest timing are placed in vcpu_enter_guest(), using
the new guest_timing_{enter,exit}_irqoff() helpers. This may result in a
small amount of additional time being acounted towards guests.

Other than this, there should be no functional change as a result of
this patch.

Signed-off-by: Mark Rutland <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jim Mattson <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Wanpeng Li <[email protected]>
---
arch/x86/kvm/svm/svm.c | 4 ++--
arch/x86/kvm/vmx/vmx.c | 4 ++--
arch/x86/kvm/x86.c | 4 +++-
arch/x86/kvm/x86.h | 45 ------------------------------------------
4 files changed, 7 insertions(+), 50 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 5151efa424ac..af5d90de243f 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3814,7 +3814,7 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu)
struct vcpu_svm *svm = to_svm(vcpu);
unsigned long vmcb_pa = svm->current_vmcb->pa;

- kvm_guest_enter_irqoff();
+ exit_to_guest_mode();

if (sev_es_guest(vcpu->kvm)) {
__svm_sev_es_vcpu_run(vmcb_pa);
@@ -3834,7 +3834,7 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu)
vmload(__sme_page_pa(sd->save_area));
}

- kvm_guest_exit_irqoff();
+ enter_from_guest_mode();
}

static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0dbf94eb954f..3dd240ef6414 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6593,7 +6593,7 @@ static fastpath_t vmx_exit_handlers_fastpath(struct kvm_vcpu *vcpu)
static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
struct vcpu_vmx *vmx)
{
- kvm_guest_enter_irqoff();
+ exit_to_guest_mode();

/* L1D Flush includes CPU buffer clear to mitigate MDS */
if (static_branch_unlikely(&vmx_l1d_should_flush))
@@ -6609,7 +6609,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,

vcpu->arch.cr2 = native_read_cr2();

- kvm_guest_exit_irqoff();
+ enter_from_guest_mode();
}

static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e50e97ac4408..bd3873b90889 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9876,6 +9876,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
set_debugreg(0, 7);
}

+ guest_timing_enter_irqoff();
+
for (;;) {
/*
* Assert that vCPU vs. VM APICv state is consistent. An APICv
@@ -9949,7 +9951,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
* of accounting via context tracking, but the loss of accuracy is
* acceptable for all known use cases.
*/
- vtime_account_guest_exit();
+ guest_timing_exit_irqoff();

if (lapic_in_kernel(vcpu)) {
s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 4abcd8d9836d..8e50645ac740 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -10,51 +10,6 @@

void kvm_spurious_fault(void);

-static __always_inline void kvm_guest_enter_irqoff(void)
-{
- /*
- * VMENTER enables interrupts (host state), but the kernel state is
- * interrupts disabled when this is invoked. Also tell RCU about
- * it. This is the same logic as for exit_to_user_mode().
- *
- * This ensures that e.g. latency analysis on the host observes
- * guest mode as interrupt enabled.
- *
- * guest_enter_irqoff() informs context tracking about the
- * transition to guest mode and if enabled adjusts RCU state
- * accordingly.
- */
- instrumentation_begin();
- trace_hardirqs_on_prepare();
- lockdep_hardirqs_on_prepare(CALLER_ADDR0);
- instrumentation_end();
-
- guest_enter_irqoff();
- lockdep_hardirqs_on(CALLER_ADDR0);
-}
-
-static __always_inline void kvm_guest_exit_irqoff(void)
-{
- /*
- * VMEXIT disables interrupts (host state), but tracing and lockdep
- * have them in state 'on' as recorded before entering guest mode.
- * Same as enter_from_user_mode().
- *
- * context_tracking_guest_exit() restores host context and reinstates
- * RCU if enabled and required.
- *
- * This needs to be done immediately after VM-Exit, before any code
- * that might contain tracepoints or call out to the greater world,
- * e.g. before x86_spec_ctrl_restore_host().
- */
- lockdep_hardirqs_off(CALLER_ADDR0);
- context_tracking_guest_exit();
-
- instrumentation_begin();
- trace_hardirqs_off_finish();
- instrumentation_end();
-}
-
#define KVM_NESTED_VMENTER_CONSISTENCY_CHECK(consistency_check) \
({ \
bool failed = (consistency_check); \
--
2.30.2


2022-01-11 17:55:08

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 1/5] kvm: add exit_to_guest_mode() and enter_from_guest_mode()

Hi Mark,

On Tue, 11 Jan 2022 15:35:35 +0000,
Mark Rutland <[email protected]> wrote:
>
> When transitioning to/from guest mode, it is necessary to inform
> lockdep, tracing, and RCU in a specific order, similar to the
> requirements for transitions to/from user mode. Additionally, it is
> necessary to perform vtime accounting for a window around running the
> guest, with RCU enabled, such that timer interrupts taken from the guest
> can be accounted as guest time.
>
> Most architectures don't handle all the necessary pieces, and a have a
> number of common bugs, including unsafe usage of RCU during the window
> between guest_enter() and guest_exit().
>
> On x86, this was dealt with across commits:
>
> 87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
> 0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
> 9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
> 3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
> 135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
> 160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
> bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
>
> ... but those fixes are specific to x86, and as the resulting logic
> (while correct) is split across generic helper functions and
> x86-specific helper functions, it is difficult to see that the
> entry/exit accounting is balanced.
>
> This patch adds generic helpers which architectures can use to handle
> guest entry/exit consistently and correctly. The guest_{enter,exit}()
> helpers are split into guest_timing_{enter,exit}() to perform vtime
> accounting, and guest_context_{enter,exit}() to perform the necessary
> context tracking and RCU management. The existing guest_{enter,exit}()
> heleprs are left as wrappers of these.
>
> Atop this, new exit_to_guest_mode() and enter_from_guest_mode() helpers
> are added to handle the ordering of lockdep, tracing, and RCU manageent.
> These are named to align with exit_to_user_mode() and
> enter_from_user_mode().
>
> Subsequent patches will migrate architectures over to the new helpers,
> following a sequence:
>
> guest_timing_enter_irqoff();
>
> exit_to_guest_mode();
> < run the vcpu >
> enter_from_guest_mode();
>
> < take any pending IRQs >
>
> guest_timing_exit_irqoff();
>
> This sequences handles all of the above correctly, and more clearly
> balances the entry and exit portions, making it easier to understand.
>
> The existing helpers are marked as deprecated, and will be removed once
> all architectures have been converted.
>
> There should be no functional change as a result of this patch.
>
> Signed-off-by: Mark Rutland <[email protected]>

Thanks a lot for looking into this and writing this up. I have a
couple of comments below, but that's pretty much cosmetic and is only
there to ensure that I actually understand this stuff. FWIW:

Reviewed-by: Marc Zyngier <[email protected]>

> ---
> include/linux/kvm_host.h | 108 +++++++++++++++++++++++++++++++++++++--
> 1 file changed, 105 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index c310648cc8f1..13fcf7979880 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -29,6 +29,8 @@
> #include <linux/refcount.h>
> #include <linux/nospec.h>
> #include <linux/notifier.h>
> +#include <linux/ftrace.h>
> +#include <linux/instrumentation.h>
> #include <asm/signal.h>
>
> #include <linux/kvm.h>
> @@ -362,8 +364,11 @@ struct kvm_vcpu {
> int last_used_slot;
> };
>
> -/* must be called with irqs disabled */
> -static __always_inline void guest_enter_irqoff(void)
> +/*
> + * Start accounting time towards a guest.
> + * Must be called before entering guest context.
> + */
> +static __always_inline void guest_timing_enter_irqoff(void)
> {
> /*
> * This is running in ioctl context so its safe to assume that it's the
> @@ -372,7 +377,17 @@ static __always_inline void guest_enter_irqoff(void)
> instrumentation_begin();
> vtime_account_guest_enter();
> instrumentation_end();
> +}
>
> +/*
> + * Enter guest context and enter an RCU extended quiescent state.
> + *
> + * This should be the last thing called before entering the guest, and must be
> + * called after any potential use of RCU (including any potentially
> + * instrumented code).

nit: "the last thing called" is terribly ambiguous. Any architecture
obviously calls a ****load of stuff after this point. Should this be
'the last thing involving RCU' instead?

> + */
> +static __always_inline void guest_context_enter_irqoff(void)
> +{
> /*
> * KVM does not hold any references to rcu protected data when it
> * switches CPU into a guest mode. In fact switching to a guest mode
> @@ -388,16 +403,77 @@ static __always_inline void guest_enter_irqoff(void)
> }
> }
>
> -static __always_inline void guest_exit_irqoff(void)
> +/*
> + * Deprecated. Architectures should move to guest_timing_enter_irqoff() and
> + * exit_to_guest_mode().
> + */
> +static __always_inline void guest_enter_irqoff(void)
> +{
> + guest_timing_enter_irqoff();
> + guest_context_enter_irqoff();
> +}
> +
> +/**
> + * exit_to_guest_mode - Fixup state when exiting to guest mode
> + *
> + * This is analagous to exit_to_user_mode(), and ensures we perform the
> + * following in order:
> + *
> + * 1) Trace interrupts on state
> + * 2) Invoke context tracking if enabled to adjust RCU state
> + * 3) Tell lockdep that interrupts are enabled

nit: or rather, are about to be enabled? Certainly on arm64, the
enable happens much later, right at the point where we enter the guest
for real.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2022-01-11 17:56:27

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 2/5] kvm/arm64: rework guest entry logic

On Tue, 11 Jan 2022 15:35:36 +0000,
Mark Rutland <[email protected]> wrote:
>
> In kvm_arch_vcpu_ioctl_run() we enter an RCU extended quiescent state
> (EQS) by calling guest_enter_irqoff(), and unmasked IRQs prior to
> exiting the EQS by calling guest_exit(). As the IRQ entry code will not
> wake RCU in this case, we may run the core IRQ code and IRQ handler
> without RCU watching, leading to various potential problems.
>
> Additionally, we do not inform lockdep or tracing that interrupts will
> be enabled during guest execution, which caan lead to misleading traces
> and warnings that interrupts have been enabled for overly-long periods.
>
> This patch fixes these issues by using the new timing and context
> entry/exit helpers to ensure that interrupts are handled during guest
> vtime but with RCU watching, with a sequence:
>
> guest_timing_enter_irqoff();
>
> exit_to_guest_mode();
> < run the vcpu >
> enter_from_guest_mode();
>
> < take any pending IRQs >
>
> guest_timing_exit_irqoff();
>
> Since instrumentation may make use of RCU, we must also ensure that no
> instrumented code is run during the EQS. I've split out the critical
> section into a new kvm_arm_enter_exit_vcpu() helper which is marked
> noinstr.
>
> Fixes: 1b3d546daf85ed2b ("arm/arm64: KVM: Properly account for guest CPU time")
> Reported-by: Nicolas Saenz Julienne <[email protected]>
> Signed-off-by: Mark Rutland <[email protected]>
> Cc: Alexandru Elisei <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: James Morse <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Suzuki K Poulose <[email protected]>
> Cc: Will Deacon <[email protected]>
> ---
> arch/arm64/kvm/arm.c | 51 ++++++++++++++++++++++++++++----------------
> 1 file changed, 33 insertions(+), 18 deletions(-)
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index e4727dc771bf..1721df2522c8 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -764,6 +764,24 @@ static bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu, int *ret)
> xfer_to_guest_mode_work_pending();
> }
>
> +/*
> + * Actually run the vCPU, entering an RCU extended quiescent state (EQS) while
> + * the vCPU is running.
> + *
> + * This must be noinstr as instrumentation may make use of RCU, and this is not
> + * safe during the EQS.
> + */
> +static int noinstr kvm_arm_vcpu_enter_exit(struct kvm_vcpu *vcpu)
> +{
> + int ret;
> +
> + exit_to_guest_mode();
> + ret = kvm_call_hyp_ret(__kvm_vcpu_run, vcpu);
> + enter_from_guest_mode();
> +
> + return ret;
> +}
> +
> /**
> * kvm_arch_vcpu_ioctl_run - the main VCPU run function to execute guest code
> * @vcpu: The VCPU pointer
> @@ -854,9 +872,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> * Enter the guest
> */
> trace_kvm_entry(*vcpu_pc(vcpu));
> - guest_enter_irqoff();
> + guest_timing_enter_irqoff();
>
> - ret = kvm_call_hyp_ret(__kvm_vcpu_run, vcpu);
> + ret = kvm_arm_vcpu_enter_exit(vcpu);
>
> vcpu->mode = OUTSIDE_GUEST_MODE;
> vcpu->stat.exits++;
> @@ -891,26 +909,23 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> kvm_arch_vcpu_ctxsync_fp(vcpu);
>
> /*
> - * We may have taken a host interrupt in HYP mode (ie
> - * while executing the guest). This interrupt is still
> - * pending, as we haven't serviced it yet!
> + * We must ensure that any pending interrupts are taken before
> + * we exit guest timing so that timer ticks are accounted as
> + * guest time. Transiently unmask interrupts so that any
> + * pending interrupts are taken.
> *
> - * We're now back in SVC mode, with interrupts
> - * disabled. Enabling the interrupts now will have
> - * the effect of taking the interrupt again, in SVC
> - * mode this time.
> + * Per ARM DDI 0487G.b section D1.13.4, an ISB (or other
> + * context synchronization event) is necessary to ensure that
> + * pending interrupts are taken.
> */
> local_irq_enable();
> + isb();
> + local_irq_disable();

Small nit: we may be able to elide this enable/isb/disable dance if a
read of ISR_EL1 returns 0.

> +
> + guest_timing_exit_irqoff();
> +
> + local_irq_enable();
>
> - /*
> - * We do local_irq_enable() before calling guest_exit() so
> - * that if a timer interrupt hits while running the guest we
> - * account that tick as being spent in the guest. We enable
> - * preemption after calling guest_exit() so that if we get
> - * preempted we make sure ticks after that is not counted as
> - * guest time.
> - */
> - guest_exit();
> trace_kvm_exit(ret, kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
>
> /* Exit types that need handling before we can be preempted */

Reviewed-by: Marc Zyngier <[email protected]>

M.

--
Without deviation from the norm, progress is not possible.

2022-01-11 18:47:42

by Palmer Dabbelt

[permalink] [raw]
Subject: Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs

On Tue, 11 Jan 2022 07:35:34 PST (-0800), [email protected] wrote:
> Several architectures have latent bugs around guest entry/exit, most
> notably:
>
> 1) Several architectures enable interrupts between guest_enter() and
> guest_exit(). As this period is an RCU extended quiescent state (EQS) this
> is unsound unless the irq entry code explicitly wakes RCU, which most
> architectures only do for entry from usersapce or idle.
>
> I believe this affects: arm64, riscv, s390
>
> I am not sure about powerpc.
>
> 2) Several architectures permit instrumentation of code between
> guest_enter() and guest_exit(), e.g. KASAN, KCOV, KCSAN, etc. As
> instrumentation may directly o indirectly use RCU, this has the same
> problems as with interrupts.
>
> I believe this affects: arm64, mips, powerpc, riscv, s390

Moving to Atish and Anup's new email addresses, looks like MAINTAINERS
hasn't been updated yet. I thought I remembering seeing patches getting
picked up for these, but LMK if you guys were expecting me to send them
along -- sorry if I misunderstood!

>
> 3) Several architectures do not inform lockdep and tracing that
> interrupts are enabled during the execution of the guest, or do so in
> an incorrect order. Generally
> this means that logs will report IRQs being masked for much longer
> than is actually the case, which is not ideal for debugging. I don't
> know whether this affects the correctness of lockdep.
>
> I believe this affects: arm64, mips, powerpc, riscv, s390
>
> This was previously fixed for x86 specifically in a series of commits:
>
> 87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
> 0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
> 9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
> 3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
> 135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
> 160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
> bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
>
> But other architectures were left broken, and the infrastructure for
> handling this correctly is x86-specific.
>
> This series introduces generic helper functions which can be used to
> handle the problems above, and migrates architectures over to these,
> fixing the latent issues.
>
> I wasn't able to figure my way around powerpc and s390, so I have not
> altered these. I'd appreciate if anyone could take a look at those
> cases, and either have a go at patches or provide some feedback as to
> any alternative approaches which work work better there.
>
> I have build-tested the arm64, mips, riscv, and x86 cases, but I don't
> have a suitable HW setup to test these, so any review and/or testing
> would be much appreciated.
>
> I've pushed the series (based on v5.16) to my kvm/entry-rework branch:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kvm/entry-rework
> git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git kvm/entry-rework
>
> ... also tagged as kvm-entry-rework-20210111
>
> Thanks,
> Mark.
>
> Mark Rutland (5):
> kvm: add exit_to_guest_mode() and enter_from_guest_mode()
> kvm/arm64: rework guest entry logic
> kvm/mips: rework guest entry logic
> kvm/riscv: rework guest entry logic
> kvm/x86: rework guest entry logic
>
> arch/arm64/kvm/arm.c | 51 +++++++++++-------
> arch/mips/kvm/mips.c | 37 ++++++++++++--
> arch/riscv/kvm/vcpu.c | 44 ++++++++++------
> arch/x86/kvm/svm/svm.c | 4 +-
> arch/x86/kvm/vmx/vmx.c | 4 +-
> arch/x86/kvm/x86.c | 4 +-
> arch/x86/kvm/x86.h | 45 ----------------
> include/linux/kvm_host.h | 108 +++++++++++++++++++++++++++++++++++++--
> 8 files changed, 206 insertions(+), 91 deletions(-)

2022-01-13 11:01:48

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 1/5] kvm: add exit_to_guest_mode() and enter_from_guest_mode()

On Tue, Jan 11, 2022 at 05:54:59PM +0000, Marc Zyngier wrote:
> Hi Mark,
>
> On Tue, 11 Jan 2022 15:35:35 +0000,
> Mark Rutland <[email protected]> wrote:
> >
> > When transitioning to/from guest mode, it is necessary to inform
> > lockdep, tracing, and RCU in a specific order, similar to the
> > requirements for transitions to/from user mode. Additionally, it is
> > necessary to perform vtime accounting for a window around running the
> > guest, with RCU enabled, such that timer interrupts taken from the guest
> > can be accounted as guest time.
> >
> > Most architectures don't handle all the necessary pieces, and a have a
> > number of common bugs, including unsafe usage of RCU during the window
> > between guest_enter() and guest_exit().
> >
> > On x86, this was dealt with across commits:
> >
> > 87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
> > 0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
> > 9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
> > 3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
> > 135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
> > 160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
> > bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
> >
> > ... but those fixes are specific to x86, and as the resulting logic
> > (while correct) is split across generic helper functions and
> > x86-specific helper functions, it is difficult to see that the
> > entry/exit accounting is balanced.
> >
> > This patch adds generic helpers which architectures can use to handle
> > guest entry/exit consistently and correctly. The guest_{enter,exit}()
> > helpers are split into guest_timing_{enter,exit}() to perform vtime
> > accounting, and guest_context_{enter,exit}() to perform the necessary
> > context tracking and RCU management. The existing guest_{enter,exit}()
> > heleprs are left as wrappers of these.
> >
> > Atop this, new exit_to_guest_mode() and enter_from_guest_mode() helpers
> > are added to handle the ordering of lockdep, tracing, and RCU manageent.
> > These are named to align with exit_to_user_mode() and
> > enter_from_user_mode().
> >
> > Subsequent patches will migrate architectures over to the new helpers,
> > following a sequence:
> >
> > guest_timing_enter_irqoff();
> >
> > exit_to_guest_mode();
> > < run the vcpu >
> > enter_from_guest_mode();
> >
> > < take any pending IRQs >
> >
> > guest_timing_exit_irqoff();
> >
> > This sequences handles all of the above correctly, and more clearly
> > balances the entry and exit portions, making it easier to understand.
> >
> > The existing helpers are marked as deprecated, and will be removed once
> > all architectures have been converted.
> >
> > There should be no functional change as a result of this patch.
> >
> > Signed-off-by: Mark Rutland <[email protected]>
>
> Thanks a lot for looking into this and writing this up. I have a
> couple of comments below, but that's pretty much cosmetic and is only
> there to ensure that I actually understand this stuff. FWIW:
>
> Reviewed-by: Marc Zyngier <[email protected]>

Thanks!

> > ---
> > include/linux/kvm_host.h | 108 +++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 105 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index c310648cc8f1..13fcf7979880 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -29,6 +29,8 @@
> > #include <linux/refcount.h>
> > #include <linux/nospec.h>
> > #include <linux/notifier.h>
> > +#include <linux/ftrace.h>
> > +#include <linux/instrumentation.h>
> > #include <asm/signal.h>
> >
> > #include <linux/kvm.h>
> > @@ -362,8 +364,11 @@ struct kvm_vcpu {
> > int last_used_slot;
> > };
> >
> > -/* must be called with irqs disabled */
> > -static __always_inline void guest_enter_irqoff(void)
> > +/*
> > + * Start accounting time towards a guest.
> > + * Must be called before entering guest context.
> > + */
> > +static __always_inline void guest_timing_enter_irqoff(void)
> > {
> > /*
> > * This is running in ioctl context so its safe to assume that it's the
> > @@ -372,7 +377,17 @@ static __always_inline void guest_enter_irqoff(void)
> > instrumentation_begin();
> > vtime_account_guest_enter();
> > instrumentation_end();
> > +}
> >
> > +/*
> > + * Enter guest context and enter an RCU extended quiescent state.
> > + *
> > + * This should be the last thing called before entering the guest, and must be
> > + * called after any potential use of RCU (including any potentially
> > + * instrumented code).
>
> nit: "the last thing called" is terribly ambiguous. Any architecture
> obviously calls a ****load of stuff after this point. Should this be
> 'the last thing involving RCU' instead?

I agree this is unclear and I struggled to fing good wording for this. Is the
following any better?

/*
* Enter guest context and enter an RCU extended quiescent state.
*
* Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
* unsafe to use any code which may directly or indirectly use RCU, tracing
* (including IRQ flag tracing), or lockdep. All code in this period must be
* non-instrumentable.
*/

If that's good I can add similar to guest_context_exit_irqoff().

[...]

> > +/**
> > + * exit_to_guest_mode - Fixup state when exiting to guest mode
> > + *
> > + * This is analagous to exit_to_user_mode(), and ensures we perform the
> > + * following in order:
> > + *
> > + * 1) Trace interrupts on state
> > + * 2) Invoke context tracking if enabled to adjust RCU state
> > + * 3) Tell lockdep that interrupts are enabled
>
> nit: or rather, are about to be enabled? Certainly on arm64, the
> enable happens much later, right at the point where we enter the guest
> for real.

True; I'd cribbed the wording from the comment block above exit_to_user_mode(),
but I stripped the context that made that clear. I'll make that:

/**
* exit_to_guest_mode - Fixup state when exiting to guest mode
*
* Entry to a guest will enable interrupts, but the kernel state is
* interrupts disabled when this is invoked. Also tell RCU about it.
*
* 1) Trace interrupts on state
* 2) Invoke context tracking if enabled to adjust RCU state
* 3) Tell lockdep that interrupts are enabled
*
* Invoked from architecture specific code before entering a guest.
* Must be called with interrupts disabled and the caller must be
* non-instrumentable.
* The caller has to invoke guest_timing_enter_irqoff() before this.
*
* Note: this is analagous to exit_to_user_mode().
*/

... with likewise for enter_from_guest_mode(), if that's clear enough?

FWIW, the comment blcok for exit_to_user_mode() in
include/linux/entry-common.h says:

/**
* exit_to_user_mode - Fixup state when exiting to user mode
*
* Syscall/interrupt exit enables interrupts, but the kernel state is
* interrupts disabled when this is invoked. Also tell RCU about it.
*
* 1) Trace interrupts on state
* 2) Invoke context tracking if enabled to adjust RCU state
* 3) Invoke architecture specific last minute exit code, e.g. speculation
* mitigations, etc.: arch_exit_to_user_mode()
* 4) Tell lockdep that interrupts are enabled
*
* Invoked from architecture specific code when syscall_exit_to_user_mode()
* is not suitable as the last step before returning to userspace. Must be
* invoked with interrupts disabled and the caller must be
* non-instrumentable.
* The caller has to invoke syscall_exit_to_user_mode_work() before this.
*/

Thanks,
Mark.

2022-01-13 11:18:07

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 2/5] kvm/arm64: rework guest entry logic

On Tue, Jan 11, 2022 at 05:55:20PM +0000, Marc Zyngier wrote:
> On Tue, 11 Jan 2022 15:35:36 +0000,
> Mark Rutland <[email protected]> wrote:
> >
> > In kvm_arch_vcpu_ioctl_run() we enter an RCU extended quiescent state
> > (EQS) by calling guest_enter_irqoff(), and unmasked IRQs prior to
> > exiting the EQS by calling guest_exit(). As the IRQ entry code will not
> > wake RCU in this case, we may run the core IRQ code and IRQ handler
> > without RCU watching, leading to various potential problems.
> >
> > Additionally, we do not inform lockdep or tracing that interrupts will
> > be enabled during guest execution, which caan lead to misleading traces
> > and warnings that interrupts have been enabled for overly-long periods.
> >
> > This patch fixes these issues by using the new timing and context
> > entry/exit helpers to ensure that interrupts are handled during guest
> > vtime but with RCU watching, with a sequence:
> >
> > guest_timing_enter_irqoff();
> >
> > exit_to_guest_mode();
> > < run the vcpu >
> > enter_from_guest_mode();
> >
> > < take any pending IRQs >
> >
> > guest_timing_exit_irqoff();
> >
> > Since instrumentation may make use of RCU, we must also ensure that no
> > instrumented code is run during the EQS. I've split out the critical
> > section into a new kvm_arm_enter_exit_vcpu() helper which is marked
> > noinstr.
> >
> > Fixes: 1b3d546daf85ed2b ("arm/arm64: KVM: Properly account for guest CPU time")
> > Reported-by: Nicolas Saenz Julienne <[email protected]>
> > Signed-off-by: Mark Rutland <[email protected]>
> > Cc: Alexandru Elisei <[email protected]>
> > Cc: Catalin Marinas <[email protected]>
> > Cc: Frederic Weisbecker <[email protected]>
> > Cc: James Morse <[email protected]>
> > Cc: Marc Zyngier <[email protected]>
> > Cc: Paolo Bonzini <[email protected]>
> > Cc: Paul E. McKenney <[email protected]>
> > Cc: Suzuki K Poulose <[email protected]>
> > Cc: Will Deacon <[email protected]>
> > ---
> > arch/arm64/kvm/arm.c | 51 ++++++++++++++++++++++++++++----------------
> > 1 file changed, 33 insertions(+), 18 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index e4727dc771bf..1721df2522c8 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -764,6 +764,24 @@ static bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu, int *ret)
> > xfer_to_guest_mode_work_pending();
> > }
> >
> > +/*
> > + * Actually run the vCPU, entering an RCU extended quiescent state (EQS) while
> > + * the vCPU is running.
> > + *
> > + * This must be noinstr as instrumentation may make use of RCU, and this is not
> > + * safe during the EQS.
> > + */
> > +static int noinstr kvm_arm_vcpu_enter_exit(struct kvm_vcpu *vcpu)
> > +{
> > + int ret;
> > +
> > + exit_to_guest_mode();
> > + ret = kvm_call_hyp_ret(__kvm_vcpu_run, vcpu);
> > + enter_from_guest_mode();
> > +
> > + return ret;
> > +}
> > +
> > /**
> > * kvm_arch_vcpu_ioctl_run - the main VCPU run function to execute guest code
> > * @vcpu: The VCPU pointer
> > @@ -854,9 +872,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> > * Enter the guest
> > */
> > trace_kvm_entry(*vcpu_pc(vcpu));
> > - guest_enter_irqoff();
> > + guest_timing_enter_irqoff();
> >
> > - ret = kvm_call_hyp_ret(__kvm_vcpu_run, vcpu);
> > + ret = kvm_arm_vcpu_enter_exit(vcpu);
> >
> > vcpu->mode = OUTSIDE_GUEST_MODE;
> > vcpu->stat.exits++;
> > @@ -891,26 +909,23 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> > kvm_arch_vcpu_ctxsync_fp(vcpu);
> >
> > /*
> > - * We may have taken a host interrupt in HYP mode (ie
> > - * while executing the guest). This interrupt is still
> > - * pending, as we haven't serviced it yet!
> > + * We must ensure that any pending interrupts are taken before
> > + * we exit guest timing so that timer ticks are accounted as
> > + * guest time. Transiently unmask interrupts so that any
> > + * pending interrupts are taken.
> > *
> > - * We're now back in SVC mode, with interrupts
> > - * disabled. Enabling the interrupts now will have
> > - * the effect of taking the interrupt again, in SVC
> > - * mode this time.
> > + * Per ARM DDI 0487G.b section D1.13.4, an ISB (or other
> > + * context synchronization event) is necessary to ensure that
> > + * pending interrupts are taken.
> > */
> > local_irq_enable();
> > + isb();
> > + local_irq_disable();
>
> Small nit: we may be able to elide this enable/isb/disable dance if a
> read of ISR_EL1 returns 0.

Wouldn't that be broken when using GIC priority masking, since that can prevent
IRQS being signalled ot the PE?

I'm happy to rework this, but I'll need to think a bit harder about it. Would
you be happy if we did that as a follow-up?

I suspect we'll want to split that out into a helper, e.g.

static __always_inline handle_pending_host_irqs(void)
{
/*
* TODO: explain PMR masking / signalling here
*/
if (!system_uses_irq_prio_masking() &&
!read_sysreg(isr_el1))
return;

local_irq_enable();
isb();
local_irq_disable();
}

>
> > +
> > + guest_timing_exit_irqoff();
> > +
> > + local_irq_enable();
> >
> > - /*
> > - * We do local_irq_enable() before calling guest_exit() so
> > - * that if a timer interrupt hits while running the guest we
> > - * account that tick as being spent in the guest. We enable
> > - * preemption after calling guest_exit() so that if we get
> > - * preempted we make sure ticks after that is not counted as
> > - * guest time.
> > - */
> > - guest_exit();
> > trace_kvm_exit(ret, kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
> >
> > /* Exit types that need handling before we can be preempted */
>
> Reviewed-by: Marc Zyngier <[email protected]>

Thanks!

Mark.

2022-01-13 11:43:47

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 2/5] kvm/arm64: rework guest entry logic

On Thu, 13 Jan 2022 11:17:53 +0000,
Mark Rutland <[email protected]> wrote:
>
> On Tue, Jan 11, 2022 at 05:55:20PM +0000, Marc Zyngier wrote:
> > On Tue, 11 Jan 2022 15:35:36 +0000,
> > Mark Rutland <[email protected]> wrote:

[...]

> > > @@ -891,26 +909,23 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> > > kvm_arch_vcpu_ctxsync_fp(vcpu);
> > >
> > > /*
> > > - * We may have taken a host interrupt in HYP mode (ie
> > > - * while executing the guest). This interrupt is still
> > > - * pending, as we haven't serviced it yet!
> > > + * We must ensure that any pending interrupts are taken before
> > > + * we exit guest timing so that timer ticks are accounted as
> > > + * guest time. Transiently unmask interrupts so that any
> > > + * pending interrupts are taken.
> > > *
> > > - * We're now back in SVC mode, with interrupts
> > > - * disabled. Enabling the interrupts now will have
> > > - * the effect of taking the interrupt again, in SVC
> > > - * mode this time.
> > > + * Per ARM DDI 0487G.b section D1.13.4, an ISB (or other
> > > + * context synchronization event) is necessary to ensure that
> > > + * pending interrupts are taken.
> > > */
> > > local_irq_enable();
> > > + isb();
> > > + local_irq_disable();
> >
> > Small nit: we may be able to elide this enable/isb/disable dance if a
> > read of ISR_EL1 returns 0.
>
> Wouldn't that be broken when using GIC priority masking, since that
> can prevent IRQS being signalled ot the PE?

You're right. But this can be made even simpler. We already know if
we've exited the guest because of an IRQ (ret tells us that), and
that's true whether we're using priority masking or not. It could be
as simple as:

if (ARM_EXCEPTION_CODE(ret) == ARM_EXCEPTION_IRQ) {
// We exited because of an interrupt. Let's take
// it now to account timer ticks to the guest.
local_irq_enable();
isb();
local_irq_disable();
}

and that would avoid accounting the interrupt to the guest if it fired
after the exit took place.

> I'm happy to rework this, but I'll need to think a bit harder about
> it. Would you be happy if we did that as a follow-up?

Oh, absolutely. I want the flow to be correct before we make it
fast(-ish).

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2022-01-13 11:55:18

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 1/5] kvm: add exit_to_guest_mode() and enter_from_guest_mode()

On Thu, 13 Jan 2022 11:01:30 +0000,
Mark Rutland <[email protected]> wrote:
>
> On Tue, Jan 11, 2022 at 05:54:59PM +0000, Marc Zyngier wrote:
> > Hi Mark,
> >
> > On Tue, 11 Jan 2022 15:35:35 +0000,
> > Mark Rutland <[email protected]> wrote:

[...]

> > > +/*
> > > + * Enter guest context and enter an RCU extended quiescent state.
> > > + *
> > > + * This should be the last thing called before entering the guest, and must be
> > > + * called after any potential use of RCU (including any potentially
> > > + * instrumented code).
> >
> > nit: "the last thing called" is terribly ambiguous. Any architecture
> > obviously calls a ****load of stuff after this point. Should this be
> > 'the last thing involving RCU' instead?
>
> I agree this is unclear and I struggled to fing good wording for this. Is the
> following any better?
>
> /*
> * Enter guest context and enter an RCU extended quiescent state.
> *
> * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
> * unsafe to use any code which may directly or indirectly use RCU, tracing
> * (including IRQ flag tracing), or lockdep. All code in this period must be
> * non-instrumentable.
> */
>
> If that's good I can add similar to guest_context_exit_irqoff().

Yes, that's much clearer, thanks.

>
> [...]
>
> > > +/**
> > > + * exit_to_guest_mode - Fixup state when exiting to guest mode
> > > + *
> > > + * This is analagous to exit_to_user_mode(), and ensures we perform the
> > > + * following in order:
> > > + *
> > > + * 1) Trace interrupts on state
> > > + * 2) Invoke context tracking if enabled to adjust RCU state
> > > + * 3) Tell lockdep that interrupts are enabled
> >
> > nit: or rather, are about to be enabled? Certainly on arm64, the
> > enable happens much later, right at the point where we enter the guest
> > for real.
>
> True; I'd cribbed the wording from the comment block above exit_to_user_mode(),
> but I stripped the context that made that clear. I'll make that:
>
> /**
> * exit_to_guest_mode - Fixup state when exiting to guest mode
> *
> * Entry to a guest will enable interrupts, but the kernel state is
> * interrupts disabled when this is invoked. Also tell RCU about it.
> *
> * 1) Trace interrupts on state
> * 2) Invoke context tracking if enabled to adjust RCU state
> * 3) Tell lockdep that interrupts are enabled
> *
> * Invoked from architecture specific code before entering a guest.
> * Must be called with interrupts disabled and the caller must be
> * non-instrumentable.
> * The caller has to invoke guest_timing_enter_irqoff() before this.
> *
> * Note: this is analagous to exit_to_user_mode().

nit: analogous

> */
>
> ... with likewise for enter_from_guest_mode(), if that's clear enough?

Yes, that's great.

Thanks again,

M.

--
Without deviation from the norm, progress is not possible.

2022-01-13 12:58:15

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 2/5] kvm/arm64: rework guest entry logic

On Thu, Jan 13, 2022 at 11:43:30AM +0000, Marc Zyngier wrote:
> On Thu, 13 Jan 2022 11:17:53 +0000,
> Mark Rutland <[email protected]> wrote:
> >
> > On Tue, Jan 11, 2022 at 05:55:20PM +0000, Marc Zyngier wrote:
> > > On Tue, 11 Jan 2022 15:35:36 +0000,
> > > Mark Rutland <[email protected]> wrote:
>
> [...]
>
> > > > @@ -891,26 +909,23 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> > > > kvm_arch_vcpu_ctxsync_fp(vcpu);
> > > >
> > > > /*
> > > > - * We may have taken a host interrupt in HYP mode (ie
> > > > - * while executing the guest). This interrupt is still
> > > > - * pending, as we haven't serviced it yet!
> > > > + * We must ensure that any pending interrupts are taken before
> > > > + * we exit guest timing so that timer ticks are accounted as
> > > > + * guest time. Transiently unmask interrupts so that any
> > > > + * pending interrupts are taken.
> > > > *
> > > > - * We're now back in SVC mode, with interrupts
> > > > - * disabled. Enabling the interrupts now will have
> > > > - * the effect of taking the interrupt again, in SVC
> > > > - * mode this time.
> > > > + * Per ARM DDI 0487G.b section D1.13.4, an ISB (or other
> > > > + * context synchronization event) is necessary to ensure that
> > > > + * pending interrupts are taken.
> > > > */
> > > > local_irq_enable();
> > > > + isb();
> > > > + local_irq_disable();
> > >
> > > Small nit: we may be able to elide this enable/isb/disable dance if a
> > > read of ISR_EL1 returns 0.
> >
> > Wouldn't that be broken when using GIC priority masking, since that
> > can prevent IRQS being signalled ot the PE?
>
> You're right. But this can be made even simpler. We already know if
> we've exited the guest because of an IRQ (ret tells us that), and
> that's true whether we're using priority masking or not. It could be
> as simple as:
>
> if (ARM_EXCEPTION_CODE(ret) == ARM_EXCEPTION_IRQ) {
> // We exited because of an interrupt. Let's take
> // it now to account timer ticks to the guest.
> local_irq_enable();
> isb();
> local_irq_disable();
> }
>
> and that would avoid accounting the interrupt to the guest if it fired
> after the exit took place.
>
> > I'm happy to rework this, but I'll need to think a bit harder about
> > it. Would you be happy if we did that as a follow-up?
>
> Oh, absolutely. I want the flow to be correct before we make it
> fast(-ish).

Cool; I'll leave that for now on the assumption we'll address that with a
follow-up patch, though your suggestion above looks "obviously" correct to me.

Thanks,
Mark.

2022-01-13 13:01:20

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 1/5] kvm: add exit_to_guest_mode() and enter_from_guest_mode()

On Thu, Jan 13, 2022 at 11:55:11AM +0000, Marc Zyngier wrote:
> On Thu, 13 Jan 2022 11:01:30 +0000,
> Mark Rutland <[email protected]> wrote:
> >
> > On Tue, Jan 11, 2022 at 05:54:59PM +0000, Marc Zyngier wrote:
> > > Hi Mark,
> > >
> > > On Tue, 11 Jan 2022 15:35:35 +0000,
> > > Mark Rutland <[email protected]> wrote:
>
> [...]
>
> > > > +/*
> > > > + * Enter guest context and enter an RCU extended quiescent state.
> > > > + *
> > > > + * This should be the last thing called before entering the guest, and must be
> > > > + * called after any potential use of RCU (including any potentially
> > > > + * instrumented code).
> > >
> > > nit: "the last thing called" is terribly ambiguous. Any architecture
> > > obviously calls a ****load of stuff after this point. Should this be
> > > 'the last thing involving RCU' instead?
> >
> > I agree this is unclear and I struggled to fing good wording for this. Is the
> > following any better?
> >
> > /*
> > * Enter guest context and enter an RCU extended quiescent state.
> > *
> > * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
> > * unsafe to use any code which may directly or indirectly use RCU, tracing
> > * (including IRQ flag tracing), or lockdep. All code in this period must be
> > * non-instrumentable.
> > */
> >
> > If that's good I can add similar to guest_context_exit_irqoff().
>
> Yes, that's much clearer, thanks.
>
> >
> > [...]
> >
> > > > +/**
> > > > + * exit_to_guest_mode - Fixup state when exiting to guest mode
> > > > + *
> > > > + * This is analagous to exit_to_user_mode(), and ensures we perform the
> > > > + * following in order:
> > > > + *
> > > > + * 1) Trace interrupts on state
> > > > + * 2) Invoke context tracking if enabled to adjust RCU state
> > > > + * 3) Tell lockdep that interrupts are enabled
> > >
> > > nit: or rather, are about to be enabled? Certainly on arm64, the
> > > enable happens much later, right at the point where we enter the guest
> > > for real.
> >
> > True; I'd cribbed the wording from the comment block above exit_to_user_mode(),
> > but I stripped the context that made that clear. I'll make that:
> >
> > /**
> > * exit_to_guest_mode - Fixup state when exiting to guest mode
> > *
> > * Entry to a guest will enable interrupts, but the kernel state is
> > * interrupts disabled when this is invoked. Also tell RCU about it.
> > *
> > * 1) Trace interrupts on state
> > * 2) Invoke context tracking if enabled to adjust RCU state
> > * 3) Tell lockdep that interrupts are enabled
> > *
> > * Invoked from architecture specific code before entering a guest.
> > * Must be called with interrupts disabled and the caller must be
> > * non-instrumentable.
> > * The caller has to invoke guest_timing_enter_irqoff() before this.
> > *
> > * Note: this is analagous to exit_to_user_mode().
>
> nit: analogous
>
> > */
> >
> > ... with likewise for enter_from_guest_mode(), if that's clear enough?
>
> Yes, that's great.

Thanks; I've pushed out an updated branch with those changes (including the
typo fixes). I'll wait until next week before sending out a v2 since I don't
think that meaningfully affects the arch bits for other architectures.

Mark.

2022-01-13 15:22:05

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs



Am 11.01.22 um 16:35 schrieb Mark Rutland:
> Several architectures have latent bugs around guest entry/exit, most
> notably:
>
> 1) Several architectures enable interrupts between guest_enter() and
> guest_exit(). As this period is an RCU extended quiescent state (EQS) this
> is unsound unless the irq entry code explicitly wakes RCU, which most
> architectures only do for entry from usersapce or idle.
>
> I believe this affects: arm64, riscv, s390
>
> I am not sure about powerpc.
>
> 2) Several architectures permit instrumentation of code between
> guest_enter() and guest_exit(), e.g. KASAN, KCOV, KCSAN, etc. As
> instrumentation may directly o indirectly use RCU, this has the same
> problems as with interrupts.
>
> I believe this affects: arm64, mips, powerpc, riscv, s390
>
> 3) Several architectures do not inform lockdep and tracing that
> interrupts are enabled during the execution of the guest, or do so in
> an incorrect order. Generally
> this means that logs will report IRQs being masked for much longer
> than is actually the case, which is not ideal for debugging. I don't
> know whether this affects the correctness of lockdep.
>
> I believe this affects: arm64, mips, powerpc, riscv, s390
>
> This was previously fixed for x86 specifically in a series of commits:
>
> 87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
> 0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
> 9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
> 3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
> 135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
> 160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
> bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
>
> But other architectures were left broken, and the infrastructure for
> handling this correctly is x86-specific.
>
> This series introduces generic helper functions which can be used to
> handle the problems above, and migrates architectures over to these,
> fixing the latent issues.
>
> I wasn't able to figure my way around powerpc and s390, so I have not

I think 2 later patches have moved the guest_enter/exit a bit out.
Does this make the s390 code clearer?

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 577f1ead6a51..5859207c2cc0 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -4145,10 +4145,6 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
* As PF_VCPU will be used in fault handler, between
* guest_enter and guest_exit should be no uaccess.
*/
- local_irq_disable();
- guest_enter_irqoff();
- __disable_cpu_timer_accounting(vcpu);
- local_irq_enable();
if (kvm_s390_pv_cpu_is_protected(vcpu)) {
memcpy(sie_page->pv_grregs,
vcpu->run->s.regs.gprs,
@@ -4156,8 +4152,16 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
}
if (test_cpu_flag(CIF_FPU))
load_fpu_regs();
+ local_irq_disable();
+ __disable_cpu_timer_accounting(vcpu);
+ guest_enter_irqoff();
+ local_irq_enable();
exit_reason = sie64a(vcpu->arch.sie_block,
vcpu->run->s.regs.gprs);
+ local_irq_disable();
+ guest_exit_irqoff();
+ __enable_cpu_timer_accounting(vcpu);
+ local_irq_enable();
if (kvm_s390_pv_cpu_is_protected(vcpu)) {
memcpy(vcpu->run->s.regs.gprs,
sie_page->pv_grregs,
@@ -4173,10 +4177,6 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
vcpu->arch.sie_block->gpsw.mask &= ~PSW_INT_MASK;
}
}
- local_irq_disable();
- __enable_cpu_timer_accounting(vcpu);
- guest_exit_irqoff();
- local_irq_enable();
vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);

rc = vcpu_post_run(vcpu, exit_reason);

2022-01-13 20:32:29

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 1/5] kvm: add exit_to_guest_mode() and enter_from_guest_mode()

On Tue, Jan 11, 2022, Mark Rutland wrote:
> Atop this, new exit_to_guest_mode() and enter_from_guest_mode() helpers
> are added to handle the ordering of lockdep, tracing, and RCU manageent.
> These are named to align with exit_to_user_mode() and
> enter_from_user_mode().
>
> Subsequent patches will migrate architectures over to the new helpers,
> following a sequence:
>
> guest_timing_enter_irqoff();
>
> exit_to_guest_mode();

I'm not a fan of this nomenclature. First and foremost, virtualization refers to
transfers to guest mode as VM-Enter, and transfers from guest mode as VM-Exit.
It's really, really confusing to read this code from a virtualization perspective.
The functions themselves are contradictory as the "enter" helper calls functions
with "exit" in their name, and vice versa.

We settled on xfer_to_guest_mode_work() for a similar conundrum in the past, though
I don't love using xfer_to/from_guest_mode() as that makes it sound like those
helpers handle the actual transition into guest mode, i.e. runs the vCPU.

To avoid too much bikeshedding, what about reusing the names we all compromised
on when we did this for x86 and call them kvm_guest_enter/exit_irqoff()? If x86
is converted in the first patch then we could even avoid temporary #ifdefs.

> < run the vcpu >
> enter_from_guest_mode();
> < take any pending IRQs >
>
> guest_timing_exit_irqoff();

2022-01-13 20:50:08

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 5/5] kvm/x86: rework guest entry logic

On Tue, Jan 11, 2022, Mark Rutland wrote:
> For consistency and clarity, migrate x86 over to the generic helpers for
> guest timing and lockdep/RCU/tracing management, and remove the
> x86-specific helpers.
>
> Prior to this patch, the guest timing was entered in
> kvm_guest_enter_irqoff() (called by svm_vcpu_enter_exit() and
> svm_vcpu_enter_exit()), and was exited by the call to
> vtime_account_guest_exit() within vcpu_enter_guest().
>
> To minimize duplication and to more clearly balance entry and exit, both
> entry and exit of guest timing are placed in vcpu_enter_guest(), using
> the new guest_timing_{enter,exit}_irqoff() helpers. This may result in a
> small amount of additional time being acounted towards guests.

This can be further qualified to state that it only affects time accounting when
using context tracking; tick-based accounting is unaffected because IRQs are
disabled the entire time.

And this might actually be a (benign?) bug fix for context tracking accounting in
the EXIT_FASTPATH_REENTER_GUEST case (commits ae95f566b3d2 "KVM: X86: TSCDEADLINE
MSR emulation fastpath" and 26efe2fd92e5, "KVM: VMX: Handle preemption timer
fastpath"). In those cases, KVM will enter the guest multiple times without
bouncing through vtime_account_guest_exit(). That means vtime_guest_enter() will
be called when the CPU is already "in guest", and call vtime_account_system()
when it really should call vtime_account_guest(). account_system_time() does
check PF_VCPU and redirect to account_guest_time(), so it appears to be benign,
but it's at least odd.

> Other than this, there should be no functional change as a result of
> this patch.

...

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e50e97ac4408..bd3873b90889 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9876,6 +9876,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> set_debugreg(0, 7);
> }
>
> + guest_timing_enter_irqoff();
> +
> for (;;) {
> /*
> * Assert that vCPU vs. VM APICv state is consistent. An APICv
> @@ -9949,7 +9951,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> * of accounting via context tracking, but the loss of accuracy is
> * acceptable for all known use cases.
> */
> - vtime_account_guest_exit();
> + guest_timing_exit_irqoff();
>
> if (lapic_in_kernel(vcpu)) {
> s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;

2022-01-14 11:48:19

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 1/5] kvm: add exit_to_guest_mode() and enter_from_guest_mode()

On Thu, Jan 13, 2022 at 08:32:20PM +0000, Sean Christopherson wrote:
> On Tue, Jan 11, 2022, Mark Rutland wrote:
> > Atop this, new exit_to_guest_mode() and enter_from_guest_mode() helpers
> > are added to handle the ordering of lockdep, tracing, and RCU manageent.
> > These are named to align with exit_to_user_mode() and
> > enter_from_user_mode().
> >
> > Subsequent patches will migrate architectures over to the new helpers,
> > following a sequence:
> >
> > guest_timing_enter_irqoff();
> >
> > exit_to_guest_mode();
>
> I'm not a fan of this nomenclature. First and foremost, virtualization refers to
> transfers to guest mode as VM-Enter, and transfers from guest mode as VM-Exit.
> It's really, really confusing to read this code from a virtualization perspective.
> The functions themselves are contradictory as the "enter" helper calls functions
> with "exit" in their name, and vice versa.

Sure; FWIW I wasn't happy with the naming either, but I couldn't find anything
that was entirely clear, because it depends on whether you consider this an
entry..exit of guest context or an exit..entry of regular kernel context. I
went with exit_to_guest_mode() and enter_from_guest_mode() because that clearly
corresponded to exit_to_user_mode() and enter_from_user_mode(), and the
convention in the common entry code is to talk in terms of the regular kernel
context.

While I was working on this, I had guest_context_enter_irqoff() for
kernel->guest and guest_context_exit_irqoff() for guest->kernel, which also
matched the style of guest_timing_{enter,exit}_irqoff().

I'm happy to change to that, if that works for you?

> We settled on xfer_to_guest_mode_work() for a similar conundrum in the past, though
> I don't love using xfer_to/from_guest_mode() as that makes it sound like those
> helpers handle the actual transition into guest mode, i.e. runs the vCPU.
>
> To avoid too much bikeshedding, what about reusing the names we all compromised
> on when we did this for x86 and call them kvm_guest_enter/exit_irqoff()? If x86
> is converted in the first patch then we could even avoid temporary #ifdefs.

I'd like to keep this somewhat orthogonal to the x86 changes (e.g. as other
architectures will need backports to stable at least for the RCU bug fix), so
I'd rather use a name that isn't immediately coupled with x86 changes.

Does the guest_context_{enter,exit}_irqoff() naming above work for you?

Thanks,
Mark.

2022-01-14 12:05:47

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 5/5] kvm/x86: rework guest entry logic

On Thu, Jan 13, 2022 at 08:50:00PM +0000, Sean Christopherson wrote:
> On Tue, Jan 11, 2022, Mark Rutland wrote:
> > For consistency and clarity, migrate x86 over to the generic helpers for
> > guest timing and lockdep/RCU/tracing management, and remove the
> > x86-specific helpers.
> >
> > Prior to this patch, the guest timing was entered in
> > kvm_guest_enter_irqoff() (called by svm_vcpu_enter_exit() and
> > svm_vcpu_enter_exit()), and was exited by the call to
> > vtime_account_guest_exit() within vcpu_enter_guest().
> >
> > To minimize duplication and to more clearly balance entry and exit, both
> > entry and exit of guest timing are placed in vcpu_enter_guest(), using
> > the new guest_timing_{enter,exit}_irqoff() helpers. This may result in a
> > small amount of additional time being acounted towards guests.
>
> This can be further qualified to state that it only affects time accounting when
> using context tracking; tick-based accounting is unaffected because IRQs are
> disabled the entire time.

Ok. I'll replace that last sentence with:

When context tracking is used a small amount of additional time will be
accounted towards guests; tick-based accounting is unnaffected as IRQs are
disabled at this point and not enabled until after the return from the guest.

>
> And this might actually be a (benign?) bug fix for context tracking accounting in
> the EXIT_FASTPATH_REENTER_GUEST case (commits ae95f566b3d2 "KVM: X86: TSCDEADLINE
> MSR emulation fastpath" and 26efe2fd92e5, "KVM: VMX: Handle preemption timer
> fastpath"). In those cases, KVM will enter the guest multiple times without
> bouncing through vtime_account_guest_exit(). That means vtime_guest_enter() will
> be called when the CPU is already "in guest", and call vtime_account_system()
> when it really should call vtime_account_guest(). account_system_time() does
> check PF_VCPU and redirect to account_guest_time(), so it appears to be benign,
> but it's at least odd.
>
> > Other than this, there should be no functional change as a result of
> > this patch.

I've added wording:

This also corrects (benign) mis-balanced context tracking accounting
introduced in commits:

ae95f566b3d22ade ("KVM: X86: TSCDEADLINE MSR emulation fastpath")
26efe2fd92e50822 ("KVM: VMX: Handle preemption timer fastpath")

Where KVM can enter a guest multiple times, calling vtime_guest_enter()
without a corresponding call to vtime_account_guest_exit(), and with
vtime_account_system() called when vtime_account_guest() should be used.
As account_system_time() checks PF_VCPU and calls account_guest_time(),
this doesn't result in any functional problem, but is unnecessarily
confusing.

... and deleted the "no functional change" line for now.

I assume that other than the naming of the entry/exit functions you're happy
with this patch?

Thanks,
Mark.

> ...
>
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index e50e97ac4408..bd3873b90889 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -9876,6 +9876,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > set_debugreg(0, 7);
> > }
> >
> > + guest_timing_enter_irqoff();
> > +
> > for (;;) {
> > /*
> > * Assert that vCPU vs. VM APICv state is consistent. An APICv
> > @@ -9949,7 +9951,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > * of accounting via context tracking, but the loss of accuracy is
> > * acceptable for all known use cases.
> > */
> > - vtime_account_guest_exit();
> > + guest_timing_exit_irqoff();
> >
> > if (lapic_in_kernel(vcpu)) {
> > s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;

2022-01-14 12:19:46

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs

On Thu, Jan 13, 2022 at 04:20:07PM +0100, Christian Borntraeger wrote:
>
>
> Am 11.01.22 um 16:35 schrieb Mark Rutland:
> > Several architectures have latent bugs around guest entry/exit, most
> > notably:
> >
> > 1) Several architectures enable interrupts between guest_enter() and
> > guest_exit(). As this period is an RCU extended quiescent state (EQS) this
> > is unsound unless the irq entry code explicitly wakes RCU, which most
> > architectures only do for entry from usersapce or idle.
> >
> > I believe this affects: arm64, riscv, s390
> >
> > I am not sure about powerpc.
> >
> > 2) Several architectures permit instrumentation of code between
> > guest_enter() and guest_exit(), e.g. KASAN, KCOV, KCSAN, etc. As
> > instrumentation may directly o indirectly use RCU, this has the same
> > problems as with interrupts.
> >
> > I believe this affects: arm64, mips, powerpc, riscv, s390
> >
> > 3) Several architectures do not inform lockdep and tracing that
> > interrupts are enabled during the execution of the guest, or do so in
> > an incorrect order. Generally
> > this means that logs will report IRQs being masked for much longer
> > than is actually the case, which is not ideal for debugging. I don't
> > know whether this affects the correctness of lockdep.
> >
> > I believe this affects: arm64, mips, powerpc, riscv, s390
> >
> > This was previously fixed for x86 specifically in a series of commits:
> >
> > 87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
> > 0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
> > 9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
> > 3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
> > 135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
> > 160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
> > bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
> >
> > But other architectures were left broken, and the infrastructure for
> > handling this correctly is x86-specific.
> >
> > This series introduces generic helper functions which can be used to
> > handle the problems above, and migrates architectures over to these,
> > fixing the latent issues.
> >
> > I wasn't able to figure my way around powerpc and s390, so I have not
>
> I think 2 later patches have moved the guest_enter/exit a bit out.
> Does this make the s390 code clearer?

Yes; that's much simpler to follow!

One major thing I wasn't sure about for s390 is the sequence:

guest_enter_irqoff(); // Enters an RCU EQS
...
local_irq_enable();
...
sie64a(...);
...
local_irq_disable();
...
guest_exit_irqoff(); // Exits an RCU EQS

... since if an IRQ is taken between local_irq_{enable,disable}(), RCU won't be
watching, and I couldn't spot whether your regular IRQ entry logic would wake
RCU in this case, or whether there was something else I'm missing that saves
you here.

For other architectures, including x86 and arm64, we enter the guest with IRQs
masked and return from the guest with IRQs masked, and don't actually take IRQs
until we unmask them in the host, after the guest_exit_*() logic has woken RCU
and so on.

I wasn't able to find documentation on the semantics of SIE, so I couldn't spot
whether the local_irq_{enable,disable}() calls were necessary, or could be
removed.

Thanks,
Mark.

> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 577f1ead6a51..5859207c2cc0 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -4145,10 +4145,6 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
> * As PF_VCPU will be used in fault handler, between
> * guest_enter and guest_exit should be no uaccess.
> */
> - local_irq_disable();
> - guest_enter_irqoff();
> - __disable_cpu_timer_accounting(vcpu);
> - local_irq_enable();
> if (kvm_s390_pv_cpu_is_protected(vcpu)) {
> memcpy(sie_page->pv_grregs,
> vcpu->run->s.regs.gprs,
> @@ -4156,8 +4152,16 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
> }
> if (test_cpu_flag(CIF_FPU))
> load_fpu_regs();
> + local_irq_disable();
> + __disable_cpu_timer_accounting(vcpu);
> + guest_enter_irqoff();
> + local_irq_enable();
> exit_reason = sie64a(vcpu->arch.sie_block,
> vcpu->run->s.regs.gprs);
> + local_irq_disable();
> + guest_exit_irqoff();
> + __enable_cpu_timer_accounting(vcpu);
> + local_irq_enable();
> if (kvm_s390_pv_cpu_is_protected(vcpu)) {
> memcpy(vcpu->run->s.regs.gprs,
> sie_page->pv_grregs,
> @@ -4173,10 +4177,6 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
> vcpu->arch.sie_block->gpsw.mask &= ~PSW_INT_MASK;
> }
> }
> - local_irq_disable();
> - __enable_cpu_timer_accounting(vcpu);
> - guest_exit_irqoff();
> - local_irq_enable();
> vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> rc = vcpu_post_run(vcpu, exit_reason);

2022-01-14 12:31:35

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs



Am 14.01.22 um 13:19 schrieb Mark Rutland:
> On Thu, Jan 13, 2022 at 04:20:07PM +0100, Christian Borntraeger wrote:
>>
>>
>> Am 11.01.22 um 16:35 schrieb Mark Rutland:
>>> Several architectures have latent bugs around guest entry/exit, most
>>> notably:
>>>
>>> 1) Several architectures enable interrupts between guest_enter() and
>>> guest_exit(). As this period is an RCU extended quiescent state (EQS) this
>>> is unsound unless the irq entry code explicitly wakes RCU, which most
>>> architectures only do for entry from usersapce or idle.
>>>
>>> I believe this affects: arm64, riscv, s390
>>>
>>> I am not sure about powerpc.
>>>
>>> 2) Several architectures permit instrumentation of code between
>>> guest_enter() and guest_exit(), e.g. KASAN, KCOV, KCSAN, etc. As
>>> instrumentation may directly o indirectly use RCU, this has the same
>>> problems as with interrupts.
>>>
>>> I believe this affects: arm64, mips, powerpc, riscv, s390
>>>
>>> 3) Several architectures do not inform lockdep and tracing that
>>> interrupts are enabled during the execution of the guest, or do so in
>>> an incorrect order. Generally
>>> this means that logs will report IRQs being masked for much longer
>>> than is actually the case, which is not ideal for debugging. I don't
>>> know whether this affects the correctness of lockdep.
>>>
>>> I believe this affects: arm64, mips, powerpc, riscv, s390
>>>
>>> This was previously fixed for x86 specifically in a series of commits:
>>>
>>> 87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
>>> 0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
>>> 9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
>>> 3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
>>> 135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
>>> 160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
>>> bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
>>>
>>> But other architectures were left broken, and the infrastructure for
>>> handling this correctly is x86-specific.
>>>
>>> This series introduces generic helper functions which can be used to
>>> handle the problems above, and migrates architectures over to these,
>>> fixing the latent issues.
>>>
>>> I wasn't able to figure my way around powerpc and s390, so I have not
>>
>> I think 2 later patches have moved the guest_enter/exit a bit out.
>> Does this make the s390 code clearer?
>
> Yes; that's much simpler to follow!
>
> One major thing I wasn't sure about for s390 is the sequence:
>
> guest_enter_irqoff(); // Enters an RCU EQS
> ...
> local_irq_enable();
> ...
> sie64a(...);
> ...
> local_irq_disable();
> ...
> guest_exit_irqoff(); // Exits an RCU EQS
>
> ... since if an IRQ is taken between local_irq_{enable,disable}(), RCU won't be
> watching, and I couldn't spot whether your regular IRQ entry logic would wake
> RCU in this case, or whether there was something else I'm missing that saves
> you here.
>
> For other architectures, including x86 and arm64, we enter the guest with IRQs
> masked and return from the guest with IRQs masked, and don't actually take IRQs
> until we unmask them in the host, after the guest_exit_*() logic has woken RCU
> and so on.
>
> I wasn't able to find documentation on the semantics of SIE, so I couldn't spot
> whether the local_irq_{enable,disable}() calls were necessary, or could be
> removed.

We run the SIE instruction with interrupts enabled. SIE is interruptible.
The disable/enable pairs are just because guest_enter/exit_irqoff() require them.
One thing to be aware of: in our entry.S - after an interrupt - we leave SIE by
setting the return address of the interrupt after the sie instruction so that we
get back into this __vcpu_run loop to check for signals and so.


>
> Thanks,
> Mark.
>
>> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
>> index 577f1ead6a51..5859207c2cc0 100644
>> --- a/arch/s390/kvm/kvm-s390.c
>> +++ b/arch/s390/kvm/kvm-s390.c
>> @@ -4145,10 +4145,6 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
>> * As PF_VCPU will be used in fault handler, between
>> * guest_enter and guest_exit should be no uaccess.
>> */
>> - local_irq_disable();
>> - guest_enter_irqoff();
>> - __disable_cpu_timer_accounting(vcpu);
>> - local_irq_enable();
>> if (kvm_s390_pv_cpu_is_protected(vcpu)) {
>> memcpy(sie_page->pv_grregs,
>> vcpu->run->s.regs.gprs,
>> @@ -4156,8 +4152,16 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
>> }
>> if (test_cpu_flag(CIF_FPU))
>> load_fpu_regs();
>> + local_irq_disable();
>> + __disable_cpu_timer_accounting(vcpu);
>> + guest_enter_irqoff();
>> + local_irq_enable();
>> exit_reason = sie64a(vcpu->arch.sie_block,
>> vcpu->run->s.regs.gprs);
>> + local_irq_disable();
>> + guest_exit_irqoff();
>> + __enable_cpu_timer_accounting(vcpu);
>> + local_irq_enable();
>> if (kvm_s390_pv_cpu_is_protected(vcpu)) {
>> memcpy(vcpu->run->s.regs.gprs,
>> sie_page->pv_grregs,
>> @@ -4173,10 +4177,6 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
>> vcpu->arch.sie_block->gpsw.mask &= ~PSW_INT_MASK;
>> }
>> }
>> - local_irq_disable();
>> - __enable_cpu_timer_accounting(vcpu);
>> - guest_exit_irqoff();
>> - local_irq_enable();
>> vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
>> rc = vcpu_post_run(vcpu, exit_reason);

2022-01-14 13:32:57

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs

On Fri, Jan 14, 2022 at 01:29:46PM +0100, Christian Borntraeger wrote:
>
>
> Am 14.01.22 um 13:19 schrieb Mark Rutland:
> > On Thu, Jan 13, 2022 at 04:20:07PM +0100, Christian Borntraeger wrote:
> > > Am 11.01.22 um 16:35 schrieb Mark Rutland:
> > > > Several architectures have latent bugs around guest entry/exit, most
> > > > notably:
> > > >
> > > > 1) Several architectures enable interrupts between guest_enter() and
> > > > guest_exit(). As this period is an RCU extended quiescent state (EQS) this
> > > > is unsound unless the irq entry code explicitly wakes RCU, which most
> > > > architectures only do for entry from usersapce or idle.
> > > >
> > > > I believe this affects: arm64, riscv, s390
> > > >
> > > > I am not sure about powerpc.
> > > >
> > > > 2) Several architectures permit instrumentation of code between
> > > > guest_enter() and guest_exit(), e.g. KASAN, KCOV, KCSAN, etc. As
> > > > instrumentation may directly o indirectly use RCU, this has the same
> > > > problems as with interrupts.
> > > >
> > > > I believe this affects: arm64, mips, powerpc, riscv, s390
> > > >
> > > > 3) Several architectures do not inform lockdep and tracing that
> > > > interrupts are enabled during the execution of the guest, or do so in
> > > > an incorrect order. Generally
> > > > this means that logs will report IRQs being masked for much longer
> > > > than is actually the case, which is not ideal for debugging. I don't
> > > > know whether this affects the correctness of lockdep.
> > > >
> > > > I believe this affects: arm64, mips, powerpc, riscv, s390
> > > >
> > > > This was previously fixed for x86 specifically in a series of commits:
> > > >
> > > > 87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
> > > > 0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
> > > > 9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
> > > > 3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
> > > > 135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
> > > > 160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
> > > > bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
> > > >
> > > > But other architectures were left broken, and the infrastructure for
> > > > handling this correctly is x86-specific.
> > > >
> > > > This series introduces generic helper functions which can be used to
> > > > handle the problems above, and migrates architectures over to these,
> > > > fixing the latent issues.
> > > >
> > > > I wasn't able to figure my way around powerpc and s390, so I have not
> > >
> > > I think 2 later patches have moved the guest_enter/exit a bit out.
> > > Does this make the s390 code clearer?
> >
> > Yes; that's much simpler to follow!
> >
> > One major thing I wasn't sure about for s390 is the sequence:
> >
> > guest_enter_irqoff(); // Enters an RCU EQS
> > ...
> > local_irq_enable();
> > ...
> > sie64a(...);
> > ...
> > local_irq_disable();
> > ...
> > guest_exit_irqoff(); // Exits an RCU EQS
> >
> > ... since if an IRQ is taken between local_irq_{enable,disable}(), RCU won't be
> > watching, and I couldn't spot whether your regular IRQ entry logic would wake
> > RCU in this case, or whether there was something else I'm missing that saves
> > you here.
> >
> > For other architectures, including x86 and arm64, we enter the guest with IRQs
> > masked and return from the guest with IRQs masked, and don't actually take IRQs
> > until we unmask them in the host, after the guest_exit_*() logic has woken RCU
> > and so on.
> >
> > I wasn't able to find documentation on the semantics of SIE, so I couldn't spot
> > whether the local_irq_{enable,disable}() calls were necessary, or could be
> > removed.
>
> We run the SIE instruction with interrupts enabled. SIE is interruptible.
> The disable/enable pairs are just because guest_enter/exit_irqoff() require them.

What I was trying to figure out was when an interrupt is taken between
guest_enter_irqoff() and guest_exit_irqoff(), where is RCU woken? I couldn't
spot that in the s390 entry code (probably simply because I'm not familiar with
it), and so AFAICT that means IRQ code could run without RCU watching, which
would cause things to explode.

On other architectures that problem is avoided because IRQs asserted during the
guest cause a specific guest exit rather than a regular IRQ exception, and the
HW enables/disables IRQs when entering/exiting the guest, so the host can leave
IRQs masked across guest_enter_irqoff()..guest_exit_irqoff().

Am I right in understanding that SIE itself won't enable (host) interrupts
while running the guest, and so it *needs* to be run with interrupts already
enabled?

> One thing to be aware of: in our entry.S - after an interrupt - we leave SIE by
> setting the return address of the interrupt after the sie instruction so that we
> get back into this __vcpu_run loop to check for signals and so.

Just to check, that's after the IRQ handler runs, right?

Thanks,
Mark.

2022-01-14 13:53:16

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs



Am 14.01.22 um 14:32 schrieb Mark Rutland:
> On Fri, Jan 14, 2022 at 01:29:46PM +0100, Christian Borntraeger wrote:
>>
>>
>> Am 14.01.22 um 13:19 schrieb Mark Rutland:
>>> On Thu, Jan 13, 2022 at 04:20:07PM +0100, Christian Borntraeger wrote:
>>>> Am 11.01.22 um 16:35 schrieb Mark Rutland:
>>>>> Several architectures have latent bugs around guest entry/exit, most
>>>>> notably:
>>>>>
>>>>> 1) Several architectures enable interrupts between guest_enter() and
>>>>> guest_exit(). As this period is an RCU extended quiescent state (EQS) this
>>>>> is unsound unless the irq entry code explicitly wakes RCU, which most
>>>>> architectures only do for entry from usersapce or idle.
>>>>>
>>>>> I believe this affects: arm64, riscv, s390
>>>>>
>>>>> I am not sure about powerpc.
>>>>>
>>>>> 2) Several architectures permit instrumentation of code between
>>>>> guest_enter() and guest_exit(), e.g. KASAN, KCOV, KCSAN, etc. As
>>>>> instrumentation may directly o indirectly use RCU, this has the same
>>>>> problems as with interrupts.
>>>>>
>>>>> I believe this affects: arm64, mips, powerpc, riscv, s390
>>>>>
>>>>> 3) Several architectures do not inform lockdep and tracing that
>>>>> interrupts are enabled during the execution of the guest, or do so in
>>>>> an incorrect order. Generally
>>>>> this means that logs will report IRQs being masked for much longer
>>>>> than is actually the case, which is not ideal for debugging. I don't
>>>>> know whether this affects the correctness of lockdep.
>>>>>
>>>>> I believe this affects: arm64, mips, powerpc, riscv, s390
>>>>>
>>>>> This was previously fixed for x86 specifically in a series of commits:
>>>>>
>>>>> 87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
>>>>> 0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
>>>>> 9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
>>>>> 3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
>>>>> 135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
>>>>> 160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
>>>>> bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
>>>>>
>>>>> But other architectures were left broken, and the infrastructure for
>>>>> handling this correctly is x86-specific.
>>>>>
>>>>> This series introduces generic helper functions which can be used to
>>>>> handle the problems above, and migrates architectures over to these,
>>>>> fixing the latent issues.
>>>>>
>>>>> I wasn't able to figure my way around powerpc and s390, so I have not
>>>>
>>>> I think 2 later patches have moved the guest_enter/exit a bit out.
>>>> Does this make the s390 code clearer?
>>>
>>> Yes; that's much simpler to follow!
>>>
>>> One major thing I wasn't sure about for s390 is the sequence:
>>>
>>> guest_enter_irqoff(); // Enters an RCU EQS
>>> ...
>>> local_irq_enable();
>>> ...
>>> sie64a(...);
>>> ...
>>> local_irq_disable();
>>> ...
>>> guest_exit_irqoff(); // Exits an RCU EQS
>>>
>>> ... since if an IRQ is taken between local_irq_{enable,disable}(), RCU won't be
>>> watching, and I couldn't spot whether your regular IRQ entry logic would wake
>>> RCU in this case, or whether there was something else I'm missing that saves
>>> you here.
>>>
>>> For other architectures, including x86 and arm64, we enter the guest with IRQs
>>> masked and return from the guest with IRQs masked, and don't actually take IRQs
>>> until we unmask them in the host, after the guest_exit_*() logic has woken RCU
>>> and so on.
>>>
>>> I wasn't able to find documentation on the semantics of SIE, so I couldn't spot
>>> whether the local_irq_{enable,disable}() calls were necessary, or could be
>>> removed.
>>
>> We run the SIE instruction with interrupts enabled. SIE is interruptible.
>> The disable/enable pairs are just because guest_enter/exit_irqoff() require them.
>
> What I was trying to figure out was when an interrupt is taken between
> guest_enter_irqoff() and guest_exit_irqoff(), where is RCU woken? I couldn't
> spot that in the s390 entry code (probably simply because I'm not familiar with
> it), and so AFAICT that means IRQ code could run without RCU watching, which
> would cause things to explode.
>
> On other architectures that problem is avoided because IRQs asserted during the
> guest cause a specific guest exit rather than a regular IRQ exception, and the
> HW enables/disables IRQs when entering/exiting the guest, so the host can leave
> IRQs masked across guest_enter_irqoff()..guest_exit_irqoff().
>
> Am I right in understanding that SIE itself won't enable (host) interrupts
> while running the guest, and so it *needs* to be run with interrupts already
> enabled?

yes

>
>> One thing to be aware of: in our entry.S - after an interrupt - we leave SIE by
>> setting the return address of the interrupt after the sie instruction so that we
>> get back into this __vcpu_run loop to check for signals and so.
>
> Just to check, that's after the IRQ handler runs, right?

and yes.

2022-01-14 22:38:55

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs

On Fri, Jan 14, 2022 at 02:51:38PM +0100, Christian Borntraeger wrote:
> Am 14.01.22 um 14:32 schrieb Mark Rutland:
> > On Fri, Jan 14, 2022 at 01:29:46PM +0100, Christian Borntraeger wrote:
> > > Am 14.01.22 um 13:19 schrieb Mark Rutland:
> > > > On Thu, Jan 13, 2022 at 04:20:07PM +0100, Christian Borntraeger wrote:
> > > > > Am 11.01.22 um 16:35 schrieb Mark Rutland:

[...]

> > > > One major thing I wasn't sure about for s390 is the sequence:
> > > >
> > > > guest_enter_irqoff(); // Enters an RCU EQS
> > > > ...
> > > > local_irq_enable();
> > > > ...
> > > > sie64a(...);
> > > > ...
> > > > local_irq_disable();
> > > > ...
> > > > guest_exit_irqoff(); // Exits an RCU EQS
> > > >
> > > > ... since if an IRQ is taken between local_irq_{enable,disable}(), RCU won't be
> > > > watching, and I couldn't spot whether your regular IRQ entry logic would wake
> > > > RCU in this case, or whether there was something else I'm missing that saves
> > > > you here.
> > > >
> > > > For other architectures, including x86 and arm64, we enter the guest with IRQs
> > > > masked and return from the guest with IRQs masked, and don't actually take IRQs
> > > > until we unmask them in the host, after the guest_exit_*() logic has woken RCU
> > > > and so on.
> > > >
> > > > I wasn't able to find documentation on the semantics of SIE, so I couldn't spot
> > > > whether the local_irq_{enable,disable}() calls were necessary, or could be
> > > > removed.
> > >
> > > We run the SIE instruction with interrupts enabled. SIE is interruptible.
> > > The disable/enable pairs are just because guest_enter/exit_irqoff() require them.
> >
> > What I was trying to figure out was when an interrupt is taken between
> > guest_enter_irqoff() and guest_exit_irqoff(), where is RCU woken? I couldn't
> > spot that in the s390 entry code (probably simply because I'm not familiar with
> > it), and so AFAICT that means IRQ code could run without RCU watching, which
> > would cause things to explode.
> >
> > On other architectures that problem is avoided because IRQs asserted during the
> > guest cause a specific guest exit rather than a regular IRQ exception, and the
> > HW enables/disables IRQs when entering/exiting the guest, so the host can leave
> > IRQs masked across guest_enter_irqoff()..guest_exit_irqoff().
> >
> > Am I right in understanding that SIE itself won't enable (host) interrupts
> > while running the guest, and so it *needs* to be run with interrupts already
> > enabled?
>
> yes
>
> > > One thing to be aware of: in our entry.S - after an interrupt - we leave SIE by
> > > setting the return address of the interrupt after the sie instruction so that we
> > > get back into this __vcpu_run loop to check for signals and so.
> >
> > Just to check, that's after the IRQ handler runs, right?
>
> and yes.

Thanks for confirming!

IIUC as above, that means there's a latent RCU bug on s390, and to fix that
we'll need to add something to the IRQ entry logic to wake RCU for any IRQ
taken in the EQS between guest_enter_irqoff() and guest_exit_irqoff(), similar
to what is done for IRQs taken from an idle EQS.

I see s390 uses the common irqentry_{enter,exit}(), so perhaps we could extend
the logic there to check something in addition to is_idle_task()? e.g. add a
noinstr helper to check kvm_running_vcpu, Or add a thread flag that says we're
in this guest EQS.

I also think there is another issue here. When an IRQ is taken from SIE, will
user_mode(regs) always be false, or could it be true if the guest userspace is
running? If it can be true I think tha context tracking checks can complain,
and it *might* be possible to trigger a panic().

In irqentry_enter(), if user_mode(regs) == true, we call
irqentry_enter_from_user_mode -> __enter_from_user_mode(). There we check that
the context is CONTEXT_USER, but IIUC that will be CONTEXT_GUEST at this point.
We also call arch_check_user_regs(), and IIUC this might permit a malicious
guest to trigger a host panic by way of debug_user_asce(), but I may have
misunderstood and that might not be possible.

Thanks,
Mark.

2022-01-14 22:46:47

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 1/5] kvm: add exit_to_guest_mode() and enter_from_guest_mode()

On Fri, Jan 14, 2022, Mark Rutland wrote:
> I'd like to keep this somewhat orthogonal to the x86 changes (e.g. as other
> architectures will need backports to stable at least for the RCU bug fix), so
> I'd rather use a name that isn't immediately coupled with x86 changes.

Ah, gotcha.

> Does the guest_context_{enter,exit}_irqoff() naming above work for you?

Yep, thanks!

2022-01-14 22:54:16

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 5/5] kvm/x86: rework guest entry logic

On Fri, Jan 14, 2022, Mark Rutland wrote:
> I assume that other than the naming of the entry/exit functions you're happy
> with this patch?

Yep!

2022-01-18 02:54:30

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs

On 1/14/22 16:19, Mark Rutland wrote:
> I also think there is another issue here. When an IRQ is taken from SIE, will
> user_mode(regs) always be false, or could it be true if the guest userspace is
> running? If it can be true I think tha context tracking checks can complain,
> and it*might* be possible to trigger a panic().

I think that it would be false, because the guest PSW is in the SIE
block and switched on SIE entry and exit, but I might be incorrect.

Paolo

> In irqentry_enter(), if user_mode(regs) == true, we call
> irqentry_enter_from_user_mode -> __enter_from_user_mode(). There we check that
> the context is CONTEXT_USER, but IIUC that will be CONTEXT_GUEST at this point.
> We also call arch_check_user_regs(), and IIUC this might permit a malicious
> guest to trigger a host panic by way of debug_user_asce(), but I may have
> misunderstood and that might not be possible.

2022-01-20 08:49:57

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs

On Mon, Jan 17, 2022 at 06:45:36PM +0100, Paolo Bonzini wrote:
> On 1/14/22 16:19, Mark Rutland wrote:
> > I also think there is another issue here. When an IRQ is taken from SIE, will
> > user_mode(regs) always be false, or could it be true if the guest userspace is
> > running? If it can be true I think tha context tracking checks can complain,
> > and it*might* be possible to trigger a panic().
>
> I think that it would be false, because the guest PSW is in the SIE block
> and switched on SIE entry and exit, but I might be incorrect.

Ah; that's the crux of my confusion: I had thought the guest PSW would
be placed in the regular lowcore *_old_psw slots. From looking at the
entry asm it looks like the host PSW (around the invocation of SIE) is
stored there, since that's what the OUTSIDE + SIEEXIT handling is
checking for.

Assuming that's correct, I agree this problem doesn't exist, and there's
only the common RCU/tracing/lockdep management to fix.

Sorry for the noise, and thanks for the pointer!

Thanks,
Mark.

2022-01-20 08:52:21

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs



Am 18.01.22 um 13:02 schrieb Mark Rutland:
> On Mon, Jan 17, 2022 at 06:45:36PM +0100, Paolo Bonzini wrote:
>> On 1/14/22 16:19, Mark Rutland wrote:
>>> I also think there is another issue here. When an IRQ is taken from SIE, will
>>> user_mode(regs) always be false, or could it be true if the guest userspace is
>>> running? If it can be true I think tha context tracking checks can complain,
>>> and it*might* be possible to trigger a panic().
>>
>> I think that it would be false, because the guest PSW is in the SIE block
>> and switched on SIE entry and exit, but I might be incorrect.
>
> Ah; that's the crux of my confusion: I had thought the guest PSW would
> be placed in the regular lowcore *_old_psw slots. From looking at the
> entry asm it looks like the host PSW (around the invocation of SIE) is
> stored there, since that's what the OUTSIDE + SIEEXIT handling is
> checking for.

Yes, Paolos observation is correct.
>
> Assuming that's correct, I agree this problem doesn't exist, and there's
> only the common RCU/tracing/lockdep management to fix.
>
> Sorry for the noise, and thanks for the pointer!
>
> Thanks,
> Mark.

2022-01-20 09:10:22

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs



Am 18.01.22 um 13:02 schrieb Mark Rutland:
> On Mon, Jan 17, 2022 at 06:45:36PM +0100, Paolo Bonzini wrote:
>> On 1/14/22 16:19, Mark Rutland wrote:
>>> I also think there is another issue here. When an IRQ is taken from SIE, will
>>> user_mode(regs) always be false, or could it be true if the guest userspace is
>>> running? If it can be true I think tha context tracking checks can complain,
>>> and it*might* be possible to trigger a panic().
>>
>> I think that it would be false, because the guest PSW is in the SIE block
>> and switched on SIE entry and exit, but I might be incorrect.
>
> Ah; that's the crux of my confusion: I had thought the guest PSW would
> be placed in the regular lowcore *_old_psw slots. From looking at the
> entry asm it looks like the host PSW (around the invocation of SIE) is
> stored there, since that's what the OUTSIDE + SIEEXIT handling is
> checking for.
>
> Assuming that's correct, I agree this problem doesn't exist, and there's
> only the common RCU/tracing/lockdep management to fix.

Will you provide an s390 patch in your next iteration or shall we then do
one as soon as there is a v2? We also need to look into vsie.c where we
also call sie64a

2022-01-20 09:11:34

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 1/5] kvm: add exit_to_guest_mode() and enter_from_guest_mode()

On Fri, Jan 14, 2022 at 04:11:06PM +0000, Sean Christopherson wrote:
> On Fri, Jan 14, 2022, Mark Rutland wrote:
> > I'd like to keep this somewhat orthogonal to the x86 changes (e.g. as other
> > architectures will need backports to stable at least for the RCU bug fix), so
> > I'd rather use a name that isn't immediately coupled with x86 changes.
>
> Ah, gotcha.
>
> > Does the guest_context_{enter,exit}_irqoff() naming above work for you?
>
> Yep, thanks!

I just realised that I already have guest_context_{enter,exit}_irqoff()
for the context-tracking bits alone, and so I'll need to use a different
name. For bisectability I can't use guest_{enter,exit}_irqoff()
immediately, so for now I'll go with guest_state_{enter,exit}_irqoff().

Once the conversion is complete and the deprecated bits are removed we
can rename those to guest_{enter,exit}_irqoff().

Thanks,
Mark.

2022-01-20 10:28:36

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs

On Tue, Jan 18, 2022 at 01:42:26PM +0100, Christian Borntraeger wrote:
>
>
> Am 18.01.22 um 13:02 schrieb Mark Rutland:
> > On Mon, Jan 17, 2022 at 06:45:36PM +0100, Paolo Bonzini wrote:
> > > On 1/14/22 16:19, Mark Rutland wrote:
> > > > I also think there is another issue here. When an IRQ is taken from SIE, will
> > > > user_mode(regs) always be false, or could it be true if the guest userspace is
> > > > running? If it can be true I think tha context tracking checks can complain,
> > > > and it*might* be possible to trigger a panic().
> > >
> > > I think that it would be false, because the guest PSW is in the SIE block
> > > and switched on SIE entry and exit, but I might be incorrect.
> >
> > Ah; that's the crux of my confusion: I had thought the guest PSW would
> > be placed in the regular lowcore *_old_psw slots. From looking at the
> > entry asm it looks like the host PSW (around the invocation of SIE) is
> > stored there, since that's what the OUTSIDE + SIEEXIT handling is
> > checking for.
> >
> > Assuming that's correct, I agree this problem doesn't exist, and there's
> > only the common RCU/tracing/lockdep management to fix.
>
> Will you provide an s390 patch in your next iteration or shall we then do
> one as soon as there is a v2? We also need to look into vsie.c where we
> also call sie64a

I'm having a go at that now; my plan is to try to have an s390 patch as
part of v2 in the next day or so.

Now that I have a rough idea of how SIE and exception handling works on
s390, I think the structural changes to kvm-s390.c:__vcpu_run() and
vsie.c:do_vsie_run() are fairly simple.

The only open bit is exactly how/where to identify when the interrupt
entry code needs to wake RCU. I can add a per-cpu variable or thread
flag to indicate that we're inside that EQS, or or I could move the irq
enable/disable into the sie64a asm and identify that as with the OUTSIDE
macro in the entry asm.

Thanks,
Mark.

2022-01-20 12:01:04

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs



Am 18.01.22 um 14:12 schrieb Mark Rutland:
> On Tue, Jan 18, 2022 at 01:42:26PM +0100, Christian Borntraeger wrote:
>>
>>
>> Am 18.01.22 um 13:02 schrieb Mark Rutland:
>>> On Mon, Jan 17, 2022 at 06:45:36PM +0100, Paolo Bonzini wrote:
>>>> On 1/14/22 16:19, Mark Rutland wrote:
>>>>> I also think there is another issue here. When an IRQ is taken from SIE, will
>>>>> user_mode(regs) always be false, or could it be true if the guest userspace is
>>>>> running? If it can be true I think tha context tracking checks can complain,
>>>>> and it*might* be possible to trigger a panic().
>>>>
>>>> I think that it would be false, because the guest PSW is in the SIE block
>>>> and switched on SIE entry and exit, but I might be incorrect.
>>>
>>> Ah; that's the crux of my confusion: I had thought the guest PSW would
>>> be placed in the regular lowcore *_old_psw slots. From looking at the
>>> entry asm it looks like the host PSW (around the invocation of SIE) is
>>> stored there, since that's what the OUTSIDE + SIEEXIT handling is
>>> checking for.
>>>
>>> Assuming that's correct, I agree this problem doesn't exist, and there's
>>> only the common RCU/tracing/lockdep management to fix.
>>
>> Will you provide an s390 patch in your next iteration or shall we then do
>> one as soon as there is a v2? We also need to look into vsie.c where we
>> also call sie64a
>
> I'm having a go at that now; my plan is to try to have an s390 patch as
> part of v2 in the next day or so.
>
> Now that I have a rough idea of how SIE and exception handling works on
> s390, I think the structural changes to kvm-s390.c:__vcpu_run() and
> vsie.c:do_vsie_run() are fairly simple.
>
> The only open bit is exactly how/where to identify when the interrupt
> entry code needs to wake RCU. I can add a per-cpu variable or thread
> flag to indicate that we're inside that EQS, or or I could move the irq
> enable/disable into the sie64a asm and identify that as with the OUTSIDE
> macro in the entry asm.
What exactly would the low-level interrupt handler need to do?

CC Sven, Heiko for the entry.S changes.

2022-01-20 14:41:07

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs

On Tue, Jan 18, 2022 at 03:15:51PM +0100, Christian Borntraeger wrote:
> Am 18.01.22 um 14:12 schrieb Mark Rutland:
> > On Tue, Jan 18, 2022 at 01:42:26PM +0100, Christian Borntraeger wrote:
> > >
> > >
> > > Am 18.01.22 um 13:02 schrieb Mark Rutland:
> > > > On Mon, Jan 17, 2022 at 06:45:36PM +0100, Paolo Bonzini wrote:
> > > > > On 1/14/22 16:19, Mark Rutland wrote:
> > > > > > I also think there is another issue here. When an IRQ is taken from SIE, will
> > > > > > user_mode(regs) always be false, or could it be true if the guest userspace is
> > > > > > running? If it can be true I think tha context tracking checks can complain,
> > > > > > and it*might* be possible to trigger a panic().
> > > > >
> > > > > I think that it would be false, because the guest PSW is in the SIE block
> > > > > and switched on SIE entry and exit, but I might be incorrect.
> > > >
> > > > Ah; that's the crux of my confusion: I had thought the guest PSW would
> > > > be placed in the regular lowcore *_old_psw slots. From looking at the
> > > > entry asm it looks like the host PSW (around the invocation of SIE) is
> > > > stored there, since that's what the OUTSIDE + SIEEXIT handling is
> > > > checking for.
> > > >
> > > > Assuming that's correct, I agree this problem doesn't exist, and there's
> > > > only the common RCU/tracing/lockdep management to fix.
> > >
> > > Will you provide an s390 patch in your next iteration or shall we then do
> > > one as soon as there is a v2? We also need to look into vsie.c where we
> > > also call sie64a
> >
> > I'm having a go at that now; my plan is to try to have an s390 patch as
> > part of v2 in the next day or so.
> >
> > Now that I have a rough idea of how SIE and exception handling works on
> > s390, I think the structural changes to kvm-s390.c:__vcpu_run() and
> > vsie.c:do_vsie_run() are fairly simple.
> >
> > The only open bit is exactly how/where to identify when the interrupt
> > entry code needs to wake RCU. I can add a per-cpu variable or thread
> > flag to indicate that we're inside that EQS, or or I could move the irq
> > enable/disable into the sie64a asm and identify that as with the OUTSIDE
> > macro in the entry asm.
> What exactly would the low-level interrupt handler need to do?

Having looked around a bit, I think the best bet is to have
irqentry_enter() check PF_VCPU in addition to PF_IDLE (which it checks
via is_idle_task()), at which point nothing needs to change in the s390
entry code.

I'm currently implementing that, let me have a go, and then we can see
if that looks ok or whether we should do something else.

> CC Sven, Heiko for the entry.S changes.

I'll make sure you're all Cc'd when I send out vs with s390 patches.

Thanks,
Mark.

2022-01-20 16:57:49

by Sven Schnelle

[permalink] [raw]
Subject: Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs

Hi Mark,

Mark Rutland <[email protected]> writes:

> On Tue, Jan 18, 2022 at 01:42:26PM +0100, Christian Borntraeger wrote:
>>
>>
>> Am 18.01.22 um 13:02 schrieb Mark Rutland:
>> > On Mon, Jan 17, 2022 at 06:45:36PM +0100, Paolo Bonzini wrote:
>> > > On 1/14/22 16:19, Mark Rutland wrote:
>> > > > I also think there is another issue here. When an IRQ is taken from SIE, will
>> > > > user_mode(regs) always be false, or could it be true if the guest userspace is
>> > > > running? If it can be true I think tha context tracking checks can complain,
>> > > > and it*might* be possible to trigger a panic().
>> > >
>> > > I think that it would be false, because the guest PSW is in the SIE block
>> > > and switched on SIE entry and exit, but I might be incorrect.
>> >
>> > Ah; that's the crux of my confusion: I had thought the guest PSW would
>> > be placed in the regular lowcore *_old_psw slots. From looking at the
>> > entry asm it looks like the host PSW (around the invocation of SIE) is
>> > stored there, since that's what the OUTSIDE + SIEEXIT handling is
>> > checking for.
>> >
>> > Assuming that's correct, I agree this problem doesn't exist, and there's
>> > only the common RCU/tracing/lockdep management to fix.
>>
>> Will you provide an s390 patch in your next iteration or shall we then do
>> one as soon as there is a v2? We also need to look into vsie.c where we
>> also call sie64a
>
> I'm having a go at that now; my plan is to try to have an s390 patch as
> part of v2 in the next day or so.
>
> Now that I have a rough idea of how SIE and exception handling works on
> s390, I think the structural changes to kvm-s390.c:__vcpu_run() and
> vsie.c:do_vsie_run() are fairly simple.
>
> The only open bit is exactly how/where to identify when the interrupt
> entry code needs to wake RCU. I can add a per-cpu variable or thread
> flag to indicate that we're inside that EQS, or or I could move the irq
> enable/disable into the sie64a asm and identify that as with the OUTSIDE
> macro in the entry asm.

I wonder whether the code in irqentry_enter() should call a function
is_eqs() instead of is_idle_task(). The default implementation would
be just a

#ifndef is_eqs
#define is_eqs is_idle_task
#endif

and if an architecture has special requirements, it could just define
is_eqs() and do the required checks there. This way the architecture
could define whether it's a percpu bit, a cpu flag or something else.

/Sven

2022-01-20 21:22:43

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs

On Tue, Jan 18, 2022 at 05:09:25PM +0100, Sven Schnelle wrote:
> Hi Mark,

Hi Sven,

> Mark Rutland <[email protected]> writes:
> > On Tue, Jan 18, 2022 at 01:42:26PM +0100, Christian Borntraeger wrote:
> >> Will you provide an s390 patch in your next iteration or shall we then do
> >> one as soon as there is a v2? We also need to look into vsie.c where we
> >> also call sie64a
> >
> > I'm having a go at that now; my plan is to try to have an s390 patch as
> > part of v2 in the next day or so.
> >
> > Now that I have a rough idea of how SIE and exception handling works on
> > s390, I think the structural changes to kvm-s390.c:__vcpu_run() and
> > vsie.c:do_vsie_run() are fairly simple.
> >
> > The only open bit is exactly how/where to identify when the interrupt
> > entry code needs to wake RCU. I can add a per-cpu variable or thread
> > flag to indicate that we're inside that EQS, or or I could move the irq
> > enable/disable into the sie64a asm and identify that as with the OUTSIDE
> > macro in the entry asm.
>
> I wonder whether the code in irqentry_enter() should call a function
> is_eqs() instead of is_idle_task(). The default implementation would
> be just a
>
> #ifndef is_eqs
> #define is_eqs is_idle_task
> #endif
>
> and if an architecture has special requirements, it could just define
> is_eqs() and do the required checks there. This way the architecture
> could define whether it's a percpu bit, a cpu flag or something else.

I had come to almost the same approach: I've added an arch_in_rcu_eqs()
which is checked in addition to the existing is_idle_thread() check.

In the case of checking is_idle_thread() and checking for PF_VCPU, I'm
assuming the compiler can merge the loads of current->flags, and there's
little gain by making this entirely architecture specific, but we can
always check that and/or reconsider in future.

Thanks,
Mark.

2022-01-20 21:27:25

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs

On Tue, Jan 18, 2022 at 05:50:51PM +0000, Mark Rutland wrote:
> On Tue, Jan 18, 2022 at 05:09:25PM +0100, Sven Schnelle wrote:
> > I wonder whether the code in irqentry_enter() should call a function
> > is_eqs() instead of is_idle_task(). The default implementation would
> > be just a
> >
> > #ifndef is_eqs
> > #define is_eqs is_idle_task
> > #endif
> >
> > and if an architecture has special requirements, it could just define
> > is_eqs() and do the required checks there. This way the architecture
> > could define whether it's a percpu bit, a cpu flag or something else.
>
> I had come to almost the same approach: I've added an arch_in_rcu_eqs()
> which is checked in addition to the existing is_idle_thread() check.
>
> In the case of checking is_idle_thread() and checking for PF_VCPU, I'm
> assuming the compiler can merge the loads of current->flags, and there's
> little gain by making this entirely architecture specific, but we can
> always check that and/or reconsider in future.

FWIW, I've pushed out my WIP to:

https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kvm/entry-rework

... and I intend to clean that up and get it out on the list tomorrow.

The new entry/exit helpers are:

https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=kvm/entry-rework&id=df292ecabba50145849d8c8888cec9153267b31d

The arch_in_rcu_eqs() bit is:

https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=kvm/entry-rework&id=6e24c5ed7558ee7a4c95dfe62891dfdc51e6c6c4

The s390 changes are:

https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=kvm/entry-rework&id=ca8daba1809b6e4f1be425ca93f6373a2ea0af6b

I need to clean up the commit messages (including typos, TODOs, and
deleting some stale gunk), and there are some comments to write, but by
and large I think the structure is about right.

Thanks,
Mark.

2022-01-21 16:59:02

by Sven Schnelle

[permalink] [raw]
Subject: Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs

Hi Mark,

Mark Rutland <[email protected]> writes:

> On Tue, Jan 18, 2022 at 05:09:25PM +0100, Sven Schnelle wrote:
>> Mark Rutland <[email protected]> writes:
>> > On Tue, Jan 18, 2022 at 01:42:26PM +0100, Christian Borntraeger wrote:
>> >> Will you provide an s390 patch in your next iteration or shall we then do
>> >> one as soon as there is a v2? We also need to look into vsie.c where we
>> >> also call sie64a
>> >
>> > I'm having a go at that now; my plan is to try to have an s390 patch as
>> > part of v2 in the next day or so.
>> >
>> > Now that I have a rough idea of how SIE and exception handling works on
>> > s390, I think the structural changes to kvm-s390.c:__vcpu_run() and
>> > vsie.c:do_vsie_run() are fairly simple.
>> >
>> > The only open bit is exactly how/where to identify when the interrupt
>> > entry code needs to wake RCU. I can add a per-cpu variable or thread
>> > flag to indicate that we're inside that EQS, or or I could move the irq
>> > enable/disable into the sie64a asm and identify that as with the OUTSIDE
>> > macro in the entry asm.
>>
>> I wonder whether the code in irqentry_enter() should call a function
>> is_eqs() instead of is_idle_task(). The default implementation would
>> be just a
>>
>> #ifndef is_eqs
>> #define is_eqs is_idle_task
>> #endif
>>
>> and if an architecture has special requirements, it could just define
>> is_eqs() and do the required checks there. This way the architecture
>> could define whether it's a percpu bit, a cpu flag or something else.
>
> I had come to almost the same approach: I've added an arch_in_rcu_eqs()
> which is checked in addition to the existing is_idle_thread() check.
>

Sounds good, thanks!

/Sven