Several architectures have latent bugs around guest entry/exit. This
series addresses those for:
arm64, mips, riscv, s390, x86
However, I'm not sure how to address powerpc and could do with some help
there. I have build-tested the arm64, mips, riscv, s390, and x86 cases,
but I don't have a suitable HW setup to test these, so any review and/or
testing would be much appreciated.
Issues include:
1) Several architectures enable interrupts between guest_enter() and
guest_exit(). As this period is an RCU extended quiescent state (EQS)
this is unsound unless the irq entry code explicitly wakes RCU, which
most architectures only do for entry from usersapce or idle.
I believe this affects: arm64, riscv, s390
I am not sure about powerpc.
2) Several architectures permit instrumentation of code between
guest_enter() and guest_exit(), e.g. KASAN, KCOV, KCSAN, etc. As
instrumentation may directly o indirectly use RCU, this has the same
problems as with interrupts.
I believe this affects: arm64, mips, powerpc, riscv, s390
3) Several architectures do not inform lockdep and tracing that
interrupts are enabled during the execution of the guest, or do so in
an incorrect order. Generally this means that logs will report IRQs
being masked for much longer than is actually the case, which is not
ideal for debugging. I don't know whether this affects the
correctness of lockdep.
I believe this affects: arm64, mips, powerpc, riscv, s390
This was previously fixed for x86 specifically in a series of commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
But other architectures were left broken, and the infrastructure for
handling this correctly is x86-specific.
This series introduces generic helper functions which can be used to
handle the problems above, and migrates architectures over to these,
fixing the latent issues. For s390, where the KVM guest EQS is
interruptible, I've added infrastructure to wake RCU during this EQS.
Since v1 [1]:
* Add arch_in_rcu_eqs()
* Convert s390
* Rename exit_to_guest_mode() -> guest_state_enter_irqoff()
* Rename enter_from_guest_mode() -> guest_state_exit_irqoff()
* Various commit message cleanups
I've pushed the series (based on v5.16) to my kvm/entry-rework branch:
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kvm/entry-rework
git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git kvm/entry-rework
... with this version tagged as kvm-entry-rework-20210119.
[1] https://lore.kernel.org/r/[email protected]/
Thanks,
Mark.
Mark Rutland (7):
entry: add arch_in_rcu_eqs()
kvm: add guest_state_{enter,exit}_irqoff()
kvm/arm64: rework guest entry logic
kvm/mips: rework guest entry logic
kvm/riscv: rework guest entry logic
kvm/s390: rework guest entry logic
kvm/x86: rework guest entry logic
arch/arm64/kvm/arm.c | 51 +++++++-----
arch/mips/kvm/mips.c | 37 ++++++++-
arch/riscv/kvm/vcpu.c | 44 +++++++----
arch/s390/include/asm/entry-common.h | 10 +++
arch/s390/include/asm/kvm_host.h | 3 +
arch/s390/kvm/kvm-s390.c | 49 +++++++++---
arch/s390/kvm/vsie.c | 17 ++--
arch/x86/kvm/svm/svm.c | 4 +-
arch/x86/kvm/vmx/vmx.c | 4 +-
arch/x86/kvm/x86.c | 4 +-
arch/x86/kvm/x86.h | 45 -----------
include/linux/entry-common.h | 16 ++++
include/linux/kvm_host.h | 112 ++++++++++++++++++++++++++-
kernel/entry/common.c | 3 +-
14 files changed, 286 insertions(+), 113 deletions(-)
--
2.30.2
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <[email protected]>
Reviewed-by: Marc Zyngier <[email protected]>
Cc: Paolo Bonzini <[email protected]>
---
include/linux/kvm_host.h | 112 +++++++++++++++++++++++++++++++++++++--
1 file changed, 109 insertions(+), 3 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c310648cc8f1a..774a3b9e9bd8d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -29,6 +29,8 @@
#include <linux/refcount.h>
#include <linux/nospec.h>
#include <linux/notifier.h>
+#include <linux/ftrace.h>
+#include <linux/instrumentation.h>
#include <asm/signal.h>
#include <linux/kvm.h>
@@ -362,8 +364,11 @@ struct kvm_vcpu {
int last_used_slot;
};
-/* must be called with irqs disabled */
-static __always_inline void guest_enter_irqoff(void)
+/*
+ * Start accounting time towards a guest.
+ * Must be called before entering guest context.
+ */
+static __always_inline void guest_timing_enter_irqoff(void)
{
/*
* This is running in ioctl context so its safe to assume that it's the
@@ -372,7 +377,18 @@ static __always_inline void guest_enter_irqoff(void)
instrumentation_begin();
vtime_account_guest_enter();
instrumentation_end();
+}
+/*
+ * Enter guest context and enter an RCU extended quiescent state.
+ *
+ * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
+ * unsafe to use any code which may directly or indirectly use RCU, tracing
+ * (including IRQ flag tracing), or lockdep. All code in this period must be
+ * non-instrumentable.
+ */
+static __always_inline void guest_context_enter_irqoff(void)
+{
/*
* KVM does not hold any references to rcu protected data when it
* switches CPU into a guest mode. In fact switching to a guest mode
@@ -388,16 +404,79 @@ static __always_inline void guest_enter_irqoff(void)
}
}
-static __always_inline void guest_exit_irqoff(void)
+/*
+ * Deprecated. Architectures should move to guest_timing_enter_irqoff() and
+ * guest_state_enter_irqoff().
+ */
+static __always_inline void guest_enter_irqoff(void)
+{
+ guest_timing_enter_irqoff();
+ guest_context_enter_irqoff();
+}
+
+/**
+ * guest_state_enter_irqoff - Fixup state when entering a guest
+ *
+ * Entry to a guest will enable interrupts, but the kernel state is interrupts
+ * disabled when this is invoked. Also tell RCU about it.
+ *
+ * 1) Trace interrupts on state
+ * 2) Invoke context tracking if enabled to adjust RCU state
+ * 3) Tell lockdep that interrupts are enabled
+ *
+ * Invoked from architecture specific code before entering a guest.
+ * Must be called with interrupts disabled and the caller must be
+ * non-instrumentable.
+ * The caller has to invoke guest_timing_enter_irqoff() before this.
+ *
+ * Note: this is analogous to exit_to_user_mode().
+ */
+static __always_inline void guest_state_enter_irqoff(void)
+{
+ instrumentation_begin();
+ trace_hardirqs_on_prepare();
+ lockdep_hardirqs_on_prepare(CALLER_ADDR0);
+ instrumentation_end();
+
+ guest_context_enter_irqoff();
+ lockdep_hardirqs_on(CALLER_ADDR0);
+}
+
+/*
+ * Exit guest context and exit an RCU extended quiescent state.
+ *
+ * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
+ * unsafe to use any code which may directly or indirectly use RCU, tracing
+ * (including IRQ flag tracing), or lockdep. All code in this period must be
+ * non-instrumentable.
+ */
+static __always_inline void guest_context_exit_irqoff(void)
{
context_tracking_guest_exit();
+}
+/*
+ * Stop accounting time towards a guest.
+ * Must be called after exiting guest context.
+ */
+static __always_inline void guest_timing_exit_irqoff(void)
+{
instrumentation_begin();
/* Flush the guest cputime we spent on the guest */
vtime_account_guest_exit();
instrumentation_end();
}
+/*
+ * Deprecated. Architectures should move to guest_state_exit_irqoff() and
+ * guest_timing_exit_irqoff().
+ */
+static __always_inline void guest_exit_irqoff(void)
+{
+ guest_context_exit_irqoff();
+ guest_timing_exit_irqoff();
+}
+
static inline void guest_exit(void)
{
unsigned long flags;
@@ -407,6 +486,33 @@ static inline void guest_exit(void)
local_irq_restore(flags);
}
+/**
+ * guest_state_exit_irqoff - Establish state when returning from guest mode
+ *
+ * Entry from a guest disables interrupts, but guest mode is traced as
+ * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
+ *
+ * 1) Tell lockdep that interrupts are disabled
+ * 2) Invoke context tracking if enabled to reactivate RCU
+ * 3) Trace interrupts off state
+ *
+ * Invoked from architecture specific code after exiting a guest.
+ * Must be invoked with interrupts disabled and the caller must be
+ * non-instrumentable.
+ * The caller has to invoke guest_timing_exit_irqoff() after this.
+ *
+ * Note: this is analogous to enter_from_user_mode().
+ */
+static __always_inline void guest_state_exit_irqoff(void)
+{
+ lockdep_hardirqs_off(CALLER_ADDR0);
+ guest_context_exit_irqoff();
+
+ instrumentation_begin();
+ trace_hardirqs_off_finish();
+ instrumentation_end();
+}
+
static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
{
/*
--
2.30.2
In kvm_arch_vcpu_ioctl_run() we use guest_enter_irqoff() and
guest_exit_irqoff() directly, with interrupts masked between these. As
we don't handle any timer ticks during this window, we will not account
time spent within the guest as guest time, which is unfortunate.
Additionally, we do not inform lockdep or tracing that interrupts will
be enabled during guest execution, which caan lead to misleading traces
and warnings that interrupts have been enabled for overly-long periods.
This patch fixes these issues by using the new timing and context
entry/exit helpers to ensure that interrupts are handled during guest
vtime but with RCU watching, with a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
Since instrumentation may make use of RCU, we must also ensure that no
instrumented code is run during the EQS. I've split out the critical
section into a new kvm_mips_enter_exit_vcpu() helper which is marked
noinstr.
Signed-off-by: Mark Rutland <[email protected]>
Cc: Aleksandar Markovic <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
---
arch/mips/kvm/mips.c | 37 ++++++++++++++++++++++++++++++++++---
1 file changed, 34 insertions(+), 3 deletions(-)
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index aa20d074d3883..1a961c2434fee 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -438,6 +438,24 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
return -ENOIOCTLCMD;
}
+/*
+ * Actually run the vCPU, entering an RCU extended quiescent state (EQS) while
+ * the vCPU is running.
+ *
+ * This must be noinstr as instrumentation may make use of RCU, and this is not
+ * safe during the EQS.
+ */
+static int noinstr kvm_mips_vcpu_enter_exit(struct kvm_vcpu *vcpu)
+{
+ int ret;
+
+ guest_state_enter_irqoff();
+ ret = kvm_mips_callbacks->vcpu_run(vcpu);
+ guest_state_exit_irqoff();
+
+ return ret;
+}
+
int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
{
int r = -EINTR;
@@ -458,7 +476,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
lose_fpu(1);
local_irq_disable();
- guest_enter_irqoff();
+ guest_timing_enter_irqoff();
trace_kvm_enter(vcpu);
/*
@@ -469,10 +487,23 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
*/
smp_store_mb(vcpu->mode, IN_GUEST_MODE);
- r = kvm_mips_callbacks->vcpu_run(vcpu);
+ r = kvm_mips_vcpu_enter_exit(vcpu);
+
+ /*
+ * We must ensure that any pending interrupts are taken before
+ * we exit guest timing so that timer ticks are accounted as
+ * guest time. Transiently unmask interrupts so that any
+ * pending interrupts are taken.
+ *
+ * TODO: is there a barrier which ensures that pending interrupts are
+ * recognised? Currently this just hopes that the CPU takes any pending
+ * interrupts between the enable and disable.
+ */
+ local_irq_enable();
+ local_irq_disable();
trace_kvm_out(vcpu);
- guest_exit_irqoff();
+ guest_timing_exit_irqoff();
local_irq_enable();
out:
--
2.30.2
In kvm_arch_vcpu_ioctl_run() we enter an RCU extended quiescent state
(EQS) by calling guest_enter_irqoff(), and unmasked IRQs prior to
exiting the EQS by calling guest_exit(). As the IRQ entry code will not
wake RCU in this case, we may run the core IRQ code and IRQ handler
without RCU watching, leading to various potential problems.
Additionally, we do not inform lockdep or tracing that interrupts will
be enabled during guest execution, which caan lead to misleading traces
and warnings that interrupts have been enabled for overly-long periods.
This patch fixes these issues by using the new timing and context
entry/exit helpers to ensure that interrupts are handled during guest
vtime but with RCU watching, with a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
Since instrumentation may make use of RCU, we must also ensure that no
instrumented code is run during the EQS. I've split out the critical
section into a new kvm_arm_enter_exit_vcpu() helper which is marked
noinstr.
Fixes: 1b3d546daf85ed2b ("arm/arm64: KVM: Properly account for guest CPU time")
Reported-by: Nicolas Saenz Julienne <[email protected]>
Signed-off-by: Mark Rutland <[email protected]>
Reviewed-by: Marc Zyngier <[email protected]>
Cc: Alexandru Elisei <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: James Morse <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Suzuki K Poulose <[email protected]>
Cc: Will Deacon <[email protected]>
---
arch/arm64/kvm/arm.c | 51 ++++++++++++++++++++++++++++----------------
1 file changed, 33 insertions(+), 18 deletions(-)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e4727dc771bf3..b2222d8eb0b55 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -764,6 +764,24 @@ static bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu, int *ret)
xfer_to_guest_mode_work_pending();
}
+/*
+ * Actually run the vCPU, entering an RCU extended quiescent state (EQS) while
+ * the vCPU is running.
+ *
+ * This must be noinstr as instrumentation may make use of RCU, and this is not
+ * safe during the EQS.
+ */
+static int noinstr kvm_arm_vcpu_enter_exit(struct kvm_vcpu *vcpu)
+{
+ int ret;
+
+ guest_state_enter_irqoff();
+ ret = kvm_call_hyp_ret(__kvm_vcpu_run, vcpu);
+ guest_state_exit_irqoff();
+
+ return ret;
+}
+
/**
* kvm_arch_vcpu_ioctl_run - the main VCPU run function to execute guest code
* @vcpu: The VCPU pointer
@@ -854,9 +872,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
* Enter the guest
*/
trace_kvm_entry(*vcpu_pc(vcpu));
- guest_enter_irqoff();
+ guest_timing_enter_irqoff();
- ret = kvm_call_hyp_ret(__kvm_vcpu_run, vcpu);
+ ret = kvm_arm_vcpu_enter_exit(vcpu);
vcpu->mode = OUTSIDE_GUEST_MODE;
vcpu->stat.exits++;
@@ -891,26 +909,23 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
kvm_arch_vcpu_ctxsync_fp(vcpu);
/*
- * We may have taken a host interrupt in HYP mode (ie
- * while executing the guest). This interrupt is still
- * pending, as we haven't serviced it yet!
+ * We must ensure that any pending interrupts are taken before
+ * we exit guest timing so that timer ticks are accounted as
+ * guest time. Transiently unmask interrupts so that any
+ * pending interrupts are taken.
*
- * We're now back in SVC mode, with interrupts
- * disabled. Enabling the interrupts now will have
- * the effect of taking the interrupt again, in SVC
- * mode this time.
+ * Per ARM DDI 0487G.b section D1.13.4, an ISB (or other
+ * context synchronization event) is necessary to ensure that
+ * pending interrupts are taken.
*/
local_irq_enable();
+ isb();
+ local_irq_disable();
+
+ guest_timing_exit_irqoff();
+
+ local_irq_enable();
- /*
- * We do local_irq_enable() before calling guest_exit() so
- * that if a timer interrupt hits while running the guest we
- * account that tick as being spent in the guest. We enable
- * preemption after calling guest_exit() so that if we get
- * preempted we make sure ticks after that is not counted as
- * guest time.
- */
- guest_exit();
trace_kvm_exit(ret, kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
/* Exit types that need handling before we can be preempted */
--
2.30.2
In __vcpu_run() and do_vsie_run(), we enter an RCU extended quiescent
state (EQS) by calling guest_enter_irqoff(), which lasts until
__vcpu_run() calls guest_exit_irqoff(). However, during the two we
enable interrupts and may handle interrupts during the EQS. As the IRQ
entry code will not wake RCU in this case, we may run the core IRQ code
and IRQ handler without RCU watching, leading to various potential
problems.
It is necessary to unmask (host) interrupts around entering the guest,
as entering the guest via SIE will not automatically unmask these. When
a host interrupts is taken from a guest, it is taken via its regular
host IRQ handler rather than being treated as a direct exit from SIE.
Due to this, we cannot simply mask interrupts around guest entry, and
must handle interrupts during this window, waking RCU as required.
Additionally, between guest_exit_irqoff() and guest_exit_irqoff(), we
use local_irq_enable() and local_irq_disable() to unmask interrupts,
violating the ordering requirements for RCU/lockdep/tracing around
entry/exit sequences. Further, since this occurs in an instrumentable
function, it's possible that instrumented code runs during this window,
with potential usage of RCU, etc.
To fix the RCU wakeup problem, an s390 implementation of
arch_in_rcu_eqs() is added which checks for PF_VCPU in current->flags.
PF_VCPU is set/cleared by guest_timing_{enter,exit}_irqoff(), which
surround the actual guest entry.
To fix the remaining issues, the lower-level guest entry logic is moved
into a shared noinstr helper function using the
guest_state_{enter,exit}_irqoff() helpers. These perform all the
lockdep/RCU/tracing manipulation necessary, but as sie64a() does not
enable/disable interrupts, we must do this explicitly with the
non-instrumented arch_local_irq_{enable,disable}() helpers:
guest_state_enter_irqoff()
arch_local_irq_enable();
sie64a(...);
arch_local_irq_disable();
guest_state_exit_irqoff();
Signed-off-by: Mark Rutland <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Janosch Frank <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Vasily Gorbik <[email protected]>
---
arch/s390/include/asm/entry-common.h | 10 ++++++
arch/s390/include/asm/kvm_host.h | 3 ++
arch/s390/kvm/kvm-s390.c | 49 +++++++++++++++++++++-------
arch/s390/kvm/vsie.c | 17 ++++------
4 files changed, 58 insertions(+), 21 deletions(-)
diff --git a/arch/s390/include/asm/entry-common.h b/arch/s390/include/asm/entry-common.h
index 17aead80aadba..e69a2ab28b847 100644
--- a/arch/s390/include/asm/entry-common.h
+++ b/arch/s390/include/asm/entry-common.h
@@ -57,6 +57,16 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
#define arch_exit_to_user_mode_prepare arch_exit_to_user_mode_prepare
+static __always_inline bool arch_in_rcu_eqs(void)
+{
+ if (IS_ENABLED(CONFIG_KVM))
+ return current->flags & PF_VCPU;
+
+ return false;
+}
+
+#define arch_in_rcu_eqs arch_in_rcu_eqs
+
static inline bool on_thread_stack(void)
{
return !(((unsigned long)(current->stack) ^ current_stack_pointer()) & ~(THREAD_SIZE - 1));
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index a604d51acfc83..bf7efd990039b 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -995,6 +995,9 @@ void kvm_arch_crypto_set_masks(struct kvm *kvm, unsigned long *apm,
extern int sie64a(struct kvm_s390_sie_block *, u64 *);
extern char sie_exit;
+extern int kvm_s390_enter_exit_sie(struct kvm_s390_sie_block *scb,
+ u64 *gprs);
+
extern int kvm_s390_gisc_register(struct kvm *kvm, u32 gisc);
extern int kvm_s390_gisc_unregister(struct kvm *kvm, u32 gisc);
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 14a18ba5ff2c8..d13401bf6a5a2 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -4169,6 +4169,30 @@ static int vcpu_post_run(struct kvm_vcpu *vcpu, int exit_reason)
return vcpu_post_run_fault_in_sie(vcpu);
}
+int noinstr kvm_s390_enter_exit_sie(struct kvm_s390_sie_block *scb,
+ u64 *gprs)
+{
+ int ret;
+
+ guest_state_enter_irqoff();
+
+ /*
+ * The guest_state_{enter,exit}_irqoff() functions inform lockdep and
+ * tracing that entry to the guest will enable host IRQs, and exit from
+ * the guest will disable host IRQs.
+ *
+ * We must not use lockdep/tracing/RCU in this critical section, so we
+ * use the low-level arch_local_irq_*() helpers to enable/disable IRQs.
+ */
+ arch_local_irq_enable();
+ ret = sie64a(scb, gprs);
+ arch_local_irq_disable();
+
+ guest_state_exit_irqoff();
+
+ return ret;
+}
+
#define PSW_INT_MASK (PSW_MASK_EXT | PSW_MASK_IO | PSW_MASK_MCHECK)
static int __vcpu_run(struct kvm_vcpu *vcpu)
{
@@ -4189,12 +4213,9 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
/*
* As PF_VCPU will be used in fault handler, between
- * guest_enter and guest_exit should be no uaccess.
+ * guest_timing_enter_irqoff and guest_timing_exit_irqoff
+ * should be no uaccess.
*/
- local_irq_disable();
- guest_enter_irqoff();
- __disable_cpu_timer_accounting(vcpu);
- local_irq_enable();
if (kvm_s390_pv_cpu_is_protected(vcpu)) {
memcpy(sie_page->pv_grregs,
vcpu->run->s.regs.gprs,
@@ -4202,8 +4223,18 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
}
if (test_cpu_flag(CIF_FPU))
load_fpu_regs();
- exit_reason = sie64a(vcpu->arch.sie_block,
- vcpu->run->s.regs.gprs);
+
+ local_irq_disable();
+ guest_timing_enter_irqoff();
+ __disable_cpu_timer_accounting(vcpu);
+
+ exit_reason = kvm_s390_enter_exit_sie(vcpu->arch.sie_block,
+ vcpu->run->s.regs.gprs);
+
+ __enable_cpu_timer_accounting(vcpu);
+ guest_timing_exit_irqoff();
+ local_irq_enable();
+
if (kvm_s390_pv_cpu_is_protected(vcpu)) {
memcpy(vcpu->run->s.regs.gprs,
sie_page->pv_grregs,
@@ -4219,10 +4250,6 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
vcpu->arch.sie_block->gpsw.mask &= ~PSW_INT_MASK;
}
}
- local_irq_disable();
- __enable_cpu_timer_accounting(vcpu);
- guest_exit_irqoff();
- local_irq_enable();
vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
rc = vcpu_post_run(vcpu, exit_reason);
diff --git a/arch/s390/kvm/vsie.c b/arch/s390/kvm/vsie.c
index acda4b6fc8518..e9b0b2d04e1e3 100644
--- a/arch/s390/kvm/vsie.c
+++ b/arch/s390/kvm/vsie.c
@@ -1106,10 +1106,6 @@ static int do_vsie_run(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page)
vcpu->arch.sie_block->fpf & FPF_BPBC)
set_thread_flag(TIF_ISOLATE_BP_GUEST);
- local_irq_disable();
- guest_enter_irqoff();
- local_irq_enable();
-
/*
* Simulate a SIE entry of the VCPU (see sie64a), so VCPU blocking
* and VCPU requests also hinder the vSIE from running and lead
@@ -1120,15 +1116,16 @@ static int do_vsie_run(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page)
barrier();
if (test_cpu_flag(CIF_FPU))
load_fpu_regs();
- if (!kvm_s390_vcpu_sie_inhibited(vcpu))
- rc = sie64a(scb_s, vcpu->run->s.regs.gprs);
+ if (!kvm_s390_vcpu_sie_inhibited(vcpu)) {
+ local_irq_disable();
+ guest_timing_enter_irqoff();
+ rc = kvm_s390_enter_exit_sie(scb_s, vcpu->run->s.regs.gprs);
+ guest_timing_exit_irqoff();
+ local_irq_enable();
+ }
barrier();
vcpu->arch.sie_block->prog0c &= ~PROG_IN_SIE;
- local_irq_disable();
- guest_exit_irqoff();
- local_irq_enable();
-
/* restore guest state for bp isolation override */
if (!guest_bp_isolation)
clear_thread_flag(TIF_ISOLATE_BP_GUEST);
--
2.30.2
Am 19.01.22 um 11:58 schrieb Mark Rutland:
CCing new emails for Anup and Atish so that they are aware of this thread.
> Several architectures have latent bugs around guest entry/exit. This
> series addresses those for:
>
> arm64, mips, riscv, s390, x86
>
> However, I'm not sure how to address powerpc and could do with some help
> there. I have build-tested the arm64, mips, riscv, s390, and x86 cases,
> but I don't have a suitable HW setup to test these, so any review and/or
> testing would be much appreciated.
>
> Issues include:
>
> 1) Several architectures enable interrupts between guest_enter() and
> guest_exit(). As this period is an RCU extended quiescent state (EQS)
> this is unsound unless the irq entry code explicitly wakes RCU, which
> most architectures only do for entry from usersapce or idle.
>
> I believe this affects: arm64, riscv, s390
>
> I am not sure about powerpc.
>
> 2) Several architectures permit instrumentation of code between
> guest_enter() and guest_exit(), e.g. KASAN, KCOV, KCSAN, etc. As
> instrumentation may directly o indirectly use RCU, this has the same
> problems as with interrupts.
>
> I believe this affects: arm64, mips, powerpc, riscv, s390
>
> 3) Several architectures do not inform lockdep and tracing that
> interrupts are enabled during the execution of the guest, or do so in
> an incorrect order. Generally this means that logs will report IRQs
> being masked for much longer than is actually the case, which is not
> ideal for debugging. I don't know whether this affects the
> correctness of lockdep.
>
> I believe this affects: arm64, mips, powerpc, riscv, s390
>
> This was previously fixed for x86 specifically in a series of commits:
>
> 87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
> 0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
> 9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
> 3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
> 135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
> 160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
> bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
>
> But other architectures were left broken, and the infrastructure for
> handling this correctly is x86-specific.
>
> This series introduces generic helper functions which can be used to
> handle the problems above, and migrates architectures over to these,
> fixing the latent issues. For s390, where the KVM guest EQS is
> interruptible, I've added infrastructure to wake RCU during this EQS.
>
> Since v1 [1]:
> * Add arch_in_rcu_eqs()
> * Convert s390
> * Rename exit_to_guest_mode() -> guest_state_enter_irqoff()
> * Rename enter_from_guest_mode() -> guest_state_exit_irqoff()
> * Various commit message cleanups
>
> I've pushed the series (based on v5.16) to my kvm/entry-rework branch:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kvm/entry-rework
> git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git kvm/entry-rework
>
> ... with this version tagged as kvm-entry-rework-20210119.
>
> [1] https://lore.kernel.org/r/[email protected]/
>
> Thanks,
> Mark.
>
> Mark Rutland (7):
> entry: add arch_in_rcu_eqs()
> kvm: add guest_state_{enter,exit}_irqoff()
> kvm/arm64: rework guest entry logic
> kvm/mips: rework guest entry logic
> kvm/riscv: rework guest entry logic
> kvm/s390: rework guest entry logic
> kvm/x86: rework guest entry logic
>
> arch/arm64/kvm/arm.c | 51 +++++++-----
> arch/mips/kvm/mips.c | 37 ++++++++-
> arch/riscv/kvm/vcpu.c | 44 +++++++----
> arch/s390/include/asm/entry-common.h | 10 +++
> arch/s390/include/asm/kvm_host.h | 3 +
> arch/s390/kvm/kvm-s390.c | 49 +++++++++---
> arch/s390/kvm/vsie.c | 17 ++--
> arch/x86/kvm/svm/svm.c | 4 +-
> arch/x86/kvm/vmx/vmx.c | 4 +-
> arch/x86/kvm/x86.c | 4 +-
> arch/x86/kvm/x86.h | 45 -----------
> include/linux/entry-common.h | 16 ++++
> include/linux/kvm_host.h | 112 ++++++++++++++++++++++++++-
> kernel/entry/common.c | 3 +-
> 14 files changed, 286 insertions(+), 113 deletions(-)
>
I just gave this a spin on s390 with debugging on and I got the following:
[ 457.151295] ------------[ cut here ]------------
[ 457.151311] WARNING: CPU: 14 PID: 0 at kernel/rcu/tree.c:613 rcu_eqs_enter.constprop.0+0xf8/0x118
[ 457.151324] Modules linked in: vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nft_compat nf_nat_tftp nft_objref nf_conntrack_tftp nft_counter kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ib_uverbs s390_trng ib_core genwqe_card crc_itu_t vfio_ccw mdev vfio_iommu_type1 eadm_sch vfio zcrypt_cex4 sch_fq_codel configfs ip_tables x_tables mlx5_core ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4
[ 457.151422] CPU: 14 PID: 0 Comm: swapper/14 Not tainted 5.16.0-00007-g89e9021389e2 #3
[ 457.151428] Hardware name: IBM 3906 M04 704 (LPAR)
[ 457.151432] Krnl PSW : 0404d00180000000 00000000a7c0495c (rcu_eqs_enter.constprop.0+0xfc/0x118)
[ 457.151440] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
[ 457.151445] Krnl GPRS: ffffffffebd81d31 4000000000000000 0000000000000070 00000000a7fd7024
[ 457.151450] 0000000000000000 0000000000000001 0000000000000000 0000000000000000
[ 457.151454] 000000000000000e 000000000000000e 00000000a84d3a88 0000001fd8645c00
[ 457.151458] 0000000000000000 0000000000000000 00000000a7c04882 0000038000653dc0
[ 457.151468] Krnl Code: 00000000a7c0494c: ebaff0a00004 lmg %r10,%r15,160(%r15)
00000000a7c04952: c0f4fffffef7 brcl 15,00000000a7c04740
#00000000a7c04958: af000000 mc 0,0
>00000000a7c0495c: a7f4ffa3 brc 15,00000000a7c048a2
00000000a7c04960: c0e500003f70 brasl %r14,00000000a7c0c840
00000000a7c04966: a7f4ffcd brc 15,00000000a7c04900
00000000a7c0496a: c0e500003f6b brasl %r14,00000000a7c0c840
00000000a7c04970: a7f4ffde brc 15,00000000a7c0492c
[ 457.151527] Call Trace:
[ 457.151530] [<00000000a7c0495c>] rcu_eqs_enter.constprop.0+0xfc/0x118
[ 457.151536] ([<00000000a7c04882>] rcu_eqs_enter.constprop.0+0x22/0x118)
[ 457.151540] [<00000000a7c14cd2>] default_idle_call+0x62/0xd8
[ 457.151545] [<00000000a6f816c6>] do_idle+0xf6/0x1b0
[ 457.151553] [<00000000a6f81a06>] cpu_startup_entry+0x36/0x40
[ 457.151558] [<00000000a7c16abe>] restart_int_handler+0x6e/0x90
[ 457.151563] no locks held by swapper/14/0.
[ 457.151567] Last Breaking-Event-Address:
[ 457.151570] [<00000000a7c0489e>] rcu_eqs_enter.constprop.0+0x3e/0x118
[ 457.151574] irq event stamp: 608654
[ 457.151578] hardirqs last enabled at (608653): [<00000000a70190d8>] tick_nohz_idle_enter+0xb0/0x130
[ 457.151584] hardirqs last disabled at (608654): [<00000000a6f8173e>] do_idle+0x16e/0x1b0
[ 457.151589] softirqs last enabled at (608586): [<00000000a7c1861a>] __do_softirq+0x4ba/0x668
[ 457.151594] softirqs last disabled at (608581): [<00000000a6f367c6>] __irq_exit_rcu+0x13e/0x170
[ 457.151600] ---[ end trace 2ae2154f9724de86 ]---
I can not see right now whats wrong, your patches look sane.
> [ 457.151295] ------------[ cut here ]------------
> [ 457.151311] WARNING: CPU: 14 PID: 0 at kernel/rcu/tree.c:613 rcu_eqs_enter.constprop.0+0xf8/0x118
> [ 457.151324] Modules linked in: vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nft_compat nf_nat_tftp nft_objref nf_conntrack_tftp nft_counter kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ib_uverbs s390_trng ib_core genwqe_card crc_itu_t vfio_ccw mdev vfio_iommu_type1 eadm_sch vfio zcrypt_cex4 sch_fq_codel configfs ip_tables x_tables mlx5_core ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4
> [ 457.151422] CPU: 14 PID: 0 Comm: swapper/14 Not tainted 5.16.0-00007-g89e9021389e2 #3
> [ 457.151428] Hardware name: IBM 3906 M04 704 (LPAR)
> [ 457.151432] Krnl PSW : 0404d00180000000 00000000a7c0495c (rcu_eqs_enter.constprop.0+0xfc/0x118)
> [ 457.151440] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
> [ 457.151445] Krnl GPRS: ffffffffebd81d31 4000000000000000 0000000000000070 00000000a7fd7024
> [ 457.151450] 0000000000000000 0000000000000001 0000000000000000 0000000000000000
> [ 457.151454] 000000000000000e 000000000000000e 00000000a84d3a88 0000001fd8645c00
> [ 457.151458] 0000000000000000 0000000000000000 00000000a7c04882 0000038000653dc0
> [ 457.151468] Krnl Code: 00000000a7c0494c: ebaff0a00004 lmg %r10,%r15,160(%r15)
> 00000000a7c04952: c0f4fffffef7 brcl 15,00000000a7c04740
> #00000000a7c04958: af000000 mc 0,0
> >00000000a7c0495c: a7f4ffa3 brc 15,00000000a7c048a2
> 00000000a7c04960: c0e500003f70 brasl %r14,00000000a7c0c840
> 00000000a7c04966: a7f4ffcd brc 15,00000000a7c04900
> 00000000a7c0496a: c0e500003f6b brasl %r14,00000000a7c0c840
> 00000000a7c04970: a7f4ffde brc 15,00000000a7c0492c
> [ 457.151527] Call Trace:
> [ 457.151530] [<00000000a7c0495c>] rcu_eqs_enter.constprop.0+0xfc/0x118
> [ 457.151536] ([<00000000a7c04882>] rcu_eqs_enter.constprop.0+0x22/0x118)
> [ 457.151540] [<00000000a7c14cd2>] default_idle_call+0x62/0xd8
> [ 457.151545] [<00000000a6f816c6>] do_idle+0xf6/0x1b0
> [ 457.151553] [<00000000a6f81a06>] cpu_startup_entry+0x36/0x40
> [ 457.151558] [<00000000a7c16abe>] restart_int_handler+0x6e/0x90
> [ 457.151563] no locks held by swapper/14/0.
> [ 457.151567] Last Breaking-Event-Address:
> [ 457.151570] [<00000000a7c0489e>] rcu_eqs_enter.constprop.0+0x3e/0x118
> [ 457.151574] irq event stamp: 608654
> [ 457.151578] hardirqs last enabled at (608653): [<00000000a70190d8>] tick_nohz_idle_enter+0xb0/0x130
> [ 457.151584] hardirqs last disabled at (608654): [<00000000a6f8173e>] do_idle+0x16e/0x1b0
> [ 457.151589] softirqs last enabled at (608586): [<00000000a7c1861a>] __do_softirq+0x4ba/0x668
> [ 457.151594] softirqs last disabled at (608581): [<00000000a6f367c6>] __irq_exit_rcu+0x13e/0x170
> [ 457.151600] ---[ end trace 2ae2154f9724de86 ]---
>
> I can not see right now whats wrong, your patches look sane.
Now followed by (might be a followup to the error above).
[ 574.301736] =============================
[ 574.301741] WARNING: suspicious RCU usage
[ 574.301746] 5.16.0-00007-g89e9021389e2 #3 Tainted: G W
[ 574.301751] -----------------------------
[ 574.301755] kernel/rcu/tree.c:821 Bad RCU dynticks_nmi_nesting counter
!
[ 574.301760]
other info that might help us debug this:
[ 574.301764]
rcu_scheduler_active = 2, debug_locks = 1
[ 574.301769] 2 locks held by CPU 0/KVM/8327:
[ 574.301773] #0: 00000001521956b8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x9a/0xa40 [kvm]
[ 574.301843] #1: 000000016e92c6a8 (&kvm->srcu){....}-{0:0}, at: __vcpu_run+0x332/0x5b8 [kvm]
[ 574.301873]
stack backtrace:
[ 574.301878] CPU: 46 PID: 8327 Comm: CPU 0/KVM Tainted: G W 5.16.0-00007-g89e9021389e2 #3
[ 574.301884] Hardware name: IBM 3906 M04 704 (LPAR)
[ 574.301888] Call Trace:
[ 574.301892] [<00000000a7c001c6>] dump_stack_lvl+0x8e/0xc8
[ 574.301903] [<00000000a6fee70e>] rcu_irq_exit_check_preempt+0x136/0x1c8
[ 574.301913] [<00000000a6ff9d1a>] irqentry_exit_cond_resched+0x32/0x80
[ 574.301921] [<00000000a7c0521c>] irqentry_exit+0xac/0x130
[ 574.301927] [<00000000a7c16626>] ext_int_handler+0xe6/0x120
[ 574.301933] [<00000000a6fc2482>] lock_acquire.part.0+0x12a/0x258
[ 574.301939] ([<00000000a6fc2450>] lock_acquire.part.0+0xf8/0x258)
[ 574.301944] [<00000000a6fc2660>] lock_acquire+0xb0/0x200
[ 574.302053] [<000003ff807092ce>] __vcpu_run+0x376/0x5b8 [kvm]
[ 574.302076] [<000003ff807099ba>] kvm_arch_vcpu_ioctl_run+0x10a/0x270 [kvm]
[ 574.302098] [<000003ff806f002c>] kvm_vcpu_ioctl+0x27c/0xa40 [kvm]
[ 574.302120] [<00000000a728b5b6>] __s390x_sys_ioctl+0xbe/0x100
[ 574.302129] [<00000000a7c038ea>] __do_syscall+0x1da/0x208
[ 574.302133] [<00000000a7c16352>] system_call+0x82/0xb0
[ 574.302138] 2 locks held by CPU 0/KVM/8327:
[ 574.302143] #0: 00000001521956b8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x9a/0xa40 [kvm]
[ 574.302172] #1: 000000016e92c6a8 (&kvm->srcu){....}-{0:0}, at: __vcpu_run+0x332/0x5b8 [kvm]
On Wed, Jan 19, 2022 at 07:25:20PM +0100, Christian Borntraeger wrote:
> Am 19.01.22 um 11:58 schrieb Mark Rutland:
>
>
> CCing new emails for Anup and Atish so that they are aware of this thread.
Ah; whoops. I'd meant to fix the Ccs on the patches.
Thanks!
[...]
> I just gave this a spin on s390 with debugging on and I got the following:
>
> [ 457.151295] ------------[ cut here ]------------
> [ 457.151311] WARNING: CPU: 14 PID: 0 at kernel/rcu/tree.c:613 rcu_eqs_enter.constprop.0+0xf8/0x118
Hmm, so IIUC that's:
WARN_ON_ONCE(rdp->dynticks_nmi_nesting != DYNTICK_IRQ_NONIDLE);
... and we're clearly in the idle thread here.
I wonder, is the s390 guest entry/exit *preemptible* ?
If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
things before a ctx-switch to the idle thread, which would then be able
to hit this.
I'll need to go audit the other architectures for similar.
Thanks,
Mark.
> [ 457.151324] Modules linked in: vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nft_compat nf_nat_tftp nft_objref nf_conntrack_tftp nft_counter kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ib_uverbs s390_trng ib_core genwqe_card crc_itu_t vfio_ccw mdev vfio_iommu_type1 eadm_sch vfio zcrypt_cex4 sch_fq_codel configfs ip_tables x_tables mlx5_core ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4
> [ 457.151422] CPU: 14 PID: 0 Comm: swapper/14 Not tainted 5.16.0-00007-g89e9021389e2 #3
> [ 457.151428] Hardware name: IBM 3906 M04 704 (LPAR)
> [ 457.151432] Krnl PSW : 0404d00180000000 00000000a7c0495c (rcu_eqs_enter.constprop.0+0xfc/0x118)
> [ 457.151440] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
> [ 457.151445] Krnl GPRS: ffffffffebd81d31 4000000000000000 0000000000000070 00000000a7fd7024
> [ 457.151450] 0000000000000000 0000000000000001 0000000000000000 0000000000000000
> [ 457.151454] 000000000000000e 000000000000000e 00000000a84d3a88 0000001fd8645c00
> [ 457.151458] 0000000000000000 0000000000000000 00000000a7c04882 0000038000653dc0
> [ 457.151468] Krnl Code: 00000000a7c0494c: ebaff0a00004 lmg %r10,%r15,160(%r15)
> 00000000a7c04952: c0f4fffffef7 brcl 15,00000000a7c04740
> #00000000a7c04958: af000000 mc 0,0
> >00000000a7c0495c: a7f4ffa3 brc 15,00000000a7c048a2
> 00000000a7c04960: c0e500003f70 brasl %r14,00000000a7c0c840
> 00000000a7c04966: a7f4ffcd brc 15,00000000a7c04900
> 00000000a7c0496a: c0e500003f6b brasl %r14,00000000a7c0c840
> 00000000a7c04970: a7f4ffde brc 15,00000000a7c0492c
> [ 457.151527] Call Trace:
> [ 457.151530] [<00000000a7c0495c>] rcu_eqs_enter.constprop.0+0xfc/0x118
> [ 457.151536] ([<00000000a7c04882>] rcu_eqs_enter.constprop.0+0x22/0x118)
> [ 457.151540] [<00000000a7c14cd2>] default_idle_call+0x62/0xd8
> [ 457.151545] [<00000000a6f816c6>] do_idle+0xf6/0x1b0
> [ 457.151553] [<00000000a6f81a06>] cpu_startup_entry+0x36/0x40
> [ 457.151558] [<00000000a7c16abe>] restart_int_handler+0x6e/0x90
> [ 457.151563] no locks held by swapper/14/0.
> [ 457.151567] Last Breaking-Event-Address:
> [ 457.151570] [<00000000a7c0489e>] rcu_eqs_enter.constprop.0+0x3e/0x118
> [ 457.151574] irq event stamp: 608654
> [ 457.151578] hardirqs last enabled at (608653): [<00000000a70190d8>] tick_nohz_idle_enter+0xb0/0x130
> [ 457.151584] hardirqs last disabled at (608654): [<00000000a6f8173e>] do_idle+0x16e/0x1b0
> [ 457.151589] softirqs last enabled at (608586): [<00000000a7c1861a>] __do_softirq+0x4ba/0x668
> [ 457.151594] softirqs last disabled at (608581): [<00000000a6f367c6>] __irq_exit_rcu+0x13e/0x170
> [ 457.151600] ---[ end trace 2ae2154f9724de86 ]---
>
> I can not see right now whats wrong, your patches look sane.
Am 19.01.22 um 20:22 schrieb Mark Rutland:
> On Wed, Jan 19, 2022 at 07:25:20PM +0100, Christian Borntraeger wrote:
>> Am 19.01.22 um 11:58 schrieb Mark Rutland:
>>
>>
>> CCing new emails for Anup and Atish so that they are aware of this thread.
>
> Ah; whoops. I'd meant to fix the Ccs on the patches.
>
> Thanks!
>
> [...]
>
>> I just gave this a spin on s390 with debugging on and I got the following:
>>
>> [ 457.151295] ------------[ cut here ]------------
>> [ 457.151311] WARNING: CPU: 14 PID: 0 at kernel/rcu/tree.c:613 rcu_eqs_enter.constprop.0+0xf8/0x118
>
> Hmm, so IIUC that's:
>
> WARN_ON_ONCE(rdp->dynticks_nmi_nesting != DYNTICK_IRQ_NONIDLE);
>
> ... and we're clearly in the idle thread here.
>
> I wonder, is the s390 guest entry/exit *preemptible* ?
Looks like debug_defconfig is indeed using preemption:
CONFIG_PREEMPT_BUILD=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTION=y
CONFIG_PREEMPT_RCU=y
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_DEBUG_PREEMPT=y
CONFIG_PREEMPTIRQ_TRACEPOINTS=y
CONFIG_TRACE_PREEMPT_TOGGLE=y
CONFIG_PREEMPT_TRACER=y
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set
>
> If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
> things before a ctx-switch to the idle thread, which would then be able
> to hit this.
>
> I'll need to go audit the other architectures for similar.
>
> Thanks,
> Mark.
>
>> [ 457.151324] Modules linked in: vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nft_compat nf_nat_tftp nft_objref nf_conntrack_tftp nft_counter kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ib_uverbs s390_trng ib_core genwqe_card crc_itu_t vfio_ccw mdev vfio_iommu_type1 eadm_sch vfio zcrypt_cex4 sch_fq_codel configfs ip_tables x_tables mlx5_core ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4
>> [ 457.151422] CPU: 14 PID: 0 Comm: swapper/14 Not tainted 5.16.0-00007-g89e9021389e2 #3
>> [ 457.151428] Hardware name: IBM 3906 M04 704 (LPAR)
>> [ 457.151432] Krnl PSW : 0404d00180000000 00000000a7c0495c (rcu_eqs_enter.constprop.0+0xfc/0x118)
>> [ 457.151440] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
>> [ 457.151445] Krnl GPRS: ffffffffebd81d31 4000000000000000 0000000000000070 00000000a7fd7024
>> [ 457.151450] 0000000000000000 0000000000000001 0000000000000000 0000000000000000
>> [ 457.151454] 000000000000000e 000000000000000e 00000000a84d3a88 0000001fd8645c00
>> [ 457.151458] 0000000000000000 0000000000000000 00000000a7c04882 0000038000653dc0
>> [ 457.151468] Krnl Code: 00000000a7c0494c: ebaff0a00004 lmg %r10,%r15,160(%r15)
>> 00000000a7c04952: c0f4fffffef7 brcl 15,00000000a7c04740
>> #00000000a7c04958: af000000 mc 0,0
>> >00000000a7c0495c: a7f4ffa3 brc 15,00000000a7c048a2
>> 00000000a7c04960: c0e500003f70 brasl %r14,00000000a7c0c840
>> 00000000a7c04966: a7f4ffcd brc 15,00000000a7c04900
>> 00000000a7c0496a: c0e500003f6b brasl %r14,00000000a7c0c840
>> 00000000a7c04970: a7f4ffde brc 15,00000000a7c0492c
>> [ 457.151527] Call Trace:
>> [ 457.151530] [<00000000a7c0495c>] rcu_eqs_enter.constprop.0+0xfc/0x118
>> [ 457.151536] ([<00000000a7c04882>] rcu_eqs_enter.constprop.0+0x22/0x118)
>> [ 457.151540] [<00000000a7c14cd2>] default_idle_call+0x62/0xd8
>> [ 457.151545] [<00000000a6f816c6>] do_idle+0xf6/0x1b0
>> [ 457.151553] [<00000000a6f81a06>] cpu_startup_entry+0x36/0x40
>> [ 457.151558] [<00000000a7c16abe>] restart_int_handler+0x6e/0x90
>> [ 457.151563] no locks held by swapper/14/0.
>> [ 457.151567] Last Breaking-Event-Address:
>> [ 457.151570] [<00000000a7c0489e>] rcu_eqs_enter.constprop.0+0x3e/0x118
>> [ 457.151574] irq event stamp: 608654
>> [ 457.151578] hardirqs last enabled at (608653): [<00000000a70190d8>] tick_nohz_idle_enter+0xb0/0x130
>> [ 457.151584] hardirqs last disabled at (608654): [<00000000a6f8173e>] do_idle+0x16e/0x1b0
>> [ 457.151589] softirqs last enabled at (608586): [<00000000a7c1861a>] __do_softirq+0x4ba/0x668
>> [ 457.151594] softirqs last disabled at (608581): [<00000000a6f367c6>] __irq_exit_rcu+0x13e/0x170
>> [ 457.151600] ---[ end trace 2ae2154f9724de86 ]---
>>
>> I can not see right now whats wrong, your patches look sane.
On 1/19/22 11:58, Mark Rutland wrote:
> When transitioning to/from guest mode, it is necessary to inform
> lockdep, tracing, and RCU in a specific order, similar to the
> requirements for transitions to/from user mode. Additionally, it is
> necessary to perform vtime accounting for a window around running the
> guest, with RCU enabled, such that timer interrupts taken from the guest
> can be accounted as guest time.
>
> Most architectures don't handle all the necessary pieces, and a have a
> number of common bugs, including unsafe usage of RCU during the window
> between guest_enter() and guest_exit().
>
> On x86, this was dealt with across commits:
>
> 87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
> 0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
> 9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
> 3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
> 135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
> 160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
> bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
>
> ... but those fixes are specific to x86, and as the resulting logic
> (while correct) is split across generic helper functions and
> x86-specific helper functions, it is difficult to see that the
> entry/exit accounting is balanced.
>
> This patch adds generic helpers which architectures can use to handle
> guest entry/exit consistently and correctly. The guest_{enter,exit}()
> helpers are split into guest_timing_{enter,exit}() to perform vtime
> accounting, and guest_context_{enter,exit}() to perform the necessary
> context tracking and RCU management. The existing guest_{enter,exit}()
> heleprs are left as wrappers of these.
>
> Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
> helpers are added to handle the ordering of lockdep, tracing, and RCU
> manageent. These are inteneded to mirror exit_to_user_mode() and
> enter_from_user_mode().
>
> Subsequent patches will migrate architectures over to the new helpers,
> following a sequence:
>
> guest_timing_enter_irqoff();
>
> guest_state_enter_irqoff();
> < run the vcpu >
> guest_state_exit_irqoff();
>
> < take any pending IRQs >
>
> guest_timing_exit_irqoff();
>
> This sequences handles all of the above correctly, and more clearly
> balances the entry and exit portions, making it easier to understand.
>
> The existing helpers are marked as deprecated, and will be removed once
> all architectures have been converted.
>
> There should be no functional change as a result of this patch.
>
> Signed-off-by: Mark Rutland <[email protected]>
> Reviewed-by: Marc Zyngier <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> ---
> include/linux/kvm_host.h | 112 +++++++++++++++++++++++++++++++++++++--
> 1 file changed, 109 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index c310648cc8f1a..774a3b9e9bd8d 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -29,6 +29,8 @@
> #include <linux/refcount.h>
> #include <linux/nospec.h>
> #include <linux/notifier.h>
> +#include <linux/ftrace.h>
> +#include <linux/instrumentation.h>
> #include <asm/signal.h>
>
> #include <linux/kvm.h>
> @@ -362,8 +364,11 @@ struct kvm_vcpu {
> int last_used_slot;
> };
>
> -/* must be called with irqs disabled */
> -static __always_inline void guest_enter_irqoff(void)
> +/*
> + * Start accounting time towards a guest.
> + * Must be called before entering guest context.
> + */
> +static __always_inline void guest_timing_enter_irqoff(void)
> {
> /*
> * This is running in ioctl context so its safe to assume that it's the
> @@ -372,7 +377,18 @@ static __always_inline void guest_enter_irqoff(void)
> instrumentation_begin();
> vtime_account_guest_enter();
> instrumentation_end();
> +}
>
> +/*
> + * Enter guest context and enter an RCU extended quiescent state.
> + *
> + * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
> + * unsafe to use any code which may directly or indirectly use RCU, tracing
> + * (including IRQ flag tracing), or lockdep. All code in this period must be
> + * non-instrumentable.
> + */
> +static __always_inline void guest_context_enter_irqoff(void)
> +{
> /*
> * KVM does not hold any references to rcu protected data when it
> * switches CPU into a guest mode. In fact switching to a guest mode
> @@ -388,16 +404,79 @@ static __always_inline void guest_enter_irqoff(void)
> }
> }
>
> -static __always_inline void guest_exit_irqoff(void)
> +/*
> + * Deprecated. Architectures should move to guest_timing_enter_irqoff() and
> + * guest_state_enter_irqoff().
> + */
> +static __always_inline void guest_enter_irqoff(void)
> +{
> + guest_timing_enter_irqoff();
> + guest_context_enter_irqoff();
> +}
> +
> +/**
> + * guest_state_enter_irqoff - Fixup state when entering a guest
> + *
> + * Entry to a guest will enable interrupts, but the kernel state is interrupts
> + * disabled when this is invoked. Also tell RCU about it.
> + *
> + * 1) Trace interrupts on state
> + * 2) Invoke context tracking if enabled to adjust RCU state
> + * 3) Tell lockdep that interrupts are enabled
> + *
> + * Invoked from architecture specific code before entering a guest.
> + * Must be called with interrupts disabled and the caller must be
> + * non-instrumentable.
> + * The caller has to invoke guest_timing_enter_irqoff() before this.
> + *
> + * Note: this is analogous to exit_to_user_mode().
> + */
> +static __always_inline void guest_state_enter_irqoff(void)
> +{
> + instrumentation_begin();
> + trace_hardirqs_on_prepare();
> + lockdep_hardirqs_on_prepare(CALLER_ADDR0);
> + instrumentation_end();
> +
> + guest_context_enter_irqoff();
> + lockdep_hardirqs_on(CALLER_ADDR0);
> +}
> +
> +/*
> + * Exit guest context and exit an RCU extended quiescent state.
> + *
> + * Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
> + * unsafe to use any code which may directly or indirectly use RCU, tracing
> + * (including IRQ flag tracing), or lockdep. All code in this period must be
> + * non-instrumentable.
> + */
> +static __always_inline void guest_context_exit_irqoff(void)
> {
> context_tracking_guest_exit();
> +}
>
> +/*
> + * Stop accounting time towards a guest.
> + * Must be called after exiting guest context.
> + */
> +static __always_inline void guest_timing_exit_irqoff(void)
> +{
> instrumentation_begin();
> /* Flush the guest cputime we spent on the guest */
> vtime_account_guest_exit();
> instrumentation_end();
> }
>
> +/*
> + * Deprecated. Architectures should move to guest_state_exit_irqoff() and
> + * guest_timing_exit_irqoff().
> + */
> +static __always_inline void guest_exit_irqoff(void)
> +{
> + guest_context_exit_irqoff();
> + guest_timing_exit_irqoff();
> +}
> +
> static inline void guest_exit(void)
> {
> unsigned long flags;
> @@ -407,6 +486,33 @@ static inline void guest_exit(void)
> local_irq_restore(flags);
> }
>
> +/**
> + * guest_state_exit_irqoff - Establish state when returning from guest mode
> + *
> + * Entry from a guest disables interrupts, but guest mode is traced as
> + * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
> + *
> + * 1) Tell lockdep that interrupts are disabled
> + * 2) Invoke context tracking if enabled to reactivate RCU
> + * 3) Trace interrupts off state
> + *
> + * Invoked from architecture specific code after exiting a guest.
> + * Must be invoked with interrupts disabled and the caller must be
> + * non-instrumentable.
> + * The caller has to invoke guest_timing_exit_irqoff() after this.
> + *
> + * Note: this is analogous to enter_from_user_mode().
> + */
> +static __always_inline void guest_state_exit_irqoff(void)
> +{
> + lockdep_hardirqs_off(CALLER_ADDR0);
> + guest_context_exit_irqoff();
> +
> + instrumentation_begin();
> + trace_hardirqs_off_finish();
> + instrumentation_end();
> +}
> +
> static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> {
> /*
Reviewed-by: Paolo Bonzini <[email protected]>
On 1/19/22 11:58, Mark Rutland wrote:
> + * TODO: is there a barrier which ensures that pending interrupts are
> + * recognised? Currently this just hopes that the CPU takes any pending
> + * interrupts between the enable and disable.
> + */
> + local_irq_enable();
> + local_irq_disable();
>
It's okay, there is irq_enable_hazard() but it's automatically included
in arch_local_irq_enable().
Paolo
On 1/19/22 20:22, Mark Rutland wrote:
> I wonder, is the s390 guest entry/exit*preemptible* ?
>
> If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
> things before a ctx-switch to the idle thread, which would then be able
> to hit this.
>
> I'll need to go audit the other architectures for similar.
They don't enable interrupts in the entry/exit path so they should be
okay. RISC-V and x86 have an explicit preempt_disable/enable, while
MIPS only has local_irq_disable/enable.
(MIPS is a mess of dead code, I have patches to clean it up).
Paolo
On Wed, Jan 19, 2022 at 08:30:17PM +0100, Christian Borntraeger wrote:
>
>
> Am 19.01.22 um 20:22 schrieb Mark Rutland:
> > On Wed, Jan 19, 2022 at 07:25:20PM +0100, Christian Borntraeger wrote:
> > > Am 19.01.22 um 11:58 schrieb Mark Rutland:
> > >
> > >
> > > CCing new emails for Anup and Atish so that they are aware of this thread.
> >
> > Ah; whoops. I'd meant to fix the Ccs on the patches.
> >
> > Thanks!
> >
> > [...]
> >
> > > I just gave this a spin on s390 with debugging on and I got the following:
> > >
> > > [ 457.151295] ------------[ cut here ]------------
> > > [ 457.151311] WARNING: CPU: 14 PID: 0 at kernel/rcu/tree.c:613 rcu_eqs_enter.constprop.0+0xf8/0x118
> >
> > Hmm, so IIUC that's:
> >
> > WARN_ON_ONCE(rdp->dynticks_nmi_nesting != DYNTICK_IRQ_NONIDLE);
> >
> > ... and we're clearly in the idle thread here.
> >
> > I wonder, is the s390 guest entry/exit *preemptible* ?
>
> Looks like debug_defconfig is indeed using preemption:
>
> CONFIG_PREEMPT_BUILD=y
> # CONFIG_PREEMPT_NONE is not set
> # CONFIG_PREEMPT_VOLUNTARY is not set
> CONFIG_PREEMPT=y
> CONFIG_PREEMPT_COUNT=y
> CONFIG_PREEMPTION=y
> CONFIG_PREEMPT_RCU=y
> CONFIG_PREEMPT_NOTIFIERS=y
> CONFIG_DEBUG_PREEMPT=y
> CONFIG_PREEMPTIRQ_TRACEPOINTS=y
> CONFIG_TRACE_PREEMPT_TOGGLE=y
> CONFIG_PREEMPT_TRACER=y
> # CONFIG_PREEMPTIRQ_DELAY_TEST is not set
Thanks for confirming!
Could you try with CONFIG_PROVE_RCU=y ? That can't be selected directly, but
selecting PROVE_LOCKING=y will enable it.
If I'm right, with that we should get a splat out of
rcu_irq_exit_check_preempt().
If so, I think we can solve this with preempt_{disable,enable}() around the
guest_timing_{enter,exit}_irqoff() calls. We'll also need to add some more
comments around arch_in_rcu_eqs() that arch-specific EQSs should be
non-preemptible.
Thanks,
Mark.
Am 20.01.22 um 12:57 schrieb Mark Rutland:
> On Wed, Jan 19, 2022 at 08:30:17PM +0100, Christian Borntraeger wrote:
>>
>>
>> Am 19.01.22 um 20:22 schrieb Mark Rutland:
>>> On Wed, Jan 19, 2022 at 07:25:20PM +0100, Christian Borntraeger wrote:
>>>> Am 19.01.22 um 11:58 schrieb Mark Rutland:
>>>>
>>>>
>>>> CCing new emails for Anup and Atish so that they are aware of this thread.
>>>
>>> Ah; whoops. I'd meant to fix the Ccs on the patches.
>>>
>>> Thanks!
>>>
>>> [...]
>>>
>>>> I just gave this a spin on s390 with debugging on and I got the following:
>>>>
>>>> [ 457.151295] ------------[ cut here ]------------
>>>> [ 457.151311] WARNING: CPU: 14 PID: 0 at kernel/rcu/tree.c:613 rcu_eqs_enter.constprop.0+0xf8/0x118
>>>
>>> Hmm, so IIUC that's:
>>>
>>> WARN_ON_ONCE(rdp->dynticks_nmi_nesting != DYNTICK_IRQ_NONIDLE);
>>>
>>> ... and we're clearly in the idle thread here.
>>>
>>> I wonder, is the s390 guest entry/exit *preemptible* ?
>>
>> Looks like debug_defconfig is indeed using preemption:
>>
>> CONFIG_PREEMPT_BUILD=y
>> # CONFIG_PREEMPT_NONE is not set
>> # CONFIG_PREEMPT_VOLUNTARY is not set
>> CONFIG_PREEMPT=y
>> CONFIG_PREEMPT_COUNT=y
>> CONFIG_PREEMPTION=y
>> CONFIG_PREEMPT_RCU=y
>> CONFIG_PREEMPT_NOTIFIERS=y
>> CONFIG_DEBUG_PREEMPT=y
>> CONFIG_PREEMPTIRQ_TRACEPOINTS=y
>> CONFIG_TRACE_PREEMPT_TOGGLE=y
>> CONFIG_PREEMPT_TRACER=y
>> # CONFIG_PREEMPTIRQ_DELAY_TEST is not set
>
> Thanks for confirming!
>
> Could you try with CONFIG_PROVE_RCU=y ? That can't be selected directly, but
> selecting PROVE_LOCKING=y will enable it.
PROVE_LOCKING was enabled in my runs as well as PROVE_RCU.
On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
> On 1/19/22 20:22, Mark Rutland wrote:
> > I wonder, is the s390 guest entry/exit*preemptible* ?
> >
> > If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
> > things before a ctx-switch to the idle thread, which would then be able
> > to hit this.
> >
> > I'll need to go audit the other architectures for similar.
>
> They don't enable interrupts in the entry/exit path so they should be okay.
True.
So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
right thing to do. I'll add that and explanatory commentary.
> RISC-V and x86 have an explicit preempt_disable/enable, while MIPS only has
> local_irq_disable/enable.
>
> (MIPS is a mess of dead code, I have patches to clean it up).
Sure; I haven't wrapped my head around ppc yet, but I assume they keep
interrupts disabled as with the other simple cases.
Thanks,
Mark.
On Thu, Jan 20, 2022 at 12:10:22PM +0100, Paolo Bonzini wrote:
> On 1/19/22 11:58, Mark Rutland wrote:
> > + * TODO: is there a barrier which ensures that pending interrupts are
> > + * recognised? Currently this just hopes that the CPU takes any pending
> > + * interrupts between the enable and disable.
> > + */
> > + local_irq_enable();
> > + local_irq_disable();
>
> It's okay, there is irq_enable_hazard() but it's automatically included in
> arch_local_irq_enable().
As with the riscv case, I'm not sure whether that ensures that a pending
IRQ is actually recognized and taken.
Since there's also an irq_disable_hazard() it looks like that's there to
ensure the IRQ mask is updated in program order, rather than
guaranteeing that a pending IRQ is necessarily taken while IRQs are
unmasked.
In practice, I suspect it probably does, but it'd be good if someone
from the MIPS side could say something either way.
Thanks,
Mark.
Am 20.01.22 um 13:03 schrieb Mark Rutland:
> On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
>> On 1/19/22 20:22, Mark Rutland wrote:
>>> I wonder, is the s390 guest entry/exit*preemptible* ?
>>>
>>> If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
>>> things before a ctx-switch to the idle thread, which would then be able
>>> to hit this.
>>>
>>> I'll need to go audit the other architectures for similar.
>>
>> They don't enable interrupts in the entry/exit path so they should be okay.
>
> True.
>
> So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
> right thing to do. I'll add that and explanatory commentary.
That would not be trivial I guess. We do allow (and need) page faults on sie for guest
demand paging and
this piece of arch/s390/mm/fault.c
case GMAP_FAULT:
if (faulthandler_disabled() || !mm)
goto out;
break;
}
would no longer work since faulthandler_disabled checks for the preempt count.
On 1/20/22 17:44, Mark Rutland wrote:
> On Wed, Jan 19, 2022 at 10:58:51AM +0000, Mark Rutland wrote:
>> In kvm_arch_vcpu_ioctl_run() we use guest_enter_irqoff() and
>> guest_exit_irqoff() directly, with interrupts masked between these. As
>> we don't handle any timer ticks during this window, we will not account
>> time spent within the guest as guest time, which is unfortunate.
>>
>> Additionally, we do not inform lockdep or tracing that interrupts will
>> be enabled during guest execution, which caan lead to misleading traces
>> and warnings that interrupts have been enabled for overly-long periods.
>>
>> This patch fixes these issues by using the new timing and context
>> entry/exit helpers to ensure that interrupts are handled during guest
>> vtime but with RCU watching, with a sequence:
>>
>> guest_timing_enter_irqoff();
>>
>> guest_state_enter_irqoff();
>> < run the vcpu >
>> guest_state_exit_irqoff();
>>
>> < take any pending IRQs >
>>
>> guest_timing_exit_irqoff();
>
> Looking again, this patch isn't sufficient.
>
> On MIPS a guest exit will be handled by kvm_mips_handle_exit() *before*
> returning into the "< run the vcpu >" step above, so we won't call
> guest_state_exit_irqoff() before using RCU, etc.
Indeed. kvm_mips_handle_exit has a weird mutual recursion through runtime-assembled code, but then this is MIPS...
This should do it:
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index e59cb6246f76..6f2291f017f5 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -1192,6 +1192,7 @@ int kvm_mips_handle_exit(struct kvm_vcpu *vcpu)
kvm_mips_set_c0_status();
local_irq_enable();
+ guest_timing_exit_irqoff();
kvm_debug("kvm_mips_handle_exit: cause: %#x, PC: %p, kvm_run: %p, kvm_vcpu: %p\n",
cause, opc, run, vcpu);
@@ -1325,6 +1326,7 @@ int kvm_mips_handle_exit(struct kvm_vcpu *vcpu)
}
if (ret == RESUME_GUEST) {
+ guest_timing_enter_irqoff();
trace_kvm_reenter(vcpu);
/*
Paolo
On Wed, Jan 19, 2022 at 10:58:51AM +0000, Mark Rutland wrote:
> In kvm_arch_vcpu_ioctl_run() we use guest_enter_irqoff() and
> guest_exit_irqoff() directly, with interrupts masked between these. As
> we don't handle any timer ticks during this window, we will not account
> time spent within the guest as guest time, which is unfortunate.
>
> Additionally, we do not inform lockdep or tracing that interrupts will
> be enabled during guest execution, which caan lead to misleading traces
> and warnings that interrupts have been enabled for overly-long periods.
>
> This patch fixes these issues by using the new timing and context
> entry/exit helpers to ensure that interrupts are handled during guest
> vtime but with RCU watching, with a sequence:
>
> guest_timing_enter_irqoff();
>
> guest_state_enter_irqoff();
> < run the vcpu >
> guest_state_exit_irqoff();
>
> < take any pending IRQs >
>
> guest_timing_exit_irqoff();
Looking again, this patch isn't sufficient.
On MIPS a guest exit will be handled by kvm_mips_handle_exit() *before*
returning into the "< run the vcpu >" step above, so we won't call
guest_state_exit_irqoff() before using RCU, etc.
This'll need some more thought...
Mark.
> Since instrumentation may make use of RCU, we must also ensure that no
> instrumented code is run during the EQS. I've split out the critical
> section into a new kvm_mips_enter_exit_vcpu() helper which is marked
> noinstr.
>
> Signed-off-by: Mark Rutland <[email protected]>
> Cc: Aleksandar Markovic <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Huacai Chen <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Thomas Bogendoerfer <[email protected]>
> ---
> arch/mips/kvm/mips.c | 37 ++++++++++++++++++++++++++++++++++---
> 1 file changed, 34 insertions(+), 3 deletions(-)
>
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index aa20d074d3883..1a961c2434fee 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -438,6 +438,24 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
> return -ENOIOCTLCMD;
> }
>
> +/*
> + * Actually run the vCPU, entering an RCU extended quiescent state (EQS) while
> + * the vCPU is running.
> + *
> + * This must be noinstr as instrumentation may make use of RCU, and this is not
> + * safe during the EQS.
> + */
> +static int noinstr kvm_mips_vcpu_enter_exit(struct kvm_vcpu *vcpu)
> +{
> + int ret;
> +
> + guest_state_enter_irqoff();
> + ret = kvm_mips_callbacks->vcpu_run(vcpu);
> + guest_state_exit_irqoff();
> +
> + return ret;
> +}
> +
> int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> {
> int r = -EINTR;
> @@ -458,7 +476,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> lose_fpu(1);
>
> local_irq_disable();
> - guest_enter_irqoff();
> + guest_timing_enter_irqoff();
> trace_kvm_enter(vcpu);
>
> /*
> @@ -469,10 +487,23 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> */
> smp_store_mb(vcpu->mode, IN_GUEST_MODE);
>
> - r = kvm_mips_callbacks->vcpu_run(vcpu);
> + r = kvm_mips_vcpu_enter_exit(vcpu);
> +
> + /*
> + * We must ensure that any pending interrupts are taken before
> + * we exit guest timing so that timer ticks are accounted as
> + * guest time. Transiently unmask interrupts so that any
> + * pending interrupts are taken.
> + *
> + * TODO: is there a barrier which ensures that pending interrupts are
> + * recognised? Currently this just hopes that the CPU takes any pending
> + * interrupts between the enable and disable.
> + */
> + local_irq_enable();
> + local_irq_disable();
>
> trace_kvm_out(vcpu);
> - guest_exit_irqoff();
> + guest_timing_exit_irqoff();
> local_irq_enable();
>
> out:
> --
> 2.30.2
>
>
On Thu, Jan 20, 2022, Mark Rutland wrote:
> Longer-term MIPS should move to a loop like everyone else has:
>
> for (;;) {
> status = kvm_mips_enter_exit_vcpu();
>
> if (handle_exit(status))
> break;
>
> ...
> }
>
> ... which is far easier to manage.
I don't suppose we can just remove MIPS "support"? And PPC while we're at it? :-D
On 1/20/22 18:15, Mark Rutland wrote:
> As above, we'll also need the guest_state_{enter,exit}() calls
> surrounding this (e.g. before that local_irq_enable() at the start of
> kvm_mips_handle_exit(),
Oh, indeed. And there is also an interrupt-enabled area similar to
s390's, in both vcpu_run and the exception handler entry point (which
falls through to the exit handler created by kvm_mips_build_exit). For
example:
/* Setup status register for running guest in UM */
uasm_i_ori(&p, V1, V1, ST0_EXL | KSU_USER | ST0_IE);
UASM_i_LA(&p, AT, ~(ST0_CU0 | ST0_MX | ST0_SX | ST0_UX));
uasm_i_and(&p, V1, V1, AT);
uasm_i_mtc0(&p, V1, C0_STATUS);
uasm_i_ehb(&p);
I'd rather get rid altogether of the EQS for MIPS.
> and that needs to happen in noinstr code, etc.
There are bigger problems with instrumentation, because the
runtime-generated code as far as I can tell is not noinstr.
Paolo
On Thu, Jan 20, 2022 at 05:57:05PM +0100, Paolo Bonzini wrote:
> On 1/20/22 17:44, Mark Rutland wrote:
> > On Wed, Jan 19, 2022 at 10:58:51AM +0000, Mark Rutland wrote:
> > > In kvm_arch_vcpu_ioctl_run() we use guest_enter_irqoff() and
> > > guest_exit_irqoff() directly, with interrupts masked between these. As
> > > we don't handle any timer ticks during this window, we will not account
> > > time spent within the guest as guest time, which is unfortunate.
> > >
> > > Additionally, we do not inform lockdep or tracing that interrupts will
> > > be enabled during guest execution, which caan lead to misleading traces
> > > and warnings that interrupts have been enabled for overly-long periods.
> > >
> > > This patch fixes these issues by using the new timing and context
> > > entry/exit helpers to ensure that interrupts are handled during guest
> > > vtime but with RCU watching, with a sequence:
> > >
> > > guest_timing_enter_irqoff();
> > >
> > > guest_state_enter_irqoff();
> > > < run the vcpu >
> > > guest_state_exit_irqoff();
> > >
> > > < take any pending IRQs >
> > >
> > > guest_timing_exit_irqoff();
> >
> > Looking again, this patch isn't sufficient.
> >
> > On MIPS a guest exit will be handled by kvm_mips_handle_exit() *before*
> > returning into the "< run the vcpu >" step above, so we won't call
> > guest_state_exit_irqoff() before using RCU, etc.
>
> Indeed. kvm_mips_handle_exit has a weird mutual recursion through runtime-assembled code, but then this is MIPS...
>
> This should do it:
>
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index e59cb6246f76..6f2291f017f5 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -1192,6 +1192,7 @@ int kvm_mips_handle_exit(struct kvm_vcpu *vcpu)
> kvm_mips_set_c0_status();
> local_irq_enable();
> + guest_timing_exit_irqoff();
> kvm_debug("kvm_mips_handle_exit: cause: %#x, PC: %p, kvm_run: %p, kvm_vcpu: %p\n",
> cause, opc, run, vcpu);
> @@ -1325,6 +1326,7 @@ int kvm_mips_handle_exit(struct kvm_vcpu *vcpu)
> }
> if (ret == RESUME_GUEST) {
> + guest_timing_enter_irqoff();
> trace_kvm_reenter(vcpu);
> /*
As above, we'll also need the guest_state_{enter,exit}() calls
surrounding this (e.g. before that local_irq_enable() at the start of
kvm_mips_handle_exit(), and that needs to happen in noinstr code, etc.
It looks like the simplest thing to do is to rename
kvm_mips_handle_exit() to __kvm_mips_handle_exit() and add a
kvm_mips_handle_exit() wrapper that handles that (with the return path
conditional on whether __kvm_mips_handle_exit() returns RESUME_GUEST).
I'll have a go at that tomorrow when I regain the capability to think.
Longer-term MIPS should move to a loop like everyone else has:
for (;;) {
status = kvm_mips_enter_exit_vcpu();
if (handle_exit(status))
break;
...
}
... which is far easier to manage.
Thanks,
Mark.
Am 20.01.22 um 16:14 schrieb Christian Borntraeger:
>
>
> Am 20.01.22 um 13:03 schrieb Mark Rutland:
>> On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
>>> On 1/19/22 20:22, Mark Rutland wrote:
>>>> I wonder, is the s390 guest entry/exit*preemptible* ?
>>>>
>>>> If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
>>>> things before a ctx-switch to the idle thread, which would then be able
>>>> to hit this.
>>>>
>>>> I'll need to go audit the other architectures for similar.
>>>
>>> They don't enable interrupts in the entry/exit path so they should be okay.
>>
>> True.
>>
>> So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
>> right thing to do. I'll add that and explanatory commentary.
>
> That would not be trivial I guess. We do allow (and need) page faults on sie for guest
> demand paging and
>
> this piece of arch/s390/mm/fault.c
>
> case GMAP_FAULT:
> if (faulthandler_disabled() || !mm)
> goto out;
> break;
> }
>
> would no longer work since faulthandler_disabled checks for the preempt count.
>
Something like this
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index d30f5986fa85..1c7d45346e12 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -385,10 +385,18 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
return 0;
goto out;
case USER_FAULT:
- case GMAP_FAULT:
if (faulthandler_disabled() || !mm)
goto out;
break;
+ /*
+ * We know that we interrupted SIE and we are not in a IRQ.
+ * preemption might be disabled thus checking for in_atomic
+ * would result in failures
+ */
+ case GMAP_FAULT:
+ if (pagefault_disabled() || !mm)
+ goto out;
+ break;
}
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
seems to work with preemption disabled around sie. Not sure yet if this is correct.
On Thu, Jan 20, 2022 at 06:29:25PM +0100, Paolo Bonzini wrote:
> On 1/20/22 18:15, Mark Rutland wrote:
> > As above, we'll also need the guest_state_{enter,exit}() calls
> > surrounding this (e.g. before that local_irq_enable() at the start of
> > kvm_mips_handle_exit(),
>
> Oh, indeed. And there is also an interrupt-enabled area similar to s390's,
> in both vcpu_run and the exception handler entry point (which falls through
> to the exit handler created by kvm_mips_build_exit). For example:
>
> /* Setup status register for running guest in UM */
> uasm_i_ori(&p, V1, V1, ST0_EXL | KSU_USER | ST0_IE);
> UASM_i_LA(&p, AT, ~(ST0_CU0 | ST0_MX | ST0_SX | ST0_UX));
> uasm_i_and(&p, V1, V1, AT);
> uasm_i_mtc0(&p, V1, C0_STATUS);
> uasm_i_ehb(&p);
>
> I'd rather get rid altogether of the EQS for MIPS.
Ok; I'm not immediately sure how to do that without invasive changes around the
context tracking bits.
Did you have a specific approach in mind, or was that just a general statement?
> > and that needs to happen in noinstr code, etc.
>
> There are bigger problems with instrumentation, because the
> runtime-generated code as far as I can tell is not noinstr.
The generated sequences themselves are not a problem -- they're not
compiler-instrumented, and kprobes will reject them since they live in a
kzalloc()'d buffer which is outside of kernel text.
Those call tlbmiss_handler_setup_pgd(), but that itself is runtime-generated,
and AFAICT doesn't call anything. It is placed within the kernel text, but it
could be blacklisted from kprobes.
Have I missed something there?
Thanks,
Mark.
Am 21.01.22 um 10:53 schrieb Christian Borntraeger:
> Am 20.01.22 um 16:14 schrieb Christian Borntraeger:
>>
>>
>> Am 20.01.22 um 13:03 schrieb Mark Rutland:
>>> On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
>>>> On 1/19/22 20:22, Mark Rutland wrote:
>>>>> I wonder, is the s390 guest entry/exit*preemptible* ?
>>>>>
>>>>> If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
>>>>> things before a ctx-switch to the idle thread, which would then be able
>>>>> to hit this.
>>>>>
>>>>> I'll need to go audit the other architectures for similar.
>>>>
>>>> They don't enable interrupts in the entry/exit path so they should be okay.
>>>
>>> True.
>>>
>>> So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
>>> right thing to do. I'll add that and explanatory commentary.
>>
>> That would not be trivial I guess. We do allow (and need) page faults on sie for guest
>> demand paging and
>>
>> this piece of arch/s390/mm/fault.c
>>
>> case GMAP_FAULT:
>> if (faulthandler_disabled() || !mm)
>> goto out;
>> break;
>> }
>>
>> would no longer work since faulthandler_disabled checks for the preempt count.
>>
>
>
> Something like this
>
>
> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> index d30f5986fa85..1c7d45346e12 100644
> --- a/arch/s390/mm/fault.c
> +++ b/arch/s390/mm/fault.c
> @@ -385,10 +385,18 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> return 0;
> goto out;
> case USER_FAULT:
> - case GMAP_FAULT:
> if (faulthandler_disabled() || !mm)
> goto out;
> break;
> + /*
> + * We know that we interrupted SIE and we are not in a IRQ.
> + * preemption might be disabled thus checking for in_atomic
> + * would result in failures
> + */
> + case GMAP_FAULT:
> + if (pagefault_disabled() || !mm)
> + goto out;
> + break;
> }
>
> perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>
> seems to work with preemption disabled around sie. Not sure yet if this is correct.
No it does not work. scheduling while preemption is disabled.
[ 1880.448663] BUG: scheduling while atomic: qemu-system-s39/1806/0x00000002
[ 1880.448674] INFO: lockdep is turned off.
[ 1880.448676] Modules linked in: kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ib_uverbs s390_trng ib_core genwqe_card crc_itu_t vfio_ccw mdev vfio_iommu_type1 eadm_sch vfio zcrypt_cex4 sch_fq_codel configfs ip_tables x_tables mlx5_core ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4
[ 1880.448730] Preemption disabled at:
[ 1880.448731] [<000003ff8070da88>] ckc_irq_pending+0x30/0xe0 [kvm]
[ 1880.448778] CPU: 47 PID: 1806 Comm: qemu-system-s39 Tainted: G W 5.16.0-00007-g89e9021389e2-dirty #15
[ 1880.448782] Hardware name: IBM 3906 M04 704 (LPAR)
[ 1880.448784] Call Trace:
[ 1880.448785] [<000000000bf001d6>] dump_stack_lvl+0x8e/0xc8
[ 1880.448794] [<000000000b26e08a>] __schedule_bug+0xe2/0xf8
[ 1880.448801] [<000000000b26e212>] schedule_debug+0x172/0x1a8
[ 1880.448804] [<000000000bf0bcae>] __schedule+0x56/0x8b0
[ 1880.448808] [<000000000bf0c570>] schedule+0x68/0x110
[ 1880.448811] [<000000000bf13e76>] schedule_timeout+0x106/0x160
[ 1880.448815] [<000000000bf0ddf2>] wait_for_completion+0xc2/0x110
[ 1880.448818] [<000000000b258674>] __flush_work+0xd4/0x118
[ 1880.448823] [<000000000b4e3c88>] __drain_all_pages+0x218/0x308
[ 1880.448829] [<000000000b4ec3bc>] __alloc_pages_slowpath.constprop.0+0x5bc/0xc98
[ 1880.448832] [<000000000b4ece5c>] __alloc_pages+0x3c4/0x448
[ 1880.448835] [<000000000b5143cc>] alloc_pages_vma+0x9c/0x360
[ 1880.448841] [<000000000b4c0d6e>] do_swap_page+0x66e/0xca0
[ 1880.448845] [<000000000b4c3012>] __handle_mm_fault+0x29a/0x4b0
[ 1880.448869] [<000000000b4c33ac>] handle_mm_fault+0x184/0x3a8
[ 1880.448872] [<000000000b2062ce>] do_exception+0x136/0x490
[ 1880.448877] [<000000000b206b9a>] do_dat_exception+0x2a/0x50
[ 1880.448880] [<000000000bf03650>] __do_pgm_check+0x120/0x1f0
[ 1880.448882] [<000000000bf164ee>] pgm_check_handler+0x11e/0x180
[ 1880.448885] [<000000000bf16298>] sie_exit+0x0/0x48
[ 1880.448888] ([<000003ff8071e954>] kvm_s390_enter_exit_sie+0x64/0x98 [kvm])
[ 1880.448910] [<000003ff807061fa>] __vcpu_run+0x2a2/0x5b8 [kvm]
[ 1880.448931] [<000003ff807069ba>] kvm_arch_vcpu_ioctl_run+0x10a/0x270 [kvm]
[ 1880.448953] [<000003ff806ed02c>] kvm_vcpu_ioctl+0x27c/0xa40 [kvm]
[ 1880.448975] [<000000000b58b5c6>] __s390x_sys_ioctl+0xbe/0x100
[ 1880.448982] [<000000000bf038fa>] __do_syscall+0x1da/0x208
[ 1880.448984] [<000000000bf16362>] system_call+0x82/0xb0
On Fri, Jan 21, 2022 at 03:17:01PM +0100, Christian Borntraeger wrote:
>
>
> Am 21.01.22 um 10:53 schrieb Christian Borntraeger:
> > Am 20.01.22 um 16:14 schrieb Christian Borntraeger:
> > >
> > >
> > > Am 20.01.22 um 13:03 schrieb Mark Rutland:
> > > > On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
> > > > > On 1/19/22 20:22, Mark Rutland wrote:
> > > > > > I wonder, is the s390 guest entry/exit*preemptible* ?
> > > > > >
> > > > > > If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
> > > > > > things before a ctx-switch to the idle thread, which would then be able
> > > > > > to hit this.
> > > > > >
> > > > > > I'll need to go audit the other architectures for similar.
> > > > >
> > > > > They don't enable interrupts in the entry/exit path so they should be okay.
> > > >
> > > > True.
> > > >
> > > > So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
> > > > right thing to do. I'll add that and explanatory commentary.
> > >
> > > That would not be trivial I guess. We do allow (and need) page faults on sie for guest
> > > demand paging and
> > >
> > > this piece of arch/s390/mm/fault.c
> > >
> > > case GMAP_FAULT:
> > > if (faulthandler_disabled() || !mm)
> > > goto out;
> > > break;
> > > }
> > >
> > > would no longer work since faulthandler_disabled checks for the preempt count.
> > >
> >
> >
> > Something like this
> >
> >
> > diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> > index d30f5986fa85..1c7d45346e12 100644
> > --- a/arch/s390/mm/fault.c
> > +++ b/arch/s390/mm/fault.c
> > @@ -385,10 +385,18 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> > return 0;
> > goto out;
> > case USER_FAULT:
> > - case GMAP_FAULT:
> > if (faulthandler_disabled() || !mm)
> > goto out;
> > break;
> > + /*
> > + * We know that we interrupted SIE and we are not in a IRQ.
> > + * preemption might be disabled thus checking for in_atomic
> > + * would result in failures
> > + */
> > + case GMAP_FAULT:
> > + if (pagefault_disabled() || !mm)
> > + goto out;
> > + break;
> > }
> >
> > perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
> >
> > seems to work with preemption disabled around sie. Not sure yet if this is correct.
>
>
> No it does not work. scheduling while preemption is disabled.
Hmm... which exceptions do we expect to take from a guest?
I wonder if we can handle those more like other architectures by getting those
to immediately return from the sie64a() call with some status code that we can
handle outside of the guest_state_{enter,exit}_irqoff() critical section.
On arm64 in VHE mode, we swap our exception vectors when entering/exiting the
guest to allow us to do that; I wonder if we could do similar here?
Thanks,
Mark.
Am 21.01.22 um 15:30 schrieb Mark Rutland:
> On Fri, Jan 21, 2022 at 03:17:01PM +0100, Christian Borntraeger wrote:
>>
>>
>> Am 21.01.22 um 10:53 schrieb Christian Borntraeger:
>>> Am 20.01.22 um 16:14 schrieb Christian Borntraeger:
>>>>
>>>>
>>>> Am 20.01.22 um 13:03 schrieb Mark Rutland:
>>>>> On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
>>>>>> On 1/19/22 20:22, Mark Rutland wrote:
>>>>>>> I wonder, is the s390 guest entry/exit*preemptible* ?
>>>>>>>
>>>>>>> If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
>>>>>>> things before a ctx-switch to the idle thread, which would then be able
>>>>>>> to hit this.
>>>>>>>
>>>>>>> I'll need to go audit the other architectures for similar.
>>>>>>
>>>>>> They don't enable interrupts in the entry/exit path so they should be okay.
>>>>>
>>>>> True.
>>>>>
>>>>> So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
>>>>> right thing to do. I'll add that and explanatory commentary.
>>>>
>>>> That would not be trivial I guess. We do allow (and need) page faults on sie for guest
>>>> demand paging and
>>>>
>>>> this piece of arch/s390/mm/fault.c
>>>>
>>>> case GMAP_FAULT:
>>>> if (faulthandler_disabled() || !mm)
>>>> goto out;
>>>> break;
>>>> }
>>>>
>>>> would no longer work since faulthandler_disabled checks for the preempt count.
>>>>
>>>
>>>
>>> Something like this
>>>
>>>
>>> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
>>> index d30f5986fa85..1c7d45346e12 100644
>>> --- a/arch/s390/mm/fault.c
>>> +++ b/arch/s390/mm/fault.c
>>> @@ -385,10 +385,18 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>>> return 0;
>>> goto out;
>>> case USER_FAULT:
>>> - case GMAP_FAULT:
>>> if (faulthandler_disabled() || !mm)
>>> goto out;
>>> break;
>>> + /*
>>> + * We know that we interrupted SIE and we are not in a IRQ.
>>> + * preemption might be disabled thus checking for in_atomic
>>> + * would result in failures
>>> + */
>>> + case GMAP_FAULT:
>>> + if (pagefault_disabled() || !mm)
>>> + goto out;
>>> + break;
>>> }
>>>
>>> perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>>>
>>> seems to work with preemption disabled around sie. Not sure yet if this is correct.
>>
>>
>> No it does not work. scheduling while preemption is disabled.
>
> Hmm... which exceptions do we expect to take from a guest?
>
> I wonder if we can handle those more like other architectures by getting those
> to immediately return from the sie64a() call with some status code that we can
> handle outside of the guest_state_{enter,exit}_irqoff() critical section.
We take all kind of page faults (invalid->valid, ro->rw) on the sie instruction and
run that in the context of the pgm_check handler just like for userspace. the pgm_check
handler does a sie_exit (similar to the interrupt handlers) by setting the return IA.
> On arm64 in VHE mode, we swap our exception vectors when entering/exiting the
> guest to allow us to do that; I wonder if we could do similar here?
On Fri, Jan 21, 2022 at 03:42:48PM +0100, Christian Borntraeger wrote:
> Am 21.01.22 um 15:30 schrieb Mark Rutland:
> > On Fri, Jan 21, 2022 at 03:17:01PM +0100, Christian Borntraeger wrote:
> > > Am 21.01.22 um 10:53 schrieb Christian Borntraeger:
> > > > Am 20.01.22 um 16:14 schrieb Christian Borntraeger:
> > > > > Am 20.01.22 um 13:03 schrieb Mark Rutland:
> > > > > > On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
> > > > > > > On 1/19/22 20:22, Mark Rutland wrote:
> > > > > > > > I wonder, is the s390 guest entry/exit*preemptible* ?
> > > > > > > >
> > > > > > > > If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
> > > > > > > > things before a ctx-switch to the idle thread, which would then be able
> > > > > > > > to hit this.
> > > > > > > >
> > > > > > > > I'll need to go audit the other architectures for similar.
> > > > > > >
> > > > > > > They don't enable interrupts in the entry/exit path so they should be okay.
> > > > > >
> > > > > > True.
> > > > > >
> > > > > > So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
> > > > > > right thing to do. I'll add that and explanatory commentary.
> > > > >
> > > > > That would not be trivial I guess. We do allow (and need) page faults on sie for guest
> > > > > demand paging and
> > > > >
> > > > > this piece of arch/s390/mm/fault.c
> > > > >
> > > > > case GMAP_FAULT:
> > > > > if (faulthandler_disabled() || !mm)
> > > > > goto out;
> > > > > break;
> > > > > }
> > > > >
> > > > > would no longer work since faulthandler_disabled checks for the preempt count.
> > > > >
> > > >
> > > >
> > > > Something like this
> > > >
> > > >
> > > > diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> > > > index d30f5986fa85..1c7d45346e12 100644
> > > > --- a/arch/s390/mm/fault.c
> > > > +++ b/arch/s390/mm/fault.c
> > > > @@ -385,10 +385,18 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> > > > return 0;
> > > > goto out;
> > > > case USER_FAULT:
> > > > - case GMAP_FAULT:
> > > > if (faulthandler_disabled() || !mm)
> > > > goto out;
> > > > break;
> > > > + /*
> > > > + * We know that we interrupted SIE and we are not in a IRQ.
> > > > + * preemption might be disabled thus checking for in_atomic
> > > > + * would result in failures
> > > > + */
> > > > + case GMAP_FAULT:
> > > > + if (pagefault_disabled() || !mm)
> > > > + goto out;
> > > > + break;
> > > > }
> > > >
> > > > perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
> > > >
> > > > seems to work with preemption disabled around sie. Not sure yet if this is correct.
> > >
> > >
> > > No it does not work. scheduling while preemption is disabled.
> >
> > Hmm... which exceptions do we expect to take from a guest?
> >
> > I wonder if we can handle those more like other architectures by getting those
> > to immediately return from the sie64a() call with some status code that we can
> > handle outside of the guest_state_{enter,exit}_irqoff() critical section.
>
> We take all kind of page faults (invalid->valid, ro->rw) on the sie instruction and
> run that in the context of the pgm_check handler just like for userspace.
Do we only expect to tak faults from a guest (which IIUC at the GMAP_FAULT
cases in the bit above), or are there other esceptions we expect to take from
the middle of a SIE?
> the pgm_check handler does a sie_exit (similar to the interrupt handlers) by
> setting the return IA.
Sure, but that sie_exit happens *after* the handler runs, where as I'm asking
whether we can structure this like all the other architectures and turn all
exceptions into returns from sie64a() that we can handle after having called
guest_state_exit_irqoff().
> > On arm64 in VHE mode, we swap our exception vectors when entering/exiting the
> > guest to allow us to do that; I wonder if we could do similar here?
Does this idea sound at all plausible?
Thanks,
Mark.
Am 21.01.22 um 16:29 schrieb Mark Rutland:
> On Fri, Jan 21, 2022 at 03:42:48PM +0100, Christian Borntraeger wrote:
>> Am 21.01.22 um 15:30 schrieb Mark Rutland:
>>> On Fri, Jan 21, 2022 at 03:17:01PM +0100, Christian Borntraeger wrote:
>>>> Am 21.01.22 um 10:53 schrieb Christian Borntraeger:
>>>>> Am 20.01.22 um 16:14 schrieb Christian Borntraeger:
>>>>>> Am 20.01.22 um 13:03 schrieb Mark Rutland:
>>>>>>> On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
>>>>>>>> On 1/19/22 20:22, Mark Rutland wrote:
>>>>>>>>> I wonder, is the s390 guest entry/exit*preemptible* ?
>>>>>>>>>
>>>>>>>>> If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
>>>>>>>>> things before a ctx-switch to the idle thread, which would then be able
>>>>>>>>> to hit this.
>>>>>>>>>
>>>>>>>>> I'll need to go audit the other architectures for similar.
>>>>>>>>
>>>>>>>> They don't enable interrupts in the entry/exit path so they should be okay.
>>>>>>>
>>>>>>> True.
>>>>>>>
>>>>>>> So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
>>>>>>> right thing to do. I'll add that and explanatory commentary.
>>>>>>
>>>>>> That would not be trivial I guess. We do allow (and need) page faults on sie for guest
>>>>>> demand paging and
>>>>>>
>>>>>> this piece of arch/s390/mm/fault.c
>>>>>>
>>>>>> case GMAP_FAULT:
>>>>>> if (faulthandler_disabled() || !mm)
>>>>>> goto out;
>>>>>> break;
>>>>>> }
>>>>>>
>>>>>> would no longer work since faulthandler_disabled checks for the preempt count.
>>>>>>
>>>>>
>>>>>
>>>>> Something like this
>>>>>
>>>>>
>>>>> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
>>>>> index d30f5986fa85..1c7d45346e12 100644
>>>>> --- a/arch/s390/mm/fault.c
>>>>> +++ b/arch/s390/mm/fault.c
>>>>> @@ -385,10 +385,18 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>>>>> return 0;
>>>>> goto out;
>>>>> case USER_FAULT:
>>>>> - case GMAP_FAULT:
>>>>> if (faulthandler_disabled() || !mm)
>>>>> goto out;
>>>>> break;
>>>>> + /*
>>>>> + * We know that we interrupted SIE and we are not in a IRQ.
>>>>> + * preemption might be disabled thus checking for in_atomic
>>>>> + * would result in failures
>>>>> + */
>>>>> + case GMAP_FAULT:
>>>>> + if (pagefault_disabled() || !mm)
>>>>> + goto out;
>>>>> + break;
>>>>> }
>>>>>
>>>>> perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>>>>>
>>>>> seems to work with preemption disabled around sie. Not sure yet if this is correct.
>>>>
>>>>
>>>> No it does not work. scheduling while preemption is disabled.
>>>
>>> Hmm... which exceptions do we expect to take from a guest?
>>>
>>> I wonder if we can handle those more like other architectures by getting those
>>> to immediately return from the sie64a() call with some status code that we can
>>> handle outside of the guest_state_{enter,exit}_irqoff() critical section.
>>
>> We take all kind of page faults (invalid->valid, ro->rw) on the sie instruction and
>> run that in the context of the pgm_check handler just like for userspace.
>
> Do we only expect to tak faults from a guest (which IIUC at the GMAP_FAULT
> cases in the bit above), or are there other esceptions we expect to take from
> the middle of a SIE?
>
>> the pgm_check handler does a sie_exit (similar to the interrupt handlers) by
>> setting the return IA.
>
> Sure, but that sie_exit happens *after* the handler runs, where as I'm asking
> whether we can structure this like all the other architectures and turn all
> exceptions into returns from sie64a() that we can handle after having called
> guest_state_exit_irqoff().
>
>>> On arm64 in VHE mode, we swap our exception vectors when entering/exiting the
>>> guest to allow us to do that; I wonder if we could do similar here?
>
> Does this idea sound at all plausible?
Maybe. We already have something like that for async_pf (see kvm_arch_setup_async_pf)
where we handle the pagefault later on after first kicking off the resolution for major
page faults (via FAULT_FLAG_RETRY_NOWAIT) in the low level handler.
Let me think about it.
On Wed, 2022-01-19 at 10:58 +0000, Mark Rutland wrote:
> In kvm_arch_vcpu_ioctl_run() we enter an RCU extended quiescent state
> (EQS) by calling guest_enter_irqoff(), and unmasked IRQs prior to
> exiting the EQS by calling guest_exit(). As the IRQ entry code will not
> wake RCU in this case, we may run the core IRQ code and IRQ handler
> without RCU watching, leading to various potential problems.
>
> Additionally, we do not inform lockdep or tracing that interrupts will
> be enabled during guest execution, which caan lead to misleading traces
> and warnings that interrupts have been enabled for overly-long periods.
>
> This patch fixes these issues by using the new timing and context
> entry/exit helpers to ensure that interrupts are handled during guest
> vtime but with RCU watching, with a sequence:
>
> guest_timing_enter_irqoff();
>
> guest_state_enter_irqoff();
> < run the vcpu >
> guest_state_exit_irqoff();
>
> < take any pending IRQs >
>
> guest_timing_exit_irqoff();
>
> Since instrumentation may make use of RCU, we must also ensure that no
> instrumented code is run during the EQS. I've split out the critical
> section into a new kvm_arm_enter_exit_vcpu() helper which is marked
> noinstr.
>
> Fixes: 1b3d546daf85ed2b ("arm/arm64: KVM: Properly account for guest CPU time")
> Reported-by: Nicolas Saenz Julienne <[email protected]>
> Signed-off-by: Mark Rutland <[email protected]>
> Reviewed-by: Marc Zyngier <[email protected]>
> Cc: Alexandru Elisei <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: James Morse <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Suzuki K Poulose <[email protected]>
> Cc: Will Deacon <[email protected]>
> ---
Reviewed-by: Nicolas Saenz Julienne <[email protected]>
Thanks,
--
Nicolás Sáenz
On Wed, 2022-01-19 at 10:58 +0000, Mark Rutland wrote:
> When transitioning to/from guest mode, it is necessary to inform
> lockdep, tracing, and RCU in a specific order, similar to the
> requirements for transitions to/from user mode. Additionally, it is
> necessary to perform vtime accounting for a window around running the
> guest, with RCU enabled, such that timer interrupts taken from the guest
> can be accounted as guest time.
>
> Most architectures don't handle all the necessary pieces, and a have a
> number of common bugs, including unsafe usage of RCU during the window
> between guest_enter() and guest_exit().
>
> On x86, this was dealt with across commits:
>
> 87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
> 0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
> 9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
> 3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
> 135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
> 160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
> bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
>
> ... but those fixes are specific to x86, and as the resulting logic
> (while correct) is split across generic helper functions and
> x86-specific helper functions, it is difficult to see that the
> entry/exit accounting is balanced.
>
> This patch adds generic helpers which architectures can use to handle
> guest entry/exit consistently and correctly. The guest_{enter,exit}()
> helpers are split into guest_timing_{enter,exit}() to perform vtime
> accounting, and guest_context_{enter,exit}() to perform the necessary
> context tracking and RCU management. The existing guest_{enter,exit}()
> heleprs are left as wrappers of these.
>
> Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
> helpers are added to handle the ordering of lockdep, tracing, and RCU
> manageent. These are inteneded to mirror exit_to_user_mode() and
> enter_from_user_mode().
>
> Subsequent patches will migrate architectures over to the new helpers,
> following a sequence:
>
> guest_timing_enter_irqoff();
>
> guest_state_enter_irqoff();
> < run the vcpu >
> guest_state_exit_irqoff();
>
> < take any pending IRQs >
>
> guest_timing_exit_irqoff();
>
> This sequences handles all of the above correctly, and more clearly
> balances the entry and exit portions, making it easier to understand.
>
> The existing helpers are marked as deprecated, and will be removed once
> all architectures have been converted.
>
> There should be no functional change as a result of this patch.
>
> Signed-off-by: Mark Rutland <[email protected]>
> Reviewed-by: Marc Zyngier <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> ---
Reviewed-by: Nicolas Saenz Julienne <[email protected]>
Thanks,
--
Nicolás Sáenz