2021-04-14 15:51:39

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH 0/5] x86/kvm: Refactor KVM PV features teardown and fix restore from hibernation

This series is a successor of Lenny's "[PATCH] x86/kvmclock: Stop kvmclocks
for hibernate restore". While reviewing his patch I realized that PV
features teardown we have is a bit messy: it is scattered across kvm.c
and kvmclock.c and not all features are being shutdown an all paths.
This series unifies all teardown paths in kvm.c and makes sure all
features are disabled when needed.

Vitaly Kuznetsov (5):
x86/kvm: Fix pr_info() for async PF setup/teardown
x86/kvm: Teardown PV features on boot CPU as well
x86/kvm: Disable kvmclock on all CPUs on shutdown
x86/kvm: Disable all PV features on crash
x86/kvm: Unify kvm_pv_guest_cpu_reboot() with kvm_guest_cpu_offline()

arch/x86/include/asm/kvm_para.h | 10 +--
arch/x86/kernel/kvm.c | 113 +++++++++++++++++++++-----------
arch/x86/kernel/kvmclock.c | 26 +-------
3 files changed, 78 insertions(+), 71 deletions(-)

--
2.30.2


2021-04-14 15:51:45

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH 1/5] x86/kvm: Fix pr_info() for async PF setup/teardown

'pr_fmt' already has 'kvm-guest: ' so 'KVM' prefix is redundant.
"Unregister pv shared memory" is very ambiguous, it's hard to
say which particular PV feature it relates to.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/kernel/kvm.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 78bb0fae3982..79dddcc178e3 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -345,7 +345,7 @@ static void kvm_guest_cpu_init(void)

wrmsrl(MSR_KVM_ASYNC_PF_EN, pa);
__this_cpu_write(apf_reason.enabled, 1);
- pr_info("KVM setup async PF for cpu %d\n", smp_processor_id());
+ pr_info("setup async PF for cpu %d\n", smp_processor_id());
}

if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) {
@@ -371,7 +371,7 @@ static void kvm_pv_disable_apf(void)
wrmsrl(MSR_KVM_ASYNC_PF_EN, 0);
__this_cpu_write(apf_reason.enabled, 0);

- pr_info("Unregister pv shared memory for cpu %d\n", smp_processor_id());
+ pr_info("disable async PF for cpu %d\n", smp_processor_id());
}

static void kvm_pv_guest_cpu_reboot(void *unused)
--
2.30.2

2021-04-14 15:51:50

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH 3/5] x86/kvm: Disable kvmclock on all CPUs on shutdown

Currenly, we disable kvmclock from machine_shutdown() hook and this
only happens for boot CPU. We need to disable it for all CPUs to
guard against memory corruption e.g. on restore from hibernate.

Note, writing '0' to kvmclock MSR doesn't clear memory location, it
just prevents hypervisor from updating the location so for the short
while after write and while CPU is still alive, the clock remains usable
and correct so we don't need to switch to some other clocksource.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/include/asm/kvm_para.h | 4 ++--
arch/x86/kernel/kvm.c | 1 +
arch/x86/kernel/kvmclock.c | 5 +----
3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 338119852512..9c56e0defd45 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -7,8 +7,6 @@
#include <linux/interrupt.h>
#include <uapi/asm/kvm_para.h>

-extern void kvmclock_init(void);
-
#ifdef CONFIG_KVM_GUEST
bool kvm_check_and_clear_guest_paused(void);
#else
@@ -86,6 +84,8 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
}

#ifdef CONFIG_KVM_GUEST
+void kvmclock_init(void);
+void kvmclock_disable(void);
bool kvm_para_available(void);
unsigned int kvm_arch_para_features(void);
unsigned int kvm_arch_para_hints(void);
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 6b16a9bb4ecd..df00d44f7424 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -595,6 +595,7 @@ static void kvm_guest_cpu_offline(void)
wrmsrl(MSR_KVM_PV_EOI_EN, 0);
kvm_pv_disable_apf();
apf_task_wake_all();
+ kvmclock_disable();
}

static int kvm_cpu_online(unsigned int cpu)
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 1fc0962c89c0..cf869de98eec 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -220,11 +220,9 @@ static void kvm_crash_shutdown(struct pt_regs *regs)
}
#endif

-static void kvm_shutdown(void)
+void kvmclock_disable(void)
{
native_write_msr(msr_kvm_system_time, 0, 0);
- kvm_disable_steal_time();
- native_machine_shutdown();
}

static void __init kvmclock_init_mem(void)
@@ -351,7 +349,6 @@ void __init kvmclock_init(void)
#endif
x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
- machine_ops.shutdown = kvm_shutdown;
#ifdef CONFIG_KEXEC_CORE
machine_ops.crash_shutdown = kvm_crash_shutdown;
#endif
--
2.30.2

2021-04-14 15:54:36

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH 4/5] x86/kvm: Disable all PV features on crash

Crash shutdown handler only disables kvmclock and steal time, other PV
features remain active so we risk corrupting memory or getting some
side-effects in kdump kernel. Move crash handler to kvm.c and unify
with CPU offline.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/include/asm/kvm_para.h | 6 -----
arch/x86/kernel/kvm.c | 44 ++++++++++++++++++++++++---------
arch/x86/kernel/kvmclock.c | 21 ----------------
3 files changed, 32 insertions(+), 39 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 9c56e0defd45..69299878b200 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -92,7 +92,6 @@ unsigned int kvm_arch_para_hints(void);
void kvm_async_pf_task_wait_schedule(u32 token);
void kvm_async_pf_task_wake(u32 token);
u32 kvm_read_and_reset_apf_flags(void);
-void kvm_disable_steal_time(void);
bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token);

DECLARE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
@@ -137,11 +136,6 @@ static inline u32 kvm_read_and_reset_apf_flags(void)
return 0;
}

-static inline void kvm_disable_steal_time(void)
-{
- return;
-}
-
static __always_inline bool kvm_handle_async_pf(struct pt_regs *regs, u32 token)
{
return false;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index df00d44f7424..1754b7c3f754 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -38,6 +38,7 @@
#include <asm/tlb.h>
#include <asm/cpuidle_haltpoll.h>
#include <asm/ptrace.h>
+#include <asm/reboot.h>
#include <asm/svm.h>

DEFINE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
@@ -375,6 +376,14 @@ static void kvm_pv_disable_apf(void)
pr_info("disable async PF for cpu %d\n", smp_processor_id());
}

+static void kvm_disable_steal_time(void)
+{
+ if (!has_steal_clock)
+ return;
+
+ wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
+}
+
static void kvm_pv_guest_cpu_reboot(void *unused)
{
/*
@@ -417,14 +426,6 @@ static u64 kvm_steal_clock(int cpu)
return steal;
}

-void kvm_disable_steal_time(void)
-{
- if (!has_steal_clock)
- return;
-
- wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
-}
-
static inline void __set_percpu_decrypted(void *ptr, unsigned long size)
{
early_set_memory_decrypted((unsigned long) ptr, size);
@@ -588,13 +589,14 @@ static void __init kvm_smp_prepare_boot_cpu(void)
kvm_spinlock_init();
}

-static void kvm_guest_cpu_offline(void)
+static void kvm_guest_cpu_offline(bool shutdown)
{
kvm_disable_steal_time();
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
wrmsrl(MSR_KVM_PV_EOI_EN, 0);
kvm_pv_disable_apf();
- apf_task_wake_all();
+ if (!shutdown)
+ apf_task_wake_all();
kvmclock_disable();
}

@@ -613,7 +615,7 @@ static int kvm_cpu_down_prepare(unsigned int cpu)
unsigned long flags;

local_irq_save(flags);
- kvm_guest_cpu_offline();
+ kvm_guest_cpu_offline(false);
local_irq_restore(flags);
return 0;
}
@@ -647,7 +649,7 @@ static void kvm_flush_tlb_others(const struct cpumask *cpumask,

static int kvm_suspend(void)
{
- kvm_guest_cpu_offline();
+ kvm_guest_cpu_offline(false);

return 0;
}
@@ -662,6 +664,20 @@ static struct syscore_ops kvm_syscore_ops = {
.resume = kvm_resume,
};

+/*
+ * After a PV feature is registered, the host will keep writing to the
+ * registered memory location. If the guest happens to shutdown, this memory
+ * won't be valid. In cases like kexec, in which you install a new kernel, this
+ * means a random memory location will be kept being written.
+ */
+#ifdef CONFIG_KEXEC_CORE
+static void kvm_crash_shutdown(struct pt_regs *regs)
+{
+ kvm_guest_cpu_offline(true);
+ native_machine_crash_shutdown(regs);
+}
+#endif
+
static void __init kvm_guest_init(void)
{
int i;
@@ -704,6 +720,10 @@ static void __init kvm_guest_init(void)
kvm_guest_cpu_init();
#endif

+#ifdef CONFIG_KEXEC_CORE
+ machine_ops.crash_shutdown = kvm_crash_shutdown;
+#endif
+
register_syscore_ops(&kvm_syscore_ops);

/*
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index cf869de98eec..b825c87c12ef 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -20,7 +20,6 @@
#include <asm/hypervisor.h>
#include <asm/mem_encrypt.h>
#include <asm/x86_init.h>
-#include <asm/reboot.h>
#include <asm/kvmclock.h>

static int kvmclock __initdata = 1;
@@ -203,23 +202,6 @@ static void kvm_setup_secondary_clock(void)
}
#endif

-/*
- * After the clock is registered, the host will keep writing to the
- * registered memory location. If the guest happens to shutdown, this memory
- * won't be valid. In cases like kexec, in which you install a new kernel, this
- * means a random memory location will be kept being written. So before any
- * kind of shutdown from our side, we unregister the clock by writing anything
- * that does not have the 'enable' bit set in the msr
- */
-#ifdef CONFIG_KEXEC_CORE
-static void kvm_crash_shutdown(struct pt_regs *regs)
-{
- native_write_msr(msr_kvm_system_time, 0, 0);
- kvm_disable_steal_time();
- native_machine_crash_shutdown(regs);
-}
-#endif
-
void kvmclock_disable(void)
{
native_write_msr(msr_kvm_system_time, 0, 0);
@@ -349,9 +331,6 @@ void __init kvmclock_init(void)
#endif
x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
-#ifdef CONFIG_KEXEC_CORE
- machine_ops.crash_shutdown = kvm_crash_shutdown;
-#endif
kvm_get_preset_lpj();

/*
--
2.30.2

2021-04-15 00:22:54

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH 2/5] x86/kvm: Teardown PV features on boot CPU as well

Various PV features (Async PF, PV EOI, steal time) work through memory
shared with hypervisor and when we restore from hibernation we must
properly teardown all these features to make sure hypervisor doesn't
write to stale locations after we jump to the previously hibernated kernel
(which can try to place anything there). For secondary CPUs the job is
already done by kvm_cpu_down_prepare(), register syscore ops to do
the same for boot CPU.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/kernel/kvm.c | 32 ++++++++++++++++++++++++++++----
1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 79dddcc178e3..6b16a9bb4ecd 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -26,6 +26,7 @@
#include <linux/kprobes.h>
#include <linux/nmi.h>
#include <linux/swait.h>
+#include <linux/syscore_ops.h>
#include <asm/timer.h>
#include <asm/cpu.h>
#include <asm/traps.h>
@@ -598,17 +599,21 @@ static void kvm_guest_cpu_offline(void)

static int kvm_cpu_online(unsigned int cpu)
{
- local_irq_disable();
+ unsigned long flags;
+
+ local_irq_save(flags);
kvm_guest_cpu_init();
- local_irq_enable();
+ local_irq_restore(flags);
return 0;
}

static int kvm_cpu_down_prepare(unsigned int cpu)
{
- local_irq_disable();
+ unsigned long flags;
+
+ local_irq_save(flags);
kvm_guest_cpu_offline();
- local_irq_enable();
+ local_irq_restore(flags);
return 0;
}
#endif
@@ -639,6 +644,23 @@ static void kvm_flush_tlb_others(const struct cpumask *cpumask,
native_flush_tlb_others(flushmask, info);
}

+static int kvm_suspend(void)
+{
+ kvm_guest_cpu_offline();
+
+ return 0;
+}
+
+static void kvm_resume(void)
+{
+ kvm_cpu_online(raw_smp_processor_id());
+}
+
+static struct syscore_ops kvm_syscore_ops = {
+ .suspend = kvm_suspend,
+ .resume = kvm_resume,
+};
+
static void __init kvm_guest_init(void)
{
int i;
@@ -681,6 +703,8 @@ static void __init kvm_guest_init(void)
kvm_guest_cpu_init();
#endif

+ register_syscore_ops(&kvm_syscore_ops);
+
/*
* Hard lockup detection is enabled by default. Disable it, as guests
* can get false positives too easily, for example if the host is
--
2.30.2

2021-04-15 00:23:49

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH 5/5] x86/kvm: Unify kvm_pv_guest_cpu_reboot() with kvm_guest_cpu_offline()

Simplify the code by making PV features shutdown happen in one place.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/kernel/kvm.c | 42 +++++++++++++++++-------------------------
1 file changed, 17 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 1754b7c3f754..7da7bea96745 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -384,31 +384,6 @@ static void kvm_disable_steal_time(void)
wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
}

-static void kvm_pv_guest_cpu_reboot(void *unused)
-{
- /*
- * We disable PV EOI before we load a new kernel by kexec,
- * since MSR_KVM_PV_EOI_EN stores a pointer into old kernel's memory.
- * New kernel can re-enable when it boots.
- */
- if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
- wrmsrl(MSR_KVM_PV_EOI_EN, 0);
- kvm_pv_disable_apf();
- kvm_disable_steal_time();
-}
-
-static int kvm_pv_reboot_notify(struct notifier_block *nb,
- unsigned long code, void *unused)
-{
- if (code == SYS_RESTART)
- on_each_cpu(kvm_pv_guest_cpu_reboot, NULL, 1);
- return NOTIFY_DONE;
-}
-
-static struct notifier_block kvm_pv_reboot_nb = {
- .notifier_call = kvm_pv_reboot_notify,
-};
-
static u64 kvm_steal_clock(int cpu)
{
u64 steal;
@@ -664,6 +639,23 @@ static struct syscore_ops kvm_syscore_ops = {
.resume = kvm_resume,
};

+static void kvm_pv_guest_cpu_reboot(void *unused)
+{
+ kvm_guest_cpu_offline(true);
+}
+
+static int kvm_pv_reboot_notify(struct notifier_block *nb,
+ unsigned long code, void *unused)
+{
+ if (code == SYS_RESTART)
+ on_each_cpu(kvm_pv_guest_cpu_reboot, NULL, 1);
+ return NOTIFY_DONE;
+}
+
+static struct notifier_block kvm_pv_reboot_nb = {
+ .notifier_call = kvm_pv_reboot_notify,
+};
+
/*
* After a PV feature is registered, the host will keep writing to the
* registered memory location. If the guest happens to shutdown, this memory
--
2.30.2

2021-05-03 18:10:18

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 0/5] x86/kvm: Refactor KVM PV features teardown and fix restore from hibernation

On 14/04/21 14:35, Vitaly Kuznetsov wrote:
> This series is a successor of Lenny's "[PATCH] x86/kvmclock: Stop kvmclocks
> for hibernate restore". While reviewing his patch I realized that PV
> features teardown we have is a bit messy: it is scattered across kvm.c
> and kvmclock.c and not all features are being shutdown an all paths.
> This series unifies all teardown paths in kvm.c and makes sure all
> features are disabled when needed.
>
> Vitaly Kuznetsov (5):
> x86/kvm: Fix pr_info() for async PF setup/teardown
> x86/kvm: Teardown PV features on boot CPU as well
> x86/kvm: Disable kvmclock on all CPUs on shutdown
> x86/kvm: Disable all PV features on crash
> x86/kvm: Unify kvm_pv_guest_cpu_reboot() with kvm_guest_cpu_offline()
>
> arch/x86/include/asm/kvm_para.h | 10 +--
> arch/x86/kernel/kvm.c | 113 +++++++++++++++++++++-----------
> arch/x86/kernel/kvmclock.c | 26 +-------
> 3 files changed, 78 insertions(+), 71 deletions(-)
>

Queuing this patch, thanks.

If the Amazon folks want to provide their Tested-by (since they looked
at it before and tested Lenny's first attempt at using syscore_ops),
they're still in time!

Paolo

2021-05-05 17:37:05

by Aboubakr, Mohamed

[permalink] [raw]
Subject: Re: [PATCH 0/5] x86/kvm: Refactor KVM PV features teardown and fix restore from hibernation

Hello,

Confirmed c5.18xlarge and c5a.18xlarge issue has been fixed by this patch.

Thanks
Mohamed Aboubakr
> On May 3, 2021, at 8:26 AM, Paolo Bonzini <[email protected]> wrote:
>
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
>> On 14/04/21 14:35, Vitaly Kuznetsov wrote:
>> This series is a successor of Lenny's "[PATCH] x86/kvmclock: Stop kvmclocks
>> for hibernate restore". While reviewing his patch I realized that PV
>> features teardown we have is a bit messy: it is scattered across kvm.c
>> and kvmclock.c and not all features are being shutdown an all paths.
>> This series unifies all teardown paths in kvm.c and makes sure all
>> features are disabled when needed.
>>
>> Vitaly Kuznetsov (5):
>> x86/kvm: Fix pr_info() for async PF setup/teardown
>> x86/kvm: Teardown PV features on boot CPU as well
>> x86/kvm: Disable kvmclock on all CPUs on shutdown
>> x86/kvm: Disable all PV features on crash
>> x86/kvm: Unify kvm_pv_guest_cpu_reboot() with kvm_guest_cpu_offline()
>>
>> arch/x86/include/asm/kvm_para.h | 10 +--
>> arch/x86/kernel/kvm.c | 113 +++++++++++++++++++++-----------
>> arch/x86/kernel/kvmclock.c | 26 +-------
>> 3 files changed, 78 insertions(+), 71 deletions(-)
>>
>
> Queuing this patch, thanks.
>
> If the Amazon folks want to provide their Tested-by (since they looked
> at it before and tested Lenny's first attempt at using syscore_ops),
> they're still in time!
>
> Paolo
>