2020-01-30 12:25:57

by Xiaoyao Li

[permalink] [raw]
Subject: [PATCH 0/2] kvm: split_lock: Fix emulator and extend #AC handler

As kernel split lock patch[1] merged into tip tree, kvm emulator needs to be
fixed and vmx's #AC handler needs to be extended.

Patch 1 fixes x86/emulator to emualte split lock access as write to avoid
malicous guest[2] exploiting it to populate host kernel log.

Patch 2 extends vmx's #AC handler that we can make old guestes (has split_lock
buges) survive on certain cases.

[1] https://lore.kernel.org/lkml/158031147976.396.8941798847364718785.tip-bot2@tip-bot2/
[2] https://lore.kernel.org/lkml/[email protected]/

Xiaoyao Li (2):
KVM: x86: Emulate split-lock access as a write
KVM: VMX: Extend VMX's #AC handding

arch/x86/include/asm/cpu.h | 13 ++++++++++++
arch/x86/kernel/cpu/intel.c | 18 ++++++++++------
arch/x86/kvm/vmx/vmx.c | 42 ++++++++++++++++++++++++++++++++++---
arch/x86/kvm/vmx/vmx.h | 3 +++
arch/x86/kvm/x86.c | 11 ++++++++++
5 files changed, 78 insertions(+), 9 deletions(-)

--
2.23.0


2020-01-30 12:26:06

by Xiaoyao Li

[permalink] [raw]
Subject: [PATCH 2/2] KVM: VMX: Extend VMX's #AC handding

There are two types of #AC can be generated in Intel CPUs:
1. legacy alignment check #AC;
2. split lock #AC;

Legacy alignment check #AC can be injected to guest if guest has enabled
alignemnet check.

When host enables split lock detection, i.e., split_lock_detect!=off,
guest will receive an unexpected #AC when there is a split_lock happens in
guest since KVM doesn't virtualize this feature to guest.

Since the old guests lack split_lock #AC handler and may have split lock
buges. To make guest survive from split lock, applying the similar policy
as host's split lock detect configuration:
- host split lock detect is sld_warn:
warning the split lock happened in guest, and disabling split lock
detect around VM-enter;
- host split lock detect is sld_fatal:
forwarding #AC to userspace. (Usually userspace dump the #AC
exception and kill the guest).

Note, if sld_warn and SMT is enabled, the split lock in guest's vcpu
leads the disabling of split lock detect on the sibling CPU thread during
the vcpu is running.

Signed-off-by: Xiaoyao Li <[email protected]>
---
arch/x86/include/asm/cpu.h | 1 +
arch/x86/kernel/cpu/intel.c | 6 ++++++
arch/x86/kvm/vmx/vmx.c | 42 ++++++++++++++++++++++++++++++++++---
arch/x86/kvm/vmx/vmx.h | 3 +++
4 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h
index 167d0539e0ad..b46262afa6c1 100644
--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -52,6 +52,7 @@ extern enum split_lock_detect_state get_split_lock_detect_state(void);
extern void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c);
extern void switch_to_sld(unsigned long tifn);
extern bool handle_user_split_lock(struct pt_regs *regs, long error_code);
+extern void split_lock_detect_set(bool on);
#else
static inline enum split_lock_detect_state get_split_lock_detect_state(void)
{
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 2f9c48e91caf..889469b54b5a 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -1124,6 +1124,12 @@ void switch_to_sld(unsigned long tifn)
__sld_msr_set(!(tifn & _TIF_SLD));
}

+void split_lock_detect_set(bool on)
+{
+ __sld_msr_set(on);
+}
+EXPORT_SYMBOL_GPL(split_lock_detect_set);
+
#define SPLIT_LOCK_CPU(model) {X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY}

/*
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index cdb4bf50ee14..402a9152c6ee 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4553,6 +4553,12 @@ static int handle_machine_check(struct kvm_vcpu *vcpu)
return 1;
}

+static bool guest_cpu_alignment_check_enabled(struct kvm_vcpu *vcpu)
+{
+ return vmx_get_cpl(vcpu) == 3 && kvm_read_cr0_bits(vcpu, X86_CR0_AM) &&
+ (kvm_get_rflags(vcpu) & X86_EFLAGS_AC);
+}
+
static int handle_exception_nmi(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -4618,9 +4624,6 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
return handle_rmode_exception(vcpu, ex_no, error_code);

switch (ex_no) {
- case AC_VECTOR:
- kvm_queue_exception_e(vcpu, AC_VECTOR, error_code);
- return 1;
case DB_VECTOR:
dr6 = vmcs_readl(EXIT_QUALIFICATION);
if (!(vcpu->guest_debug &
@@ -4649,6 +4652,29 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
kvm_run->debug.arch.pc = vmcs_readl(GUEST_CS_BASE) + rip;
kvm_run->debug.arch.exception = ex_no;
break;
+ case AC_VECTOR:
+ /*
+ * Inject #AC back to guest only when legacy alignment check
+ * enabled.
+ * Otherwise, it must be a split-lock #AC.
+ * - If sld_state == sld_warn, it can let guest survive by
+ * setting the vcpu's diasble_split_lock_detect to true so
+ * that it will toggle MSR_TEST.SPLIT_LOCK_DETECT bit during
+ * every following VM Entry and Exit;
+ * - If sld_state == sld_fatal, it forwards #AC to userspace;
+ */
+ if (guest_cpu_alignment_check_enabled(vcpu) ||
+ WARN_ON(get_split_lock_detect_state() == sld_off)) {
+ kvm_queue_exception_e(vcpu, AC_VECTOR, error_code);
+ return 1;
+ }
+ if (get_split_lock_detect_state() == sld_warn) {
+ pr_warn("kvm: split lock #AC happened in %s [%d]\n",
+ current->comm, current->pid);
+ vmx->disable_split_lock_detect = true;
+ return 1;
+ }
+ /* fall through*/
default:
kvm_run->exit_reason = KVM_EXIT_EXCEPTION;
kvm_run->ex.exception = ex_no;
@@ -6511,6 +6537,11 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
*/
x86_spec_ctrl_set_guest(vmx->spec_ctrl, 0);

+ if (static_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT) &&
+ !test_tsk_thread_flag(current, TIF_SLD) &&
+ unlikely(vmx->disable_split_lock_detect))
+ split_lock_detect_set(false);
+
/* L1D Flush includes CPU buffer clear to mitigate MDS */
if (static_branch_unlikely(&vmx_l1d_should_flush))
vmx_l1d_flush(vcpu);
@@ -6545,6 +6576,11 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)

x86_spec_ctrl_restore_host(vmx->spec_ctrl, 0);

+ if (static_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT) &&
+ !test_tsk_thread_flag(current, TIF_SLD) &&
+ unlikely(vmx->disable_split_lock_detect))
+ split_lock_detect_set(true);
+
/* All fields are clean at this point */
if (static_branch_unlikely(&enable_evmcs))
current_evmcs->hv_clean_fields |=
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 7f42cf3dcd70..912eba66c5d5 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -274,6 +274,9 @@ struct vcpu_vmx {

bool req_immediate_exit;

+ /* Disable split-lock detection when running the vCPU */
+ bool disable_split_lock_detect;
+
/* Support for PML */
#define PML_ENTITY_NUM 512
struct page *pml_pg;
--
2.23.0

2020-01-30 12:26:59

by Xiaoyao Li

[permalink] [raw]
Subject: [PATCH 1/2] KVM: x86: Emulate split-lock access as a write

If split lock detect is enabled (warn/fatal), #AC handler calls die()
when split lock happens in kernel.

A sane guest should never tigger emulation on a split-lock access, but
it cannot prevent malicous guest from doing this. So just emulating the
access as a write if it's a split-lock access to avoid malicous guest
polluting the kernel log.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
---
arch/x86/include/asm/cpu.h | 12 ++++++++++++
arch/x86/kernel/cpu/intel.c | 12 ++++++------
arch/x86/kvm/x86.c | 11 +++++++++++
3 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h
index ff6f3ca649b3..167d0539e0ad 100644
--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -40,11 +40,23 @@ int mwait_usable(const struct cpuinfo_x86 *);
unsigned int x86_family(unsigned int sig);
unsigned int x86_model(unsigned int sig);
unsigned int x86_stepping(unsigned int sig);
+
+enum split_lock_detect_state {
+ sld_off = 0,
+ sld_warn,
+ sld_fatal,
+};
+
#ifdef CONFIG_CPU_SUP_INTEL
+extern enum split_lock_detect_state get_split_lock_detect_state(void);
extern void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c);
extern void switch_to_sld(unsigned long tifn);
extern bool handle_user_split_lock(struct pt_regs *regs, long error_code);
#else
+static inline enum split_lock_detect_state get_split_lock_detect_state(void)
+{
+ return sld_off;
+}
static inline void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c) {}
static inline void switch_to_sld(unsigned long tifn) {}
static inline bool handle_user_split_lock(struct pt_regs *regs, long error_code)
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 5d92e381fd91..2f9c48e91caf 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -33,12 +33,6 @@
#include <asm/apic.h>
#endif

-enum split_lock_detect_state {
- sld_off = 0,
- sld_warn,
- sld_fatal,
-};
-
/*
* Default to sld_off because most systems do not support split lock detection
* split_lock_setup() will switch this to sld_warn on systems that support
@@ -1004,6 +998,12 @@ cpu_dev_register(intel_cpu_dev);
#undef pr_fmt
#define pr_fmt(fmt) "x86/split lock detection: " fmt

+enum split_lock_detect_state get_split_lock_detect_state(void)
+{
+ return sld_state;
+}
+EXPORT_SYMBOL_GPL(get_split_lock_detect_state);
+
static const struct {
const char *option;
enum split_lock_detect_state state;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e6d4e4dcd11c..7d9303c303d9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5800,6 +5800,13 @@ static int emulator_write_emulated(struct x86_emulate_ctxt *ctxt,
(cmpxchg64((u64 *)(ptr), *(u64 *)(old), *(u64 *)(new)) == *(u64 *)(old))
#endif

+static inline bool is_split_lock_access(gpa_t gpa, unsigned int bytes)
+{
+ unsigned int cache_line_size = cache_line_size();
+
+ return (gpa & (cache_line_size - 1)) + bytes > cache_line_size;
+}
+
static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
unsigned long addr,
const void *old,
@@ -5826,6 +5833,10 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
if (((gpa + bytes - 1) & PAGE_MASK) != (gpa & PAGE_MASK))
goto emul_write;

+ if (get_split_lock_detect_state() != sld_off &&
+ is_split_lock_access(gpa, bytes))
+ goto emul_write;
+
if (kvm_vcpu_map(vcpu, gpa_to_gfn(gpa), &map))
goto emul_write;

--
2.23.0

2020-01-30 12:34:20

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 1/2] KVM: x86: Emulate split-lock access as a write

From: Xiaoyao Li
> Sent: 30 January 2020 12:20
> If split lock detect is enabled (warn/fatal), #AC handler calls die()
> when split lock happens in kernel.
>
> A sane guest should never tigger emulation on a split-lock access, but
> it cannot prevent malicous guest from doing this. So just emulating the
> access as a write if it's a split-lock access to avoid malicous guest
> polluting the kernel log.

That doesn't seem right if, for example, the locked access is addx.
ISTM it would be better to force an immediate fatal error of some
kind than just corrupt the guest memory.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-01-30 15:19:12

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 1/2] KVM: x86: Emulate split-lock access as a write



> On Jan 30, 2020, at 4:31 AM, David Laight <[email protected]> wrote:
>
> From: Xiaoyao Li
>> Sent: 30 January 2020 12:20
>> If split lock detect is enabled (warn/fatal), #AC handler calls die()
>> when split lock happens in kernel.
>>
>> A sane guest should never tigger emulation on a split-lock access, but
>> it cannot prevent malicous guest from doing this. So just emulating the
>> access as a write if it's a split-lock access to avoid malicous guest
>> polluting the kernel log.
>
> That doesn't seem right if, for example, the locked access is addx.
> ISTM it would be better to force an immediate fatal error of some
> kind than just corrupt the guest memory.
>
>

The existing page-spanning case is just as wrong.

2020-01-30 15:19:49

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: VMX: Extend VMX's #AC handding



> On Jan 30, 2020, at 4:24 AM, Xiaoyao Li <[email protected]> wrote:
>
> There are two types of #AC can be generated in Intel CPUs:
> 1. legacy alignment check #AC;
> 2. split lock #AC;
>
> Legacy alignment check #AC can be injected to guest if guest has enabled
> alignemnet check.
>
> When host enables split lock detection, i.e., split_lock_detect!=off,
> guest will receive an unexpected #AC when there is a split_lock happens in
> guest since KVM doesn't virtualize this feature to guest.
>
> Since the old guests lack split_lock #AC handler and may have split lock
> buges. To make guest survive from split lock, applying the similar policy
> as host's split lock detect configuration:
> - host split lock detect is sld_warn:
> warning the split lock happened in guest, and disabling split lock
> detect around VM-enter;
> - host split lock detect is sld_fatal:
> forwarding #AC to userspace. (Usually userspace dump the #AC
> exception and kill the guest).

A correct userspace implementation should, with a modern guest kernel, forward the exception. Otherwise you’re introducing a DoS into the guest if the guest kernel is fine but guest userspace is buggy.

What’s the intended behavior here?

2020-01-30 16:31:10

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: VMX: Extend VMX's #AC handding

On 1/30/2020 11:18 PM, Andy Lutomirski wrote:
>
>
>> On Jan 30, 2020, at 4:24 AM, Xiaoyao Li <[email protected]> wrote:
>>
>> There are two types of #AC can be generated in Intel CPUs:
>> 1. legacy alignment check #AC;
>> 2. split lock #AC;
>>
>> Legacy alignment check #AC can be injected to guest if guest has enabled
>> alignemnet check.
>>
>> When host enables split lock detection, i.e., split_lock_detect!=off,
>> guest will receive an unexpected #AC when there is a split_lock happens in
>> guest since KVM doesn't virtualize this feature to guest.
>>
>> Since the old guests lack split_lock #AC handler and may have split lock
>> buges. To make guest survive from split lock, applying the similar policy
>> as host's split lock detect configuration:
>> - host split lock detect is sld_warn:
>> warning the split lock happened in guest, and disabling split lock
>> detect around VM-enter;
>> - host split lock detect is sld_fatal:
>> forwarding #AC to userspace. (Usually userspace dump the #AC
>> exception and kill the guest).
>
> A correct userspace implementation should, with a modern guest kernel, forward the exception. Otherwise you’re introducing a DoS into the guest if the guest kernel is fine but guest userspace is buggy.

To prevent DoS in guest, the better solution is virtualizing and
advertising this feature to guest, so guest can explicitly enable it by
setting split_lock_detect=fatal, if it's a latest linux guest.

However, it's another topic, I'll send out the patches later.

> What’s the intended behavior here?
>
It's for old guests. Below I quote what Paolo said in
https://lore.kernel.org/kvm/[email protected]/

"So for an old guest, as soon as the guest kernel happens to do a split
lock, it gets an unexpected #AC and crashes and burns. And then, after
much googling and gnashing of teeth, people proceed to disable split
lock detection.

(Old guests are the common case: you're a cloud provider and your
customers run old stuff; it's a workstation and you want to play that
game that requires an old version of Windows; etc.).

To save them the googling and gnashing of teeth, I guess we can do a
pr_warn_ratelimited on the first split lock encountered by a guest. (It
has to be ratelimited because userspace could create an arbitrary amount
of guests to spam the kernel logs). But the end result is the same,
split lock detection is disabled by the user."



2020-01-30 17:18:12

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: VMX: Extend VMX's #AC handding



> On Jan 30, 2020, at 8:30 AM, Xiaoyao Li <[email protected]> wrote:
>
> On 1/30/2020 11:18 PM, Andy Lutomirski wrote:
>>>> On Jan 30, 2020, at 4:24 AM, Xiaoyao Li <[email protected]> wrote:
>>>
>>> There are two types of #AC can be generated in Intel CPUs:
>>> 1. legacy alignment check #AC;
>>> 2. split lock #AC;
>>>
>>> Legacy alignment check #AC can be injected to guest if guest has enabled
>>> alignemnet check.
>>>
>>> When host enables split lock detection, i.e., split_lock_detect!=off,
>>> guest will receive an unexpected #AC when there is a split_lock happens in
>>> guest since KVM doesn't virtualize this feature to guest.
>>>
>>> Since the old guests lack split_lock #AC handler and may have split lock
>>> buges. To make guest survive from split lock, applying the similar policy
>>> as host's split lock detect configuration:
>>> - host split lock detect is sld_warn:
>>> warning the split lock happened in guest, and disabling split lock
>>> detect around VM-enter;
>>> - host split lock detect is sld_fatal:
>>> forwarding #AC to userspace. (Usually userspace dump the #AC
>>> exception and kill the guest).
>> A correct userspace implementation should, with a modern guest kernel, forward the exception. Otherwise you’re introducing a DoS into the guest if the guest kernel is fine but guest userspace is buggy.
>
> To prevent DoS in guest, the better solution is virtualizing and advertising this feature to guest, so guest can explicitly enable it by setting split_lock_detect=fatal, if it's a latest linux guest.
>
> However, it's another topic, I'll send out the patches later.
>

Can we get a credible description of how this would work? I suggest:

Intel adds and documents a new CPUID bit or core capability bit that means “split lock detection is forced on”. If this bit is set, the MSR bit controlling split lock detection is still writable, but split lock detection is on regardless of the value. Operating systems are expected to set the bit to 1 to indicate to a hypervisor, if present, that they understand that split lock detection is on.

This would be an SDM-only change, but it would also be a commitment to certain behavior for future CPUs that don’t implement split locks.

Can one of you Intel folks ask the architecture team about this?

2020-01-31 07:24:03

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: VMX: Extend VMX's #AC handding

On 1/31/2020 1:16 AM, Andy Lutomirski wrote:
>
>
>> On Jan 30, 2020, at 8:30 AM, Xiaoyao Li <[email protected]> wrote:
>>
>> On 1/30/2020 11:18 PM, Andy Lutomirski wrote:
>>>>> On Jan 30, 2020, at 4:24 AM, Xiaoyao Li <[email protected]> wrote:
>>>>
>>>> There are two types of #AC can be generated in Intel CPUs:
>>>> 1. legacy alignment check #AC;
>>>> 2. split lock #AC;
>>>>
>>>> Legacy alignment check #AC can be injected to guest if guest has enabled
>>>> alignemnet check.
>>>>
>>>> When host enables split lock detection, i.e., split_lock_detect!=off,
>>>> guest will receive an unexpected #AC when there is a split_lock happens in
>>>> guest since KVM doesn't virtualize this feature to guest.
>>>>
>>>> Since the old guests lack split_lock #AC handler and may have split lock
>>>> buges. To make guest survive from split lock, applying the similar policy
>>>> as host's split lock detect configuration:
>>>> - host split lock detect is sld_warn:
>>>> warning the split lock happened in guest, and disabling split lock
>>>> detect around VM-enter;
>>>> - host split lock detect is sld_fatal:
>>>> forwarding #AC to userspace. (Usually userspace dump the #AC
>>>> exception and kill the guest).
>>> A correct userspace implementation should, with a modern guest kernel, forward the exception. Otherwise you’re introducing a DoS into the guest if the guest kernel is fine but guest userspace is buggy.
>>
>> To prevent DoS in guest, the better solution is virtualizing and advertising this feature to guest, so guest can explicitly enable it by setting split_lock_detect=fatal, if it's a latest linux guest.
>>
>> However, it's another topic, I'll send out the patches later.
>>
>
> Can we get a credible description of how this would work? I suggest:
>
> Intel adds and documents a new CPUID bit or core capability bit that means “split lock detection is forced on”. If this bit is set, the MSR bit controlling split lock detection is still writable, but split lock detection is on regardless of the value. Operating systems are expected to set the bit to 1 to indicate to a hypervisor, if present, that they understand that split lock detection is on.
>
> This would be an SDM-only change, but it would also be a commitment to certain behavior for future CPUs that don’t implement split locks.

It sounds a PV solution for virtualization that it doesn't need to be
defined in Intel-SDM but in KVM document.

As you suggested, we can define new bit in KVM_CPUID_FEATURES
(0x40000001) as KVM_FEATURE_SLD_FORCED and reuse MSR_TEST_CTL or use a
new virtualized MSR for guest to tell hypervisor it understand split
lock detection is forced on.

> Can one of you Intel folks ask the architecture team about this?
>

2020-01-31 15:39:30

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: VMX: Extend VMX's #AC handding



> On Jan 30, 2020, at 11:22 PM, Xiaoyao Li <[email protected]> wrote:
>
> On 1/31/2020 1:16 AM, Andy Lutomirski wrote:
>>>> On Jan 30, 2020, at 8:30 AM, Xiaoyao Li <[email protected]> wrote:
>>>
>>> On 1/30/2020 11:18 PM, Andy Lutomirski wrote:
>>>>>> On Jan 30, 2020, at 4:24 AM, Xiaoyao Li <[email protected]> wrote:
>>>>>
>>>>> There are two types of #AC can be generated in Intel CPUs:
>>>>> 1. legacy alignment check #AC;
>>>>> 2. split lock #AC;
>>>>>
>>>>> Legacy alignment check #AC can be injected to guest if guest has enabled
>>>>> alignemnet check.
>>>>>
>>>>> When host enables split lock detection, i.e., split_lock_detect!=off,
>>>>> guest will receive an unexpected #AC when there is a split_lock happens in
>>>>> guest since KVM doesn't virtualize this feature to guest.
>>>>>
>>>>> Since the old guests lack split_lock #AC handler and may have split lock
>>>>> buges. To make guest survive from split lock, applying the similar policy
>>>>> as host's split lock detect configuration:
>>>>> - host split lock detect is sld_warn:
>>>>> warning the split lock happened in guest, and disabling split lock
>>>>> detect around VM-enter;
>>>>> - host split lock detect is sld_fatal:
>>>>> forwarding #AC to userspace. (Usually userspace dump the #AC
>>>>> exception and kill the guest).
>>>> A correct userspace implementation should, with a modern guest kernel, forward the exception. Otherwise you’re introducing a DoS into the guest if the guest kernel is fine but guest userspace is buggy.
>>>
>>> To prevent DoS in guest, the better solution is virtualizing and advertising this feature to guest, so guest can explicitly enable it by setting split_lock_detect=fatal, if it's a latest linux guest.
>>>
>>> However, it's another topic, I'll send out the patches later.
>>>
>> Can we get a credible description of how this would work? I suggest:
>> Intel adds and documents a new CPUID bit or core capability bit that means “split lock detection is forced on”. If this bit is set, the MSR bit controlling split lock detection is still writable, but split lock detection is on regardless of the value. Operating systems are expected to set the bit to 1 to indicate to a hypervisor, if present, that they understand that split lock detection is on.
>> This would be an SDM-only change, but it would also be a commitment to certain behavior for future CPUs that don’t implement split locks.
>
> It sounds a PV solution for virtualization that it doesn't need to be defined in Intel-SDM but in KVM document.
>
> As you suggested, we can define new bit in KVM_CPUID_FEATURES (0x40000001) as KVM_FEATURE_SLD_FORCED and reuse MSR_TEST_CTL or use a new virtualized MSR for guest to tell hypervisor it understand split lock detection is forced on.

Of course KVM can do this. But this missed the point. Intel added a new CPU feature, complete with an enumeration mechanism, that cannot be correctly used if a hypervisor is present. As it stands, without specific hypervisor and guest support of a non-Intel interface, it is *impossible* to give architecturally correct behavior to a guest. If KVM implements your suggestion, *Windows* guests will still malfunction on Linux.

This is Intel’s mess. Intel should have some responsibility for cleaning it up. If Intel puts something like my suggestion into the SDM, all the OSes and hypervisors will implement it the *same* way and will be compatible. Sure, old OSes will still be broken, but at least new guests will work correctly. Without something like this, even new guests are busted.

>
>> Can one of you Intel folks ask the architecture team about this?
>

2020-01-31 17:49:24

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: VMX: Extend VMX's #AC handding

On 1/31/2020 11:37 PM, Andy Lutomirski wrote:
>
>
>> On Jan 30, 2020, at 11:22 PM, Xiaoyao Li <[email protected]> wrote:
>>
>> On 1/31/2020 1:16 AM, Andy Lutomirski wrote:
>>>>> On Jan 30, 2020, at 8:30 AM, Xiaoyao Li <[email protected]> wrote:
>>>>
>>>> On 1/30/2020 11:18 PM, Andy Lutomirski wrote:
>>>>>>> On Jan 30, 2020, at 4:24 AM, Xiaoyao Li <[email protected]> wrote:
>>>>>>
>>>>>> There are two types of #AC can be generated in Intel CPUs:
>>>>>> 1. legacy alignment check #AC;
>>>>>> 2. split lock #AC;
>>>>>>
>>>>>> Legacy alignment check #AC can be injected to guest if guest has enabled
>>>>>> alignemnet check.
>>>>>>
>>>>>> When host enables split lock detection, i.e., split_lock_detect!=off,
>>>>>> guest will receive an unexpected #AC when there is a split_lock happens in
>>>>>> guest since KVM doesn't virtualize this feature to guest.
>>>>>>
>>>>>> Since the old guests lack split_lock #AC handler and may have split lock
>>>>>> buges. To make guest survive from split lock, applying the similar policy
>>>>>> as host's split lock detect configuration:
>>>>>> - host split lock detect is sld_warn:
>>>>>> warning the split lock happened in guest, and disabling split lock
>>>>>> detect around VM-enter;
>>>>>> - host split lock detect is sld_fatal:
>>>>>> forwarding #AC to userspace. (Usually userspace dump the #AC
>>>>>> exception and kill the guest).
>>>>> A correct userspace implementation should, with a modern guest kernel, forward the exception. Otherwise you’re introducing a DoS into the guest if the guest kernel is fine but guest userspace is buggy.
>>>>
>>>> To prevent DoS in guest, the better solution is virtualizing and advertising this feature to guest, so guest can explicitly enable it by setting split_lock_detect=fatal, if it's a latest linux guest.
>>>>
>>>> However, it's another topic, I'll send out the patches later.
>>>>
>>> Can we get a credible description of how this would work? I suggest:
>>> Intel adds and documents a new CPUID bit or core capability bit that means “split lock detection is forced on”. If this bit is set, the MSR bit controlling split lock detection is still writable, but split lock detection is on regardless of the value. Operating systems are expected to set the bit to 1 to indicate to a hypervisor, if present, that they understand that split lock detection is on.
>>> This would be an SDM-only change, but it would also be a commitment to certain behavior for future CPUs that don’t implement split locks.
>>
>> It sounds a PV solution for virtualization that it doesn't need to be defined in Intel-SDM but in KVM document.
>>
>> As you suggested, we can define new bit in KVM_CPUID_FEATURES (0x40000001) as KVM_FEATURE_SLD_FORCED and reuse MSR_TEST_CTL or use a new virtualized MSR for guest to tell hypervisor it understand split lock detection is forced on.
>
> Of course KVM can do this. But this missed the point. Intel added a new CPU feature, complete with an enumeration mechanism, that cannot be correctly used if a hypervisor is present.

Why it cannot be correctly used if a hypervisor is present?
Because it needs to disable split lock detection when running a vcpu for
guest as this patch wants to do?

> As it stands, without specific hypervisor and guest support of a non-Intel interface, it is *impossible* to give architecturally correct behavior to a guest. If KVM implements your suggestion, *Windows* guests will still malfunction on Linux.

Actually, KVM don't need to implement my suggestion. It can just
virtualize and expose this feature (MSR_IA32_CORE_CAPABILITIES and
MSR_TEST_CTRL) to guest, (but it may have some requirement that HT is
disabled and host is sld_off) then guest can use it architecturally.

If we don't virtualize and expose this feature to guest as now, both old
and modern guest should think there is no feature split lock detection,
and of course there is no #AC when it executes a split lock.

However, the problem here is the scope of host split lock detection,
i.e., whether or not host's enabling of split lock detection takes
effect on guest.

>
> This is Intel’s mess. Intel should have some responsibility for cleaning it up. If Intel puts something like my suggestion into the SDM, all the OSes and hypervisors will implement it the *same* way and will be compatible. Sure, old OSes will still be broken, but at least new guests will work correctly. Without something like this, even new guests are busted.
>
>>
>>> Can one of you Intel folks ask the architecture team about this?
>>

2020-01-31 20:04:25

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 1/2] KVM: x86: Emulate split-lock access as a write

On Thu, Jan 30, 2020 at 07:16:24AM -0800, Andy Lutomirski wrote:
>
> > On Jan 30, 2020, at 4:31 AM, David Laight <[email protected]> wrote:
> >
> >> If split lock detect is enabled (warn/fatal), #AC handler calls die()
> >> when split lock happens in kernel.
> >>
> >> A sane guest should never tigger emulation on a split-lock access, but
> >> it cannot prevent malicous guest from doing this. So just emulating the
> >> access as a write if it's a split-lock access to avoid malicous guest
> >> polluting the kernel log.
> >
> > That doesn't seem right if, for example, the locked access is addx.
> > ISTM it would be better to force an immediate fatal error of some
> > kind than just corrupt the guest memory.
>
> The existing page-spanning case is just as wrong.

Yes, it's a deliberate shortcut to handle a corner case that no real world
workload will ever trigger[*]. The split-lock #AC case is the same.
Actually, it's significantly less likely than the page-split case.

With a sane, non-malicious guest, the emulator code in question only gets
triggered if unrestricted guest is supported. Without unrestricted guest,
there are certain modes, e.g. Big Real Mode, where VM-Enter will fail, in
which case KVM needs to emulate the entire guest code stream until the
guest transitions back to a valid mode (from VMX perspective). When
unrestricted guest is enabled, the emulator is only invoked for MMIO,
I/O strings, and for some instructions that are emulated on #UD to allow
migrating VMs between hosts without heterogenous CPU capabilities.

Unrestricted guest is supported on all Intel CPUs since Westmere, and will
be supported on all CPUs that support split-lock #AC and VMX. Except for
a few esoteric use cases where using shadow paging is more performant than
using EPT, there is zero benefit to disabling unrestricted guest, whereas
enabling unrestricted guest provides additional performance and security.

In other words, the odds of a sane, non-malicious guest executing a split-
lock instruction that needs to be emulated by KVM are basically zilch.

The reason the emulator needs to handle this case at all is because a
malicious guest could play TLB games to trick KVM into emulating a split-
lock instruction, e.g. get the guest's translation for RIP pointing at a
string I/O instruction to trigger VM-Exit, but have the host translation
point at a completely different instruction. With split-lock #AC enable
in the host, that would cause a kernel split-lock #AC and panic the whole
system.

Exiting to host userspace with "emulation failed" is the other reasonable
alternative, but that's basically the same as killing the guest. We're
arguing that, in the extremely unlikely event that there is a workload out
there that hits this, it's preferable to *maybe* corrupt guest memory and
log the anomaly in the kernel log, as opposed to outright killing the guest
with a generic "emulation failed".

Looking forward, the other reason for taking this shortcut is to easily
handle the case where KVM adds support for exposing split-lock #AC to the
guest. With this approach, we don't have to teach the emulator how to
query for split-lock #AC enabling in the guest. Again, in the interest of
not adding code to the emulator that is effectively useless.

[*] https://lkml.kernel.org/r/[email protected]

2020-01-31 20:21:49

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: VMX: Extend VMX's #AC handding

On Sat, Feb 01, 2020 at 01:47:10AM +0800, Xiaoyao Li wrote:
> On 1/31/2020 11:37 PM, Andy Lutomirski wrote:
> >
> >>On Jan 30, 2020, at 11:22 PM, Xiaoyao Li <[email protected]> wrote:
> >>
> >> On 1/31/2020 1:16 AM, Andy Lutomirski wrote:

...

> >>>Can we get a credible description of how this would work? I suggest: Intel
> >>>adds and documents a new CPUID bit or core capability bit that means
> >>>“split lock detection is forced on”. If this bit is set, the MSR bit
> >>>controlling split lock detection is still writable, but split lock
> >>>detection is on regardless of the value. Operating systems are expected
> >>>to set the bit to 1 to indicate to a hypervisor, if present, that they
> >>>understand that split lock detection is on. This would be an SDM-only
> >>>change, but it would also be a commitment to certain behavior for future
> >>>CPUs that don’t implement split locks.
> >>
> >>It sounds a PV solution for virtualization that it doesn't need to be
> >>defined in Intel-SDM but in KVM document.
> >>
> >>As you suggested, we can define new bit in KVM_CPUID_FEATURES (0x40000001)
> >>as KVM_FEATURE_SLD_FORCED and reuse MSR_TEST_CTL or use a new virtualized
> >>MSR for guest to tell hypervisor it understand split lock detection is
> >>forced on.
> >
> >Of course KVM can do this. But this missed the point. Intel added a new CPU
> >feature, complete with an enumeration mechanism, that cannot be correctly
> >used if a hypervisor is present.
>
> Why it cannot be correctly used if a hypervisor is present? Because it needs
> to disable split lock detection when running a vcpu for guest as this patch
> wants to do?

Because SMT. Unless vCPUs are pinned 1:1 with pCPUs, and the guest is
given an accurate topology, disabling/enabling split-lock #AC may (or may
not) also disable/enable split-lock #AC on a random vCPU in the guest.

> >As it stands, without specific hypervisor and guest support of a non-Intel
> >interface, it is *impossible* to give architecturally correct behavior to a
> >guest. If KVM implements your suggestion, *Windows* guests will still
> >malfunction on Linux.
>
> Actually, KVM don't need to implement my suggestion. It can just virtualize
> and expose this feature (MSR_IA32_CORE_CAPABILITIES and MSR_TEST_CTRL) to
> guest, (but it may have some requirement that HT is disabled and host is
> sld_off) then guest can use it architecturally.

This is essentially what I proposed a while back. KVM would allow enabling
split-lock #AC in the guest if and only if SMT is disabled or the enable bit
is per-thread, *or* the host is in "warn" mode (can live with split-lock #AC
being randomly disabled/enabled) and userspace has communicated to KVM that
it is pinning vCPUs.

2020-01-31 20:59:29

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: VMX: Extend VMX's #AC handding




> On Jan 31, 2020, at 12:18 PM, Sean Christopherson <[email protected]> wrote:
>
> On Sat, Feb 01, 2020 at 01:47:10AM +0800, Xiaoyao Li wrote:
>>> On 1/31/2020 11:37 PM, Andy Lutomirski wrote:
>>>
>>>> On Jan 30, 2020, at 11:22 PM, Xiaoyao Li <[email protected]> wrote:
>>>>
>>>> On 1/31/2020 1:16 AM, Andy Lutomirski wrote:
>
> ...
>
>>>>> Can we get a credible description of how this would work? I suggest: Intel
>>>>> adds and documents a new CPUID bit or core capability bit that means
>>>>> “split lock detection is forced on”. If this bit is set, the MSR bit
>>>>> controlling split lock detection is still writable, but split lock
>>>>> detection is on regardless of the value. Operating systems are expected
>>>>> to set the bit to 1 to indicate to a hypervisor, if present, that they
>>>>> understand that split lock detection is on. This would be an SDM-only
>>>>> change, but it would also be a commitment to certain behavior for future
>>>>> CPUs that don’t implement split locks.
>>>>
>>>> It sounds a PV solution for virtualization that it doesn't need to be
>>>> defined in Intel-SDM but in KVM document.
>>>>
>>>> As you suggested, we can define new bit in KVM_CPUID_FEATURES (0x40000001)
>>>> as KVM_FEATURE_SLD_FORCED and reuse MSR_TEST_CTL or use a new virtualized
>>>> MSR for guest to tell hypervisor it understand split lock detection is
>>>> forced on.
>>>
>>> Of course KVM can do this. But this missed the point. Intel added a new CPU
>>> feature, complete with an enumeration mechanism, that cannot be correctly
>>> used if a hypervisor is present.
>>
>> Why it cannot be correctly used if a hypervisor is present? Because it needs
>> to disable split lock detection when running a vcpu for guest as this patch
>> wants to do?
>
> Because SMT. Unless vCPUs are pinned 1:1 with pCPUs, and the guest is
> given an accurate topology, disabling/enabling split-lock #AC may (or may
> not) also disable/enable split-lock #AC on a random vCPU in the guest.
>
>>> As it stands, without specific hypervisor and guest support of a non-Intel
>>> interface, it is *impossible* to give architecturally correct behavior to a
>>> guest. If KVM implements your suggestion, *Windows* guests will still
>>> malfunction on Linux.
>>
>> Actually, KVM don't need to implement my suggestion. It can just virtualize
>> and expose this feature (MSR_IA32_CORE_CAPABILITIES and MSR_TEST_CTRL) to
>> guest, (but it may have some requirement that HT is disabled and host is
>> sld_off) then guest can use it architecturally.
>
> This is essentially what I proposed a while back. KVM would allow enabling
> split-lock #AC in the guest if and only if SMT is disabled or the enable bit
> is per-thread, *or* the host is in "warn" mode (can live with split-lock #AC
> being randomly disabled/enabled) and userspace has communicated to KVM that
> it is pinning vCPUs.

How about covering the actual sensible case: host is set to fatal? In this mode, the guest gets split lock detection whether it wants it or not. How do we communicate this to the guest?

2020-01-31 21:05:46

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: VMX: Extend VMX's #AC handding

On Fri, Jan 31, 2020 at 12:57:51PM -0800, Andy Lutomirski wrote:
>
> > On Jan 31, 2020, at 12:18 PM, Sean Christopherson <[email protected]> wrote:
> >
> > This is essentially what I proposed a while back. KVM would allow enabling
> > split-lock #AC in the guest if and only if SMT is disabled or the enable bit
> > is per-thread, *or* the host is in "warn" mode (can live with split-lock #AC
> > being randomly disabled/enabled) and userspace has communicated to KVM that
> > it is pinning vCPUs.
>
> How about covering the actual sensible case: host is set to fatal? In this
> mode, the guest gets split lock detection whether it wants it or not. How do
> we communicate this to the guest?

KVM doesn't advertise split-lock #AC to the guest and returns -EFAULT to the
userspace VMM if the guest triggers a split-lock #AC.

Effectively the same behavior as any other userspace process, just that KVM
explicitly returns -EFAULT instead of the process getting a SIGBUS.

2020-01-31 21:34:28

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: VMX: Extend VMX's #AC handding



> On Jan 31, 2020, at 1:04 PM, Sean Christopherson <[email protected]> wrote:
>
> On Fri, Jan 31, 2020 at 12:57:51PM -0800, Andy Lutomirski wrote:
>>
>>>> On Jan 31, 2020, at 12:18 PM, Sean Christopherson <[email protected]> wrote:
>>>
>>> This is essentially what I proposed a while back. KVM would allow enabling
>>> split-lock #AC in the guest if and only if SMT is disabled or the enable bit
>>> is per-thread, *or* the host is in "warn" mode (can live with split-lock #AC
>>> being randomly disabled/enabled) and userspace has communicated to KVM that
>>> it is pinning vCPUs.
>>
>> How about covering the actual sensible case: host is set to fatal? In this
>> mode, the guest gets split lock detection whether it wants it or not. How do
>> we communicate this to the guest?
>
> KVM doesn't advertise split-lock #AC to the guest and returns -EFAULT to the
> userspace VMM if the guest triggers a split-lock #AC.
>
> Effectively the same behavior as any other userspace process, just that KVM
> explicitly returns -EFAULT instead of the process getting a SIGBUS.


Which helps how if the guest is actually SLD-aware?

I suppose we could make the argument that, if an SLD-aware guest gets #AC at CPL0, it’s a bug, but it still seems rather nicer to forward the #AC to the guest instead of summarily killing it.

ISTM, on an SLD-fatal host with an SLD-aware guest, the host should tell the guest “hey, you may not do split locks — SLD is forced on” and the guest should somehow acknowledge it so that it sees the architectural behavior instead of something we made up. Hence my suggestion.

2020-02-01 17:00:44

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: VMX: Extend VMX's #AC handding

On 2/1/2020 5:33 AM, Andy Lutomirski wrote:
>
>
>> On Jan 31, 2020, at 1:04 PM, Sean Christopherson <[email protected]> wrote:
>>
>> On Fri, Jan 31, 2020 at 12:57:51PM -0800, Andy Lutomirski wrote:
>>>
>>>>> On Jan 31, 2020, at 12:18 PM, Sean Christopherson <[email protected]> wrote:
>>>>
>>>> This is essentially what I proposed a while back. KVM would allow enabling
>>>> split-lock #AC in the guest if and only if SMT is disabled or the enable bit
>>>> is per-thread, *or* the host is in "warn" mode (can live with split-lock #AC
>>>> being randomly disabled/enabled) and userspace has communicated to KVM that
>>>> it is pinning vCPUs.
>>>
>>> How about covering the actual sensible case: host is set to fatal? In this
>>> mode, the guest gets split lock detection whether it wants it or not. How do
>>> we communicate this to the guest?
>>
>> KVM doesn't advertise split-lock #AC to the guest and returns -EFAULT to the
>> userspace VMM if the guest triggers a split-lock #AC.
>>
>> Effectively the same behavior as any other userspace process, just that KVM
>> explicitly returns -EFAULT instead of the process getting a SIGBUS.
>
>
> Which helps how if the guest is actually SLD-aware?
>
> I suppose we could make the argument that, if an SLD-aware guest gets #AC at CPL0, it’s a bug, but it still seems rather nicer to forward the #AC to the guest instead of summarily killing it.

If KVM does advertise split-lock detection to the guest, then kvm/host
can know whether a guest is SLD-aware by checking guest's
MSR_TEST_CTRL.SPLIT_LOCK_DETECT bit.

- If guest's MSR_TEST_CTRL.SPLIT_LOCK_DETECT is set, it indicates
guest is SLD-aware so KVM forwards #AC to guest.

- If not set. It may be a old guest or a malicious guest or a guest
without SLD support, and we cannot figure it out. So we have to kill the
guest when host is SLD-fatal, and let guest survive when SLD-WARN for
old sane buggy guest.

In a word, all the above is on the condition that KVM advertise
split-lock detection to guest. But this patch doesn't do this. Maybe I
should add that part in v2.

> ISTM, on an SLD-fatal host with an SLD-aware guest, the host should tell the guest “hey, you may not do split locks — SLD is forced on” and the guest should somehow acknowledge it so that it sees the architectural behavior instead of something we made up. Hence my suggestion.
>

2020-02-01 17:57:28

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: VMX: Extend VMX's #AC handding



> On Feb 1, 2020, at 8:58 AM, Xiaoyao Li <[email protected]> wrote:
>
> On 2/1/2020 5:33 AM, Andy Lutomirski wrote:
>>>> On Jan 31, 2020, at 1:04 PM, Sean Christopherson <[email protected]> wrote:
>>>
>>> On Fri, Jan 31, 2020 at 12:57:51PM -0800, Andy Lutomirski wrote:
>>>>
>>>>>> On Jan 31, 2020, at 12:18 PM, Sean Christopherson <[email protected]> wrote:
>>>>>
>>>>> This is essentially what I proposed a while back. KVM would allow enabling
>>>>> split-lock #AC in the guest if and only if SMT is disabled or the enable bit
>>>>> is per-thread, *or* the host is in "warn" mode (can live with split-lock #AC
>>>>> being randomly disabled/enabled) and userspace has communicated to KVM that
>>>>> it is pinning vCPUs.
>>>>
>>>> How about covering the actual sensible case: host is set to fatal? In this
>>>> mode, the guest gets split lock detection whether it wants it or not. How do
>>>> we communicate this to the guest?
>>>
>>> KVM doesn't advertise split-lock #AC to the guest and returns -EFAULT to the
>>> userspace VMM if the guest triggers a split-lock #AC.
>>>
>>> Effectively the same behavior as any other userspace process, just that KVM
>>> explicitly returns -EFAULT instead of the process getting a SIGBUS.
>> Which helps how if the guest is actually SLD-aware?
>> I suppose we could make the argument that, if an SLD-aware guest gets #AC at CPL0, it’s a bug, but it still seems rather nicer to forward the #AC to the guest instead of summarily killing it.
>
> If KVM does advertise split-lock detection to the guest, then kvm/host can know whether a guest is SLD-aware by checking guest's MSR_TEST_CTRL.SPLIT_LOCK_DETECT bit.
>
> - If guest's MSR_TEST_CTRL.SPLIT_LOCK_DETECT is set, it indicates guest is SLD-aware so KVM forwards #AC to guest.
>

I disagree. If you advertise split-lock detection with the current core capability bit, it should *work*. And it won’t. The choices you’re actually giving the guest are:

a) Guest understands SLD and wants it on. The guest gets the same behavior as in bare metal.

b) Guest does not understand. Guest gets killed if it screws up as described below.

> - If not set. It may be a old guest or a malicious guest or a guest without SLD support, and we cannot figure it out. So we have to kill the guest when host is SLD-fatal, and let guest survive when SLD-WARN for old sane buggy guest.

All true, but the result of running a Linux guest in SLD-warn mode will be broken.

>
> In a word, all the above is on the condition that KVM advertise split-lock detection to guest. But this patch doesn't do this. Maybe I should add that part in v2.

I think you should think the details all the way through, and I think you’re likely to determine that the Intel architecture team needs to do *something* to clean up this mess.

There are two independent problems here. First, SLD *can’t* be virtualized sanely because it’s per-core not per-thread. Second, most users *won’t want* to virtualize it correctly even if they could: if a guest is allowed to do split locks, it can DoS the system.

So I think there should be an architectural way to tell a guest that SLD is on whether it likes it or not. And the guest, if booted with sld=warn, can print a message saying “haha, actually SLD is fatal” and carry on.

>
>> ISTM, on an SLD-fatal host with an SLD-aware guest, the host should tell the guest “hey, you may not do split locks — SLD is forced on” and the guest should somehow acknowledge it so that it sees the architectural behavior instead of something we made up. Hence my suggestion.
>

2020-02-02 04:40:06

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: VMX: Extend VMX's #AC handding

On 2/2/2020 1:56 AM, Andy Lutomirski wrote:
>
>
>> On Feb 1, 2020, at 8:58 AM, Xiaoyao Li <[email protected]> wrote:
>>
>> On 2/1/2020 5:33 AM, Andy Lutomirski wrote:
>>>>> On Jan 31, 2020, at 1:04 PM, Sean Christopherson <[email protected]> wrote:
>>>>
>>>> On Fri, Jan 31, 2020 at 12:57:51PM -0800, Andy Lutomirski wrote:
>>>>>
>>>>>>> On Jan 31, 2020, at 12:18 PM, Sean Christopherson <[email protected]> wrote:
>>>>>>
>>>>>> This is essentially what I proposed a while back. KVM would allow enabling
>>>>>> split-lock #AC in the guest if and only if SMT is disabled or the enable bit
>>>>>> is per-thread, *or* the host is in "warn" mode (can live with split-lock #AC
>>>>>> being randomly disabled/enabled) and userspace has communicated to KVM that
>>>>>> it is pinning vCPUs.
>>>>>
>>>>> How about covering the actual sensible case: host is set to fatal? In this
>>>>> mode, the guest gets split lock detection whether it wants it or not. How do
>>>>> we communicate this to the guest?
>>>>
>>>> KVM doesn't advertise split-lock #AC to the guest and returns -EFAULT to the
>>>> userspace VMM if the guest triggers a split-lock #AC.
>>>>
>>>> Effectively the same behavior as any other userspace process, just that KVM
>>>> explicitly returns -EFAULT instead of the process getting a SIGBUS.
>>> Which helps how if the guest is actually SLD-aware?
>>> I suppose we could make the argument that, if an SLD-aware guest gets #AC at CPL0, it’s a bug, but it still seems rather nicer to forward the #AC to the guest instead of summarily killing it.
>>
>> If KVM does advertise split-lock detection to the guest, then kvm/host can know whether a guest is SLD-aware by checking guest's MSR_TEST_CTRL.SPLIT_LOCK_DETECT bit.
>>
>> - If guest's MSR_TEST_CTRL.SPLIT_LOCK_DETECT is set, it indicates guest is SLD-aware so KVM forwards #AC to guest.
>>
>
> I disagree. If you advertise split-lock detection with the current core capability bit, it should *work*. And it won’t. The choices you’re actually giving the guest are:
>
> a) Guest understands SLD and wants it on. The guest gets the same behavior as in bare metal.
>
> b) Guest does not understand. Guest gets killed if it screws up as described below.
>
>> - If not set. It may be a old guest or a malicious guest or a guest without SLD support, and we cannot figure it out. So we have to kill the guest when host is SLD-fatal, and let guest survive when SLD-WARN for old sane buggy guest.
>
> All true, but the result of running a Linux guest in SLD-warn mode will be broken.
>
>>
>> In a word, all the above is on the condition that KVM advertise split-lock detection to guest. But this patch doesn't do this. Maybe I should add that part in v2.
>
> I think you should think the details all the way through, and I think you’re likely to determine that the Intel architecture team needs to do *something* to clean up this mess.
>
> There are two independent problems here. First, SLD *can’t* be virtualized sanely because it’s per-core not per-thread.

Sadly, it's the fact we cannot change. So it's better virtualized only
when SMT is disabled to make thing simple.

> Second, most users *won’t want* to virtualize it correctly even if they could: if a guest is allowed to do split locks, it can DoS the system.

To avoid DoS attack, it must use sld_fatal mode. In this case, guest are
forbidden to do split locks.

> So I think there should be an architectural way to tell a guest that SLD is on whether it likes it or not. And the guest, if booted with sld=warn, can print a message saying “haha, actually SLD is fatal” and carry on.

OK. Let me sort it out.

If SMT is disabled/unsupported, so KVM advertises SLD feature to guest.
below are all the case:

-----------------------------------------------------------------------
Host Guest Guest behavior
-----------------------------------------------------------------------
1. off same as in bare metal
-----------------------------------------------------------------------
2. warn off allow guest do split lock (for old guest):
hardware bit set initially, once split lock
happens, clear hardware bit when vcpu is running
So, it's the same as in bare metal

3. warn 1. user space: get #AC, then clear MSR bit, but
hardware bit is not cleared, #AC again, finally
clear hardware bit when vcpu is running.
So it's somehow the same as in bare-metal

2. kernel: same as in bare metal.

4. fatal same as in bare metal
----------------------------------------------------------------------
5.fatal off guest is killed when split lock,
or forward #AC to guest, this way guest gets an
unexpected #AC

6. warn 1. user space: get #AC, then clear MSR bit, but
hardware bit is not cleared, #AC again,
finally guest is killed, or KVM forwards #AC
to guest then guest gets an unexpected #AC.
2. kernel: same as in bare metal, call die();

7. fatal same as in bare metal
----------------------------------------------------------------------

Based on the table above, if we want guest has same behavior as in bare
metal, we can set host to sld_warn mode.
If we want prevent DoS from guest, we should set host to sld_fatal mode.


Now, let's analysis what if there is an architectural way to tell a
guest that SLD is forced on. Assume it's a SLD_forced_on cpuid bit.

- Host is sld_off, SLD_forced_on cpuid bit is not set, no change for
case #1

- Host is sld_fatal, SLD_forced_on cpuid bit must be set:
- if guest is SLD-aware, guest is supposed to only use fatal
mode that goes to case #7. And guest is not recommended
using warn mode. if guest persists, it goes to case #6

- if guest is not SLD-aware, maybe it's an old guest or it's a
malicious guest that pretends not SLD-aware, it goes to
case #5.

- Host is sld_warn, we have two choice
- set SLD_forced_on cpuid bit, it's the same as host is fatal.
- not set SLD_force_on_cpuid bit, it's the same as case #2,#3,#4

So I think introducing an architectural way to tell a guest that SLD is
forced on can make the only difference is that, there is a way to tell
guest not to use warn mode, thus eliminating case #6.

If you think it really matters, I can forward this requirement to our
Intel architecture people.
>>
>>> ISTM, on an SLD-fatal host with an SLD-aware guest, the host should tell the guest “hey, you may not do split locks — SLD is forced on” and the guest should somehow acknowledge it so that it sees the architectural behavior instead of something we made up. Hence my suggestion.
>>

2020-02-03 19:17:18

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: VMX: Extend VMX's #AC handding

On Sat, Feb 1, 2020 at 8:33 PM Xiaoyao Li <[email protected]> wrote:
>
> On 2/2/2020 1:56 AM, Andy Lutomirski wrote:
> >
> >
> > There are two independent problems here. First, SLD *can’t* be virtualized sanely because it’s per-core not per-thread.
>
> Sadly, it's the fact we cannot change. So it's better virtualized only
> when SMT is disabled to make thing simple.
>
> > Second, most users *won’t want* to virtualize it correctly even if they could: if a guest is allowed to do split locks, it can DoS the system.
>
> To avoid DoS attack, it must use sld_fatal mode. In this case, guest are
> forbidden to do split locks.
>
> > So I think there should be an architectural way to tell a guest that SLD is on whether it likes it or not. And the guest, if booted with sld=warn, can print a message saying “haha, actually SLD is fatal” and carry on.
>
> OK. Let me sort it out.
>
> If SMT is disabled/unsupported, so KVM advertises SLD feature to guest.
> below are all the case:
>
> -----------------------------------------------------------------------
> Host Guest Guest behavior
> -----------------------------------------------------------------------
> 1. off same as in bare metal
> -----------------------------------------------------------------------
> 2. warn off allow guest do split lock (for old guest):
> hardware bit set initially, once split lock
> happens, clear hardware bit when vcpu is running
> So, it's the same as in bare metal
>
> 3. warn 1. user space: get #AC, then clear MSR bit, but
> hardware bit is not cleared, #AC again, finally
> clear hardware bit when vcpu is running.
> So it's somehow the same as in bare-metal

Well, kind of, except that the warning is inaccurate -- there is no
guarantee that the hardware bit will be set at all when the guest is
running. This doesn't sound *that* bad, but it does mean that the
guest doesn't get the degree of DoS protection it thinks it's getting.

My inclination is that, the host mode is warn, then SLD should not be
exposed to the guest at all and the host should fully handle it.

>
> 2. kernel: same as in bare metal.
>
> 4. fatal same as in bare metal
> ----------------------------------------------------------------------
> 5.fatal off guest is killed when split lock,
> or forward #AC to guest, this way guest gets an
> unexpected #AC

Killing the guest seems like the right choice. But see below -- this
is not ideal if the guest is new.

>
> 6. warn 1. user space: get #AC, then clear MSR bit, but
> hardware bit is not cleared, #AC again,
> finally guest is killed, or KVM forwards #AC
> to guest then guest gets an unexpected #AC.
> 2. kernel: same as in bare metal, call die();
>
> 7. fatal same as in bare metal
> ----------------------------------------------------------------------
>
> Based on the table above, if we want guest has same behavior as in bare
> metal, we can set host to sld_warn mode.

I don't think this is correct. If the host is in warn mode, then the
guest behavior will be erratic. I'm not sure it makes sense for KVM
to expose such erratic behavior to the guest.

> If we want prevent DoS from guest, we should set host to sld_fatal mode.
>
>
> Now, let's analysis what if there is an architectural way to tell a
> guest that SLD is forced on. Assume it's a SLD_forced_on cpuid bit.
>
> - Host is sld_off, SLD_forced_on cpuid bit is not set, no change for
> case #1

Agreed.

>
> - Host is sld_fatal, SLD_forced_on cpuid bit must be set:
> - if guest is SLD-aware, guest is supposed to only use fatal
> mode that goes to case #7. And guest is not recommended
> using warn mode. if guest persists, it goes to case #6

Agreed, although I'm not sure I fully understand your #6 suggestion.
If the host is fatal and the guest is SLD-aware, then #AC should be
forwarded to the guest and the guest should act as if sld is fatal.
If the guest is booted with sld=off or sld=warn, the guest should log
a message saying that the configuration is unsupported and act as
though sld=fatal.

>
> - if guest is not SLD-aware, maybe it's an old guest or it's a
> malicious guest that pretends not SLD-aware, it goes to
> case #5.

By case 5 you mean kill the guest?

>
> - Host is sld_warn, we have two choice
> - set SLD_forced_on cpuid bit, it's the same as host is fatal.
> - not set SLD_force_on_cpuid bit, it's the same as case #2,#3,#4

I think the right solution is to treat the guest just like host
usermode: guest never gets killed and never gets #AC. Instead the
*host* logs a warning.

>
> So I think introducing an architectural way to tell a guest that SLD is
> forced on can make the only difference is that, there is a way to tell
> guest not to use warn mode, thus eliminating case #6.

It also tells the guest not to use off mode. Also, we don't know what
non-Linux guests are going to do.

2020-02-04 06:05:51

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: VMX: Extend VMX's #AC handding

On Mon, Feb 03, 2020 at 10:49:50AM -0800, Andy Lutomirski wrote:
> On Sat, Feb 1, 2020 at 8:33 PM Xiaoyao Li <[email protected]> wrote:
> >
> > On 2/2/2020 1:56 AM, Andy Lutomirski wrote:
> > >
> > >
> > > There are two independent problems here. First, SLD *can’t* be
> > > virtualized sanely because it’s per-core not per-thread.
> >
> > Sadly, it's the fact we cannot change. So it's better virtualized only when
> > SMT is disabled to make thing simple.
> >
> > > Second, most users *won’t want* to virtualize it correctly even if they
> > > could: if a guest is allowed to do split locks, it can DoS the system.
> >
> > To avoid DoS attack, it must use sld_fatal mode. In this case, guest are
> > forbidden to do split locks.
> >
> > > So I think there should be an architectural way to tell a guest that SLD
> > > is on whether it likes it or not. And the guest, if booted with sld=warn,
> > > can print a message saying “haha, actually SLD is fatal” and carry on.

Ya, but to make it architectural it needs to be actual hardware behavior.
I highly doubt we can get explicit documentation in the SDM regarding the
behavior of a hypervisor. E.g. the "official" hypervisor CPUID bit and
CPUID range is documented in the SDM as simply being reserved until the
end of time. Getting a bit reserved in the SDM does us no good as VMM
handling of the bit would still be determined by convention.

But, getting something in the SDM would serve our purposes, even if it's
too late to get it implemented for the first CPUs that support SLD. It
would, in theory, require kernels to be prepared to handle a sticky SLD
bit and define a common way for VMMs to virtualize the behavior.

A sticky/lock bit in the MSR is probably the easiest to implement in
ucode? That'd be easy for KVM to emulate, and then the kernel init code
would look something like:

static void split_lock_init(void)
{
u64 test_ctrl_val;

if (rdmsrl_safe(MSR_TEST_CTRL, &test_ctrl_val)) {
sld_state = sld_off;
return;
}

if (sld_state != sld_fatal &&
(test_ctrl_val & MSR_TEST_CTRL_LOCK_DETECT) &&
(test_ctrl_val & MSR_TEST_CTRL_LOCK_DETECT_STICKY)) {
pr_crit("haha, actually SLD is fatal\n");
sld_state = std_fatal;
return;
}

...
}

> > OK. Let me sort it out.
> >
> > If SMT is disabled/unsupported, so KVM advertises SLD feature to guest.
> > below are all the case:
> >
> > -----------------------------------------------------------------------
> > Host Guest Guest behavior
> > -----------------------------------------------------------------------
> > 1. off same as in bare metal
> > -----------------------------------------------------------------------
> > 2. warn off allow guest do split lock (for old guest):
> > hardware bit set initially, once split lock
> > happens, clear hardware bit when vcpu is running
> > So, it's the same as in bare metal
> >
> > 3. warn 1. user space: get #AC, then clear MSR bit, but
> > hardware bit is not cleared, #AC again, finally
> > clear hardware bit when vcpu is running.
> > So it's somehow the same as in bare-metal
>
> Well, kind of, except that the warning is inaccurate -- there is no
> guarantee that the hardware bit will be set at all when the guest is
> running. This doesn't sound *that* bad, but it does mean that the
> guest doesn't get the degree of DoS protection it thinks it's getting.
>
> My inclination is that, the host mode is warn, then SLD should not be
> exposed to the guest at all and the host should fully handle it.

KVM can expose it to the guest. KVM just needs to ensure SLD is turned
on prior to VM-Enter with vcpu->msr_test_ctrl.sld=1, which is easy enough.

> > 2. kernel: same as in bare metal.
> >
> > 4. fatal same as in bare metal
> > ----------------------------------------------------------------------
> > 5.fatal off guest is killed when split lock,
> > or forward #AC to guest, this way guest gets an
> > unexpected #AC
>
> Killing the guest seems like the right choice. But see below -- this
> is not ideal if the guest is new.
>
> >
> > 6. warn 1. user space: get #AC, then clear MSR bit, but
> > hardware bit is not cleared, #AC again,
> > finally guest is killed, or KVM forwards #AC
> > to guest then guest gets an unexpected #AC.
> > 2. kernel: same as in bare metal, call die();
> >
> > 7. fatal same as in bare metal
> > ----------------------------------------------------------------------
> >
> > Based on the table above, if we want guest has same behavior as in bare
> > metal, we can set host to sld_warn mode.
>
> I don't think this is correct. If the host is in warn mode, then the
> guest behavior will be erratic. I'm not sure it makes sense for KVM
> to expose such erratic behavior to the guest.

It's doable without introducing non-architectural behavior and without too
much pain on KVM's end.

https://lkml.kernel.org/r/[email protected]

> > If we want prevent DoS from guest, we should set host to sld_fatal mode.
> >
> >
> > Now, let's analysis what if there is an architectural way to tell a
> > guest that SLD is forced on. Assume it's a SLD_forced_on cpuid bit.
> >
> > - Host is sld_off, SLD_forced_on cpuid bit is not set, no change for
> > case #1

2020-02-04 14:48:30

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH 1/2] KVM: x86: Emulate split-lock access as a write

Sean Christopherson <[email protected]> writes:

> Exiting to host userspace with "emulation failed" is the other reasonable
> alternative, but that's basically the same as killing the guest. We're
> arguing that, in the extremely unlikely event that there is a workload out
> there that hits this, it's preferable to *maybe* corrupt guest memory and
> log the anomaly in the kernel log, as opposed to outright killing the guest
> with a generic "emulation failed".
>

FWIW, if I was to cast a vote I'd pick 'kill the guest' one way or
another. "Maybe corrupt guest memory" scares me much more and in many
cases host and guest are different responsibility domains (think
'cloud').

--
Vitaly

2020-02-10 22:01:43

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 1/2] KVM: x86: Emulate split-lock access as a write

On Tue, Feb 04, 2020 at 03:47:15PM +0100, Vitaly Kuznetsov wrote:
> Sean Christopherson <[email protected]> writes:
>
> > Exiting to host userspace with "emulation failed" is the other reasonable
> > alternative, but that's basically the same as killing the guest. We're
> > arguing that, in the extremely unlikely event that there is a workload out
> > there that hits this, it's preferable to *maybe* corrupt guest memory and
> > log the anomaly in the kernel log, as opposed to outright killing the guest
> > with a generic "emulation failed".
> >
>
> FWIW, if I was to cast a vote I'd pick 'kill the guest' one way or
> another. "Maybe corrupt guest memory" scares me much more and in many
> cases host and guest are different responsibility domains (think
> 'cloud').

I'm ok with that route as well. What I don't want to do is add a bunch of
logic to inject #AC at this point.