Hi,
Legacy instruction VERW was overloaded by some processors to clear
micro-architectural CPU buffers as a mitigation of CPU bugs. This series
moves VERW execution to a later point in exit-to-user path. This is
needed because in some cases it may be possible for kernel data to be
accessed after VERW in arch_exit_to_user_mode(). Such accesses may put
data into MDS affected CPU buffers, for example:
1. Kernel data accessed by an NMI between VERW and return-to-user can
remain in CPU buffers (since NMI returning to kernel does not
execute VERW to clear CPU buffers).
2. Alyssa reported that after VERW is executed,
CONFIG_GCC_PLUGIN_STACKLEAK=y scrubs the stack used by a system
call. Memory accesses during stack scrubbing can move kernel stack
contents into CPU buffers.
3. When caller saved registers are restored after a return from
function executing VERW, the kernel stack accesses can remain in
CPU buffers(since they occur after VERW).
Although these cases are less practical to exploit, moving VERW closer
to ring transition reduces the attack surface.
Overview of the series:
Patch 1: Prepares VERW macros for use in asm.
Patch 2: Adds macros to 64-bit entry/exit points.
Patch 3: Adds macros to 32-bit entry/exit points.
Patch 4: Enables the new macros.
Patch 5: Cleans up C implementation.
Patch 6: Adds macro to VMenter.
Below is some performance data collected on a Skylake client compared
with previous implementation:
Baseline: v6.6-rc5
| Test | Configuration | Relative |
| ------------------ | ---------------------- | -------- |
| build-linux-kernel | defconfig | 1.00 |
| hackbench | 32 - Process | 1.02 |
| nginx | Short Connection - 500 | 1.01 |
Signed-off-by: Pawan Gupta <[email protected]>
---
Pawan Gupta (6):
x86/bugs: Add asm helpers for executing VERW
x86/entry_64: Add VERW just before userspace transition
x86/entry_32: Add VERW just before userspace transition
x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key
x86/bugs: Cleanup mds_user_clear
KVM: VMX: Move VERW closer to VMentry for MDS mitigation
Documentation/arch/x86/mds.rst | 20 +++++++++----------
arch/x86/entry/entry_32.S | 8 ++++++++
arch/x86/entry/entry_64.S | 14 ++++++++++++++
arch/x86/entry/entry_64_compat.S | 2 ++
arch/x86/include/asm/cpufeatures.h | 2 +-
arch/x86/include/asm/entry-common.h | 1 -
arch/x86/include/asm/nospec-branch.h | 37 ++++++++++++++++++++++++------------
arch/x86/kernel/cpu/bugs.c | 13 +++++--------
arch/x86/kernel/nmi.c | 2 --
arch/x86/kvm/vmx/vmenter.S | 9 +++++++++
arch/x86/kvm/vmx/vmx.c | 10 +++++++---
11 files changed, 80 insertions(+), 38 deletions(-)
---
base-commit: 58720809f52779dc0f08e53e54b014209d13eebb
change-id: 20231011-delay-verw-d0474986b2c3
Best regards,
--
Thanks,
Pawan
The VERW mitigation at exit-to-user is enabled via a static branch
mds_user_clear. This static branch is never toggled after boot, and can
be safely replaced with an ALTERNATIVE() which is convenient to use in
asm.
Switch to ALTERNATIVE() to use the VERW mitigation late in exit-to-user
path. Also remove the now redundant VERW in exc_nmi() and
arch_exit_to_user_mode().
Signed-off-by: Pawan Gupta <[email protected]>
---
Documentation/arch/x86/mds.rst | 20 +++++++++-----------
arch/x86/include/asm/entry-common.h | 1 -
arch/x86/include/asm/nospec-branch.h | 11 -----------
arch/x86/kernel/cpu/bugs.c | 10 +++++-----
arch/x86/kernel/nmi.c | 2 --
arch/x86/kvm/vmx/vmx.c | 2 +-
6 files changed, 15 insertions(+), 31 deletions(-)
diff --git a/Documentation/arch/x86/mds.rst b/Documentation/arch/x86/mds.rst
index e73fdff62c0a..05090f33f65e 100644
--- a/Documentation/arch/x86/mds.rst
+++ b/Documentation/arch/x86/mds.rst
@@ -95,6 +95,9 @@ The kernel provides a function to invoke the buffer clearing:
mds_clear_cpu_buffers()
+Also macro USER_CLEAR_CPU_BUFFERS is meant to be used in ASM late in
+exit-to-user path. This macro works for cases where GPRs can't be clobbered.
+
The mitigation is invoked on kernel/userspace, hypervisor/guest and C-state
(idle) transitions.
@@ -138,18 +141,13 @@ Mitigation points
When transitioning from kernel to user space the CPU buffers are flushed
on affected CPUs when the mitigation is not disabled on the kernel
- command line. The migitation is enabled through the static key
- mds_user_clear.
-
- The mitigation is invoked in prepare_exit_to_usermode() which covers
- all but one of the kernel to user space transitions. The exception
- is when we return from a Non Maskable Interrupt (NMI), which is
- handled directly in do_nmi().
-
- (The reason that NMI is special is that prepare_exit_to_usermode() can
- enable IRQs. In NMI context, NMIs are blocked, and we don't want to
- enable IRQs with NMIs blocked.)
+ command line. The migitation is enabled through the feature flag
+ X86_FEATURE_USER_CLEAR_CPU_BUF.
+ The mitigation is invoked just before transitioning to userspace after
+ user registers are restored. This is done to minimize the window in
+ which kernel data could be accessed after VERW e.g. via an NMI after
+ VERW.
2. C-State transition
^^^^^^^^^^^^^^^^^^^^^
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index ce8f50192ae3..7e523bb3d2d3 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -91,7 +91,6 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
static __always_inline void arch_exit_to_user_mode(void)
{
- mds_user_clear_cpu_buffers();
amd_clear_divider();
}
#define arch_exit_to_user_mode arch_exit_to_user_mode
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index e1b623a27e1b..501c26ecd722 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -595,17 +595,6 @@ static __always_inline void mds_clear_cpu_buffers(void)
asm volatile("verw %[ds]" : : [ds] "m" (ds) : "cc");
}
-/**
- * mds_user_clear_cpu_buffers - Mitigation for MDS and TAA vulnerability
- *
- * Clear CPU buffers if the corresponding static key is enabled
- */
-static __always_inline void mds_user_clear_cpu_buffers(void)
-{
- if (static_branch_likely(&mds_user_clear))
- mds_clear_cpu_buffers();
-}
-
/**
* mds_idle_clear_cpu_buffers - Mitigation for MDS vulnerability
*
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 10499bcd4e39..75fc0a70877f 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -252,7 +252,7 @@ static void __init mds_select_mitigation(void)
if (!boot_cpu_has(X86_FEATURE_MD_CLEAR))
mds_mitigation = MDS_MITIGATION_VMWERV;
- static_branch_enable(&mds_user_clear);
+ setup_force_cpu_cap(X86_FEATURE_USER_CLEAR_CPU_BUF);
if (!boot_cpu_has(X86_BUG_MSBDS_ONLY) &&
(mds_nosmt || cpu_mitigations_auto_nosmt()))
@@ -356,7 +356,7 @@ static void __init taa_select_mitigation(void)
* For guests that can't determine whether the correct microcode is
* present on host, enable the mitigation for UCODE_NEEDED as well.
*/
- static_branch_enable(&mds_user_clear);
+ setup_force_cpu_cap(X86_FEATURE_USER_CLEAR_CPU_BUF);
if (taa_nosmt || cpu_mitigations_auto_nosmt())
cpu_smt_disable(false);
@@ -424,7 +424,7 @@ static void __init mmio_select_mitigation(void)
*/
if (boot_cpu_has_bug(X86_BUG_MDS) || (boot_cpu_has_bug(X86_BUG_TAA) &&
boot_cpu_has(X86_FEATURE_RTM)))
- static_branch_enable(&mds_user_clear);
+ setup_force_cpu_cap(X86_FEATURE_USER_CLEAR_CPU_BUF);
else
static_branch_enable(&mmio_stale_data_clear);
@@ -484,11 +484,11 @@ static void __init md_clear_update_mitigation(void)
if (cpu_mitigations_off())
return;
- if (!static_key_enabled(&mds_user_clear))
+ if (!boot_cpu_has(X86_FEATURE_USER_CLEAR_CPU_BUF))
goto out;
/*
- * mds_user_clear is now enabled. Update MDS, TAA and MMIO Stale Data
+ * X86_FEATURE_USER_CLEAR_CPU_BUF is now enabled. Update MDS, TAA and MMIO Stale Data
* mitigation, if necessary.
*/
if (mds_mitigation == MDS_MITIGATION_OFF &&
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index a0c551846b35..ebfff8dca661 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -551,8 +551,6 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
if (this_cpu_dec_return(nmi_state))
goto nmi_restart;
- if (user_mode(regs))
- mds_user_clear_cpu_buffers();
if (IS_ENABLED(CONFIG_NMI_CHECK_CPU)) {
WRITE_ONCE(nsp->idt_seq, nsp->idt_seq + 1);
WARN_ON_ONCE(nsp->idt_seq & 0x1);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 72e3943f3693..c16297a49e4d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7229,7 +7229,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
/* L1D Flush includes CPU buffer clear to mitigate MDS */
if (static_branch_unlikely(&vmx_l1d_should_flush))
vmx_l1d_flush(vcpu);
- else if (static_branch_unlikely(&mds_user_clear))
+ else if (cpu_feature_enabled(X86_FEATURE_USER_CLEAR_CPU_BUF))
mds_clear_cpu_buffers();
else if (static_branch_unlikely(&mmio_stale_data_clear) &&
kvm_arch_has_assigned_device(vcpu->kvm))
--
2.34.1
As done for entry_64, add support for executing VERW late in exit to
user path for 32-bit mode.
Signed-off-by: Pawan Gupta <[email protected]>
---
arch/x86/entry/entry_32.S | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 6e6af42e044a..bbf77d2aab2e 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -886,6 +886,9 @@ SYM_FUNC_START(entry_SYSENTER_32)
popfl
popl %eax
+ /* Mitigate CPU data sampling attacks .e.g. MDS */
+ USER_CLEAR_CPU_BUFFERS
+
/*
* Return back to the vDSO, which will pop ecx and edx.
* Don't bother with DS and ES (they already contain __USER_DS).
@@ -954,6 +957,9 @@ restore_all_switch_stack:
/* Restore user state */
RESTORE_REGS pop=4 # skip orig_eax/error_code
+
+ /* Mitigate CPU data sampling attacks .e.g. MDS */
+ USER_CLEAR_CPU_BUFFERS
.Lirq_return:
/*
* ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization
@@ -1146,6 +1152,8 @@ SYM_CODE_START(asm_exc_nmi)
/* Not on SYSENTER stack. */
call exc_nmi
+ /* Mitigate CPU data sampling attacks .e.g. MDS */
+ USER_CLEAR_CPU_BUFFERS
jmp .Lnmi_return
.Lnmi_from_sysenter_stack:
--
2.34.1
MDS mitigation requires clearing the CPU buffers before returning to
user. This needs to be done late in the exit-to-user path. Current
location of VERW leaves a possibility of kernel data ending up in CPU
buffers for memory accesses done after VERW such as:
1. Kernel data accessed by an NMI between VERW and return-to-user can
remain in CPU buffers ( since NMI returning to kernel does not
execute VERW to clear CPU buffers.
2. Alyssa reported that after VERW is executed,
CONFIG_GCC_PLUGIN_STACKLEAK=y scrubs the stack used by a system
call. Memory accesses during stack scrubbing can move kernel stack
contents into CPU buffers.
3. When caller saved registers are restored after a return from
function executing VERW, the kernel stack accesses can remain in
CPU buffers(since they occur after VERW).
To fix this VERW needs to be moved very late in exit-to-user path.
In preparation for moving VERW to entry/exit asm code, create macros
that can be used in asm. Also make them depend on a new feature flag
X86_FEATURE_USER_CLEAR_CPU_BUF.
Reported-by: Alyssa Milburn <[email protected]>
Signed-off-by: Pawan Gupta <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 2 +-
arch/x86/include/asm/nospec-branch.h | 25 +++++++++++++++++++++++++
2 files changed, 26 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 58cb9495e40f..3f018dfedd5f 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -308,10 +308,10 @@
#define X86_FEATURE_SMBA (11*32+21) /* "" Slow Memory Bandwidth Allocation */
#define X86_FEATURE_BMEC (11*32+22) /* "" Bandwidth Monitoring Event Configuration */
#define X86_FEATURE_USER_SHSTK (11*32+23) /* Shadow stack support for user mode applications */
-
#define X86_FEATURE_SRSO (11*32+24) /* "" AMD BTB untrain RETs */
#define X86_FEATURE_SRSO_ALIAS (11*32+25) /* "" AMD BTB untrain RETs through aliasing */
#define X86_FEATURE_IBPB_ON_VMEXIT (11*32+26) /* "" Issue an IBPB only on VMEXIT */
+#define X86_FEATURE_USER_CLEAR_CPU_BUF (11*32+27) /* "" Clear CPU buffers before returning to user */
/* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
#define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index c55cc243592e..e1b623a27e1b 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -111,6 +111,24 @@
#define RESET_CALL_DEPTH_FROM_CALL
#endif
+/*
+ * Macro to execute VERW instruction to mitigate transient data sampling
+ * attacks such as MDS. On affected systems a microcode update overloaded VERW
+ * instruction to also clear the CPU buffers.
+ *
+ * Note: Only the memory operand variant of VERW clears the CPU buffers. To
+ * handle the case when VERW is executed after user registers are restored, use
+ * RIP to point the memory operand to a part NOPL instruction that contains
+ * __KERNEL_DS.
+ */
+#define __EXEC_VERW(m) verw _ASM_RIP(m)
+
+#define EXEC_VERW \
+ __EXEC_VERW(551f); \
+ /* nopl __KERNEL_DS(%rax) */ \
+ .byte 0x0f, 0x1f, 0x80, 0x00, 0x00; \
+551: .word __KERNEL_DS; \
+
/*
* Fill the CPU return stack buffer.
*
@@ -329,6 +347,13 @@
#endif
.endm
+/* Clear CPU buffers before returning to user */
+.macro USER_CLEAR_CPU_BUFFERS
+ ALTERNATIVE "jmp .Lskip_verw_\@;", "", X86_FEATURE_USER_CLEAR_CPU_BUF
+ EXEC_VERW
+.Lskip_verw_\@:
+.endm
+
#else /* __ASSEMBLY__ */
#define ANNOTATE_RETPOLINE_SAFE \
--
2.34.1
There are no more users of mds_user_clear static key, remove it.
Signed-off-by: Pawan Gupta <[email protected]>
---
arch/x86/include/asm/nospec-branch.h | 1 -
arch/x86/kernel/cpu/bugs.c | 3 ---
2 files changed, 4 deletions(-)
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 501c26ecd722..520689325014 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -563,7 +563,6 @@ DECLARE_STATIC_KEY_FALSE(switch_to_cond_stibp);
DECLARE_STATIC_KEY_FALSE(switch_mm_cond_ibpb);
DECLARE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
-DECLARE_STATIC_KEY_FALSE(mds_user_clear);
DECLARE_STATIC_KEY_FALSE(mds_idle_clear);
DECLARE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush);
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 75fc0a70877f..cf2181b45055 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -111,9 +111,6 @@ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_ibpb);
/* Control unconditional IBPB in switch_mm() */
DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb);
-/* Control MDS CPU buffer clear before returning to user space */
-DEFINE_STATIC_KEY_FALSE(mds_user_clear);
-EXPORT_SYMBOL_GPL(mds_user_clear);
/* Control MDS CPU buffer clear before idling (halt, mwait) */
DEFINE_STATIC_KEY_FALSE(mds_idle_clear);
EXPORT_SYMBOL_GPL(mds_idle_clear);
--
2.34.1
During VMentry VERW is executed to mitigate MDS. After VERW, any memory
access like register push onto stack may put host data in MDS affected
CPU buffers. A guest can then use MDS to sample host data.
Although likelihood of secrets surviving in registers at current VERW
callsite is less, but it can't be ruled out. Harden the MDS mitigation
by moving the VERW mitigation late in VMentry path.
Note that VERW for MMIO Stale Data mitigation is unchanged because of
the complexity of per-guest conditional VERW which is not easy to handle
that late in asm with no GPRs available. If the CPU is also affected by
MDS, VERW is unconditionally executed late in asm regardless of guest
having MMIO access.
Signed-off-by: Pawan Gupta <[email protected]>
---
arch/x86/kvm/vmx/vmenter.S | 9 +++++++++
arch/x86/kvm/vmx/vmx.c | 10 +++++++---
2 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index be275a0410a8..efa716cf4727 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -1,6 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
#include <linux/linkage.h>
#include <asm/asm.h>
+#include <asm/segment.h>
#include <asm/bitsperlong.h>
#include <asm/kvm_vcpu_regs.h>
#include <asm/nospec-branch.h>
@@ -31,6 +32,8 @@
#define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE
#endif
+#define GUEST_CLEAR_CPU_BUFFERS USER_CLEAR_CPU_BUFFERS
+
.macro VMX_DO_EVENT_IRQOFF call_insn call_target
/*
* Unconditionally create a stack frame, getting the correct RSP on the
@@ -177,10 +180,16 @@ SYM_FUNC_START(__vmx_vcpu_run)
* the 'vmx_vmexit' label below.
*/
.Lvmresume:
+ /* Mitigate CPU data sampling attacks .e.g. MDS */
+ GUEST_CLEAR_CPU_BUFFERS
+
vmresume
jmp .Lvmfail
.Lvmlaunch:
+ /* Mitigate CPU data sampling attacks .e.g. MDS */
+ GUEST_CLEAR_CPU_BUFFERS
+
vmlaunch
jmp .Lvmfail
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c16297a49e4d..e3d0eda292c3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7226,13 +7226,17 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
guest_state_enter_irqoff();
- /* L1D Flush includes CPU buffer clear to mitigate MDS */
+ /*
+ * L1D Flush includes CPU buffer clear to mitigate MDS, but VERW
+ * mitigation for MDS is done late in VMentry and is still executed
+ * inspite of L1D Flush. This is because an extra VERW should not matter
+ * much after the big hammer L1D Flush.
+ */
if (static_branch_unlikely(&vmx_l1d_should_flush))
vmx_l1d_flush(vcpu);
- else if (cpu_feature_enabled(X86_FEATURE_USER_CLEAR_CPU_BUF))
- mds_clear_cpu_buffers();
else if (static_branch_unlikely(&mmio_stale_data_clear) &&
kvm_arch_has_assigned_device(vcpu->kvm))
+ /* MMIO mitigation is mutually exclusive to MDS mitigation later in asm */
mds_clear_cpu_buffers();
vmx_disable_fb_clear(vmx);
--
2.34.1
Mitigation for MDS is to use VERW instruction to clear any secrets in
CPU Buffers. Any memory accesses after VERW execution can still remain
in CPU buffers. It is safer to execute VERW late in return to user path
to minimize the window in which kernel data can end up in CPU buffers.
There are not many kernel secrets to be had after SWITCH_TO_USER_CR3.
Add support for deploying VERW mitigation after user register state is
restored. This helps minimize the chances of kernel data ending up into
CPU buffers after executing VERW.
Note that the mitigation at the new location is not yet enabled.
Suggested-by: Dave Hansen <[email protected]>
Signed-off-by: Pawan Gupta <[email protected]>
---
arch/x86/entry/entry_64.S | 14 ++++++++++++++
arch/x86/entry/entry_64_compat.S | 2 ++
2 files changed, 16 insertions(+)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 43606de22511..e72ac30f0714 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -223,6 +223,8 @@ syscall_return_via_sysret:
SYM_INNER_LABEL(entry_SYSRETQ_unsafe_stack, SYM_L_GLOBAL)
ANNOTATE_NOENDBR
swapgs
+ /* Mitigate CPU data sampling attacks .e.g. MDS */
+ USER_CLEAR_CPU_BUFFERS
sysretq
SYM_INNER_LABEL(entry_SYSRETQ_end, SYM_L_GLOBAL)
ANNOTATE_NOENDBR
@@ -663,6 +665,10 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
/* Restore RDI. */
popq %rdi
swapgs
+
+ /* Mitigate CPU data sampling attacks .e.g. MDS */
+ USER_CLEAR_CPU_BUFFERS
+
jmp .Lnative_iret
@@ -774,6 +780,9 @@ native_irq_return_ldt:
*/
popq %rax /* Restore user RAX */
+ /* Mitigate CPU data sampling attacks .e.g. MDS */
+ USER_CLEAR_CPU_BUFFERS
+
/*
* RSP now points to an ordinary IRET frame, except that the page
* is read-only and RSP[31:16] are preloaded with the userspace
@@ -1502,6 +1511,9 @@ nmi_restore:
std
movq $0, 5*8(%rsp) /* clear "NMI executing" */
+ /* Mitigate CPU data sampling attacks .e.g. MDS */
+ USER_CLEAR_CPU_BUFFERS
+
/*
* iretq reads the "iret" frame and exits the NMI stack in a
* single instruction. We are returning to kernel mode, so this
@@ -1520,6 +1532,8 @@ SYM_CODE_START(ignore_sysret)
UNWIND_HINT_END_OF_STACK
ENDBR
mov $-ENOSYS, %eax
+ /* Mitigate CPU data sampling attacks .e.g. MDS */
+ USER_CLEAR_CPU_BUFFERS
sysretl
SYM_CODE_END(ignore_sysret)
#endif
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 70150298f8bd..d2ccd9148239 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -271,6 +271,8 @@ SYM_INNER_LABEL(entry_SYSRETL_compat_unsafe_stack, SYM_L_GLOBAL)
xorl %r9d, %r9d
xorl %r10d, %r10d
swapgs
+ /* Mitigate CPU data sampling attacks .e.g. MDS */
+ USER_CLEAR_CPU_BUFFERS
sysretl
SYM_INNER_LABEL(entry_SYSRETL_compat_end, SYM_L_GLOBAL)
ANNOTATE_NOENDBR
--
2.34.1
On Fri, Oct 20, 2023, Pawan Gupta wrote:
> During VMentry VERW is executed to mitigate MDS. After VERW, any memory
> access like register push onto stack may put host data in MDS affected
> CPU buffers. A guest can then use MDS to sample host data.
>
> Although likelihood of secrets surviving in registers at current VERW
> callsite is less, but it can't be ruled out. Harden the MDS mitigation
> by moving the VERW mitigation late in VMentry path.
>
> Note that VERW for MMIO Stale Data mitigation is unchanged because of
> the complexity of per-guest conditional VERW which is not easy to handle
> that late in asm with no GPRs available. If the CPU is also affected by
> MDS, VERW is unconditionally executed late in asm regardless of guest
> having MMIO access.
>
> Signed-off-by: Pawan Gupta <[email protected]>
> ---
> arch/x86/kvm/vmx/vmenter.S | 9 +++++++++
> arch/x86/kvm/vmx/vmx.c | 10 +++++++---
> 2 files changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
> index be275a0410a8..efa716cf4727 100644
> --- a/arch/x86/kvm/vmx/vmenter.S
> +++ b/arch/x86/kvm/vmx/vmenter.S
> @@ -1,6 +1,7 @@
> /* SPDX-License-Identifier: GPL-2.0 */
> #include <linux/linkage.h>
> #include <asm/asm.h>
> +#include <asm/segment.h>
> #include <asm/bitsperlong.h>
> #include <asm/kvm_vcpu_regs.h>
> #include <asm/nospec-branch.h>
> @@ -31,6 +32,8 @@
> #define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE
> #endif
>
> +#define GUEST_CLEAR_CPU_BUFFERS USER_CLEAR_CPU_BUFFERS
> +
> .macro VMX_DO_EVENT_IRQOFF call_insn call_target
> /*
> * Unconditionally create a stack frame, getting the correct RSP on the
> @@ -177,10 +180,16 @@ SYM_FUNC_START(__vmx_vcpu_run)
> * the 'vmx_vmexit' label below.
> */
> .Lvmresume:
> + /* Mitigate CPU data sampling attacks .e.g. MDS */
> + GUEST_CLEAR_CPU_BUFFERS
I have a very hard time believing that it's worth duplicating the mitigation
for VMRESUME vs. VMLAUNCH just to land it after a Jcc.
3b1: 48 8b 00 mov (%rax),%rax
3b4: 74 18 je 3ce <__vmx_vcpu_run+0x9e>
3b6: eb 0e jmp 3c6 <__vmx_vcpu_run+0x96>
3b8: 0f 00 2d 05 00 00 00 verw 0x5(%rip) # 3c4 <__vmx_vcpu_run+0x94>
3bf: 0f 1f 80 00 00 18 00 nopl 0x180000(%rax)
3c6: 0f 01 c3 vmresume
3c9: e9 c9 00 00 00 jmp 497 <vmx_vmexit+0xa7>
3ce: eb 0e jmp 3de <__vmx_vcpu_run+0xae>
3d0: 0f 00 2d 05 00 00 00 verw 0x5(%rip) # 3dc <__vmx_vcpu_run+0xac>
3d7: 0f 1f 80 00 00 18 00 nopl 0x180000(%rax)
3de: 0f 01 c2 vmlaunch
Also, would it'd be better to put the NOP first? Or even better, out of line?
It'd be quite hilarious if the CPU pulled a stupid and speculated on the operand
of the NOP, i.e. if the user/guest controlled RAX allowed for pulling in data
after the VERW.
On Fri, Oct 20, 2023, Pawan Gupta wrote:
> diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
> index c55cc243592e..e1b623a27e1b 100644
> --- a/arch/x86/include/asm/nospec-branch.h
> +++ b/arch/x86/include/asm/nospec-branch.h
> @@ -111,6 +111,24 @@
> #define RESET_CALL_DEPTH_FROM_CALL
> #endif
>
> +/*
> + * Macro to execute VERW instruction to mitigate transient data sampling
> + * attacks such as MDS. On affected systems a microcode update overloaded VERW
> + * instruction to also clear the CPU buffers.
> + *
> + * Note: Only the memory operand variant of VERW clears the CPU buffers. To
> + * handle the case when VERW is executed after user registers are restored, use
> + * RIP to point the memory operand to a part NOPL instruction that contains
> + * __KERNEL_DS.
> + */
> +#define __EXEC_VERW(m) verw _ASM_RIP(m)
> +
> +#define EXEC_VERW \
> + __EXEC_VERW(551f); \
> + /* nopl __KERNEL_DS(%rax) */ \
> + .byte 0x0f, 0x1f, 0x80, 0x00, 0x00; \
> +551: .word __KERNEL_DS; \
Why are there so many macro layers? Nothing jumps out to justfying two layers,
let alone three.
> +
> /*
> * Fill the CPU return stack buffer.
> *
> @@ -329,6 +347,13 @@
> #endif
> .endm
>
> +/* Clear CPU buffers before returning to user */
> +.macro USER_CLEAR_CPU_BUFFERS
> + ALTERNATIVE "jmp .Lskip_verw_\@;", "", X86_FEATURE_USER_CLEAR_CPU_BUF
> + EXEC_VERW
Rather than a NOP after VERW, why not something like this?
/* Clear CPU buffers before returning to user */
.macro USER_CLEAR_CPU_BUFFERS
ALTERNATIVE "jmp .Lskip_verw_\@;", "jmp .Ldo_verw_\@;", X86_FEATURE_USER_CLEAR_CPU_BUF
551: .word __KERNEL_DS
.Ldo_verw_\@: verw _ASM_RIP(551b)
.Lskip_verw_\@:
.endm
> +.Lskip_verw_\@:
> +.endm
On Fri, Oct 20, 2023 at 01:45:09PM -0700, Pawan Gupta wrote:
> As done for entry_64, add support for executing VERW late in exit to
> user path for 32-bit mode.
>
> Signed-off-by: Pawan Gupta <[email protected]>
> ---
> arch/x86/entry/entry_32.S | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index 6e6af42e044a..bbf77d2aab2e 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -886,6 +886,9 @@ SYM_FUNC_START(entry_SYSENTER_32)
> popfl
> popl %eax
>
> + /* Mitigate CPU data sampling attacks .e.g. MDS */
> + USER_CLEAR_CPU_BUFFERS
> +
> /*
> * Return back to the vDSO, which will pop ecx and edx.
> * Don't bother with DS and ES (they already contain __USER_DS).
Did you forget the INT 0x80 entry point?
-Andi
On 20/10/2023 9:44 pm, Pawan Gupta wrote:
> +#define EXEC_VERW \
> + __EXEC_VERW(551f); \
> + /* nopl __KERNEL_DS(%rax) */ \
> + .byte 0x0f, 0x1f, 0x80, 0x00, 0x00; \
> +551: .word __KERNEL_DS; \
Is this actually wise from a perf point of view?
You're causing a data access to the instruction stream, and not only
that, the immediate next instruction. Some parts don't take kindly to
snoops hitting L1I.
A better option would be to simply have
.section .text.entry
.align CACHELINE
mds_verw_sel:
.word __KERNEL_DS
int3
.align CACHELINE
And then just have EXEC_VERW be
verw mds_verw_sel(%rip)
in the fastpaths. That keeps the memory operand in .text.entry it works
on Meltdown-vulnerable CPUs, but creates effectively a data cacheline
that isn't mixed into anywhere in the frontend, which also gets far
better locality of reference rather than having it duplicated in 9
different places.
Also it avoids playing games with hiding data inside an instruction.
It's a neat trick, but the neater trick is avoid it whenever possible.
~Andrew
On Fri, Oct 20, 2023 at 03:55:07PM -0700, Sean Christopherson wrote:
> On Fri, Oct 20, 2023, Pawan Gupta wrote:
> > During VMentry VERW is executed to mitigate MDS. After VERW, any memory
> > access like register push onto stack may put host data in MDS affected
> > CPU buffers. A guest can then use MDS to sample host data.
> >
> > Although likelihood of secrets surviving in registers at current VERW
> > callsite is less, but it can't be ruled out. Harden the MDS mitigation
> > by moving the VERW mitigation late in VMentry path.
> >
> > Note that VERW for MMIO Stale Data mitigation is unchanged because of
> > the complexity of per-guest conditional VERW which is not easy to handle
> > that late in asm with no GPRs available. If the CPU is also affected by
> > MDS, VERW is unconditionally executed late in asm regardless of guest
> > having MMIO access.
> >
> > Signed-off-by: Pawan Gupta <[email protected]>
> > ---
> > arch/x86/kvm/vmx/vmenter.S | 9 +++++++++
> > arch/x86/kvm/vmx/vmx.c | 10 +++++++---
> > 2 files changed, 16 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
> > index be275a0410a8..efa716cf4727 100644
> > --- a/arch/x86/kvm/vmx/vmenter.S
> > +++ b/arch/x86/kvm/vmx/vmenter.S
> > @@ -1,6 +1,7 @@
> > /* SPDX-License-Identifier: GPL-2.0 */
> > #include <linux/linkage.h>
> > #include <asm/asm.h>
> > +#include <asm/segment.h>
> > #include <asm/bitsperlong.h>
> > #include <asm/kvm_vcpu_regs.h>
> > #include <asm/nospec-branch.h>
> > @@ -31,6 +32,8 @@
> > #define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE
> > #endif
> >
> > +#define GUEST_CLEAR_CPU_BUFFERS USER_CLEAR_CPU_BUFFERS
> > +
> > .macro VMX_DO_EVENT_IRQOFF call_insn call_target
> > /*
> > * Unconditionally create a stack frame, getting the correct RSP on the
> > @@ -177,10 +180,16 @@ SYM_FUNC_START(__vmx_vcpu_run)
> > * the 'vmx_vmexit' label below.
> > */
> > .Lvmresume:
> > + /* Mitigate CPU data sampling attacks .e.g. MDS */
> > + GUEST_CLEAR_CPU_BUFFERS
>
> I have a very hard time believing that it's worth duplicating the mitigation
> for VMRESUME vs. VMLAUNCH just to land it after a Jcc.
VERW modifies the flags, so it either needs to be after Jcc or we
push/pop flags that adds 2 extra memory operations. Please let me know
if there is a better option.
> Also, would it'd be better to put the NOP first? Or even better, out of line?
> It'd be quite hilarious if the CPU pulled a stupid and speculated on the operand
> of the NOP, i.e. if the user/guest controlled RAX allowed for pulling in data
> after the VERW.
I did confirm with CPU architects that NOP operand won't be
dereferenced, even speculatively. But yes, even if it did, moving NOP
first does take care of it.
On Fri, Oct 20, 2023 at 04:13:13PM -0700, Sean Christopherson wrote:
> On Fri, Oct 20, 2023, Pawan Gupta wrote:
> > diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
> > index c55cc243592e..e1b623a27e1b 100644
> > --- a/arch/x86/include/asm/nospec-branch.h
> > +++ b/arch/x86/include/asm/nospec-branch.h
> > @@ -111,6 +111,24 @@
> > #define RESET_CALL_DEPTH_FROM_CALL
> > #endif
> >
> > +/*
> > + * Macro to execute VERW instruction to mitigate transient data sampling
> > + * attacks such as MDS. On affected systems a microcode update overloaded VERW
> > + * instruction to also clear the CPU buffers.
> > + *
> > + * Note: Only the memory operand variant of VERW clears the CPU buffers. To
> > + * handle the case when VERW is executed after user registers are restored, use
> > + * RIP to point the memory operand to a part NOPL instruction that contains
> > + * __KERNEL_DS.
> > + */
> > +#define __EXEC_VERW(m) verw _ASM_RIP(m)
> > +
> > +#define EXEC_VERW \
> > + __EXEC_VERW(551f); \
> > + /* nopl __KERNEL_DS(%rax) */ \
> > + .byte 0x0f, 0x1f, 0x80, 0x00, 0x00; \
> > +551: .word __KERNEL_DS; \
>
> Why are there so many macro layers? Nothing jumps out to justfying two layers,
> let alone three.
I can't remember the exact reason, but I think the reason I added
__EXEC_VERW() was because using EXEC_VERW() in a C function was leading
to build error in the internal draft version. This version is not
calling it from C, so yes I can get rid of the extra layer.
> > /*
> > * Fill the CPU return stack buffer.
> > *
> > @@ -329,6 +347,13 @@
> > #endif
> > .endm
> >
> > +/* Clear CPU buffers before returning to user */
> > +.macro USER_CLEAR_CPU_BUFFERS
> > + ALTERNATIVE "jmp .Lskip_verw_\@;", "", X86_FEATURE_USER_CLEAR_CPU_BUF
> > + EXEC_VERW
>
> Rather than a NOP after VERW, why not something like this?
>
> /* Clear CPU buffers before returning to user */
> .macro USER_CLEAR_CPU_BUFFERS
> ALTERNATIVE "jmp .Lskip_verw_\@;", "jmp .Ldo_verw_\@;", X86_FEATURE_USER_CLEAR_CPU_BUF
> 551: .word __KERNEL_DS
> .Ldo_verw_\@: verw _ASM_RIP(551b)
> .Lskip_verw_\@:
> .endm
I wasn't comfortable adding a variable directly in the instruction
stream because the CPU may interpret it wrongly. With NOP it is bound to
ignore the data part.
On Sat, Oct 21, 2023 at 12:55:45AM +0100, Andrew Cooper wrote:
> On 20/10/2023 9:44 pm, Pawan Gupta wrote:
> > +#define EXEC_VERW \
> > + __EXEC_VERW(551f); \
> > + /* nopl __KERNEL_DS(%rax) */ \
> > + .byte 0x0f, 0x1f, 0x80, 0x00, 0x00; \
> > +551: .word __KERNEL_DS; \
>
> Is this actually wise from a perf point of view?
>
> You're causing a data access to the instruction stream, and not only
> that, the immediate next instruction. Some parts don't take kindly to
> snoops hitting L1I.
I suspected the same and asked CPU architects, they did not anticipate
reads being interpreted as part of self modifying code. The perf numbers
do not indicate a problem, but they dont speak for all the parts. It
could be an issue with some parts.
> A better option would be to simply have
>
> .section .text.entry
> .align CACHELINE
> mds_verw_sel:
> .word __KERNEL_DS
> int3
> .align CACHELINE
>
>
> And then just have EXEC_VERW be
>
> verw mds_verw_sel(%rip)
>
> in the fastpaths. That keeps the memory operand in .text.entry it works
> on Meltdown-vulnerable CPUs, but creates effectively a data cacheline
> that isn't mixed into anywhere in the frontend, which also gets far
> better locality of reference rather than having it duplicated in 9
> different places.
> Also it avoids playing games with hiding data inside an instruction.
> It's a neat trick, but the neater trick is avoid it whenever possible.
Thanks for the pointers. I think verw in 32-bit mode won't be able to
address the operand outside of 4GB range. Maybe this is fine or could it
be a problem addressing from e.g. KVM module?
On Fri, Oct 20, 2023 at 04:49:34PM -0700, Andi Kleen wrote:
> On Fri, Oct 20, 2023 at 01:45:09PM -0700, Pawan Gupta wrote:
> > As done for entry_64, add support for executing VERW late in exit to
> > user path for 32-bit mode.
> >
> > Signed-off-by: Pawan Gupta <[email protected]>
> > ---
> > arch/x86/entry/entry_32.S | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> > index 6e6af42e044a..bbf77d2aab2e 100644
> > --- a/arch/x86/entry/entry_32.S
> > +++ b/arch/x86/entry/entry_32.S
> > @@ -886,6 +886,9 @@ SYM_FUNC_START(entry_SYSENTER_32)
> > popfl
> > popl %eax
> >
> > + /* Mitigate CPU data sampling attacks .e.g. MDS */
> > + USER_CLEAR_CPU_BUFFERS
> > +
> > /*
> > * Return back to the vDSO, which will pop ecx and edx.
> > * Don't bother with DS and ES (they already contain __USER_DS).
>
> Did you forget the INT 0x80 entry point?
I do have VERW in the INT80 path, the diff is showing just the label
restore_all_switch_stack. Below is the sequence:
SYM_FUNC_START(entry_INT80_32)
ASM_CLAC
pushl %eax /* pt_regs->orig_ax */
SAVE_ALL pt_regs_ax=$-ENOSYS switch_stacks=1 /* save rest */
movl %esp, %eax
call do_int80_syscall_32
.Lsyscall_32_done:
STACKLEAK_ERASE
restore_all_switch_stack:
SWITCH_TO_ENTRY_STACK
CHECK_AND_APPLY_ESPFIX
/* Switch back to user CR3 */
SWITCH_TO_USER_CR3 scratch_reg=%eax
BUG_IF_WRONG_CR3
/* Restore user state */
RESTORE_REGS pop=4 # skip orig_eax/error_code
/* Mitigate CPU data sampling attacks .e.g. MDS */
USER_CLEAR_CPU_BUFFERS
^^^^^^^^^^^^^^^^^^^^^^
On 21/10/2023 2:18 am, Pawan Gupta wrote:
> On Sat, Oct 21, 2023 at 12:55:45AM +0100, Andrew Cooper wrote:
>> Also it avoids playing games with hiding data inside an instruction.
>> It's a neat trick, but the neater trick is avoid it whenever possible.
> Thanks for the pointers. I think verw in 32-bit mode won't be able to
> address the operand outside of 4GB range.
And? In a 32bit kernel, what lives outside of a 4G range?
> Maybe this is fine or could it
> be a problem addressing from e.g. KVM module?
RIP-relative addressing is disp32. Which is the same as it is for
direct calls.
So if your module is far enough away for VERW to have issues, you've got
far more basic problems to solve first.
~Andrew
On Sat, Oct 21, 2023 at 02:33:47AM +0100, Andrew Cooper wrote:
> On 21/10/2023 2:18 am, Pawan Gupta wrote:
> > On Sat, Oct 21, 2023 at 12:55:45AM +0100, Andrew Cooper wrote:
> >> Also it avoids playing games with hiding data inside an instruction.
> >> It's a neat trick, but the neater trick is avoid it whenever possible.
> > Thanks for the pointers. I think verw in 32-bit mode won't be able to
> > address the operand outside of 4GB range.
>
> And? In a 32bit kernel, what lives outside of a 4G range?
>
> > Maybe this is fine or could it
> > be a problem addressing from e.g. KVM module?
>
> RIP-relative addressing is disp32. Which is the same as it is for
> direct calls.
>
> So if your module is far enough away for VERW to have issues, you've got
> far more basic problems to solve first.
Sorry, I raised the wrong problem. In 64-bit mode, verww only has 32-bit
of relative addressing, so memory operand has to be within 4GB of
callsite. That could be a constraint.
On 20.10.23 г. 23:45 ч., Pawan Gupta wrote:
> There are no more users of mds_user_clear static key, remove it.
>
> Signed-off-by: Pawan Gupta <[email protected]>
This patch can be squashed into the previous one. You've already done
the bulk of the work to eliminate usage of mds_user_clear there.
On Fri, Oct 20, 2023, Pawan Gupta wrote:
> On Fri, Oct 20, 2023 at 03:55:07PM -0700, Sean Christopherson wrote:
> > On Fri, Oct 20, 2023, Pawan Gupta wrote:
> > > During VMentry VERW is executed to mitigate MDS. After VERW, any memory
> > > access like register push onto stack may put host data in MDS affected
> > > CPU buffers. A guest can then use MDS to sample host data.
> > >
> > > Although likelihood of secrets surviving in registers at current VERW
> > > callsite is less, but it can't be ruled out. Harden the MDS mitigation
> > > by moving the VERW mitigation late in VMentry path.
> > >
> > > Note that VERW for MMIO Stale Data mitigation is unchanged because of
> > > the complexity of per-guest conditional VERW which is not easy to handle
> > > that late in asm with no GPRs available. If the CPU is also affected by
> > > MDS, VERW is unconditionally executed late in asm regardless of guest
> > > having MMIO access.
> > >
> > > Signed-off-by: Pawan Gupta <[email protected]>
> > > ---
> > > arch/x86/kvm/vmx/vmenter.S | 9 +++++++++
> > > arch/x86/kvm/vmx/vmx.c | 10 +++++++---
> > > 2 files changed, 16 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
> > > index be275a0410a8..efa716cf4727 100644
> > > --- a/arch/x86/kvm/vmx/vmenter.S
> > > +++ b/arch/x86/kvm/vmx/vmenter.S
> > > @@ -1,6 +1,7 @@
> > > /* SPDX-License-Identifier: GPL-2.0 */
> > > #include <linux/linkage.h>
> > > #include <asm/asm.h>
> > > +#include <asm/segment.h>
> > > #include <asm/bitsperlong.h>
> > > #include <asm/kvm_vcpu_regs.h>
> > > #include <asm/nospec-branch.h>
> > > @@ -31,6 +32,8 @@
> > > #define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE
> > > #endif
> > >
> > > +#define GUEST_CLEAR_CPU_BUFFERS USER_CLEAR_CPU_BUFFERS
> > > +
> > > .macro VMX_DO_EVENT_IRQOFF call_insn call_target
> > > /*
> > > * Unconditionally create a stack frame, getting the correct RSP on the
> > > @@ -177,10 +180,16 @@ SYM_FUNC_START(__vmx_vcpu_run)
> > > * the 'vmx_vmexit' label below.
> > > */
> > > .Lvmresume:
> > > + /* Mitigate CPU data sampling attacks .e.g. MDS */
> > > + GUEST_CLEAR_CPU_BUFFERS
> >
> > I have a very hard time believing that it's worth duplicating the mitigation
> > for VMRESUME vs. VMLAUNCH just to land it after a Jcc.
>
> VERW modifies the flags, so it either needs to be after Jcc or we
> push/pop flags that adds 2 extra memory operations. Please let me know
> if there is a better option.
Ugh, I assumed that piggybacking VERW overrode the original behavior entirely, I
didn't realize it sacrifices EFLAGS.ZF on the altar of mitigations.
Luckily, this is easy to solve now that VMRESUME vs. VMLAUNCH uses a flag instead
of a dedicated bool.
From: Sean Christopherson <[email protected]>
Date: Mon, 23 Oct 2023 07:44:35 -0700
Subject: [PATCH] KVM: VMX: Use BT+JNC, i.e. EFLAGS.CF to select VMRESUME vs.
VMLAUNCH
Use EFLAGS.CF instead of EFLAGS.ZF to track whether to use VMRESUME versus
VMLAUNCH. Freeing up EFLAGS.ZF will allow doing VERW, which clobbers ZF,
for MDS mitigations as late as possible without needing to duplicate VERW
for both paths.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/run_flags.h | 7 +++++--
arch/x86/kvm/vmx/vmenter.S | 6 +++---
2 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/vmx/run_flags.h b/arch/x86/kvm/vmx/run_flags.h
index edc3f16cc189..6a9bfdfbb6e5 100644
--- a/arch/x86/kvm/vmx/run_flags.h
+++ b/arch/x86/kvm/vmx/run_flags.h
@@ -2,7 +2,10 @@
#ifndef __KVM_X86_VMX_RUN_FLAGS_H
#define __KVM_X86_VMX_RUN_FLAGS_H
-#define VMX_RUN_VMRESUME (1 << 0)
-#define VMX_RUN_SAVE_SPEC_CTRL (1 << 1)
+#define VMX_RUN_VMRESUME_SHIFT 0
+#define VMX_RUN_SAVE_SPEC_CTRL_SHIFT 1
+
+#define VMX_RUN_VMRESUME BIT(VMX_RUN_VMRESUME_SHIFT)
+#define VMX_RUN_SAVE_SPEC_CTRL BIT(VMX_RUN_SAVE_SPEC_CTRL_SHIFT)
#endif /* __KVM_X86_VMX_RUN_FLAGS_H */
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index be275a0410a8..b3b13ec04bac 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -139,7 +139,7 @@ SYM_FUNC_START(__vmx_vcpu_run)
mov (%_ASM_SP), %_ASM_AX
/* Check if vmlaunch or vmresume is needed */
- test $VMX_RUN_VMRESUME, %ebx
+ bt $VMX_RUN_VMRESUME_SHIFT, %ebx
/* Load guest registers. Don't clobber flags. */
mov VCPU_RCX(%_ASM_AX), %_ASM_CX
@@ -161,8 +161,8 @@ SYM_FUNC_START(__vmx_vcpu_run)
/* Load guest RAX. This kills the @regs pointer! */
mov VCPU_RAX(%_ASM_AX), %_ASM_AX
- /* Check EFLAGS.ZF from 'test VMX_RUN_VMRESUME' above */
- jz .Lvmlaunch
+ /* Check EFLAGS.CF from the VMX_RUN_VMRESUME bit test above. */
+ jnc .Lvmlaunch
/*
* After a successful VMRESUME/VMLAUNCH, control flow "magically"
base-commit: ec2f1daad460c6201338dae606466220ccaa96d5
--
On Mon, Oct 23, 2023 at 11:51:39AM +0300, Nikolay Borisov wrote:
>
>
> On 20.10.23 г. 23:45 ч., Pawan Gupta wrote:
> > There are no more users of mds_user_clear static key, remove it.
> >
> > Signed-off-by: Pawan Gupta <[email protected]>
>
> This patch can be squashed into the previous one. You've already done the
> bulk of the work to eliminate usage of mds_user_clear there.
Yes, it can be squashed, into previous one. Will do that in next
revision.
On Mon, Oct 23, 2023 at 07:58:57AM -0700, Sean Christopherson wrote:
> On Fri, Oct 20, 2023, Pawan Gupta wrote:
> > On Fri, Oct 20, 2023 at 03:55:07PM -0700, Sean Christopherson wrote:
> > > On Fri, Oct 20, 2023, Pawan Gupta wrote:
> > > > During VMentry VERW is executed to mitigate MDS. After VERW, any memory
> > > > access like register push onto stack may put host data in MDS affected
> > > > CPU buffers. A guest can then use MDS to sample host data.
> > > >
> > > > Although likelihood of secrets surviving in registers at current VERW
> > > > callsite is less, but it can't be ruled out. Harden the MDS mitigation
> > > > by moving the VERW mitigation late in VMentry path.
> > > >
> > > > Note that VERW for MMIO Stale Data mitigation is unchanged because of
> > > > the complexity of per-guest conditional VERW which is not easy to handle
> > > > that late in asm with no GPRs available. If the CPU is also affected by
> > > > MDS, VERW is unconditionally executed late in asm regardless of guest
> > > > having MMIO access.
> > > >
> > > > Signed-off-by: Pawan Gupta <[email protected]>
> > > > ---
> > > > arch/x86/kvm/vmx/vmenter.S | 9 +++++++++
> > > > arch/x86/kvm/vmx/vmx.c | 10 +++++++---
> > > > 2 files changed, 16 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
> > > > index be275a0410a8..efa716cf4727 100644
> > > > --- a/arch/x86/kvm/vmx/vmenter.S
> > > > +++ b/arch/x86/kvm/vmx/vmenter.S
> > > > @@ -1,6 +1,7 @@
> > > > /* SPDX-License-Identifier: GPL-2.0 */
> > > > #include <linux/linkage.h>
> > > > #include <asm/asm.h>
> > > > +#include <asm/segment.h>
> > > > #include <asm/bitsperlong.h>
> > > > #include <asm/kvm_vcpu_regs.h>
> > > > #include <asm/nospec-branch.h>
> > > > @@ -31,6 +32,8 @@
> > > > #define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE
> > > > #endif
> > > >
> > > > +#define GUEST_CLEAR_CPU_BUFFERS USER_CLEAR_CPU_BUFFERS
> > > > +
> > > > .macro VMX_DO_EVENT_IRQOFF call_insn call_target
> > > > /*
> > > > * Unconditionally create a stack frame, getting the correct RSP on the
> > > > @@ -177,10 +180,16 @@ SYM_FUNC_START(__vmx_vcpu_run)
> > > > * the 'vmx_vmexit' label below.
> > > > */
> > > > .Lvmresume:
> > > > + /* Mitigate CPU data sampling attacks .e.g. MDS */
> > > > + GUEST_CLEAR_CPU_BUFFERS
> > >
> > > I have a very hard time believing that it's worth duplicating the mitigation
> > > for VMRESUME vs. VMLAUNCH just to land it after a Jcc.
> >
> > VERW modifies the flags, so it either needs to be after Jcc or we
> > push/pop flags that adds 2 extra memory operations. Please let me know
> > if there is a better option.
>
> Ugh, I assumed that piggybacking VERW overrode the original behavior entirely, I
> didn't realize it sacrifices EFLAGS.ZF on the altar of mitigations.
>
> Luckily, this is easy to solve now that VMRESUME vs. VMLAUNCH uses a flag instead
> of a dedicated bool.
Thats great.
> From: Sean Christopherson <[email protected]>
> Date: Mon, 23 Oct 2023 07:44:35 -0700
> Subject: [PATCH] KVM: VMX: Use BT+JNC, i.e. EFLAGS.CF to select VMRESUME vs.
> VMLAUNCH
>
> Use EFLAGS.CF instead of EFLAGS.ZF to track whether to use VMRESUME versus
> VMLAUNCH. Freeing up EFLAGS.ZF will allow doing VERW, which clobbers ZF,
> for MDS mitigations as late as possible without needing to duplicate VERW
> for both paths.
>
> Signed-off-by: Sean Christopherson <[email protected]>
Thanks for the patch, I will include it in the next revision.
On Fri, Oct 20, 2023 at 07:21:34PM -0700, Pawan Gupta wrote:
> On Sat, Oct 21, 2023 at 02:33:47AM +0100, Andrew Cooper wrote:
> > On 21/10/2023 2:18 am, Pawan Gupta wrote:
> > > On Sat, Oct 21, 2023 at 12:55:45AM +0100, Andrew Cooper wrote:
> > >> Also it avoids playing games with hiding data inside an instruction.
> > >> It's a neat trick, but the neater trick is avoid it whenever possible.
> > > Thanks for the pointers. I think verw in 32-bit mode won't be able to
> > > address the operand outside of 4GB range.
> >
> > And? In a 32bit kernel, what lives outside of a 4G range?
> >
> > > Maybe this is fine or could it
> > > be a problem addressing from e.g. KVM module?
> >
> > RIP-relative addressing is disp32. Which is the same as it is for
> > direct calls.
> >
> > So if your module is far enough away for VERW to have issues, you've got
> > far more basic problems to solve first.
>
> Sorry, I raised the wrong problem. In 64-bit mode, verww only has 32-bit
> of relative addressing, so memory operand has to be within 4GB of
> callsite. That could be a constraint.
Even on x86-64, modules are mapped within 4GB of the kernel, so I don't
think that's a concern.
--
Josh
On Fri, Oct 20, 2023 at 01:45:03PM -0700, Pawan Gupta wrote:
> + /* Mitigate CPU data sampling attacks .e.g. MDS */
> + USER_CLEAR_CPU_BUFFERS
> +
> jmp .Lnative_iret
>
>
> @@ -774,6 +780,9 @@ native_irq_return_ldt:
> */
> popq %rax /* Restore user RAX */
>
> + /* Mitigate CPU data sampling attacks .e.g. MDS */
> + USER_CLEAR_CPU_BUFFERS
> +
I'm thinking the comments add unnecessary noise here. The macro name is
self-documenting enough.
The detail about what mitigations are being done can go above the macro
definition itself, which the reader can refer to if they want more
detail about what the macro is doing and why.
Speaking of the macro name, I think just "CLEAR_CPU_BUFFERS" is
sufficient. The "USER_" prefix makes it harder to read IMO.
--
Josh
On Fri, Oct 20, 2023 at 01:45:03PM -0700, Pawan Gupta wrote:
> @@ -663,6 +665,10 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
> /* Restore RDI. */
> popq %rdi
> swapgs
> +
> + /* Mitigate CPU data sampling attacks .e.g. MDS */
> + USER_CLEAR_CPU_BUFFERS
> +
> jmp .Lnative_iret
>
>
> @@ -774,6 +780,9 @@ native_irq_return_ldt:
> */
> popq %rax /* Restore user RAX */
>
> + /* Mitigate CPU data sampling attacks .e.g. MDS */
> + USER_CLEAR_CPU_BUFFERS
> +
Can the above two USER_CLEAR_CPU_BUFFERS be replaced with a single one
just above native_irq_return_iret? Otherwise the native_irq_return_ldt
case ends up getting two VERWs.
> /*
> * RSP now points to an ordinary IRET frame, except that the page
> * is read-only and RSP[31:16] are preloaded with the userspace
> @@ -1502,6 +1511,9 @@ nmi_restore:
> std
> movq $0, 5*8(%rsp) /* clear "NMI executing" */
>
> + /* Mitigate CPU data sampling attacks .e.g. MDS */
> + USER_CLEAR_CPU_BUFFERS
> +
> /*
> * iretq reads the "iret" frame and exits the NMI stack in a
> * single instruction. We are returning to kernel mode, so this
This isn't needed here. This is the NMI return-to-kernel path.
The NMI return-to-user path is already mitigated as it goes through
swapgs_restore_regs_and_return_to_usermode.
--
Josh
On Fri, Oct 20, 2023 at 01:45:15PM -0700, Pawan Gupta wrote:
> @@ -484,11 +484,11 @@ static void __init md_clear_update_mitigation(void)
> if (cpu_mitigations_off())
> return;
>
> - if (!static_key_enabled(&mds_user_clear))
> + if (!boot_cpu_has(X86_FEATURE_USER_CLEAR_CPU_BUF))
> goto out;
>
> /*
> - * mds_user_clear is now enabled. Update MDS, TAA and MMIO Stale Data
> + * X86_FEATURE_USER_CLEAR_CPU_BUF is now enabled. Update MDS, TAA and MMIO Stale Data
> * mitigation, if necessary.
> */
This comment line got long, the paragraph can be reformatted.
--
Josh
On Fri, Oct 20, 2023 at 01:45:29PM -0700, Pawan Gupta wrote:
> @@ -31,6 +32,8 @@
> #define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE
> #endif
>
> +#define GUEST_CLEAR_CPU_BUFFERS USER_CLEAR_CPU_BUFFERS
I don't think the extra macro buys anything here.
--
Josh
On Mon, Oct 23, 2023 at 11:08:06AM -0700, Josh Poimboeuf wrote:
> On Fri, Oct 20, 2023 at 07:21:34PM -0700, Pawan Gupta wrote:
> > On Sat, Oct 21, 2023 at 02:33:47AM +0100, Andrew Cooper wrote:
> > > On 21/10/2023 2:18 am, Pawan Gupta wrote:
> > > > On Sat, Oct 21, 2023 at 12:55:45AM +0100, Andrew Cooper wrote:
> > > >> Also it avoids playing games with hiding data inside an instruction.
> > > >> It's a neat trick, but the neater trick is avoid it whenever possible.
> > > > Thanks for the pointers. I think verw in 32-bit mode won't be able to
> > > > address the operand outside of 4GB range.
> > >
> > > And? In a 32bit kernel, what lives outside of a 4G range?
> > >
> > > > Maybe this is fine or could it
> > > > be a problem addressing from e.g. KVM module?
> > >
> > > RIP-relative addressing is disp32. Which is the same as it is for
> > > direct calls.
> > >
> > > So if your module is far enough away for VERW to have issues, you've got
> > > far more basic problems to solve first.
> >
> > Sorry, I raised the wrong problem. In 64-bit mode, verww only has 32-bit
> > of relative addressing, so memory operand has to be within 4GB of
> > callsite. That could be a constraint.
>
> Even on x86-64, modules are mapped within 4GB of the kernel, so I don't
> think that's a concern.
You are correct, modules are indeed mapped within 4GB of the kernel. So
what Andrew suggested is feasible. Is that your preference?
On Mon, Oct 23, 2023 at 11:22:11AM -0700, Josh Poimboeuf wrote:
> On Fri, Oct 20, 2023 at 01:45:03PM -0700, Pawan Gupta wrote:
> > + /* Mitigate CPU data sampling attacks .e.g. MDS */
> > + USER_CLEAR_CPU_BUFFERS
> > +
> > jmp .Lnative_iret
> >
> >
> > @@ -774,6 +780,9 @@ native_irq_return_ldt:
> > */
> > popq %rax /* Restore user RAX */
> >
> > + /* Mitigate CPU data sampling attacks .e.g. MDS */
> > + USER_CLEAR_CPU_BUFFERS
> > +
>
> I'm thinking the comments add unnecessary noise here. The macro name is
> self-documenting enough.
>
> The detail about what mitigations are being done can go above the macro
> definition itself, which the reader can refer to if they want more
> detail about what the macro is doing and why.
Sure, I will move the comments to definition.
> Speaking of the macro name, I think just "CLEAR_CPU_BUFFERS" is
> sufficient. The "USER_" prefix makes it harder to read IMO.
Ok, will drop "USER_".
On 10/23/23 11:22, Josh Poimboeuf wrote:
> On Fri, Oct 20, 2023 at 01:45:03PM -0700, Pawan Gupta wrote:
>> + /* Mitigate CPU data sampling attacks .e.g. MDS */
>> + USER_CLEAR_CPU_BUFFERS
>> +
>> jmp .Lnative_iret
>>
>>
>> @@ -774,6 +780,9 @@ native_irq_return_ldt:
>> */
>> popq %rax /* Restore user RAX */
>>
>> + /* Mitigate CPU data sampling attacks .e.g. MDS */
>> + USER_CLEAR_CPU_BUFFERS
>> +
>
> I'm thinking the comments add unnecessary noise here. The macro name is
> self-documenting enough.
>
> The detail about what mitigations are being done can go above the macro
> definition itself, which the reader can refer to if they want more
> detail about what the macro is doing and why.
>
> Speaking of the macro name, I think just "CLEAR_CPU_BUFFERS" is
> sufficient. The "USER_" prefix makes it harder to read IMO.
Yes, please. The "USER_" prefix should be reserved for things that are
uniquely for the unambiguous return-to-userspace paths.
On Mon, Oct 23, 2023 at 11:35:21AM -0700, Josh Poimboeuf wrote:
> On Fri, Oct 20, 2023 at 01:45:03PM -0700, Pawan Gupta wrote:
> > @@ -663,6 +665,10 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
> > /* Restore RDI. */
> > popq %rdi
> > swapgs
> > +
> > + /* Mitigate CPU data sampling attacks .e.g. MDS */
> > + USER_CLEAR_CPU_BUFFERS
> > +
> > jmp .Lnative_iret
> >
> >
> > @@ -774,6 +780,9 @@ native_irq_return_ldt:
> > */
> > popq %rax /* Restore user RAX */
> >
> > + /* Mitigate CPU data sampling attacks .e.g. MDS */
> > + USER_CLEAR_CPU_BUFFERS
> > +
>
> Can the above two USER_CLEAR_CPU_BUFFERS be replaced with a single one
> just above native_irq_return_iret? Otherwise the native_irq_return_ldt
> case ends up getting two VERWs.
Wouldn't that make interrupts returning to kernel also execute VERWs?
idtentry_body
error_return
restore_regs_and_return_to_kernel
verw
native_irq_return_ldt doesn't look to be a common case. Anyways, I will
see how to remove the extra VERW.
> > /*
> > * RSP now points to an ordinary IRET frame, except that the page
> > * is read-only and RSP[31:16] are preloaded with the userspace
> > @@ -1502,6 +1511,9 @@ nmi_restore:
> > std
> > movq $0, 5*8(%rsp) /* clear "NMI executing" */
> >
> > + /* Mitigate CPU data sampling attacks .e.g. MDS */
> > + USER_CLEAR_CPU_BUFFERS
> > +
> > /*
> > * iretq reads the "iret" frame and exits the NMI stack in a
> > * single instruction. We are returning to kernel mode, so this
>
> This isn't needed here. This is the NMI return-to-kernel path.
Yes, the VERW here can be omitted. But probably need to check if an NMI
occuring between VERW and ring transition will still execute VERW after
the NMI.
On Mon, Oct 23, 2023 at 11:48:44AM -0700, Josh Poimboeuf wrote:
> On Fri, Oct 20, 2023 at 01:45:15PM -0700, Pawan Gupta wrote:
> > @@ -484,11 +484,11 @@ static void __init md_clear_update_mitigation(void)
> > if (cpu_mitigations_off())
> > return;
> >
> > - if (!static_key_enabled(&mds_user_clear))
> > + if (!boot_cpu_has(X86_FEATURE_USER_CLEAR_CPU_BUF))
> > goto out;
> >
> > /*
> > - * mds_user_clear is now enabled. Update MDS, TAA and MMIO Stale Data
> > + * X86_FEATURE_USER_CLEAR_CPU_BUF is now enabled. Update MDS, TAA and MMIO Stale Data
> > * mitigation, if necessary.
> > */
>
> This comment line got long, the paragraph can be reformatted.
Yes, will fix.
On Mon, Oct 23, 2023 at 11:56:43AM -0700, Josh Poimboeuf wrote:
> On Fri, Oct 20, 2023 at 01:45:29PM -0700, Pawan Gupta wrote:
> > @@ -31,6 +32,8 @@
> > #define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE
> > #endif
> >
> > +#define GUEST_CLEAR_CPU_BUFFERS USER_CLEAR_CPU_BUFFERS
>
> I don't think the extra macro buys anything here.
Using USER_CLEAR_CPU_BUFFERS in the VMentry path didn't feel right. But,
after "USER_" is gone as per your comment on 2/6 patch,
GUEST_CLEAR_CPU_BUFFERS can also go away.
On Mon, Oct 23, 2023 at 02:04:10PM -0700, Pawan Gupta wrote:
> On Mon, Oct 23, 2023 at 11:35:21AM -0700, Josh Poimboeuf wrote:
> > On Fri, Oct 20, 2023 at 01:45:03PM -0700, Pawan Gupta wrote:
> > > @@ -663,6 +665,10 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
> > > /* Restore RDI. */
> > > popq %rdi
> > > swapgs
> > > +
> > > + /* Mitigate CPU data sampling attacks .e.g. MDS */
> > > + USER_CLEAR_CPU_BUFFERS
> > > +
> > > jmp .Lnative_iret
> > >
> > >
> > > @@ -774,6 +780,9 @@ native_irq_return_ldt:
> > > */
> > > popq %rax /* Restore user RAX */
> > >
> > > + /* Mitigate CPU data sampling attacks .e.g. MDS */
> > > + USER_CLEAR_CPU_BUFFERS
> > > +
> >
> > Can the above two USER_CLEAR_CPU_BUFFERS be replaced with a single one
> > just above native_irq_return_iret? Otherwise the native_irq_return_ldt
> > case ends up getting two VERWs.
>
> Wouldn't that make interrupts returning to kernel also execute VERWs?
>
> idtentry_body
> error_return
> restore_regs_and_return_to_kernel
> verw
>
> native_irq_return_ldt doesn't look to be a common case. Anyways, I will
> see how to remove the extra VERW.
Ah, right.
> > > /*
> > > * RSP now points to an ordinary IRET frame, except that the page
> > > * is read-only and RSP[31:16] are preloaded with the userspace
> > > @@ -1502,6 +1511,9 @@ nmi_restore:
> > > std
> > > movq $0, 5*8(%rsp) /* clear "NMI executing" */
> > >
> > > + /* Mitigate CPU data sampling attacks .e.g. MDS */
> > > + USER_CLEAR_CPU_BUFFERS
> > > +
> > > /*
> > > * iretq reads the "iret" frame and exits the NMI stack in a
> > > * single instruction. We are returning to kernel mode, so this
> >
> > This isn't needed here. This is the NMI return-to-kernel path.
>
> Yes, the VERW here can be omitted. But probably need to check if an NMI
> occuring between VERW and ring transition will still execute VERW after
> the NMI.
That window does exist, though I'm not sure it's worth worrying about.
--
Josh
On Mon, Oct 23, 2023 at 02:47:52PM -0700, Josh Poimboeuf wrote:
> > > > /*
> > > > * RSP now points to an ordinary IRET frame, except that the page
> > > > * is read-only and RSP[31:16] are preloaded with the userspace
> > > > @@ -1502,6 +1511,9 @@ nmi_restore:
> > > > std
> > > > movq $0, 5*8(%rsp) /* clear "NMI executing" */
> > > >
> > > > + /* Mitigate CPU data sampling attacks .e.g. MDS */
> > > > + USER_CLEAR_CPU_BUFFERS
> > > > +
> > > > /*
> > > > * iretq reads the "iret" frame and exits the NMI stack in a
> > > > * single instruction. We are returning to kernel mode, so this
> > >
> > > This isn't needed here. This is the NMI return-to-kernel path.
> >
> > Yes, the VERW here can be omitted. But probably need to check if an NMI
> > occuring between VERW and ring transition will still execute VERW after
> > the NMI.
>
> That window does exist, though I'm not sure it's worth worrying about.
I am in favor of omitting the VERW here, unless someone objects with a
rationale. IMO, precisely timing the NMIs in such a narrow window is
impractical.
On 10/23/23 15:30, Pawan Gupta wrote:
>>>>> /*
>>>>> * iretq reads the "iret" frame and exits the NMI stack in a
>>>>> * single instruction. We are returning to kernel mode, so this
>>>> This isn't needed here. This is the NMI return-to-kernel path.
>>> Yes, the VERW here can be omitted. But probably need to check if an NMI
>>> occuring between VERW and ring transition will still execute VERW after
>>> the NMI.
>> That window does exist, though I'm not sure it's worth worrying about.
> I am in favor of omitting the VERW here, unless someone objects with a
> rationale. IMO, precisely timing the NMIs in such a narrow window is
> impractical.
I'd bet that given the right PMU event you could make this pretty
reliable. But normal users can't do that by default. That leaves the
NMI watchdog which (I bet) you can still time, but which is pretty low
frequency.
Are there any other NMI sources that a normal user can cause problems with?
Let's at least leave a marker in here that folks can grep for:
/* Skip CLEAR_CPU_BUFFERS since it will rarely help */
and some nice logic in the changelog that they can dig out if need be.
But, basically it sounds like the logic is:
1. It's rare to get an NMI after VERW but before returning to userspace
2. There is no known way to make that NMI less rare or target it
3. It would take a large number of these precisely-timed NMIs to mount
an actual attack. There's presumably not enough bandwidth.
Anything else?
On Mon, Oct 23, 2023 at 03:45:41PM -0700, Dave Hansen wrote:
> On 10/23/23 15:30, Pawan Gupta wrote:
> >>>>> /*
> >>>>> * iretq reads the "iret" frame and exits the NMI stack in a
> >>>>> * single instruction. We are returning to kernel mode, so this
> >>>> This isn't needed here. This is the NMI return-to-kernel path.
> >>> Yes, the VERW here can be omitted. But probably need to check if an NMI
> >>> occuring between VERW and ring transition will still execute VERW after
> >>> the NMI.
> >> That window does exist, though I'm not sure it's worth worrying about.
> > I am in favor of omitting the VERW here, unless someone objects with a
> > rationale. IMO, precisely timing the NMIs in such a narrow window is
> > impractical.
>
> I'd bet that given the right PMU event you could make this pretty
> reliable. But normal users can't do that by default. That leaves the
> NMI watchdog which (I bet) you can still time, but which is pretty low
> frequency.
>
> Are there any other NMI sources that a normal user can cause problems with?
Generating recoverable parity check errors using rowhammer? But, thats
probably going too far for very little gain.
> Let's at least leave a marker in here that folks can grep for:
>
> /* Skip CLEAR_CPU_BUFFERS since it will rarely help */
Sure.
> and some nice logic in the changelog that they can dig out if need be.
>
> But, basically it sounds like the logic is:
>
> 1. It's rare to get an NMI after VERW but before returning to userspace
> 2. There is no known way to make that NMI less rare or target it
> 3. It would take a large number of these precisely-timed NMIs to mount
> an actual attack. There's presumably not enough bandwidth.
Thanks for this.
> Anything else?
4. The NMI in question occurs after a VERW, i.e. when user state is
restored and most interesting data is already scrubbed. Whats left is
only the data that NMI touches, and that may or may not be
interesting.
On Sat, Oct 21, 2023 at 12:55:45AM +0100, Andrew Cooper wrote:
> On 20/10/2023 9:44 pm, Pawan Gupta wrote:
> > +#define EXEC_VERW \
> > + __EXEC_VERW(551f); \
> > + /* nopl __KERNEL_DS(%rax) */ \
> > + .byte 0x0f, 0x1f, 0x80, 0x00, 0x00; \
> > +551: .word __KERNEL_DS; \
>
> Is this actually wise from a perf point of view?
>
> You're causing a data access to the instruction stream, and not only
> that, the immediate next instruction. Some parts don't take kindly to
> snoops hitting L1I.
>
> A better option would be to simply have
>
> .section .text.entry
> .align CACHELINE
> mds_verw_sel:
> .word __KERNEL_DS
> int3
> .align CACHELINE
>
>
> And then just have EXEC_VERW be
>
> verw mds_verw_sel(%rip)
>
> in the fastpaths. That keeps the memory operand in .text.entry it works
> on Meltdown-vulnerable CPUs, but creates effectively a data cacheline
> that isn't mixed into anywhere in the frontend, which also gets far
> better locality of reference.
With .text.entry section I am getting getting below warnings and an
error:
-----------------------------------------------------------------
LD vmlinux.o
vmlinux.o: warning: objtool: .text.entry+0x0: unreachable instruction
vmlinux.o: warning: objtool: .text.entry+0x40: unreachable instruction
vmlinux.o: warning: objtool: .text.entry+0x80: unreachable instruction
vmlinux.o: warning: objtool: .text.entry+0xc0: unreachable instruction
vmlinux.o: warning: objtool: .text.entry+0x100: unreachable instruction
vmlinux.o: warning: objtool: .text.entry+0x140: unreachable instruction
vmlinux.o: warning: objtool: .text.entry+0x180: unreachable instruction
vmlinux.o: warning: objtool: .text.entry+0x1c0: unreachable instruction
vmlinux.o: warning: objtool: .text.entry+0x200: unreachable instruction
vmlinux.o: warning: objtool: .text.entry+0x240: unreachable instruction
vmlinux.o: warning: objtool: .text.entry+0x280: unreachable instruction
vmlinux.o: warning: objtool: .text.entry+0x2c0: unreachable instruction
vmlinux.o: warning: objtool: .text.entry+0x300: unreachable instruction
vmlinux.o: warning: objtool: .altinstr_replacement+0x2c: relocation to !ENDBR: .text.entry+0x0
vmlinux.o: warning: objtool: .altinstr_replacement+0x1c4: relocation to !ENDBR: .text.entry+0x0
vmlinux.o: warning: objtool: .altinstr_replacement+0x1d0: relocation to !ENDBR: .text.entry+0x0
vmlinux.o: warning: objtool: .altinstr_replacement+0x2d2: relocation to !ENDBR: .text.entry+0x80
vmlinux.o: warning: objtool: .altinstr_replacement+0x5d5: relocation to !ENDBR: .text.entry+0xc0
OBJCOPY modules.builtin.modinfo
GEN modules.builtin
MODPOST vmlinux.symvers
UPD include/generated/utsversion.h
CC init/version-timestamp.o
LD .tmp_vmlinux.kallsyms1
ld: error: unplaced orphan section `.text.entry' from `vmlinux.o'
make[2]: *** [scripts/Makefile.vmlinux:36: vmlinux] Error 1
-----------------------------------------------------------------
... because my config has CONFIG_LD_ORPHAN_WARN_LEVEL="error" and
objtool needs to be told about this entry.
Do you think its worth fighting these warnings and error, or simply use
.rodata section for verw memory operand?
On Tue, Oct 24, 2023 at 11:28:18PM -0700, Pawan Gupta wrote:
> With .text.entry section I am getting getting below warnings and an
> error:
>
> -----------------------------------------------------------------
> LD vmlinux.o
> vmlinux.o: warning: objtool: .text.entry+0x0: unreachable instruction
> vmlinux.o: warning: objtool: .text.entry+0x40: unreachable instruction
> vmlinux.o: warning: objtool: .text.entry+0x80: unreachable instruction
> vmlinux.o: warning: objtool: .text.entry+0xc0: unreachable instruction
> vmlinux.o: warning: objtool: .text.entry+0x100: unreachable instruction
> vmlinux.o: warning: objtool: .text.entry+0x140: unreachable instruction
> vmlinux.o: warning: objtool: .text.entry+0x180: unreachable instruction
> vmlinux.o: warning: objtool: .text.entry+0x1c0: unreachable instruction
> vmlinux.o: warning: objtool: .text.entry+0x200: unreachable instruction
> vmlinux.o: warning: objtool: .text.entry+0x240: unreachable instruction
> vmlinux.o: warning: objtool: .text.entry+0x280: unreachable instruction
> vmlinux.o: warning: objtool: .text.entry+0x2c0: unreachable instruction
> vmlinux.o: warning: objtool: .text.entry+0x300: unreachable instruction
> vmlinux.o: warning: objtool: .altinstr_replacement+0x2c: relocation to !ENDBR: .text.entry+0x0
> vmlinux.o: warning: objtool: .altinstr_replacement+0x1c4: relocation to !ENDBR: .text.entry+0x0
> vmlinux.o: warning: objtool: .altinstr_replacement+0x1d0: relocation to !ENDBR: .text.entry+0x0
> vmlinux.o: warning: objtool: .altinstr_replacement+0x2d2: relocation to !ENDBR: .text.entry+0x80
> vmlinux.o: warning: objtool: .altinstr_replacement+0x5d5: relocation to !ENDBR: .text.entry+0xc0
> OBJCOPY modules.builtin.modinfo
> GEN modules.builtin
> MODPOST vmlinux.symvers
> UPD include/generated/utsversion.h
> CC init/version-timestamp.o
> LD .tmp_vmlinux.kallsyms1
> ld: error: unplaced orphan section `.text.entry' from `vmlinux.o'
> make[2]: *** [scripts/Makefile.vmlinux:36: vmlinux] Error 1
> -----------------------------------------------------------------
>
> ... because my config has CONFIG_LD_ORPHAN_WARN_LEVEL="error" and
> objtool needs to be told about this entry.
>
> Do you think its worth fighting these warnings and error, or simply use
> .rodata section for verw memory operand?
I'm thinking you need to at the very least stay in a section that's
actually still mapped with PTI :-)
.entry.text is the only thing that is reliably mapped with PTI (IIRC),
some configs also get a chunk of the kernel image, but not all.
Something like the below should do I suppose.
---
arch/x86/entry/entry.S | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/arch/x86/entry/entry.S b/arch/x86/entry/entry.S
index bfb7bcb362bc..9eb2b532c92a 100644
--- a/arch/x86/entry/entry.S
+++ b/arch/x86/entry/entry.S
@@ -6,6 +6,8 @@
#include <linux/linkage.h>
#include <asm/export.h>
#include <asm/msr-index.h>
+#include <asm/unwind_hints.h>
+#include <asm/segment.h>
.pushsection .noinstr.text, "ax"
@@ -20,3 +22,16 @@ SYM_FUNC_END(entry_ibpb)
EXPORT_SYMBOL_GPL(entry_ibpb);
.popsection
+
+.pushsection .entry.text, "ax"
+
+.align 64
+SYM_CODE_START_NOALIGN(mds_verw_sel)
+ UNWIND_HINT_UNDEFINED
+ ANNOTATE_NOENDBR
+1:
+ .word __KERNEL_DS
+ .skip 64 - (. - 1b), 0xcc
+SYM_CODE_END(mds_verw_sel);
+
+.popsection
On 25/10/2023 8:22 am, Peter Zijlstra wrote:
> On Tue, Oct 24, 2023 at 11:28:18PM -0700, Pawan Gupta wrote:
>
>> With .text.entry section I am getting getting below warnings and an
>> error:
>>
>> -----------------------------------------------------------------
>> LD vmlinux.o
>> vmlinux.o: warning: objtool: .text.entry+0x0: unreachable instruction
>> vmlinux.o: warning: objtool: .text.entry+0x40: unreachable instruction
>> vmlinux.o: warning: objtool: .text.entry+0x80: unreachable instruction
>> vmlinux.o: warning: objtool: .text.entry+0xc0: unreachable instruction
>> vmlinux.o: warning: objtool: .text.entry+0x100: unreachable instruction
>> vmlinux.o: warning: objtool: .text.entry+0x140: unreachable instruction
>> vmlinux.o: warning: objtool: .text.entry+0x180: unreachable instruction
>> vmlinux.o: warning: objtool: .text.entry+0x1c0: unreachable instruction
>> vmlinux.o: warning: objtool: .text.entry+0x200: unreachable instruction
>> vmlinux.o: warning: objtool: .text.entry+0x240: unreachable instruction
>> vmlinux.o: warning: objtool: .text.entry+0x280: unreachable instruction
>> vmlinux.o: warning: objtool: .text.entry+0x2c0: unreachable instruction
>> vmlinux.o: warning: objtool: .text.entry+0x300: unreachable instruction
>> vmlinux.o: warning: objtool: .altinstr_replacement+0x2c: relocation to !ENDBR: .text.entry+0x0
>> vmlinux.o: warning: objtool: .altinstr_replacement+0x1c4: relocation to !ENDBR: .text.entry+0x0
>> vmlinux.o: warning: objtool: .altinstr_replacement+0x1d0: relocation to !ENDBR: .text.entry+0x0
>> vmlinux.o: warning: objtool: .altinstr_replacement+0x2d2: relocation to !ENDBR: .text.entry+0x80
>> vmlinux.o: warning: objtool: .altinstr_replacement+0x5d5: relocation to !ENDBR: .text.entry+0xc0
>> OBJCOPY modules.builtin.modinfo
>> GEN modules.builtin
>> MODPOST vmlinux.symvers
>> UPD include/generated/utsversion.h
>> CC init/version-timestamp.o
>> LD .tmp_vmlinux.kallsyms1
>> ld: error: unplaced orphan section `.text.entry' from `vmlinux.o'
>> make[2]: *** [scripts/Makefile.vmlinux:36: vmlinux] Error 1
>> -----------------------------------------------------------------
>>
>> ... because my config has CONFIG_LD_ORPHAN_WARN_LEVEL="error" and
>> objtool needs to be told about this entry.
>>
>> Do you think its worth fighting these warnings and error, or simply use
>> .rodata section for verw memory operand?
> I'm thinking you need to at the very least stay in a section that's
> actually still mapped with PTI :-)
Sorry. Xen and Linux have this section named opposite ways around.
> diff --git a/arch/x86/entry/entry.S b/arch/x86/entry/entry.S
> index bfb7bcb362bc..9eb2b532c92a 100644
> --- a/arch/x86/entry/entry.S
> +++ b/arch/x86/entry/entry.S
> @@ -20,3 +22,16 @@ SYM_FUNC_END(entry_ibpb)
> EXPORT_SYMBOL_GPL(entry_ibpb);
>
> .popsection
> +
> +.pushsection .entry.text, "ax"
> +
> +.align 64
> +SYM_CODE_START_NOALIGN(mds_verw_sel)
> + UNWIND_HINT_UNDEFINED
> + ANNOTATE_NOENDBR
> +1:
> + .word __KERNEL_DS
> + .skip 64 - (. - 1b), 0xcc
The 1 label aliases mds_verw_sel and this must remain like this for the
construct to work.
So instead of .skip, why not simply .align 64, 0xcc and get rid of the
1: label?
Do we have a suitably named constant cacheline size, rather than
opencoding 64?
> +SYM_CODE_END(mds_verw_sel);
Given that KVM needs it, this probably needs an EXPORT_SYMBOL_GPL() on it.
~Andrew
On Wed, Oct 25, 2023 at 08:52:50AM +0100, Andrew Cooper wrote:
> > +.pushsection .entry.text, "ax"
> > +
> > +.align 64
> > +SYM_CODE_START_NOALIGN(mds_verw_sel)
> > + UNWIND_HINT_UNDEFINED
> > + ANNOTATE_NOENDBR
> > +1:
> > + .word __KERNEL_DS
> > + .skip 64 - (. - 1b), 0xcc
>
> The 1 label aliases mds_verw_sel and this must remain like this for the
> construct to work.
>
> So instead of .skip, why not simply .align 64, 0xcc and get rid of the
> 1: label?
Because I forgot you can add a filler byte to .align :/ Yes, that's much
saner.
> Do we have a suitably named constant cacheline size, rather than
> opencoding 64?
L1_CACHE_BYTES probably.
>
> > +SYM_CODE_END(mds_verw_sel);
>
> Given that KVM needs it, this probably needs an EXPORT_SYMBOL_GPL() on it.
localyesconfig ftw ;-)
/me runs
On Wed, Oct 25, 2023 at 09:22:55AM +0200, Peter Zijlstra wrote:
> I'm thinking you need to at the very least stay in a section that's
> actually still mapped with PTI :-)
>
> .entry.text is the only thing that is reliably mapped with PTI (IIRC),
> some configs also get a chunk of the kernel image, but not all.
>
> Something like the below should do I suppose.
Thanks, will do this with Andrew's improvements.
> ---
> arch/x86/entry/entry.S | 15 +++++++++++++++