As the mitigation for MDS and RFDS, CLEAR_CPU_BUFFERS macro executes VERW
instruction that is used to clear the CPU buffers before returning to user
space. Currently, VERW is executed after the user CR3 is restored. This
leads to vm86() to fault because VERW takes a memory operand that is not
mapped in user page tables when vm86() syscall returns. This is an issue
with 32-bit kernels only, as 64-bit kernels do not support vm86().
Move the VERW before the CR3 switch for 32-bit kernels as a workaround.
This is slightly less secure because there is a possibility that the data
in the registers may be sensitive, and doesn't get cleared from CPU
buffers. As 32-bit kernels haven't received some of the other transient
execution mitigations, this is a reasonable trade-off to ensure that
vm86() syscall works.
Fixes: a0e2dab44d22 ("x86/entry_32: Add VERW just before userspace transition")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218707
Closes: https://lore.kernel.org/all/[email protected]/
Reported-by: Robert Gill <[email protected]>
Signed-off-by: Pawan Gupta <[email protected]>
---
arch/x86/entry/entry_32.S | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index d3a814efbff6..1b9c1587f06e 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -837,6 +837,7 @@ SYM_FUNC_START(entry_SYSENTER_32)
jz .Lsyscall_32_done
STACKLEAK_ERASE
+ CLEAR_CPU_BUFFERS
/* Opportunistic SYSEXIT */
@@ -881,7 +882,6 @@ SYM_FUNC_START(entry_SYSENTER_32)
BUG_IF_WRONG_CR3 no_user_check=1
popfl
popl %eax
- CLEAR_CPU_BUFFERS
/*
* Return back to the vDSO, which will pop ecx and edx.
@@ -941,6 +941,7 @@ SYM_FUNC_START(entry_INT80_32)
STACKLEAK_ERASE
restore_all_switch_stack:
+ CLEAR_CPU_BUFFERS
SWITCH_TO_ENTRY_STACK
CHECK_AND_APPLY_ESPFIX
@@ -951,7 +952,6 @@ restore_all_switch_stack:
/* Restore user state */
RESTORE_REGS pop=4 # skip orig_eax/error_code
- CLEAR_CPU_BUFFERS
.Lirq_return:
/*
* ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization
---
base-commit: 0bbac3facb5d6cc0171c45c9873a2dc96bea9680
change-id: 20240426-fix-dosemu-vm86-dd111a01737e
On 27.04.24 01:48, Pawan Gupta wrote:
> As the mitigation for MDS and RFDS, CLEAR_CPU_BUFFERS macro executes VERW
> instruction that is used to clear the CPU buffers before returning to user
> space. Currently, VERW is executed after the user CR3 is restored. This
> leads to vm86() to fault because VERW takes a memory operand that is not
> mapped in user page tables when vm86() syscall returns. This is an issue
> with 32-bit kernels only, as 64-bit kernels do not support vm86().
>
> Move the VERW before the CR3 switch for 32-bit kernels as a workaround.
> This is slightly less secure because there is a possibility that the data
> in the registers may be sensitive, and doesn't get cleared from CPU
> buffers. As 32-bit kernels haven't received some of the other transient
> execution mitigations, this is a reasonable trade-off to ensure that
> vm86() syscall works.
>
> Fixes: a0e2dab44d22 ("x86/entry_32: Add VERW just before userspace transition")
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218707
> Closes: https://lore.kernel.org/all/[email protected]/
> Reported-by: Robert Gill <[email protected]>
> Signed-off-by: Pawan Gupta <[email protected]>
Did this fall through the cracks? Just wondering, as from here it looks
like for about two weeks now nothing happened to fix the regression
linked above. But I might have missed something.
Cioa, Thorsten
> ---
> arch/x86/entry/entry_32.S | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index d3a814efbff6..1b9c1587f06e 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -837,6 +837,7 @@ SYM_FUNC_START(entry_SYSENTER_32)
> jz .Lsyscall_32_done
>
> STACKLEAK_ERASE
> + CLEAR_CPU_BUFFERS
>
> /* Opportunistic SYSEXIT */
>
> @@ -881,7 +882,6 @@ SYM_FUNC_START(entry_SYSENTER_32)
> BUG_IF_WRONG_CR3 no_user_check=1
> popfl
> popl %eax
> - CLEAR_CPU_BUFFERS
>
> /*
> * Return back to the vDSO, which will pop ecx and edx.
> @@ -941,6 +941,7 @@ SYM_FUNC_START(entry_INT80_32)
> STACKLEAK_ERASE
>
> restore_all_switch_stack:
> + CLEAR_CPU_BUFFERS
> SWITCH_TO_ENTRY_STACK
> CHECK_AND_APPLY_ESPFIX
>
> @@ -951,7 +952,6 @@ restore_all_switch_stack:
>
> /* Restore user state */
> RESTORE_REGS pop=4 # skip orig_eax/error_code
> - CLEAR_CPU_BUFFERS
> .Lirq_return:
> /*
> * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization
>
> ---
> base-commit: 0bbac3facb5d6cc0171c45c9873a2dc96bea9680
> change-id: 20240426-fix-dosemu-vm86-dd111a01737e
>
>
On 4/26/24 16:48, Pawan Gupta wrote:
> As the mitigation for MDS and RFDS, CLEAR_CPU_BUFFERS macro executes VERW
> instruction that is used to clear the CPU buffers before returning to user
> space. Currently, VERW is executed after the user CR3 is restored. This
> leads to vm86() to fault because VERW takes a memory operand that is not
> mapped in user page tables when vm86() syscall returns. This is an issue
> with 32-bit kernels only, as 64-bit kernels do not support vm86().
entry.S has this handy comment:
/*
* Define the VERW operand that is disguised as entry code so that
* it can be referenced with KPTI enabled. This ensure VERW can be
* used late in exit-to-user path after page tables are switched.
*/
Why isn't that working?
> Move the VERW before the CR3 switch for 32-bit kernels as a workaround.
> This is slightly less secure because there is a possibility that the data
> in the registers may be sensitive, and doesn't get cleared from CPU
> buffers. As 32-bit kernels haven't received some of the other transient
> execution mitigations, this is a reasonable trade-off to ensure that
> vm86() syscall works.
>
> Fixes: a0e2dab44d22 ("x86/entry_32: Add VERW just before userspace transition")
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218707
> Closes: https://lore.kernel.org/all/[email protected]/
> Reported-by: Robert Gill <[email protected]>
> Signed-off-by: Pawan Gupta <[email protected]>
> ---
> arch/x86/entry/entry_32.S | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index d3a814efbff6..1b9c1587f06e 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -837,6 +837,7 @@ SYM_FUNC_START(entry_SYSENTER_32)
> jz .Lsyscall_32_done
>
> STACKLEAK_ERASE
> + CLEAR_CPU_BUFFERS
>
> /* Opportunistic SYSEXIT */
>
> @@ -881,7 +882,6 @@ SYM_FUNC_START(entry_SYSENTER_32)
> BUG_IF_WRONG_CR3 no_user_check=1
> popfl
> popl %eax
> - CLEAR_CPU_BUFFERS
Right now, this code basically does:
STACKLEAK_ERASE
/* Restore user registers and segments */
movl PT_EIP(%esp), %edx
...
SWITCH_TO_USER_CR3 scratch_reg=%eax
...
CLEAR_CPU_BUFFERS
The proposed patch is:
STACKLEAK_ERASE
+ CLEAR_CPU_BUFFERS
/* Restore user registers and segments */
movl PT_EIP(%esp), %edx
...
SWITCH_TO_USER_CR3 scratch_reg=%eax
...
- CLEAR_CPU_BUFFERS
That's a bit confusing to me. I would have expected the
CLEAR_CPU_BUFFERS to go _just_ before the SWITCH_TO_USER_CR3 and after
the user register restore.
Is there a reason it can't go there? I think only %eax is "live" with
kernel state at that point and it's only an entry stack pointer, so not
a secret.
> /*
> * Return back to the vDSO, which will pop ecx and edx.
> @@ -941,6 +941,7 @@ SYM_FUNC_START(entry_INT80_32)
> STACKLEAK_ERASE
>
> restore_all_switch_stack:
> + CLEAR_CPU_BUFFERS
> SWITCH_TO_ENTRY_STACK
> CHECK_AND_APPLY_ESPFIX
>
> @@ -951,7 +952,6 @@ restore_all_switch_stack:
>
> /* Restore user state */
> RESTORE_REGS pop=4 # skip orig_eax/error_code
> - CLEAR_CPU_BUFFERS
> .Lirq_return:
> /*
> * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization
There is a working stack here, on both sides of the CR3 switch. It's
annoying to do another push/pop which won't get patched out, but this
_could_ just do:
RESTORE_REGS pop=4
CLEAR_CPU_BUFFERS
pushl %eax
SWITCH_TO_USER_CR3 scratch_reg=%eax
popl %eax
right?
That would only expose the CR3 value, which isn't a secret.
On 5/9/24 15:17, Pawan Gupta wrote:
> It probably can go right before the SWITCH_TO_USER_CR3. I didn't have
> 32-bit setup with dosemu to experiment with. I will attach a debug patch to
> the bugzilla and request the reporter to test it.
There's an enter_from_vm86.c in the kernel selftests. That will at
least exercise this path.
On 4/26/24 16:48, Pawan Gupta wrote:
> Move the VERW before the CR3 switch for 32-bit kernels as a workaround.
I look at the 32-bit code so rarely, I seem to forget have to re-learn
this gunk every time I look at it. Take a look at RESTORE_INT_REGS. On
32-bit, we actually restore %ds:
popl %ds
So even doing this:
> + CLEAR_CPU_BUFFERS
> /* Restore user state */
> RESTORE_REGS pop=4 # skip orig_eax/error_code
> - CLEAR_CPU_BUFFERS
> .Lirq_return:
fixes the issue. Moving it above the CR3 switch also works of course,
but I don't think this has anything to do with CR3. It's just that
userspace sets a funky %ds value and CLEAR_CPU_BUFFERS uses ds:.
I don't think any of the segment registers can have secrets in them, can
they? I mean, it's possible, but in practice I can't imagine.
So why not just do the CLEAR_CPU_BUFFERS in RESTORE_REGS but after
RESTORE_INT_REGS? You might be able to do it universally, or you could
pass in a macro argument to do it conditionally.
P.S. Can we remove 32-bit support yet? Please? :)
On Thu, May 09, 2024 at 05:20:31PM -0700, Dave Hansen wrote:
> On 4/26/24 16:48, Pawan Gupta wrote:
> > Move the VERW before the CR3 switch for 32-bit kernels as a workaround.
>
> I look at the 32-bit code so rarely, I seem to forget have to re-learn
> this gunk every time I look at it. Take a look at RESTORE_INT_REGS. On
> 32-bit, we actually restore %ds:
>
> popl %ds
>
> So even doing this:
>
> > + CLEAR_CPU_BUFFERS
> > /* Restore user state */
> > RESTORE_REGS pop=4 # skip orig_eax/error_code
> > - CLEAR_CPU_BUFFERS
> > .Lirq_return:
>
> fixes the issue. Moving it above the CR3 switch also works of course,
> but I don't think this has anything to do with CR3. It's just that
> userspace sets a funky %ds value and CLEAR_CPU_BUFFERS uses ds:.
I will test it out, but I think you are right. VERW documentation says:
#GP(0) If a memory operand effective address is outside the CS,
DS, ES, FS, or GS segment limit.
> I don't think any of the segment registers can have secrets in them, can
> they? I mean, it's possible, but in practice I can't imagine.
I don't think so they are secrets. AFAICT, their values are build-time
constants, and can be easily deduced.
> So why not just do the CLEAR_CPU_BUFFERS in RESTORE_REGS but after
> RESTORE_INT_REGS? You might be able to do it universally, or you could
> pass in a macro argument to do it conditionally.
Sounds good. I will try that, possibly tomorrow.
> P.S. Can we remove 32-bit support yet? Please? :)
+1 ... or atleast the mitigations for 32-bit :)
On Thu, May 09, 2024 at 09:14:01AM -0700, Dave Hansen wrote:
> On 4/26/24 16:48, Pawan Gupta wrote:
> > As the mitigation for MDS and RFDS, CLEAR_CPU_BUFFERS macro executes VERW
> > instruction that is used to clear the CPU buffers before returning to user
> > space. Currently, VERW is executed after the user CR3 is restored. This
> > leads to vm86() to fault because VERW takes a memory operand that is not
> > mapped in user page tables when vm86() syscall returns. This is an issue
> > with 32-bit kernels only, as 64-bit kernels do not support vm86().
>
> entry.S has this handy comment:
>
> /*
> * Define the VERW operand that is disguised as entry code so that
> * it can be referenced with KPTI enabled. This ensure VERW can be
> * used late in exit-to-user path after page tables are switched.
> */
>
> Why isn't that working?
It works in general, but not for vm86() syscall. I don't know much about
how vm86() works, but it seems to emulate 16-bit real mode with limited
memory mapped in user page table. Most likely, user page table doesn't have
a mapping for mds_ver_sel is not mapped resulting in #GP fault.
[...]
> Right now, this code basically does:
>
> STACKLEAK_ERASE
> /* Restore user registers and segments */
> movl PT_EIP(%esp), %edx
> ...
> SWITCH_TO_USER_CR3 scratch_reg=%eax
> ...
> CLEAR_CPU_BUFFERS
>
> The proposed patch is:
>
> STACKLEAK_ERASE
> + CLEAR_CPU_BUFFERS
> /* Restore user registers and segments */
> movl PT_EIP(%esp), %edx
> ...
> SWITCH_TO_USER_CR3 scratch_reg=%eax
> ...
> - CLEAR_CPU_BUFFERS
>
> That's a bit confusing to me. I would have expected the
> CLEAR_CPU_BUFFERS to go _just_ before the SWITCH_TO_USER_CR3 and after
> the user register restore.
>
> Is there a reason it can't go there? I think only %eax is "live" with
> kernel state at that point and it's only an entry stack pointer, so not
> a secret.
It probably can go right before the SWITCH_TO_USER_CR3. I didn't have
32-bit setup with dosemu to experiment with. I will attach a debug patch to
the bugzilla and request the reporter to test it.
> > /*
> > * Return back to the vDSO, which will pop ecx and edx.
> > @@ -941,6 +941,7 @@ SYM_FUNC_START(entry_INT80_32)
> > STACKLEAK_ERASE
> >
> > restore_all_switch_stack:
> > + CLEAR_CPU_BUFFERS
> > SWITCH_TO_ENTRY_STACK
> > CHECK_AND_APPLY_ESPFIX
> >
> > @@ -951,7 +952,6 @@ restore_all_switch_stack:
> >
> > /* Restore user state */
> > RESTORE_REGS pop=4 # skip orig_eax/error_code
> > - CLEAR_CPU_BUFFERS
> > .Lirq_return:
> > /*
> > * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization
>
> There is a working stack here, on both sides of the CR3 switch. It's
> annoying to do another push/pop which won't get patched out, but this
> _could_ just do:
>
> RESTORE_REGS pop=4
> CLEAR_CPU_BUFFERS
>
> pushl %eax
> SWITCH_TO_USER_CR3 scratch_reg=%eax
> popl %eax
>
> right?
We can probably avoid the push/pop as well, because CLEAR_CPU_BUFFERS will
only clobber the ZF.
> That would only expose the CR3 value, which isn't a secret.
Right.
On Thu, May 09, 2024 at 05:04:36PM -0700, Dave Hansen wrote:
> On 5/9/24 15:17, Pawan Gupta wrote:
> > It probably can go right before the SWITCH_TO_USER_CR3. I didn't have
> > 32-bit setup with dosemu to experiment with. I will attach a debug patch to
> > the bugzilla and request the reporter to test it.
>
> There's an enter_from_vm86.c in the kernel selftests. That will at
> least exercise this path.
Yes, entry_from_vm86.c should be a good test case for this, thanks.