LinuxLists.cc - [PATCH] x86/entry/64: De-Xen-ify our NMI code further

2021-01-25 01:16:40

Subject: [PATCH] x86/entry/64: De-Xen-ify our NMI code further

From: Lai Jiangshan <[email protected]>

The commit 929bacec21478("x86/entry/64: De-Xen-ify our NMI code") simplified
the NMI code by changing paravirt code into native code and left a comment
about "inspecting RIP instead". But until now, "inspecting RIP instead"
has not been made happened and this patch tries to complete it.

Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 46 +++++++++++----------------------------
1 file changed, 13 insertions(+), 33 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index cad08703c4ad..cb6b8a6c6652 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1268,32 +1268,12 @@ SYM_CODE_START(asm_exc_nmi)
je nested_nmi

/*
- * Now test if the previous stack was an NMI stack. This covers
- * the case where we interrupt an outer NMI after it clears
- * "NMI executing" but before IRET. We need to be careful, though:
- * there is one case in which RSP could point to the NMI stack
- * despite there being no NMI active: naughty userspace controls
- * RSP at the very beginning of the SYSCALL targets. We can
- * pull a fast one on naughty userspace, though: we program
- * SYSCALL to mask DF, so userspace cannot cause DF to be set
- * if it controls the kernel's RSP. We set DF before we clear
- * "NMI executing".
+ * Now test if we interrupt an outer NMI after it clears
+ * "NMI executing" but before iret.
*/
- lea 6*8(%rsp), %rdx
- /* Compare the NMI stack (rdx) with the stack we came from (4*8(%rsp)) */
- cmpq %rdx, 4*8(%rsp)
- /* If the stack pointer is above the NMI stack, this is a normal NMI */
- ja first_nmi
-
- subq $EXCEPTION_STKSZ, %rdx
- cmpq %rdx, 4*8(%rsp)
- /* If it is below the NMI stack, it is a normal NMI */
- jb first_nmi
-
- /* Ah, it is within the NMI stack. */
-
- testb $(X86_EFLAGS_DF >> 8), (3*8 + 1)(%rsp)
- jz first_nmi /* RSP was user controlled. */
+ movq $nmi_executing_cleared, %rdx
+ cmpq 8(%rsp), %rdx
+ jne first_nmi

/* This is a nested NMI. */

@@ -1438,16 +1418,16 @@ nmi_restore:
addq $6*8, %rsp

/*
- * Clear "NMI executing". Set DF first so that we can easily
- * distinguish the remaining code between here and IRET from
- * the SYSCALL entry and exit paths.
- *
- * We arguably should just inspect RIP instead, but I (Andy) wrote
- * this code when I had the misapprehension that Xen PV supported
- * NMIs, and Xen PV would break that approach.
+ * Clear "NMI executing". It also leaves a window after it before
+ * iret which should be also considered to be "NMI executing" albeit
+ * with "NMI executing" variable being zero. So we should also check
+ * the RIP after it when checking "NMI executing". See the code
+ * before nested_nmi. No code is allowed to be added to between
+ * clearing "NMI executing" and iret unless we check a larger window
+ * with a range of RIPs instead of currently a single-RIP window.
*/
- std
movq $0, 5*8(%rsp) /* clear "NMI executing" */
+nmi_executing_cleared:

/*
* iretq reads the "iret" frame and exits the NMI stack in a
--
2.19.1.6.gb485710b

2021-01-25 03:04:12

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH] x86/entry/64: De-Xen-ify our NMI code further

On Sun, Jan 24, 2021 at 5:13 PM Lai Jiangshan <[email protected]> wrote:
>
> From: Lai Jiangshan <[email protected]>
>
> The commit 929bacec21478("x86/entry/64: De-Xen-ify our NMI code") simplified
> the NMI code by changing paravirt code into native code and left a comment
> about "inspecting RIP instead". But until now, "inspecting RIP instead"
> has not been made happened and this patch tries to complete it.
>
> Signed-off-by: Lai Jiangshan <[email protected]>
> ---
> arch/x86/entry/entry_64.S | 46 +++++++++++----------------------------
> 1 file changed, 13 insertions(+), 33 deletions(-)
>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index cad08703c4ad..cb6b8a6c6652 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1268,32 +1268,12 @@ SYM_CODE_START(asm_exc_nmi)
> je nested_nmi
>
> /*
> - * Now test if the previous stack was an NMI stack. This covers
> - * the case where we interrupt an outer NMI after it clears
> - * "NMI executing" but before IRET. We need to be careful, though:
> - * there is one case in which RSP could point to the NMI stack
> - * despite there being no NMI active: naughty userspace controls
> - * RSP at the very beginning of the SYSCALL targets. We can
> - * pull a fast one on naughty userspace, though: we program
> - * SYSCALL to mask DF, so userspace cannot cause DF to be set
> - * if it controls the kernel's RSP. We set DF before we clear
> - * "NMI executing".
> + * Now test if we interrupt an outer NMI after it clears
> + * "NMI executing" but before iret.

s/interrupt/interrupted

But let's make it a lot more clear:

Now test if we interrupted an outer NMI that just cleared "NMI
executing" and is about to IRET. This is a single-instruction window.
This check does not handle the case in which we get a nested interrupt
(#MC, #VE, #VC, etc.) after clearing "NMI executing" but before the
outer NMI executes IRET.

> + movq $nmi_executing_cleared, %rdx
> + cmpq 8(%rsp), %rdx
> + jne first_nmi

If we're okay with non-PIC code, then this is suboptimal -- you can
just compare directly. But using PIC is polite, so that movq should
be a RIP-relative leaq.

>
> /* This is a nested NMI. */
>
> @@ -1438,16 +1418,16 @@ nmi_restore:
> addq $6*8, %rsp
>
> /*
> - * Clear "NMI executing". Set DF first so that we can easily
> - * distinguish the remaining code between here and IRET from
> - * the SYSCALL entry and exit paths.
> - *
> - * We arguably should just inspect RIP instead, but I (Andy) wrote
> - * this code when I had the misapprehension that Xen PV supported
> - * NMIs, and Xen PV would break that approach.
> + * Clear "NMI executing". It also leaves a window after it before
> + * iret which should be also considered to be "NMI executing" albeit
> + * with "NMI executing" variable being zero. So we should also check
> + * the RIP after it when checking "NMI executing". See the code
> + * before nested_nmi. No code is allowed to be added to between
> + * clearing "NMI executing" and iret unless we check a larger window
> + * with a range of RIPs instead of currently a single-RIP window.

Let's simplify this comment:

Clear "NMI executing". This leaves a window in which a nested NMI
could observe "NMI executing" cleared, and a nested NMI will detect
this by inspecting RIP.

> */
> - std
> movq $0, 5*8(%rsp) /* clear "NMI executing" */
> +nmi_executing_cleared:
>

This should be local. Let's call it .Lnmi_iret. And add a comment:

.Lnmi_iret: /* must be immediately after clearing "NMI executing" */

> /*
> * iretq reads the "iret" frame and exits the NMI stack in a

2021-01-25 06:50:31

by Lai Jiangshan

[permalink] [raw]

Subject: [PATCH V2] x86/entry/64: De-Xen-ify our NMI code further

From: Lai Jiangshan <[email protected]>

The commit 929bacec21478("x86/entry/64: De-Xen-ify our NMI code") simplified
the NMI code by changing paravirt code into native code and left a comment
about "inspecting RIP instead". But until now, "inspecting RIP instead"
has not been made happened and this patch tries to complete it.

Comments in the code was from Andy Lutomirski. Thanks!

Signed-off-by: Lai Jiangshan <[email protected]>
---
arch/x86/entry/entry_64.S | 44 ++++++++++-----------------------------
1 file changed, 11 insertions(+), 33 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index cad08703c4ad..21f67ea62341 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1268,32 +1268,14 @@ SYM_CODE_START(asm_exc_nmi)
je nested_nmi

/*
- * Now test if the previous stack was an NMI stack. This covers
- * the case where we interrupt an outer NMI after it clears
- * "NMI executing" but before IRET. We need to be careful, though:
- * there is one case in which RSP could point to the NMI stack
- * despite there being no NMI active: naughty userspace controls
- * RSP at the very beginning of the SYSCALL targets. We can
- * pull a fast one on naughty userspace, though: we program
- * SYSCALL to mask DF, so userspace cannot cause DF to be set
- * if it controls the kernel's RSP. We set DF before we clear
- * "NMI executing".
+ * Now test if we interrupted an outer NMI that just cleared "NMI
+ * executing" and is about to IRET. This is a single-instruction
+ * window. This check does not handle the case in which we get a
+ * nested interrupt (#MC, #VE, #VC, etc.) after clearing
+ * "NMI executing" but before the outer NMI executes IRET.
*/
- lea 6*8(%rsp), %rdx
- /* Compare the NMI stack (rdx) with the stack we came from (4*8(%rsp)) */
- cmpq %rdx, 4*8(%rsp)
- /* If the stack pointer is above the NMI stack, this is a normal NMI */
- ja first_nmi
-
- subq $EXCEPTION_STKSZ, %rdx
- cmpq %rdx, 4*8(%rsp)
- /* If it is below the NMI stack, it is a normal NMI */
- jb first_nmi
-
- /* Ah, it is within the NMI stack. */
-
- testb $(X86_EFLAGS_DF >> 8), (3*8 + 1)(%rsp)
- jz first_nmi /* RSP was user controlled. */
+ cmpq $.Lnmi_iret, 8(%rsp)
+ jne first_nmi

/* This is a nested NMI. */

@@ -1438,17 +1420,13 @@ nmi_restore:
addq $6*8, %rsp

/*
- * Clear "NMI executing". Set DF first so that we can easily
- * distinguish the remaining code between here and IRET from
- * the SYSCALL entry and exit paths.
- *
- * We arguably should just inspect RIP instead, but I (Andy) wrote
- * this code when I had the misapprehension that Xen PV supported
- * NMIs, and Xen PV would break that approach.
+ * Clear "NMI executing". This leaves a window in which a nested NMI
+ * could observe "NMI executing" cleared, and a nested NMI will detect
+ * this by inspecting RIP.
*/
- std
movq $0, 5*8(%rsp) /* clear "NMI executing" */

+.Lnmi_iret: /* must be immediately after clearing "NMI executing" */
/*
* iretq reads the "iret" frame and exits the NMI stack in a
* single instruction. We are returning to kernel mode, so this
--
2.19.1.6.gb485710b

2021-01-25 17:43:26

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH V2] x86/entry/64: De-Xen-ify our NMI code further

On Mon, 25 Jan 2021 15:45:06 +0800
Lai Jiangshan <[email protected]> wrote:

> From: Lai Jiangshan <[email protected]>
>
> The commit 929bacec21478("x86/entry/64: De-Xen-ify our NMI code") simplified
> the NMI code by changing paravirt code into native code and left a comment
> about "inspecting RIP instead". But until now, "inspecting RIP instead"
> has not been made happened and this patch tries to complete it.
>
> Comments in the code was from Andy Lutomirski. Thanks!
>
> Signed-off-by: Lai Jiangshan <[email protected]>
> ---
> arch/x86/entry/entry_64.S | 44 ++++++++++-----------------------------
> 1 file changed, 11 insertions(+), 33 deletions(-)
>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index cad08703c4ad..21f67ea62341 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1268,32 +1268,14 @@ SYM_CODE_START(asm_exc_nmi)
> je nested_nmi
>
> /*
> - * Now test if the previous stack was an NMI stack. This covers
> - * the case where we interrupt an outer NMI after it clears
> - * "NMI executing" but before IRET. We need to be careful, though:
> - * there is one case in which RSP could point to the NMI stack
> - * despite there being no NMI active: naughty userspace controls
> - * RSP at the very beginning of the SYSCALL targets. We can
> - * pull a fast one on naughty userspace, though: we program
> - * SYSCALL to mask DF, so userspace cannot cause DF to be set
> - * if it controls the kernel's RSP. We set DF before we clear
> - * "NMI executing".
> + * Now test if we interrupted an outer NMI that just cleared "NMI
> + * executing" and is about to IRET. This is a single-instruction
> + * window. This check does not handle the case in which we get a
> + * nested interrupt (#MC, #VE, #VC, etc.) after clearing
> + * "NMI executing" but before the outer NMI executes IRET.
> */
> - lea 6*8(%rsp), %rdx
> - /* Compare the NMI stack (rdx) with the stack we came from (4*8(%rsp)) */
> - cmpq %rdx, 4*8(%rsp)
> - /* If the stack pointer is above the NMI stack, this is a normal NMI */
> - ja first_nmi
> -
> - subq $EXCEPTION_STKSZ, %rdx
> - cmpq %rdx, 4*8(%rsp)
> - /* If it is below the NMI stack, it is a normal NMI */
> - jb first_nmi
> -
> - /* Ah, it is within the NMI stack. */
> -
> - testb $(X86_EFLAGS_DF >> 8), (3*8 + 1)(%rsp)
> - jz first_nmi /* RSP was user controlled. */

So we no longer check to see if the current stack is on the NMI stack.
Makes sense, since this beginning of the NMI code can not be interrupted,
as there's no breakpoints or faults that can occur when that happens. The
$nmi_executing is set in all other locations except for:

repeat_nmi - end_repeat_nmi
and for the iretq itself (.Lnmi_iret).

Thus, by just checking the nmi_executing variable on the stack along with
the RIP compared to repeat_nim-end_repeat_nmi + .Lnmi_iret, we can safely
tell if the NMI is nested or not.

I was working on rewriting the beginning comments to reflect these updates,
and discovered a possible bug that exists (unrelated to this patch).

> + cmpq $.Lnmi_iret, 8(%rsp)
> + jne first_nmi
>

On triggering an NMI from user space, I see the switch to the thread stack
is done, and "exc_nmi" is called.

The problem I see with this is that exc_nmi is called with the thread
stack, if it were to take an exception, NMIs would be enabled allowing for
a nested NMI to run. From what I can tell, I don't see anything stopping
that NMI from executing over the currently running NMI. That is, this means
that NMI handlers are now re-entrant.

Yes, the stack issue is not a problem here, but NMI handlers are not
allowed to be re-entrant. For example, we have spin locks in NMI handlers
that are considered fine if they are only used in NMI handlers. But because
there's a possible way to make NMI handlers re-entrant then these spin
locks can deadlock.

I'm guessing that we need to add some tricks to the user space path to
set and clear the "NMI executing" variable, but the return may become a bit
complex in clearing that without races.

-- Steve

2021-01-25 17:56:39

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH V2] x86/entry/64: De-Xen-ify our NMI code further

On Mon, 25 Jan 2021 12:38:59 -0500
Steven Rostedt <[email protected]> wrote:

> On triggering an NMI from user space, I see the switch to the thread stack
> is done, and "exc_nmi" is called.
>
> The problem I see with this is that exc_nmi is called with the thread
> stack, if it were to take an exception, NMIs would be enabled allowing for
> a nested NMI to run. From what I can tell, I don't see anything stopping
> that NMI from executing over the currently running NMI. That is, this means
> that NMI handlers are now re-entrant.
>
> Yes, the stack issue is not a problem here, but NMI handlers are not
> allowed to be re-entrant. For example, we have spin locks in NMI handlers
> that are considered fine if they are only used in NMI handlers. But because
> there's a possible way to make NMI handlers re-entrant then these spin
> locks can deadlock.
>
> I'm guessing that we need to add some tricks to the user space path to
> set and clear the "NMI executing" variable, but the return may become a bit
> complex in clearing that without races.

I think this may work if we wrap the exc_nmi call with the following:

Overwrite the NMI HW stack frame on the NMI stack as if an NMI came in at
the return back to the user space path of the NMI handler. Set the stack
pointer to the NMI stack just after the first frame that was updated. Then
jump to asm_exc_nmi.

Then the code would act like it came in from kernel mode, and execute the
NMI nesting code normally. When it finishes, and does the iretq, it will
return to the NMI handler for the user space return with the kernel thread
stack, and then the special code for returning to user space can be called.

The exc_nmi C code will need to handle this case to update pt_regs to make
sure the registered NMI handlers still see the pt_regs from user space. But
I think something like this may be the easiest way to handle this without
dealing with more NMI stack nesting races.

I could try to write something up to implemented this.

-- Steve

2021-01-25 18:21:51

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH V2] x86/entry/64: De-Xen-ify our NMI code further

On Mon, 25 Jan 2021 09:51:45 -0800
Andy Lutomirski <[email protected]> wrote:

> > The problem I see with this is that exc_nmi is called with the thread
> > stack, if it were to take an exception, NMIs would be enabled allowing for
> > a nested NMI to run. From what I can tell, I don't see anything stopping
> > that NMI from executing over the currently running NMI. That is, this means
> > that NMI handlers are now re-entrant.
>
> That was intentional in my part. The C code checks for this condition and handles it, just like it does on 32-bit kernels.

I vaguely remember you implementing this, and me reviewing it. I'm getting
to that age that there's less and less I remember doing :-/

I'll include a comment about that in my rewrite. But first, I'll re-review
your changes to make sure there's no one offs that happen with the mixture
of nmi stack handling and the x86_32 version doing the same thing.

Thanks for the reminder.

-- Steve

2021-01-26 02:15:43

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH V2] x86/entry/64: De-Xen-ify our NMI code further

> On Jan 25, 2021, at 9:39 AM, Steven Rostedt <[email protected]> wrote:
>
> On Mon, 25 Jan 2021 15:45:06 +0800
> Lai Jiangshan <[email protected]> wrote:
>
>> From: Lai Jiangshan <[email protected]>
>>
>> The commit 929bacec21478("x86/entry/64: De-Xen-ify our NMI code") simplified
>> the NMI code by changing paravirt code into native code and left a comment
>> about "inspecting RIP instead". But until now, "inspecting RIP instead"
>> has not been made happened and this patch tries to complete it.
>>
>> Comments in the code was from Andy Lutomirski. Thanks!
>>
>> Signed-off-by: Lai Jiangshan <[email protected]>
>> ---
>> arch/x86/entry/entry_64.S | 44 ++++++++++-----------------------------
>> 1 file changed, 11 insertions(+), 33 deletions(-)
>>
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index cad08703c4ad..21f67ea62341 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -1268,32 +1268,14 @@ SYM_CODE_START(asm_exc_nmi)
>> je nested_nmi
>>
>> /*
>> - * Now test if the previous stack was an NMI stack. This covers
>> - * the case where we interrupt an outer NMI after it clears
>> - * "NMI executing" but before IRET. We need to be careful, though:
>> - * there is one case in which RSP could point to the NMI stack
>> - * despite there being no NMI active: naughty userspace controls
>> - * RSP at the very beginning of the SYSCALL targets. We can
>> - * pull a fast one on naughty userspace, though: we program
>> - * SYSCALL to mask DF, so userspace cannot cause DF to be set
>> - * if it controls the kernel's RSP. We set DF before we clear
>> - * "NMI executing".
>> + * Now test if we interrupted an outer NMI that just cleared "NMI
>> + * executing" and is about to IRET. This is a single-instruction
>> + * window. This check does not handle the case in which we get a
>> + * nested interrupt (#MC, #VE, #VC, etc.) after clearing
>> + * "NMI executing" but before the outer NMI executes IRET.
>> */
>> - lea 6*8(%rsp), %rdx
>> - /* Compare the NMI stack (rdx) with the stack we came from (4*8(%rsp)) */
>> - cmpq %rdx, 4*8(%rsp)
>> - /* If the stack pointer is above the NMI stack, this is a normal NMI */
>> - ja first_nmi
>> -
>> - subq $EXCEPTION_STKSZ, %rdx
>> - cmpq %rdx, 4*8(%rsp)
>> - /* If it is below the NMI stack, it is a normal NMI */
>> - jb first_nmi
>> -
>> - /* Ah, it is within the NMI stack. */
>> -
>> - testb $(X86_EFLAGS_DF >> 8), (3*8 + 1)(%rsp)
>> - jz first_nmi /* RSP was user controlled. */
>
> So we no longer check to see if the current stack is on the NMI stack.
> Makes sense, since this beginning of the NMI code can not be interrupted,
> as there's no breakpoints or faults that can occur when that happens. The
> $nmi_executing is set in all other locations except for:
>
> repeat_nmi - end_repeat_nmi
> and for the iretq itself (.Lnmi_iret).
>
> Thus, by just checking the nmi_executing variable on the stack along with
> the RIP compared to repeat_nim-end_repeat_nmi + .Lnmi_iret, we can safely
> tell if the NMI is nested or not.
>
> I was working on rewriting the beginning comments to reflect these updates,
> and discovered a possible bug that exists (unrelated to this patch).
>
>> + cmpq $.Lnmi_iret, 8(%rsp)
>> + jne first_nmi
>>
>
> On triggering an NMI from user space, I see the switch to the thread stack
> is done, and "exc_nmi" is called.
>
> The problem I see with this is that exc_nmi is called with the thread
> stack, if it were to take an exception, NMIs would be enabled allowing for
> a nested NMI to run. From what I can tell, I don't see anything stopping
> that NMI from executing over the currently running NMI. That is, this means
> that NMI handlers are now re-entrant.

That was intentional in my part. The C code checks for this condition and handles it, just like it does on 32-bit kernels.