The recent changes to NMI allow exceptions to take place in NMI
handlers, but I think that a #PF (say, due to access to vmalloc space)
is still problematic. Consider the sequence
#PF (cr2 set by processor)
NMI
...
#PF (cr2 clobbered)
do_page_fault()
IRET
...
IRET
do_page_fault()
address = read_cr2()
The last line reads the overwritten cr2 value.
I vaguely remember some discussion about this back in the day, but I
can't find anything in the code to save/restore cr2 in the NMI handler.
Did I miss it? Or perhaps the page fault handler ignores the incorrect
cr2 and IRETs, to fault back immediately?
--
error compiling committee.c: too many arguments to function
On Tue, 2012-05-22 at 15:53 +0300, Avi Kivity wrote:
> The recent changes to NMI allow exceptions to take place in NMI
> handlers, but I think that a #PF (say, due to access to vmalloc space)
> is still problematic. Consider the sequence
>
> #PF (cr2 set by processor)
> NMI
> ...
> #PF (cr2 clobbered)
> do_page_fault()
> IRET
> ...
> IRET
> do_page_fault()
> address = read_cr2()
This is still problematic. But the "allow faults in NMI" wasn't written
for page faults, although they wont totally crash the system like they
use to. If a NMI triggers during a page fault routine before the reading
of the cr2, and it takes a page fault, then yes, this will corrupt the
cr2 and cause unpredictable results (not good)
That said, we still should not be having page faults in NMI. The fault
handling was to allow breakpoints in the NMI code, which should not be a
problem here. There is code to handle nested breakpoints because of
NMIs.
The only time I found #PF useful in NMIs was for debugging. Having a
stack dump of all tasks (sysrq-t) when the NMI watchdog detects a
deadlock can be very useful. But stack traces can trigger page faults,
and before this fault handling in NMI code went in, I could not get a
full task state dump from NMI context. This was due to the first page
fault happening by a stack dump would enable NMIs, and as the state of
all tasks dumping out to the serial port took a long time, another NMI
would come in and corrupt the NMI stack leading to a system hang or
triple fault reboot. Never letting the task dump to finish.
This code now alleviates that problem.
>
> The last line reads the overwritten cr2 value.
>
> I vaguely remember some discussion about this back in the day, but I
> can't find anything in the code to save/restore cr2 in the NMI handler.
> Did I miss it? Or perhaps the page fault handler ignores the incorrect
> cr2 and IRETs, to fault back immediately?
>
Now if we want to handle page faults from NMI context, we could do some
tricks to have the NMI detect that it interrupted a page fault before it
read the cr2 and in that case, save off the cr2 register, and restore it
before returning.
Or we could just have the NMI always restore the cr2 register.
-- Steve
On 05/22/2012 04:30 PM, Steven Rostedt wrote:
> On Tue, 2012-05-22 at 15:53 +0300, Avi Kivity wrote:
>> The recent changes to NMI allow exceptions to take place in NMI
>> handlers, but I think that a #PF (say, due to access to vmalloc space)
>> is still problematic. Consider the sequence
>>
>> #PF (cr2 set by processor)
>> NMI
>> ...
>> #PF (cr2 clobbered)
>> do_page_fault()
>> IRET
>> ...
>> IRET
>> do_page_fault()
>> address = read_cr2()
>
> This is still problematic. But the "allow faults in NMI" wasn't written
> for page faults, although they wont totally crash the system like they
> use to. If a NMI triggers during a page fault routine before the reading
> of the cr2, and it takes a page fault, then yes, this will corrupt the
> cr2 and cause unpredictable results (not good)
>
> That said, we still should not be having page faults in NMI. The fault
> handling was to allow breakpoints in the NMI code, which should not be a
> problem here. There is code to handle nested breakpoints because of
> NMIs.
I thought the whole thing was started by someone adding a
vmalloc_sync_all() to prevent this scenario, and Linus wanting to
fix NMI instead. But maybe I'm confusing two threads.
> Now if we want to handle page faults from NMI context, we could do some
> tricks to have the NMI detect that it interrupted a page fault before it
> read the cr2 and in that case, save off the cr2 register, and restore it
> before returning.
>
> Or we could just have the NMI always restore the cr2 register.
IMO that's best.
--
error compiling committee.c: too many arguments to function
On Tue, 2012-05-22 at 16:45 +0300, Avi Kivity wrote:
> I thought the whole thing was started by someone adding a
> vmalloc_sync_all() to prevent this scenario, and Linus wanting to
> fix NMI instead. But maybe I'm confusing two threads.
No, you are rightfully confused.
That was the case that started the thread and created the eventual code
that we have. But Mathieu Desnoyers brought this up not to just solve
the page fault issue for perf, but also to handle page faults for his
own ring buffer and as a nice side effect, fixing breakpoints, which is
critical to be able to modify text without the use of stop machine.
But the answer was to modify perf to not need to use vmalloc'd data from
NMI context. Or at least never do it where a page fault could happen.
The motivation for me to allow faults in NMI context was for
breakpoints, as it was required to remove stop machine from the function
tracer.
>
> > Now if we want to handle page faults from NMI context, we could do some
> > tricks to have the NMI detect that it interrupted a page fault before it
> > read the cr2 and in that case, save off the cr2 register, and restore it
> > before returning.
> >
> > Or we could just have the NMI always restore the cr2 register.
>
> IMO that's best.
OK, I can whip up a patch, but I wont push that in till 3.6.
-- Steve
On 05/22/2012 05:09 PM, Steven Rostedt wrote:
>> >
>> > Or we could just have the NMI always restore the cr2 register.
>>
>> IMO that's best.
>
> OK, I can whip up a patch, but I wont push that in till 3.6.
>
Thanks. Something I've noticed is that writing cr2 is slow, so you may
want to write it only if it has changed (which should be very rare).
--
error compiling committee.c: too many arguments to function
On Tue, 2012-05-22 at 17:20 +0300, Avi Kivity wrote:
> On 05/22/2012 05:09 PM, Steven Rostedt wrote:
> >> >
> >> > Or we could just have the NMI always restore the cr2 register.
> >>
> >> IMO that's best.
> >
> > OK, I can whip up a patch, but I wont push that in till 3.6.
> >
>
> Thanks. Something I've noticed is that writing cr2 is slow, so you may
> want to write it only if it has changed (which should be very rare).
>
Is reading it fast? Then we could do a two reads and only write when
needed.
Something like this pseudo assembly
mov cr2, rax
push rax
call do_nmi
pop rax
mov cr2, rbx
cmp rax, rbx
be skip
mov rax, cr2
skip:
-- Steve
On 05/22/2012 05:27 PM, Steven Rostedt wrote:
> On Tue, 2012-05-22 at 17:20 +0300, Avi Kivity wrote:
>> On 05/22/2012 05:09 PM, Steven Rostedt wrote:
>> >> >
>> >> > Or we could just have the NMI always restore the cr2 register.
>> >>
>> >> IMO that's best.
>> >
>> > OK, I can whip up a patch, but I wont push that in till 3.6.
>> >
>>
>> Thanks. Something I've noticed is that writing cr2 is slow, so you may
>> want to write it only if it has changed (which should be very rare).
>>
>
> Is reading it fast? Then we could do a two reads and only write when
> needed.
The upside is 70 cycles on one machine, see d3edefc0035669.
>
> Something like this pseudo assembly
>
> mov cr2, rax
> push rax
>
> call do_nmi
>
> pop rax
> mov cr2, rbx
> cmp rax, rbx
> be skip
> mov rax, cr2
> skip:
>
Yes, provided no exceptions can happen at those points.
--
error compiling committee.c: too many arguments to function
On Tue, 2012-05-22 at 17:37 +0300, Avi Kivity wrote:
> >
> >
> > Is reading it fast? Then we could do a two reads and only write when
> > needed.
>
> The upside is 70 cycles on one machine, see d3edefc0035669.
Thanks
>
>
> >
> > Something like this pseudo assembly
> >
> > mov cr2, rax
> > push rax
> >
> > call do_nmi
> >
> > pop rax
> > mov cr2, rbx
> > cmp rax, rbx
> > be skip
> > mov rax, cr2
> > skip:
> >
>
>
> Yes, provided no exceptions can happen at those points.
Yes, exceptions can only happen in the do_nmi area. There should not be
any breakpoints or page faults in the assembly code of the NMI handler.
Now another NMI may come in at any point here, but it will detect that
it is nested and return without doing anything (but telling this NMI to
repeat itself).
-- Steve
* Steven Rostedt ([email protected]) wrote:
> On Tue, 2012-05-22 at 17:37 +0300, Avi Kivity wrote:
> > >
> > >
> > > Is reading it fast? Then we could do a two reads and only write when
> > > needed.
> >
> > The upside is 70 cycles on one machine, see d3edefc0035669.
>
> Thanks
>
> >
> >
> > >
> > > Something like this pseudo assembly
> > >
> > > mov cr2, rax
> > > push rax
> > >
> > > call do_nmi
> > >
> > > pop rax
> > > mov cr2, rbx
> > > cmp rax, rbx
> > > be skip
> > > mov rax, cr2
> > > skip:
> > >
> >
> >
> > Yes, provided no exceptions can happen at those points.
>
> Yes, exceptions can only happen in the do_nmi area. There should not be
> any breakpoints or page faults in the assembly code of the NMI handler.
>
> Now another NMI may come in at any point here, but it will detect that
> it is nested and return without doing anything (but telling this NMI to
> repeat itself).
That should take care of cr2. Those are faraway memories, but I think we
should be careful about pdg_offset too. If we look at x86-64
vmalloc_fault(), we notice that it uses the current task struct:
WARN_ON_ONCE(in_nmi()); <--- we should take that as a hint ;)
/*
* Copy kernel mappings over when needed. This can also
* happen within a race in page table update. In the later
* case just flush:
*/
pgd = pgd_offset(current->active_mm, address);
x86-32 does not have this problem, since it reads the cr3 register to
get the pgd_addr.
x86-64 using the current task can be an issue if the NMI nests over the
scheduler execution.
A few years ago, I posted this patch
http://www.gossamer-threads.com/lists/linux/kernel/1249694?do=post_view_threaded
that tried to fix this by reading cr3 on x86_64. However, after reports
that it caused some x86_64 machines to fail to boot, I removed this
patch from the LTTng patchset. So there was certainly something I missed
back then.
Just food for thoughts,
Thanks,
Mathieu
--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
On Tue, May 22, 2012 at 7:27 AM, Steven Rostedt <[email protected]> wrote:
>
> Is reading it fast? Then we could do a two reads and only write when
> needed.
Even better: we could do nothing at all.
We could just say: let's make sure that any #PF case that can happen
in #NMI can also be re-done with arbitrary 'error_code' and 'struct
regs' contents.
At that point, what could happen is
- #PF
- NMI
- #PF
- read cr2 for NMI fault
- handle the NMI #PF
- return from #PF
- return from #NMI
- read cr2 for original #PF fault - but get the NMI cr2 again
- hande the #PF again (this should be a no-op now)
- return from #PF
- instruction restart causes new #PF
- now we do the original page fault
So one option is to just make sure that the few cases (just the
vmalloc area?) that NMI can trigger are all ok to be re-done with
other state.
I note that right now we have
if (unlikely(fault_in_kernel_space(address))) {
if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) {
if (vmalloc_fault(address) >= 0)
return;
and that the error_code check means that the retried NMI #PF would not
go through that. But maybe we don't even need that check?
That error_code thing seems to literally be the only thing that keeps
us from just re-doing the vmalloc_fault() silently.
Linus
On 05/22/2012 06:33 PM, Linus Torvalds wrote:
> On Tue, May 22, 2012 at 7:27 AM, Steven Rostedt <[email protected]> wrote:
>>
>> Is reading it fast? Then we could do a two reads and only write when
>> needed.
>
> Even better: we could do nothing at all.
>
> We could just say: let's make sure that any #PF case that can happen
> in #NMI can also be re-done with arbitrary 'error_code' and 'struct
> regs' contents.
>
> At that point, what could happen is
> - #PF
> - NMI
> - #PF
> - read cr2 for NMI fault
> - handle the NMI #PF
> - return from #PF
> - return from #NMI
> - read cr2 for original #PF fault - but get the NMI cr2 again
> - hande the #PF again (this should be a no-op now)
> - return from #PF
> - instruction restart causes new #PF
> - now we do the original page fault
>
> So one option is to just make sure that the few cases (just the
> vmalloc area?) that NMI can trigger are all ok to be re-done with
> other state.
>
> I note that right now we have
>
> if (unlikely(fault_in_kernel_space(address))) {
> if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) {
> if (vmalloc_fault(address) >= 0)
> return;
>
> and that the error_code check means that the retried NMI #PF would not
> go through that. But maybe we don't even need that check?
>
> That error_code thing seems to literally be the only thing that keeps
> us from just re-doing the vmalloc_fault() silently.
>
do_page_fault() is not the only code that relies on cr2; vmx_vcpu_run()
is the other. If an NMI happens there, and takes a #PF, then the guest
will run with a bad cr2.
(svm saves and restores cr2 in microcode, and also provides a way to
mask NMIs, so it isn't vulnerable to this issue).
--
error compiling committee.c: too many arguments to function
>
> Even better: we could do nothing at all.
>
> We could just say: let's make sure that any #PF case that can happen
> in #NMI can also be re-done with arbitrary 'error_code' and 'struct
> regs' contents.
>
> At that point, what could happen is
> - #PF
> - NMI
> - #PF
> - read cr2 for NMI fault
> - handle the NMI #PF
> - return from #PF
> - return from #NMI
> - read cr2 for original #PF fault - but get the NMI cr2 again
> - hande the #PF again (this should be a no-op now)
> - return from #PF
> - instruction restart causes new #PF
> - now we do the original page fault
>
> So one option is to just make sure that the few cases (just the
> vmalloc area?) that NMI can trigger are all ok to be re-done with
> other state.
>
> I note that right now we have
>
> if (unlikely(fault_in_kernel_space(address))) {
> if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) {
> if (vmalloc_fault(address) >= 0)
> return;
>
> and that the error_code check means that the retried NMI #PF would not
> go through that. But maybe we don't even need that check?
>
> That error_code thing seems to literally be the only thing that keeps
> us from just re-doing the vmalloc_fault() silently.
>
This concerns me for two reasons:
- We would have to process "chimera" pagefaults like the one you showed
above, where we have the right struct regs and the right error code, but
the wrong %cr2 pointing to the page fault context.
- Getting all this right, reliable, tested and robust and have it stay
that way for what is effectively a race between multiple events seems
implausible. I really worry that we'll have subtle failures in the
field when people are using their debugging tools.
As such I'd prefer if NMI would save and restore %cr2, or, alternately,
NMI can save %cr2 and the #PF handler could check if it is in NMI
context and then restore %cr2 -- the latter depends on the #PF handler
being able to hide the cost of a load - test - not-taken branch in the
common case, otherwise that is an obvious lose.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Tue, 2012-05-22 at 08:33 -0700, Linus Torvalds wrote:
> On Tue, May 22, 2012 at 7:27 AM, Steven Rostedt <[email protected]> wrote:
> >
> > Is reading it fast? Then we could do a two reads and only write when
> > needed.
>
> Even better: we could do nothing at all.
>
> We could just say: let's make sure that any #PF case that can happen
> in #NMI can also be re-done with arbitrary 'error_code' and 'struct
> regs' contents.
>
> At that point, what could happen is
> - #PF
> - NMI
> - #PF
> - read cr2 for NMI fault
> - handle the NMI #PF
> - return from #PF
> - return from #NMI
> - read cr2 for original #PF fault - but get the NMI cr2 again
> - hande the #PF again (this should be a no-op now)
I tested this by forcing all NMIs to have a page fault, and then running
perf against hackbench to see what breaks. This doesn't work because the
cr2 contains where the fault happened.
If the NMI faulting address was in kernel space, the page fault it
interrupted will think the user task is trying to access kernel memory
and fault.
Basically, the address will not be valid for the user task and this will
make the task take a SEGFAULT.
I had the NMI fault on the address 0x12348, but had a exception table to
handle it.
hackbench[2236]: segfault at 12348 ip 0000003054a091c4 sp 00007fff22c6e840 error 4 in ld-2.12.2.so[3054a00000+1e000]
With the below patch, the faulting NMIs don't seem to be harming anything.
> - return from #PF
> - instruction restart causes new #PF
> - now we do the original page fault
>
> So one option is to just make sure that the few cases (just the
> vmalloc area?) that NMI can trigger are all ok to be re-done with
> other state.
>
> I note that right now we have
>
> if (unlikely(fault_in_kernel_space(address))) {
> if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) {
> if (vmalloc_fault(address) >= 0)
> return;
>
> and that the error_code check means that the retried NMI #PF would not
> go through that. But maybe we don't even need that check?
>
> That error_code thing seems to literally be the only thing that keeps
> us from just re-doing the vmalloc_fault() silently.
Avi, you want to test this?
-- Steve
Index: linux-trace.git/arch/x86/kernel/entry_64.S
===================================================================
--- linux-trace.git.orig/arch/x86/kernel/entry_64.S
+++ linux-trace.git/arch/x86/kernel/entry_64.S
@@ -1586,7 +1586,7 @@ ENTRY(nmi)
* Check the special variable on the stack to see if NMIs are
* executing.
*/
- cmpl $1, -8(%rsp)
+ cmpl $1, -16(%rsp)
je nested_nmi
/*
@@ -1615,7 +1615,7 @@ nested_nmi:
1:
/* Set up the interrupted NMIs stack to jump to repeat_nmi */
- leaq -6*8(%rsp), %rdx
+ leaq -7*8(%rsp), %rdx
movq %rdx, %rsp
CFI_ADJUST_CFA_OFFSET 6*8
pushq_cfi $__KERNEL_DS
@@ -1625,7 +1625,7 @@ nested_nmi:
pushq_cfi $repeat_nmi
/* Put stack back */
- addq $(11*8), %rsp
+ addq $(12*8), %rsp
CFI_ADJUST_CFA_OFFSET -11*8
nested_nmi_out:
@@ -1650,6 +1650,8 @@ first_nmi:
* +-------------------------+
* | temp storage for rdx |
* +-------------------------+
+ * | saved cr2 |
+ * +-------------------------+
* | NMI executing variable |
* +-------------------------+
* | Saved SS |
@@ -1672,8 +1674,20 @@ first_nmi:
* to the repeat_nmi. The original stack frame and the temp storage
* is also used by nested NMIs and can not be trusted on exit.
*/
+
+ /*
+ * Save off the CR2 register. If we take a page fault in the NMI then
+ * it could corrupt the CR2 value. If the NMI preempts a page fault
+ * handler before it was able to read the CR2 register, and then the
+ * NMI itself takes a page fault, the page fault that was preempted
+ * will read the information from the NMI page fault and not the
+ * origin fault. Save it off and restore it if it changes.
+ */
+ movq %cr2, %rdx
+ pushq_cfi %rdx
+
/* Do not pop rdx, nested NMIs will corrupt that part of the stack */
- movq (%rsp), %rdx
+ movq 8(%rsp), %rdx
CFI_RESTORE rdx
/* Leave room for the "in NMI" variable. */
@@ -1682,7 +1696,7 @@ first_nmi:
/* Copy the stack frame to the Saved frame */
.rept 5
- pushq_cfi 6*8(%rsp)
+ pushq_cfi 7*8(%rsp)
.endr
CFI_DEF_CFA_OFFSET SS+8-RIP
@@ -1734,6 +1748,13 @@ end_repeat_nmi:
nmi_swapgs:
SWAPGS_UNSAFE_STACK
nmi_restore:
+ /* Test if the cr2 reg changed */
+ movq ORIG_RAX-R15+(12*8)(%rsp), %rdx
+ movq %cr2, %rcx
+ cmp %rdx, %rcx
+ je 1f
+ movq %rdx, %cr2
+1:
RESTORE_ALL 8
/* Clear the NMI executing stack variable */
movq $0, 10*8(%rsp)
On Tue, May 22, 2012 at 8:39 PM, Steven Rostedt <[email protected]> wrote:
> On Tue, 2012-05-22 at 08:33 -0700, Linus Torvalds wrote:
>> On Tue, May 22, 2012 at 7:27 AM, Steven Rostedt <[email protected]> wrote:
>> >
>> > Is reading it fast? Then we could do a two reads and only write when
>> > needed.
>>
>> Even better: we could do nothing at all.
>>
>> We could just say: let's make sure that any #PF case that can happen
>> in #NMI can also be re-done with arbitrary 'error_code' and 'struct
>> regs' contents.
>>
>> At that point, what could happen is
>> - #PF
>> - NMI
>> - #PF
>> - read cr2 for NMI fault
>> - handle the NMI #PF
>> - return from #PF
>> - return from #NMI
>> - read cr2 for original #PF fault - but get the NMI cr2 again
>> - hande the #PF again (this should be a no-op now)
>
> I tested this by forcing all NMIs to have a page fault, and then running
> perf against hackbench to see what breaks. This doesn't work because the
> cr2 contains where the fault happened.
>
> If the NMI faulting address was in kernel space, the page fault it
> interrupted will think the user task is trying to access kernel memory
> and fault.
>
> Basically, the address will not be valid for the user task and this will
> make the task take a SEGFAULT.
>
> I had the NMI fault on the address 0x12348, but had a exception table to
> handle it.
>
> hackbench[2236]: segfault at 12348 ip 0000003054a091c4 sp 00007fff22c6e840 error 4 in ld-2.12.2.so[3054a00000+1e000]
>
> With the below patch, the faulting NMIs don't seem to be harming anything.
>
>
>> - return from #PF
>> - instruction restart causes new #PF
>> - now we do the original page fault
>>
>> So one option is to just make sure that the few cases (just the
>> vmalloc area?) that NMI can trigger are all ok to be re-done with
>> other state.
>>
>> I note that right now we have
>>
>> if (unlikely(fault_in_kernel_space(address))) {
>> if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) {
>> if (vmalloc_fault(address) >= 0)
>> return;
>>
>> and that the error_code check means that the retried NMI #PF would not
>> go through that. But maybe we don't even need that check?
>>
>> That error_code thing seems to literally be the only thing that keeps
>> us from just re-doing the vmalloc_fault() silently.
>
> Avi, you want to test this?
>
> -- Steve
>
> Index: linux-trace.git/arch/x86/kernel/entry_64.S
> ===================================================================
> --- linux-trace.git.orig/arch/x86/kernel/entry_64.S
> +++ linux-trace.git/arch/x86/kernel/entry_64.S
> @@ -1586,7 +1586,7 @@ ENTRY(nmi)
> * Check the special variable on the stack to see if NMIs are
> * executing.
> */
> - cmpl $1, -8(%rsp)
> + cmpl $1, -16(%rsp)
> je nested_nmi
>
> /*
> @@ -1615,7 +1615,7 @@ nested_nmi:
>
> 1:
> /* Set up the interrupted NMIs stack to jump to repeat_nmi */
> - leaq -6*8(%rsp), %rdx
> + leaq -7*8(%rsp), %rdx
> movq %rdx, %rsp
> CFI_ADJUST_CFA_OFFSET 6*8
> pushq_cfi $__KERNEL_DS
> @@ -1625,7 +1625,7 @@ nested_nmi:
> pushq_cfi $repeat_nmi
>
> /* Put stack back */
> - addq $(11*8), %rsp
> + addq $(12*8), %rsp
> CFI_ADJUST_CFA_OFFSET -11*8
>
> nested_nmi_out:
> @@ -1650,6 +1650,8 @@ first_nmi:
> * +-------------------------+
> * | temp storage for rdx |
> * +-------------------------+
> + * | saved cr2 |
> + * +-------------------------+
> * | NMI executing variable |
> * +-------------------------+
> * | Saved SS |
> @@ -1672,8 +1674,20 @@ first_nmi:
> * to the repeat_nmi. The original stack frame and the temp storage
> * is also used by nested NMIs and can not be trusted on exit.
> */
> +
> + /*
> + * Save off the CR2 register. If we take a page fault in the NMI then
> + * it could corrupt the CR2 value. If the NMI preempts a page fault
> + * handler before it was able to read the CR2 register, and then the
> + * NMI itself takes a page fault, the page fault that was preempted
> + * will read the information from the NMI page fault and not the
> + * origin fault. Save it off and restore it if it changes.
> + */
> + movq %cr2, %rdx
> + pushq_cfi %rdx
> +
> /* Do not pop rdx, nested NMIs will corrupt that part of the stack */
> - movq (%rsp), %rdx
> + movq 8(%rsp), %rdx
> CFI_RESTORE rdx
>
> /* Leave room for the "in NMI" variable. */
> @@ -1682,7 +1696,7 @@ first_nmi:
>
> /* Copy the stack frame to the Saved frame */
> .rept 5
> - pushq_cfi 6*8(%rsp)
> + pushq_cfi 7*8(%rsp)
> .endr
> CFI_DEF_CFA_OFFSET SS+8-RIP
>
> @@ -1734,6 +1748,13 @@ end_repeat_nmi:
> nmi_swapgs:
> SWAPGS_UNSAFE_STACK
> nmi_restore:
> + /* Test if the cr2 reg changed */
> + movq ORIG_RAX-R15+(12*8)(%rsp), %rdx
> + movq %cr2, %rcx
> + cmp %rdx, %rcx
> + je 1f
> + movq %rdx, %cr2
> +1:
> RESTORE_ALL 8
> /* Clear the NMI executing stack variable */
> movq $0, 10*8(%rsp)
You could save cr2 in a callee-saved register (like r12) instead of
putting it on the stack.
--
Brian Gerst
On Tue, 2012-05-22 at 21:26 -0400, Brian Gerst wrote:
> > @@ -1734,6 +1748,13 @@ end_repeat_nmi:
> > nmi_swapgs:
> > SWAPGS_UNSAFE_STACK
> > nmi_restore:
> > + /* Test if the cr2 reg changed */
> > + movq ORIG_RAX-R15+(12*8)(%rsp), %rdx
> > + movq %cr2, %rcx
> > + cmp %rdx, %rcx
> > + je 1f
> > + movq %rdx, %cr2
> > +1:
> > RESTORE_ALL 8
> > /* Clear the NMI executing stack variable */
> > movq $0, 10*8(%rsp)
>
> You could save cr2 in a callee-saved register (like r12) instead of
> putting it on the stack.
>
You know, I thought about that but decided against it. My rational was
that I wanted the store of the cr2 in the first NMI where it would do it
again if it had to do a repeated NMI. At first I thought that a repeated
NMI would corrupt the cr2, but that is not the case as the cr2 would
have been restored before repeating the NMI.
I guess I also wanted to limit the number of reads of the cr2 as well.
But as repeated NMIs is such a seldom case (requires a fault and then
another NMI to come in), that this optimization is practically useless.
I agree, it would be better to just use one of the non-clobbered regs.
Thanks, I'll try that out and this should make the patch much simpler.
-- Steve
On Tue, 2012-05-22 at 21:26 -0400, Brian Gerst wrote:
>
> You could save cr2 in a callee-saved register (like r12) instead of
> putting it on the stack.
>
Much simpler!
-- Steve
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index f41b3b1..59bae47 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1725,10 +1725,30 @@ end_repeat_nmi:
*/
call save_paranoid
DEFAULT_FRAME 0
+
+ /*
+ * Save off the CR2 register. If we take a page fault in the NMI then
+ * it could corrupt the CR2 value. If the NMI preempts a page fault
+ * handler before it was able to read the CR2 register, and then the
+ * NMI itself takes a page fault, the page fault that was preempted
+ * will read the information from the NMI page fault and not the
+ * origin fault. Save it off and restore it if it changes.
+ * Use the r12 callee-saved register.
+ */
+ movq %cr2, %r12
+
/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
movq %rsp,%rdi
movq $-1,%rsi
call do_nmi
+
+ /* Did the NMI take a page fault? Restore cr2 if it did */
+ movq %cr2, %rcx
+ cmpq %rcx, %r12
+ je 1f
+ movq %r12, %cr2
+1:
+
testl %ebx,%ebx /* swapgs needed? */
jnz nmi_restore
nmi_swapgs:
Commit-ID: 7fbb98c5cb07563d3ee08714073a8e5452a96be2
Gitweb: http://git.kernel.org/tip/7fbb98c5cb07563d3ee08714073a8e5452a96be2
Author: Steven Rostedt <[email protected]>
AuthorDate: Thu, 7 Jun 2012 10:21:21 -0400
Committer: Steven Rostedt <[email protected]>
CommitDate: Thu, 7 Jun 2012 10:21:21 -0400
x86: Save cr2 in NMI in case NMIs take a page fault
Avi Kivity reported that page faults in NMIs could cause havic if
the NMI preempted another page fault handler:
The recent changes to NMI allow exceptions to take place in NMI
handlers, but I think that a #PF (say, due to access to vmalloc space)
is still problematic. Consider the sequence
#PF (cr2 set by processor)
NMI
...
#PF (cr2 clobbered)
do_page_fault()
IRET
...
IRET
do_page_fault()
address = read_cr2()
The last line reads the overwritten cr2 value.
Originally I wrote a patch to solve this by saving the cr2 on the stack.
Brian Gerst suggested to save it in the r12 register as both r12 and rbx
are saved by the do_nmi handler as required by the C standard. But rbx
is already used for saving if swapgs needs to be run on exit of the NMI
handler.
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Reported-by: Avi Kivity <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Suggested-by: Brian Gerst <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
arch/x86/kernel/entry_64.S | 20 ++++++++++++++++++++
1 files changed, 20 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 7d65133..111f6bb 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1758,10 +1758,30 @@ end_repeat_nmi:
*/
call save_paranoid
DEFAULT_FRAME 0
+
+ /*
+ * Save off the CR2 register. If we take a page fault in the NMI then
+ * it could corrupt the CR2 value. If the NMI preempts a page fault
+ * handler before it was able to read the CR2 register, and then the
+ * NMI itself takes a page fault, the page fault that was preempted
+ * will read the information from the NMI page fault and not the
+ * origin fault. Save it off and restore it if it changes.
+ * Use the r12 callee-saved register.
+ */
+ movq %cr2, %r12
+
/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
movq %rsp,%rdi
movq $-1,%rsi
call do_nmi
+
+ /* Did the NMI take a page fault? Restore cr2 if it did */
+ movq %cr2, %rcx
+ cmpq %rcx, %r12
+ je 1f
+ movq %r12, %cr2
+1:
+
testl %ebx,%ebx /* swapgs needed? */
jnz nmi_restore
nmi_swapgs:
Commit-ID: 70fb74a5420f9caa3e001d65004e4b669124283e
Gitweb: http://git.kernel.org/tip/70fb74a5420f9caa3e001d65004e4b669124283e
Author: Steven Rostedt <[email protected]>
AuthorDate: Thu, 7 Jun 2012 11:54:37 -0400
Committer: Steven Rostedt <[email protected]>
CommitDate: Fri, 8 Jun 2012 18:51:12 -0400
x86: Save cr2 in NMI in case NMIs take a page fault (for i386)
Avi Kivity reported that page faults in NMIs could cause havic if
the NMI preempted another page fault handler:
The recent changes to NMI allow exceptions to take place in NMI
handlers, but I think that a #PF (say, due to access to vmalloc space)
is still problematic. Consider the sequence
#PF (cr2 set by processor)
NMI
...
#PF (cr2 clobbered)
do_page_fault()
IRET
...
IRET
do_page_fault()
address = read_cr2()
The last line reads the overwritten cr2 value.
This is the i386 version, which has the luxury of doing the work
in C code.
Link: http://lkml.kernel.org/r/[email protected]
Reported-by: Avi Kivity <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
arch/x86/kernel/nmi.c | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index a15a888..f84f5c5 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -395,6 +395,14 @@ static __kprobes void default_do_nmi(struct pt_regs *regs)
* thus there is no race between the first check of state for NOT_RUNNING
* and setting it to NMI_EXECUTING. The HW will prevent nested NMIs
* at this point.
+ *
+ * In case the NMI takes a page fault, we need to save off the CR2
+ * because the NMI could have preempted another page fault and corrupt
+ * the CR2 that is about to be read. As nested NMIs must be restarted
+ * and they can not take breakpoints or page faults, the update of the
+ * CR2 must be done before converting the nmi state back to NOT_RUNNING.
+ * Otherwise, there would be a race of another nested NMI coming in
+ * after setting state to NOT_RUNNING but before updating the nmi_cr2.
*/
enum nmi_states {
NMI_NOT_RUNNING = 0,
@@ -402,6 +410,7 @@ enum nmi_states {
NMI_LATCHED,
};
static DEFINE_PER_CPU(enum nmi_states, nmi_state);
+static DEFINE_PER_CPU(unsigned long, nmi_cr2);
#define nmi_nesting_preprocess(regs) \
do { \
@@ -410,11 +419,14 @@ static DEFINE_PER_CPU(enum nmi_states, nmi_state);
return; \
} \
this_cpu_write(nmi_state, NMI_EXECUTING); \
+ this_cpu_write(nmi_cr2, read_cr2()); \
} while (0); \
nmi_restart:
#define nmi_nesting_postprocess() \
do { \
+ if (unlikely(this_cpu_read(nmi_cr2) != read_cr2())) \
+ write_cr2(this_cpu_read(nmi_cr2)); \
if (this_cpu_dec_return(nmi_state)) \
goto nmi_restart; \
} while (0)