LinuxLists.cc - [patch 2/2] x86 NMI-safe INT3 and Page Fault

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

* Maciej W. Rozycki ([email protected]) wrote:
> On Wed, 14 Jul 2010, Mathieu Desnoyers wrote:
>
> > This patch makes all faults, traps and exception safe to be called from NMI
> > context *except* single-stepping, which requires iret to restore the TF (trap
> > flag) and jump to the return address in a single instruction. Sorry, no kprobes
>
> Watch out for the RF flag too, that is not set correctly by POPFD -- that
> may be important for faulting instructions that also have a hardware
> breakpoint set at their address.
>
> > support in NMI handlers because of this limitation. This cannot be emulated
> > with popf/lret, because lret would be single-stepped. It does not apply to
> > "immediate values" because they do not use single-stepping. This code detects if
> > the TF flag is set and uses the iret path for single-stepping, even if it
> > reactivates NMIs prematurely.
>
> What about the VM flag for VM86 tasks? It cannot be changed by POPFD
> either.
>
> How about only using the special return path when a nested exception is
> about to return to the NMI handler? You'd avoid all the odd cases then
> that do not happen in the NMI context.

This is exactly what this patch does :-)

It selects the return path with

+ testl $NMI_MASK,TI_preempt_count(%ebp)
+ jz resume_kernel /* Not nested over NMI ? */

In addition, about int3 breakpoints use in the kernel, AFAIK the handler does
not explicitly set the RF flag, and the breakpoint instruction (int3) appears
not to set it. (from my understanding of Intel's
Intel Architecture Software Developer’s Manual Volume 3: System Programming
15.3.1.1. INSTRUCTION-BREAKPOINT EXCEPTION C)

So it should be safe to set a int3 breakpoint in a NMI handler with this patch.
It's just the "single-stepping" feature of kprobes which is problematic.
Luckily, only int3 is needed for code patching bypass.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-14 19:21:33

by Maciej W. Rozycki

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Wed, 14 Jul 2010, Mathieu Desnoyers wrote:

> > How about only using the special return path when a nested exception is
> > about to return to the NMI handler? You'd avoid all the odd cases then
> > that do not happen in the NMI context.
>
> This is exactly what this patch does :-)

Ah, OK then -- I understood you actually tested the value of TF in the
image to be restored.

> It selects the return path with
>
> + testl $NMI_MASK,TI_preempt_count(%ebp)
> + jz resume_kernel /* Not nested over NMI ? */
>
> In addition, about int3 breakpoints use in the kernel, AFAIK the handler does
> not explicitly set the RF flag, and the breakpoint instruction (int3) appears
> not to set it. (from my understanding of Intel's
> Intel Architecture Software Developer’s Manual Volume 3: System Programming
> 15.3.1.1. INSTRUCTION-BREAKPOINT EXCEPTION C)

The CPU only sets RF itself in the image saved in certain cases -- you'd
see it set in the page fault handler for example, so that once the handler
has finished any instruction breakpoint does not hit (presumably again,
because the instruction breakpoint debug exception has the highest
priority). You mentioned the need to handle these faults.

> So it should be safe to set a int3 breakpoint in a NMI handler with this patch.
>
> It's just the "single-stepping" feature of kprobes which is problematic.
> Luckily, only int3 is needed for code patching bypass.

Actually the breakpoint exception handler should actually probably set RF
explicitly, but that depends on the exact debugging scenario, so I can't
comment on it further. I don't know how INT3 is used in this context, so
I'm just noting this may be a danger zone.

Maciej

2010-07-14 19:58:44

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

* Maciej W. Rozycki ([email protected]) wrote:
> On Wed, 14 Jul 2010, Mathieu Desnoyers wrote:
>
> > > How about only using the special return path when a nested exception is
> > > about to return to the NMI handler? You'd avoid all the odd cases then
> > > that do not happen in the NMI context.
> >
> > This is exactly what this patch does :-)
>
> Ah, OK then -- I understood you actually tested the value of TF in the
> image to be restored.

It tests it too. When it detects that the return path is about to return to a
NMI handler, it checks if the TF flag is set. If it is set, then "iret" is
really needed, because TF can only single-step an instruction when set by
"iret". The popf/ret scheme would otherwise trap at the "ret" instruction that
follows popf. Anyway, single-stepping is really discouraged in nmi handlers,
because there is no way to go around the iret.

>
> > It selects the return path with
> >
> > + testl $NMI_MASK,TI_preempt_count(%ebp)
> > + jz resume_kernel /* Not nested over NMI ? */
> >
> > In addition, about int3 breakpoints use in the kernel, AFAIK the handler does
> > not explicitly set the RF flag, and the breakpoint instruction (int3) appears
> > not to set it. (from my understanding of Intel's
> > Intel Architecture Software Developer’s Manual Volume 3: System Programming
> > 15.3.1.1. INSTRUCTION-BREAKPOINT EXCEPTION C)
>
> The CPU only sets RF itself in the image saved in certain cases -- you'd
> see it set in the page fault handler for example, so that once the handler
> has finished any instruction breakpoint does not hit (presumably again,
> because the instruction breakpoint debug exception has the highest
> priority). You mentioned the need to handle these faults.

Well, the only case where I think it might make sense to allow a breakpoint in
NMI handler code would be to temporarily replace a static branch, which should
in no way be able to trigger any other fault.

>
> > So it should be safe to set a int3 breakpoint in a NMI handler with this patch.
> >
> > It's just the "single-stepping" feature of kprobes which is problematic.
> > Luckily, only int3 is needed for code patching bypass.
>
> Actually the breakpoint exception handler should actually probably set RF
> explicitly, but that depends on the exact debugging scenario, so I can't
> comment on it further. I don't know how INT3 is used in this context, so
> I'm just noting this may be a danger zone.

In the case of temporary bypass, the int3 is only there to divert the
instruction execution flow to somewhere else, and we come back to the original
code at the address following the instruction which has the breakpoint. So
basically, we never come back to the original instruction, ever. We might as
well just clear the RF flag from the EFLAGS image before popf.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-14 20:36:35

by Maciej W. Rozycki

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Wed, 14 Jul 2010, Mathieu Desnoyers wrote:

> It tests it too. When it detects that the return path is about to return to a
> NMI handler, it checks if the TF flag is set. If it is set, then "iret" is
> really needed, because TF can only single-step an instruction when set by
> "iret". The popf/ret scheme would otherwise trap at the "ret" instruction that
> follows popf. Anyway, single-stepping is really discouraged in nmi handlers,
> because there is no way to go around the iret.

Hmm, with Pentium Pro and more recent processors there is actually a
nasty hack that will let you get away with POPF/RET and TF set. ;) You
can try it if you like and can arrange for an appropriate scenario.

> In the case of temporary bypass, the int3 is only there to divert the
> instruction execution flow to somewhere else, and we come back to the original
> code at the address following the instruction which has the breakpoint. So
> basically, we never come back to the original instruction, ever. We might as
> well just clear the RF flag from the EFLAGS image before popf.

Yes, if you return to elsewhere, then that's actually quite desirable
IMHO.

This RF flag is quite complicated to handle and there are some errata
involved too. If I understand it correctly, all fault-class exception
handlers are expected to set it manually in the image to be restored if
they return to the original faulting instruction (that includes the debug
exception handler if it was invoked as a fault, i.e. in response to an
instruction breakpoint). Then all trap-class exception handlers are
expected to clear the flag (and that includes the debug exception handler
if it was invoked as a trap, e.g. in response to a data breakpoint or a
single step). I haven't checked if Linux gets these bits right, but it
may be worth doing so.

For the record -- GDB hardly cares, because it removes any instruction
breakpoints before it is asked to resume execution of an instruction that
has a breakpoint set at, single-steps the instruction with all the other
threads locked out and then reinserts the breakpoints so that they can hit
again. Then it proceeds with whatever should be done next to fulfil the
execution request.

Maciej

2010-07-16 12:30:21

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/14/2010 06:49 PM, Mathieu Desnoyers wrote:
> Implements an alternative iret with popf and return so trap and exception
> handlers can return to the NMI handler without issuing iret. iret would cause
> NMIs to be reenabled prematurely. x86_32 uses popf and far return. x86_64 has to
> copy the return instruction pointer to the top of the previous stack, issue a
> popf, loads the previous esp and issue a near return (ret).
>
> It allows placing dynamically patched static jumps in asm gotos, which will be
> used for optimized tracepoints, in NMI code since returning from a breakpoint
> would be valid. Accessing vmalloc'd memory, which allows executing module code
> or accessing vmapped or vmalloc'd areas from NMI context, would also be valid.
> This is very useful to tracers like LTTng.
>
> This patch makes all faults, traps and exception safe to be called from NMI
> context*except* single-stepping, which requires iret to restore the TF (trap
> flag) and jump to the return address in a single instruction. Sorry, no kprobes
> support in NMI handlers because of this limitation. This cannot be emulated
> with popf/lret, because lret would be single-stepped. It does not apply to
> "immediate values" because they do not use single-stepping. This code detects if
> the TF flag is set and uses the iret path for single-stepping, even if it
> reactivates NMIs prematurely.
>

You need to save/restore cr2 in addition, otherwise the following hits you

- page fault
- processor writes cr2, enters fault handler
- nmi
- page fault
- cr2 overwritten

I guess you would usually not notice the corruption since you'd just see
a spurious fault on the page the NMI handler touched, but if the first
fault happened in a kvm guest, then we'd corrupt the guest's cr2.

But the whole thing strikes me as overkill. If it's 8k per-cpu, what's
wrong with using a per-cpu pointer to a kmalloc() area?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-07-16 14:49:32

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

* Avi Kivity ([email protected]) wrote:
> On 07/14/2010 06:49 PM, Mathieu Desnoyers wrote:
>> Implements an alternative iret with popf and return so trap and exception
>> handlers can return to the NMI handler without issuing iret. iret would cause
>> NMIs to be reenabled prematurely. x86_32 uses popf and far return. x86_64 has to
>> copy the return instruction pointer to the top of the previous stack, issue a
>> popf, loads the previous esp and issue a near return (ret).
>>
>> It allows placing dynamically patched static jumps in asm gotos, which will be
>> used for optimized tracepoints, in NMI code since returning from a breakpoint
>> would be valid. Accessing vmalloc'd memory, which allows executing module code
>> or accessing vmapped or vmalloc'd areas from NMI context, would also be valid.
>> This is very useful to tracers like LTTng.
>>
>> This patch makes all faults, traps and exception safe to be called from NMI
>> context*except* single-stepping, which requires iret to restore the TF (trap
>> flag) and jump to the return address in a single instruction. Sorry, no kprobes
>> support in NMI handlers because of this limitation. This cannot be emulated
>> with popf/lret, because lret would be single-stepped. It does not apply to
>> "immediate values" because they do not use single-stepping. This code detects if
>> the TF flag is set and uses the iret path for single-stepping, even if it
>> reactivates NMIs prematurely.
>>
>
> You need to save/restore cr2 in addition, otherwise the following hits you
>
> - page fault
> - processor writes cr2, enters fault handler
> - nmi
> - page fault
> - cr2 overwritten
>
> I guess you would usually not notice the corruption since you'd just see
> a spurious fault on the page the NMI handler touched, but if the first
> fault happened in a kvm guest, then we'd corrupt the guest's cr2.

OK, just to make sure: you mean we'd have to save/restore the cr2 register
at the beginning/end of the NMI handler execution, right ? The shouldn't we
save/restore cr3 too ?

> But the whole thing strikes me as overkill. If it's 8k per-cpu, what's
> wrong with using a per-cpu pointer to a kmalloc() area?

Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
much more than perf) can potentially cause large latencies, which could be
squashed by allowing page faults in NMI handlers. This looks like a stronger
argument to me.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-16 15:34:41

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

> Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
> much more than perf) can potentially cause large latencies, which could be

You need to fix all other code too that walks tasks lists to avoid all those.

% gid for_each_process | wc -l

In fact the mm-struct walk is cheaper than a task-list walk because there
are always less than tasks.

-Andi

2010-07-16 15:40:28

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

* Andi Kleen ([email protected]) wrote:
> > Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
> > much more than perf) can potentially cause large latencies, which could be
>
> You need to fix all other code too that walks tasks lists to avoid all those.
>
> % gid for_each_process | wc -l

This can very well be done incrementally. And I agree, these should eventually
targeted too, especially those which hold locks. We've already started hearing
about tasklist lock live-locks in the past year, so I think we're pretty much at
the point where it should be looked at.

Thanks,

Mathieu

>
> In fact the mm-struct walk is cheaper than a task-list walk because there
> are always less than tasks.

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-16 16:48:08

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 05:49 PM, Mathieu Desnoyers wrote:
>
>> You need to save/restore cr2 in addition, otherwise the following hits you
>>
>> - page fault
>> - processor writes cr2, enters fault handler
>> - nmi
>> - page fault
>> - cr2 overwritten
>>
>> I guess you would usually not notice the corruption since you'd just see
>> a spurious fault on the page the NMI handler touched, but if the first
>> fault happened in a kvm guest, then we'd corrupt the guest's cr2.
>>
> OK, just to make sure: you mean we'd have to save/restore the cr2 register
> at the beginning/end of the NMI handler execution, right ?

Yes.

> The shouldn't we
> save/restore cr3 too ?
>
>

No, faults should not change cr3.

>> But the whole thing strikes me as overkill. If it's 8k per-cpu, what's
>> wrong with using a per-cpu pointer to a kmalloc() area?
>>
> Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
> much more than perf) can potentially cause large latencies, which could be
> squashed by allowing page faults in NMI handlers. This looks like a stronger
> argument to me.

Why is that kernel code calling vmalloc_sync_all()? If it is only NMI
which cannot take vmalloc faults, why bother? If not, why not?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-07-16 16:59:01

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

* Avi Kivity ([email protected]) wrote:
> On 07/16/2010 05:49 PM, Mathieu Desnoyers wrote:
>>
>>> You need to save/restore cr2 in addition, otherwise the following hits you
>>>
>>> - page fault
>>> - processor writes cr2, enters fault handler
>>> - nmi
>>> - page fault
>>> - cr2 overwritten
>>>
>>> I guess you would usually not notice the corruption since you'd just see
>>> a spurious fault on the page the NMI handler touched, but if the first
>>> fault happened in a kvm guest, then we'd corrupt the guest's cr2.
>>>
>> OK, just to make sure: you mean we'd have to save/restore the cr2 register
>> at the beginning/end of the NMI handler execution, right ?
>
> Yes.

OK

>
>> The shouldn't we
>> save/restore cr3 too ?
>>
>>
>
> No, faults should not change cr3.

Ah, right.

>
>>> But the whole thing strikes me as overkill. If it's 8k per-cpu, what's
>>> wrong with using a per-cpu pointer to a kmalloc() area?
>>>
>> Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
>> much more than perf) can potentially cause large latencies, which could be
>> squashed by allowing page faults in NMI handlers. This looks like a stronger
>> argument to me.
>
> Why is that kernel code calling vmalloc_sync_all()? If it is only NMI
> which cannot take vmalloc faults, why bother? If not, why not?

Modules come as yet another example of stuff that is loaded in vmalloc'd space
and can be accesses from NMI context. That would include oprofile, tracers, and
probably others I'm forgetting about.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-16 17:54:59

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 07:58 PM, Mathieu Desnoyers wrote:
>
>> Why is that kernel code calling vmalloc_sync_all()? If it is only NMI
>> which cannot take vmalloc faults, why bother? If not, why not?
>>
> Modules come as yet another example of stuff that is loaded in vmalloc'd space
> and can be accesses from NMI context. That would include oprofile, tracers, and
> probably others I'm forgetting about.
>

Module loading can certainly take a vmalloc_sync_all() (though I agree
it's unpleasant). Anything else?

Note perf is not modular at this time, but could be made so with
preempt/sched notifiers to hook the context switch.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-07-16 18:07:08

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 10:54 AM, Avi Kivity wrote:
> On 07/16/2010 07:58 PM, Mathieu Desnoyers wrote:
>>
>>> Why is that kernel code calling vmalloc_sync_all()? If it is only NMI
>>> which cannot take vmalloc faults, why bother? If not, why not?
>>>
>> Modules come as yet another example of stuff that is loaded in vmalloc'd space
>> and can be accesses from NMI context. That would include oprofile, tracers, and
>> probably others I'm forgetting about.
>>
>
> Module loading can certainly take a vmalloc_sync_all() (though I agree
> it's unpleasant). Anything else?
>
> Note perf is not modular at this time, but could be made so with
> preempt/sched notifiers to hook the context switch.
>

Actually, module loading is already a performance problem; a lot of
distros load sometimes hundreds of modules on startup, and it's heavily
serialized, so I can see this being desirable to skip.

I really hope noone ever gets the idea of touching user space from an
NMI handler, though, and expecting it to work...

-hpa

2010-07-16 18:16:15

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 09:05 PM, H. Peter Anvin wrote:
>
>> Module loading can certainly take a vmalloc_sync_all() (though I agree
>> it's unpleasant). Anything else?
>>
>> Note perf is not modular at this time, but could be made so with
>> preempt/sched notifiers to hook the context switch.
>>
>>
> Actually, module loading is already a performance problem; a lot of
> distros load sometimes hundreds of modules on startup, and it's heavily
> serialized, so I can see this being desirable to skip.
>

There aren't that many processes at this time (or there shouldn't be,
don't know how fork-happy udev is at this stage), so the sync should be
pretty fast. In any case, we can sync only modules that contain NMI
handlers.

> I really hope noone ever gets the idea of touching user space from an
> NMI handler, though, and expecting it to work...
>

I think the concern here is about an NMI handler's code running in
vmalloc space, or is it something else?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-07-16 18:19:21

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 11:15 AM, Avi Kivity wrote:
>
> There aren't that many processes at this time (or there shouldn't be,
> don't know how fork-happy udev is at this stage), so the sync should be
> pretty fast. In any case, we can sync only modules that contain NMI
> handlers.
>
>> I really hope noone ever gets the idea of touching user space from an
>> NMI handler, though, and expecting it to work...
>>
>
> I think the concern here is about an NMI handler's code running in
> vmalloc space, or is it something else?
>

Code or data, yes; including percpu data.

-hpa

2010-07-16 18:22:44

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

* Avi Kivity ([email protected]) wrote:
> On 07/16/2010 09:05 PM, H. Peter Anvin wrote:
>>
>>> Module loading can certainly take a vmalloc_sync_all() (though I agree
>>> it's unpleasant). Anything else?
>>>
>>> Note perf is not modular at this time, but could be made so with
>>> preempt/sched notifiers to hook the context switch.
>>>
>>>
>> Actually, module loading is already a performance problem; a lot of
>> distros load sometimes hundreds of modules on startup, and it's heavily
>> serialized, so I can see this being desirable to skip.
>>
>
> There aren't that many processes at this time (or there shouldn't be,
> don't know how fork-happy udev is at this stage), so the sync should be
> pretty fast. In any case, we can sync only modules that contain NMI
> handlers.

USB hotplug is a use-case happening randomly after the system is well there and
running; I'm afraid this does not fit in your module loading expectations. It
triggers tons of events, many of these actually load modules.

Thanks,

Mathieu

>
>> I really hope noone ever gets the idea of touching user space from an
>> NMI handler, though, and expecting it to work...
>>
>
> I think the concern here is about an NMI handler's code running in
> vmalloc space, or is it something else?
>
> --
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.
>

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-16 18:30:26

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 09:17 PM, H. Peter Anvin wrote:
>
>> I think the concern here is about an NMI handler's code running in
>> vmalloc space, or is it something else?
>>
>>
> Code or data, yes; including percpu data.
>

Use kmalloc and percpu pointers, it's not that onerous.

Oh, and you can access vmalloc space by switching cr3 temporarily to
init_mm's, no? Obviously not a very performant solution, at least
without PCIDs, but can be used if needed.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-07-16 18:30:47

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, Jul 16, 2010 at 11:15 AM, Avi Kivity <[email protected]> wrote:
>
> I think the concern here is about an NMI handler's code running in vmalloc
> space, or is it something else?

I think the concern was also potentially doing things like backtraces
etc that may need access to the module data structures (I think the
ELF headers end up all being in vmalloc space too, for example).

The whole debugging thing is also an issue. Now, I obviously am not a
big fan of remote debuggers, but everybody tells me I'm wrong. And
putting a breakpoint on NMI is certainly not insane if you are doing
debugging in the first place. So it's not necessarily always about the
page faults.

Linus

2010-07-16 18:33:23

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 09:22 PM, Mathieu Desnoyers wrote:
>
>> There aren't that many processes at this time (or there shouldn't be,
>> don't know how fork-happy udev is at this stage), so the sync should be
>> pretty fast. In any case, we can sync only modules that contain NMI
>> handlers.
>>
> USB hotplug is a use-case happening randomly after the system is well there and
> running; I'm afraid this does not fit in your module loading expectations. It
> triggers tons of events, many of these actually load modules.
>

How long would vmalloc_sync_all take with a few thousand mm_struct take?

We share the pmds, yes? So it's a few thousand memory accesses. The
direct impact is probably negligible, compared to actually loading the
module from disk. All we need is to make sure the locking doesn't slow
down unrelated stuff.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-07-16 18:44:40

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, Jul 16, 2010 at 11:28 AM, Avi Kivity <[email protected]> wrote:
>
> Use kmalloc and percpu pointers, it's not that onerous.

What people don't seem to understand is that WE SHOULD NOT MAKE NMI
FORCE US TO DO "STRANGE" CODE IN CODE-PATHS THAT HAVE NOTHING
WHAT-SO-EVER TO DO WITH NMI.

I'm shouting, because this point seems to have been continually
missed. It was missed in the original patches, and it's been missed in
the discussions.

Non-NMI code should simply never have to even _think_ about NMI's. Why
should it? It should just do whatever comes "natural" within its own
context.

This is why I've been pushing for the "let's just fix NMI" approach.
Not adding random hacks to other code sequences that have nothing
what-so-ever to do with NMI.

So don't add NMI code to the page fault code. Not to the debug code,
or to the module loading code. Don't say "use special allocations
because the NMI code may care about these particular data structures".
Because that way lies crap and unmaintainability.

Linus

2010-07-16 19:27:56

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 09:37 PM, Linus Torvalds wrote:
> On Fri, Jul 16, 2010 at 11:28 AM, Avi Kivity<[email protected]> wrote:
>
>> Use kmalloc and percpu pointers, it's not that onerous.
>>
> What people don't seem to understand is that WE SHOULD NOT MAKE NMI
> FORCE US TO DO "STRANGE" CODE IN CODE-PATHS THAT HAVE NOTHING
> WHAT-SO-EVER TO DO WITH NMI.
>
> I'm shouting, because this point seems to have been continually
> missed. It was missed in the original patches, and it's been missed in
> the discussions.
>
> Non-NMI code should simply never have to even _think_ about NMI's. Why
> should it? It should just do whatever comes "natural" within its own
> context.
>
>

But we're not talking about non-NMI code. The 8k referred to in the
original patch are buffers used by NMI stack recording. Module code
vmalloc_sync_all() is only need by code that is executed during NMI,
hence must be NMI aware.

> This is why I've been pushing for the "let's just fix NMI" approach.
> Not adding random hacks to other code sequences that have nothing
> what-so-ever to do with NMI.
>

"fixing NMI" will result in code that is understandable by maybe three
people after long and hard thinking. NMI can happen in too many
semi-defined contexts, so there will be plenty of edge cases. I'm not
sure we can ever trust such trickery.

> So don't add NMI code to the page fault code. Not to the debug code,
> or to the module loading code. Don't say "use special allocations
> because the NMI code may care about these particular data structures".
> Because that way lies crap and unmaintainability.
>

If NMI code can call random hooks and access random data, yes. But I
don't think we're at that point yet.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-07-16 19:28:54

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

> Actually, module loading is already a performance problem; a lot of
> distros load sometimes hundreds of modules on startup, and it's heavily

On startup you don't have many processes. If there's a problem
it's surely not the fault of vmalloc_sync_all()

BTW in my experience one reason module loading was traditionally slow was
that it did a stop_machine(). I think(?) that has been fixed
at some point. But even with that's it's more an issue on larger
systems.

> I really hope noone ever gets the idea of touching user space from an
> NMI handler, though, and expecting it to work...

It can make sense for a backtrace in a profiler.

In fact perf is nearly doing it I believe, but moves
it to the self IPI handler in most cases.

-Andi
--
[email protected] -- Speaking for myself only.

2010-07-16 19:30:52

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 11:32 AM, Avi Kivity wrote:
>
> How long would vmalloc_sync_all take with a few thousand mm_struct take?
>
> We share the pmds, yes? So it's a few thousand memory accesses. The
> direct impact is probably negligible, compared to actually loading the
> module from disk. All we need is to make sure the locking doesn't slow
> down unrelated stuff.
>

It's not the memory accesses, it's the need to synchronize all the CPUs.

-hpa

2010-07-16 19:30:57

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, Jul 16, 2010 at 11:25:19AM -0700, Linus Torvalds wrote:
> On Fri, Jul 16, 2010 at 11:15 AM, Avi Kivity <[email protected]> wrote:
> >
> > I think the concern here is about an NMI handler's code running in vmalloc
> > space, or is it something else?
>
> I think the concern was also potentially doing things like backtraces
> etc that may need access to the module data structures (I think the
> ELF headers end up all being in vmalloc space too, for example).
>
> The whole debugging thing is also an issue. Now, I obviously am not a
> big fan of remote debuggers, but everybody tells me I'm wrong. And
> putting a breakpoint on NMI is certainly not insane if you are doing
> debugging in the first place. So it's not necessarily always about the
> page faults.

We already have infrastructure for kprobes to prevent breakpoints
on critical code (the __kprobes section). In principle kgdb/kdb
could be taught about honoring those too.

That wouldn't help for truly external JTAG debuggers, but I would assume
those generally can (should) handle any contexts anyways.

-Andi

--
[email protected] -- Speaking for myself only.

2010-07-16 19:32:22

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, Jul 16, 2010 at 09:32:00PM +0300, Avi Kivity wrote:
> On 07/16/2010 09:22 PM, Mathieu Desnoyers wrote:
> >
> >>There aren't that many processes at this time (or there shouldn't be,
> >>don't know how fork-happy udev is at this stage), so the sync should be
> >>pretty fast. In any case, we can sync only modules that contain NMI
> >>handlers.
> >USB hotplug is a use-case happening randomly after the system is well there and
> >running; I'm afraid this does not fit in your module loading expectations. It
> >triggers tons of events, many of these actually load modules.
>
> How long would vmalloc_sync_all take with a few thousand mm_struct take?
>
> We share the pmds, yes? So it's a few thousand memory accesses.
> The direct impact is probably negligible, compared to actually
> loading the module from disk. All we need is to make sure the
> locking doesn't slow down unrelated stuff.

Also you have to remember that vmalloc_sync_all() only does something
when the top level page is actually updated. That is very rare.
(in many cases it should happen at most once per boot)
Most mapping changes update lower levels, and those are already
shared.

-Andi
--
[email protected] -- Speaking for myself only.

2010-07-16 19:32:55

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 10:28 PM, Andi Kleen wrote:
>
>> I really hope noone ever gets the idea of touching user space from an
>> NMI handler, though, and expecting it to work...
>>
> It can make sense for a backtrace in a profiler.
>
> In fact perf is nearly doing it I believe, but moves
> it to the self IPI handler in most cases.
>

Interesting, is the self IPI guaranteed to execute synchronously after
the NMI's IRET? Or can the core IRET faster than the APIC and so we get
the backtrace at the wrong place?

(and does it matter? the NMI itself is not always accurate)

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-07-16 19:34:11

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, Jul 16, 2010 at 10:32:13PM +0300, Avi Kivity wrote:
> On 07/16/2010 10:28 PM, Andi Kleen wrote:
> >
> >>I really hope noone ever gets the idea of touching user space from an
> >>NMI handler, though, and expecting it to work...
> >It can make sense for a backtrace in a profiler.
> >
> >In fact perf is nearly doing it I believe, but moves
> >it to the self IPI handler in most cases.
>
> Interesting, is the self IPI guaranteed to execute synchronously
> after the NMI's IRET? Or can the core IRET faster than the APIC and
> so we get the backtrace at the wrong place?
>
> (and does it matter? the NMI itself is not always accurate)

self ipi runs after the next STI (or POPF)

-Andi
--
[email protected] -- Speaking for myself only.

2010-07-16 19:40:56

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 10:29 PM, H. Peter Anvin wrote:
> On 07/16/2010 11:32 AM, Avi Kivity wrote:
>
>> How long would vmalloc_sync_all take with a few thousand mm_struct take?
>>
>> We share the pmds, yes? So it's a few thousand memory accesses. The
>> direct impact is probably negligible, compared to actually loading the
>> module from disk. All we need is to make sure the locking doesn't slow
>> down unrelated stuff.
>>
>>
> It's not the memory accesses, it's the need to synchronize all the CPUs.
>

I'm missing something. Why do we need to sync all cpus? the
vmalloc_sync_all() I'm reading doesn't.

Even if we do an on_each_cpu() somewhere, it isn't the end of the world.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-07-16 21:46:15

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, Jul 16, 2010 at 12:26 PM, Avi Kivity <[email protected]> wrote:
> On 07/16/2010 09:37 PM, Linus Torvalds wrote:
>>
>> Non-NMI code should simply never have to even _think_ about NMI's. Why
>> should it? It should just do whatever comes "natural" within its own
>> context.
>
> But we're not talking about non-NMI code.

Yes, we are. We're talking about breakpoints (look at the subject
line), and you are very much talking about things like that _idiotic_
vmalloc_sync_all() by module loading code etc etc.

Every _single_ "solution" I have seen - apart from my suggestion - has
been about making code "special" because some other code might run in
an NMI. Module init sequences having to do idiotic things just because
they have data structures that might get accessed by NMI.

And the thing is, if we just do NMI's correctly, and allow nesting,
ALL THOSE PROBLEMS GO AWAY. And there is no reason what-so-ever to do
stupid things elsewhere.

In other words, why the hell are you arguing? Help Mathieu write the
low-level NMI handler right, and remove that idiotic
"vmalloc_sync_all()" that is fundamentally broken and should not
exist. Rather than talk about adding more of that kind of crap.

Linus

2010-07-16 22:02:53

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

Date Fri, 16 Jul 2010 14:39:50 -0700
>From Linus Torvalds

> Linus Torvalds wrote:
> But we're not talking about non-NMI code.Yes, we are. We're talking about breakpoints (look at the subjectline), and you are very much talking about things like that _idiotic_vmalloc_sync_all() by module
> loading code etc etc.Every _single_ "solution" I have seen - apart from my suggestion - hasbeen about making code "special" because some other code might run inan NMI. Module init sequences having to > do idiotic things just becausethey have data structures that might get accessed by NMI.And the thing is, if we just do NMI's correctly, and allow nesting,ALL THOSE PROBLEMS GO AWAY. And there is no > reason what-so-ever to dostupid things elsewhere.In other words, why the hell are you arguing? Help Mathieu write thelow-level NMI handler right, and remove that idiotic"vmalloc_sync_all()" that is
> fundamentally broken and should notexist. Rather than talk about adding more of that kind of crap.
>
> Linus

So Linus, my understanding of Intel's processor design is that the
processor will NEVER singal a nested NMI until it sees an iret from
the first NMI exception. At least that's how the processors were
working
when I started this unless this behavior has changed. Just put a
gate on the exception that uses its own stack (which I think we do
anyway).

Jeff

2010-07-16 22:07:36

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

> And the thing is, if we just do NMI's correctly, and allow nesting,
> ALL THOSE PROBLEMS GO AWAY. And there is no reason what-so-ever to do
> stupid things elsewhere.

One issue I have with nesting NMIs is that you need
a nesting limit, otherwise you'll overflow the NMI stack.

We just got rid of nesting for normal interrupts because
of this stack overflow problem which hit in real situations.

In some cases you can get quite high NMI frequencies, e.g. with
performance counters. Now the current performance counter handlers
do not nest by themselves of course, but they might nest
with other longer running NMI users.

I think none of the current handlers are likely to nest
for very long, but there's more and more NMI coded all the time,
so it's definitely a concern.

-Andi

--
[email protected] -- Speaking for myself only.

2010-07-16 22:22:48

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, Jul 16, 2010 at 3:02 PM, Jeffrey Merkey <[email protected]> wrote:
>
> So Linus, my understanding of Intel's processor design is that the
> processor will NEVER singal a nested NMI until it sees an iret from
> the first NMI exception.

Wrong.

I like x86, but it has warts. The NMI blocking is one of them.

The NMI's will be nested until the _next_ "iret", but it has no
nesting. So if you take a fault during the NMI (debug, page table
fixup, whatever), the iret in the faulthandler will re-enable NMI's
even though we're still busy with the original NMI. There is no
nesting, or any way to say that "this is a NMI-releasing iret". They
could even do it still - make a new "iret that doesn't clear NMI" by
adding a segment override prefix to iret or whatever. But it's not
going to happen, and it's just one of those ugly special cases that
has various historical reasons (recursive faults during NMI sure as
hell didn't make sense back in the real-mode 8086 days).

So we have to handle it in software. Or not ever trap at all inside
the NMI handler.

The original patch - and the patch I detest - is to make the normal
fault paths use a "popf + ret" to emulate iret, but without the NMI
release.

Now, I could live with that if it's the only solution, but it _is_
pretty damn ugly.

If somebody shows that it's actually faster to do "popf + ret" when
retuning to kernel space (a poor mans special-case iret), maybe it
would be worth it, but the really critical code sequence is actually
not "return to kernel space", but the "return to user space" case that
really wants the iret. And I just think it's disgusting to add extra
tests to that path.

The other alternative would be to just make the rule be "NMI can never
take traps". It's possible to do that, but quite frankly, it's a pain.
It's a pain for page faults due to the whole vmalloc thing, and it's a
pain if you ever want to debug an NMI in any way (or put a breakpoint
on anything that is accessed from an NMI, which could potentially be
quite a lot of things).

If it was just the debug issue, I'd say "neener neener, debuggers are
for wimps", but it's clearly not just about debug. It's a whole lot of
other thigs. Random percpu datastructures used for tracing, kernel
pointer verification code, yadda yadda.

Linus

2010-07-16 22:34:46

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, Jul 16, 2010 at 3:07 PM, Andi Kleen <[email protected]> wrote:
>
> One issue I have with nesting NMIs is that you need
> a nesting limit, otherwise you'll overflow the NMI stack.

Have you actually looked at the suggestion I (and now Mathieu)
suggested code for?

The nesting is very limited. NMI's would nest just once, and when that
happens, the nested NMI would never use more than something like a
hundred bytes of stack (most of which is what the CPU pushes
directly). And there would be no device interrupts that nest, and
practically the faults that nest obviously aren't going to be complex
faults either (ie the page fault would be the simple case that never
calls to 'handle_vm_fault()', but handles it all in
arch/x86/mm/fault.c.

IOW, there is absolutely _no_ issues with nesting. It's two levels
deep, and a much smaller stack footprint than our regular exception
nesting for those two levels too.

And your argument that there would be more and more NMI usage only
makes it more important that we handle NMI's without going crazy. Just
handle them cleanly instead of making them something totally special.

Linus

2010-07-16 22:40:47

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

* Andi Kleen ([email protected]) wrote:
> > And the thing is, if we just do NMI's correctly, and allow nesting,
> > ALL THOSE PROBLEMS GO AWAY. And there is no reason what-so-ever to do
> > stupid things elsewhere.
>
> One issue I have with nesting NMIs is that you need
> a nesting limit, otherwise you'll overflow the NMI stack.
>
> We just got rid of nesting for normal interrupts because
> of this stack overflow problem which hit in real situations.
>
> In some cases you can get quite high NMI frequencies, e.g. with
> performance counters. Now the current performance counter handlers
> do not nest by themselves of course, but they might nest
> with other longer running NMI users.
>
> I think none of the current handlers are likely to nest
> for very long, but there's more and more NMI coded all the time,
> so it's definitely a concern.

We're not proposing to actually "nest" NMIs per se. We copy the stack at the
beginning of the NMI handler (and then use the copy) to permit nesting of faults
over NMI handlers. Following NMIs that would "try" to nest over the NMI handler
would see their regular execution postponed until the end of the currently
running NMI handler. It's OK for these "nested" NMI handlers to use the bottom
of NMI stack because the NMI handler on which they are trying to nest is only
using the stack copy. These "nested" handlers return to the original NMI handler
very early just after setting a "pending nmi" flag. There is more to it (e.g.
handling NMI handler exit atomically with respect to incoming NMIs); please
refer to the last assembly code snipped I sent to Linus a little earlier today
for details.

Thanks,

Mathieu

>
> -Andi
>
> --
> [email protected] -- Speaking for myself only.

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-16 22:41:43

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, Jul 16, 2010 at 03:26:32PM -0700, Linus Torvalds wrote:
> On Fri, Jul 16, 2010 at 3:07 PM, Andi Kleen <[email protected]> wrote:
> >
> > One issue I have with nesting NMIs is that you need
> > a nesting limit, otherwise you'll overflow the NMI stack.
>
> Have you actually looked at the suggestion I (and now Mathieu)
> suggested code for?

Maybe I'm misunderstanding everything (and it has been a lot of emails
in the thread), but the case I was thinking of would be if the second NMI
faults too, and then another one comes in after the IRET etc.

-Andi

2010-07-16 22:48:06

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, Jul 16, 2010 at 4:22 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Jul 16, 2010 at 3:02 PM, Jeffrey Merkey <[email protected]> wrote:
>>
>> So Linus, my understanding of Intel's processor design is that the
>> processor will NEVER singal a nested NMI until it sees an iret from
>> the first NMI exception.
>
> Wrong.
>
> I like x86, but it has warts. The NMI blocking is one of them.
>
> The NMI's will be nested until the _next_ "iret", but it has no
> nesting. So if you take a fault during the NMI (debug, page table
> fixup, whatever), the iret in the faulthandler will re-enable NMI's
> even though we're still busy with the original NMI. There is no
> nesting, or any way to say that "this is a NMI-releasing iret". They
> could even do it still - make a new "iret that doesn't clear NMI" by
> adding a segment override prefix to iret or whatever. But it's not
> going to happen, and it's just one of those ugly special cases that
> has various historical reasons (recursive faults during NMI sure as
> hell didn't make sense back in the real-mode 8086 days).
>
> So we have to handle it in software. Or not ever trap at all inside
> the NMI handler.
>
> The original patch - and the patch I detest - is to make the normal
> fault paths use a "popf + ret" to emulate iret, but without the NMI
> release.
>
> Now, I could live with that if it's the only solution, but it _is_
> pretty damn ugly.
>
> If somebody shows that it's actually faster to do "popf + ret" when
> retuning to kernel space (a poor mans special-case iret), maybe it
> would be worth it, but the really critical code sequence is actually
> not "return to kernel space", but the "return to user space" case that
> really wants the iret. And I just think it's disgusting to add extra
> tests to that path.
>
> The other alternative would be to just make the rule be "NMI can never
> take traps". It's possible to do that, but quite frankly, it's a pain.
> It's a pain for page faults due to the whole vmalloc thing, and it's a
> pain if you ever want to debug an NMI in any way (or put a breakpoint
> on anything that is accessed from an NMI, which could potentially be
> quite a lot of things).
>
> If it was just the debug issue, I'd say "neener neener, debuggers are
> for wimps", but it's clearly not just about debug. It's a whole lot of
> other thigs. Random percpu datastructures used for tracing, kernel
> pointer verification code, yadda yadda.
>
> ? ? ? ? ? ? ? ? ?Linus
>

Well, the way I handled this problem on NetWare SMP and that other
kernel was to create a pool of TSS descriptors and reload each during
the exception to swap stacks before any handlers were called. Allowed
it to nest until I ran out of TSS descriptors (64 levels). Not sure
that's the way to go here though but it worked on that case.

Jeff

2010-07-16 22:50:05

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

>
> If it was just the debug issue, I'd say "neener neener, debuggers are
> for wimps", but it's clearly not just about debug. It's a whole lot of
> other thigs. Random percpu datastructures used for tracing, kernel
> pointer verification code, yadda yadda.
>
> ? ? ? ? ? ? ? ? ?Linus
>

I guess I am a wimp then ... :-)

Jeff

2010-07-16 22:53:42

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

>
> Well, the way I handled this problem on NetWare SMP and that other
> kernel was to create a pool of TSS descriptors and reload each during
> the exception to swap stacks before any handlers were called. ?Allowed
> it to nest until I ran out of TSS descriptors (64 levels). ?Not sure
> that's the way to go here though but it worked on that case.
>
> Jeff
>

Here is where that old dusty code lives these days - it deals with this problem.

http://open-source-netware.googlecode.com/files/manos-06-26-2010.tar.gz

file to look at is startup.386

;
; nmi entry code
;

nmi_entry macro
cli
push ebx
push ebp
mov ebp, esp
sub ebp, SIZE TaskStateSegment
mov ebx, ebp

mov [ebp].tSS, ss
mov [ebp].tGS, gs ; save segment registers
mov [ebp].tFS, fs
mov [ebp].tES, es
mov [ebp].tDS, ds
pop [ebp].tEBP
mov [ebp].tEDI, edi
mov [ebp].tESI, esi
mov [ebp].tEDX, edx
mov [ebp].tECX, ecx
pop [ebp].tEBX
mov [ebp].tEAX, eax

pop [ebp].tEIP ; remove return address
pop eax
mov [ebp].tCS, ax
pop [ebp].tSystemFlags ; get flags into TSS

mov [ebp].tESP, esp ; save true stack address
mov esp, ebx ; cover stack frame

mov eax, CR0
and eax, 0FFFFFFF7h ; clear task switch bit in CR0 to
mov CR0, eax ; avoid NPX exceptions

xor eax, eax
mov dr7, eax ; disable breakpoints

mov eax, CR3 ;
mov [ebp].tCR3, eax ;
mov eax, DebuggerPDE
mov CR3, eax

;
; if we do not clear the NESTED_TASK_FLAG, then the IRET
; at the end of this function will cause
; an invalid TSS exception to be generated because the
; task busy bit was cleared earlier
;

pushfd
and dword ptr [esp], NOT (NESTED_TASK_FLAG OR SINGLE_STEP_FLAG)
or dword ptr [esp], RESUME_FLAG
popfd

mov eax, 0FFFFFFFFh ; mark as a non-pooled TSS exception
push eax

push 0
push 0
push ebp

endm

;
; TSS entry code
;

task_entry macro
LOCAL @TSSNotNested, @NoLTR
LOCAL @UsedDefaultSegment
LOCAL @UsedPooledSegment
LOCAL @EnterTheDebugger

cli
xor eax, eax
str ax
mov esi, offset SystemGDTTable
mov esi, dword ptr [esi + 2]
lea ebx, [esi + eax]
mov al, [ebx].TSSBase2
mov ah, [ebx].TSSBase3
shl eax, 16
mov ax, [ebx].TSSBase1

;
; eax -> TSS Segment (Current)
; ebx -> TSS Descriptor (Current)
;

movzx ecx, word ptr [eax].tBackLink
or ecx, ecx
jz @TSSNotNested

mov esi, offset SystemGDTTable
mov esi, dword ptr [esi + 2]
lea edx, [esi + ecx]
mov cl, [edx].TSSBase2
mov ch, [edx].TSSBase3
shl ecx, 16
mov cx, [edx].TSSBase1

mov ebp, ecx

;
; edx -> TSS Descriptor (Previous)
; ebp -> TSS Segment (Previous)
;
; clear busy state and reset TSS
;

mov [edx].TSSType, 10001001b

@TSSNotNested:
mov [ebx].TSSType, 10001001b

lgdt ds: SystemGDTTable ; reset GDT TSS Busy bit

movzx eax, word ptr [eax].tBackLink
or eax, eax
jz @NoLTR

ltr ax

@NoLTR:

mov eax, CR0
and eax, 0FFFFFFF7h ; clear task switch bit in CR0 to
mov CR0, eax ; avoid NPX exceptions

xor eax, eax
mov dr7, eax ; disable breakpoints

pushfd
and dword ptr [esp], NOT (NESTED_TASK_FLAG OR SINGLE_STEP_FLAG)
or dword ptr [esp], RESUME_FLAG
popfd

push ebp
call AllocPooledResource
pop ebp

or eax, eax
jz @UsedDefaultSegment

lea ebp, [eax].TSSSegment
mov esp, [eax].StackTop

push eax ; push address of pooled resource
jmp @UsedPooledSegment

@UsedDefaultSegment:
mov eax, 0FFFFFFFFh ; push non-pooled marker onto the stack
push eax

@UsedPooledSegment:

push 0
mov eax, CR2 ; get fault address
push eax
push ebp ; pass the TSS

endm

;
; TSS exit code
;

Jeff

2010-07-17 01:16:33

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, Jul 16, 2010 at 3:41 PM, Andi Kleen <[email protected]> wrote:
>
> Maybe I'm misunderstanding everything (and it has been a lot of emails
> in the thread), but the case I was thinking of would be if the second NMI
> faults too, and then another one comes in after the IRET etc.

No, the nested NMI cannot fault, because it never even enters C code.
It literally just returns immediately after having noticed it is
nested (and corrupted the stack of the original one, so that the
original NMI will re-do itself at return)..

So the nested NMI will use some few tens of bytes of stack. In fact,
it will use the stack "above" the stack that the original NMI handler
is using, because it will reset the stack pointer back to the top of
the NMI stack. So in a very real sense, it is not even extending the
stack, it is just re-using a small part of the same stack that the
original NMI used (and that we copied away so that it doesn't matter
that it gets re-used)

As to another small but important detail: the _nested_ NMI actually
returns using "popf+ret", leaving NMI's blocked again. Thus
guaranteeing forward progress and lack of NMI storms.

To summarize:

- the "original" (first-level) NMI can take faults (like the page
fault to fill in vmalloc pages lazily, or debug faults). That will
actually cause two stack frames (or three, if you debug a page fault
that happened while NMI was active). So there is certainly exception
nesting going on, but we're talking _much_ less stack than normal
stack usage where the nesting can be deep and in complex routines.

- any "nested" NMI's will not actually use any more stack at all than
a non-nested one, because we've pre-reserved space for them (and we
_had_ to reserve space for them due to IST)

- even if we get NMI's during the execution of the original NMI,
there can be only one such "spurious" NMI per nested exception. So if
we take a single page fault, that exception will re-enable NMI
(because it returns with "iret"), and as a result we may take a
_single_ new nested NMI until we disable NMI's again.

In other words, the approach is not all that different from doing
"lazy irq disable" like powerpc does for regular interrupts. For
NMI's, we do it because it's impossible (on x86) to disable NMI's
without actually taking one.

Linus

2010-07-18 09:25:22

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/17/2010 12:39 AM, Linus Torvalds wrote:
> On Fri, Jul 16, 2010 at 12:26 PM, Avi Kivity<[email protected]> wrote:
>
>> On 07/16/2010 09:37 PM, Linus Torvalds wrote:
>>
>>> Non-NMI code should simply never have to even _think_ about NMI's. Why
>>> should it? It should just do whatever comes "natural" within its own
>>> context.
>>>
>> But we're not talking about non-NMI code.
>>
> Yes, we are. We're talking about breakpoints (look at the subject
> line), and you are very much talking about things like that _idiotic_
> vmalloc_sync_all() by module loading code etc etc.
>

Well, I'd put it in the nmi handler registration code, but you're
right. A user placing breakpoints can't even tell whether the
breakpoint will be hit by NMI code, especially data breakpoints.

> Every _single_ "solution" I have seen - apart from my suggestion - has
> been about making code "special" because some other code might run in
> an NMI. Module init sequences having to do idiotic things just because
> they have data structures that might get accessed by NMI.
>
> And the thing is, if we just do NMI's correctly, and allow nesting,
> ALL THOSE PROBLEMS GO AWAY. And there is no reason what-so-ever to do
> stupid things elsewhere.
>
> In other words, why the hell are you arguing? Help Mathieu write the
> low-level NMI handler right, and remove that idiotic
> "vmalloc_sync_all()" that is fundamentally broken and should not
> exist. Rather than talk about adding more of that kind of crap.
>

Well, at least we'll get a good test case for kvm's nmi blocking
emulation (it's tricky since if we fault on an iret sometimes nmis get
unblocked even though the instruction did not complete).

--
error compiling committee.c: too many arguments to function

2010-07-18 09:27:50

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 10:30 PM, Andi Kleen wrote:
> We already have infrastructure for kprobes to prevent breakpoints
> on critical code (the __kprobes section). In principle kgdb/kdb
> could be taught about honoring those too.
>
>

It doesn't help with NMI code calling other functions, or with data
breakpoints.

--
error compiling committee.c: too many arguments to function

2010-08-04 09:47:45

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, 2010-07-16 at 11:05 -0700, H. Peter Anvin wrote:
>
> I really hope noone ever gets the idea of touching user space from an
> NMI handler, though, and expecting it to work...

Perf actually already does that to unwind user-space stacks... ;-)

See arch/x86/kernel/cpu/perf_event.c:copy_from_user_nmi() and its users.

What we do is a manual page table walk (using __get_user_pages_fast) and
simply bail when the page is not available.

That said, I think that the thing that started the whole
per-cpu-per-context temp stack-frame storage story also means that that
function is now broken and can lead to kmap_atomic corruption.

I really should brush up that stack based kmap_atomic thing, last time I
got stuck on FRV wanting things.

Linus should I refresh that whole series and give a FRV a slow but
working implementation and then let David Howells sort out things if he
cares about that?

2010-08-04 20:27:08