[moved to a new thread, cc list trimmed]
Hi all-
We've considered two approaches to dealing with NMIs:
1. Allow nesting. We know quite well how messy that is.
2. Forbid IRET inside NMIs. Doable but maybe not that pretty.
We haven't considered:
3. Forbid faults (other than MCE) inside NMI.
Option 3 is almost easy. There are really only two kinds of faults
that can legitimately nest inside NMI: #PF and #DB. #DB is easy to
fix (e.g. with my patches or Peter's patches).
What if we went all out and forbade page faults in NMI as well. There
are two reasons that I can think of that we might page fault inside an
NMI:
a) vmalloc fault. I think Ingo already half-implemented a rework to
eliminate vmalloc faults entirely.
b) User memory access faults.
The reason we access user state in general from an NMI is to allow
perf to capture enough user stack data to let the tooling backtrace
back to user space. What if we did it differently? Instead of
capturing this data in NMI context, capture it in
prepare_exit_to_usermode. That would let us capture user state
*correctly*, which we currently can't really do. There's a
never-ending series of minor bugs in which we try to guess the user
register state from NMI context, and it sort of works. In
prepare_exit_to_usermode, we really truly know the user state.
There's a race where an NMI hits during or after
prepare_exit_to_usermode, but maybe that's okay -- just admit defeat
in that case and don't show the user state. (Realistically, without
CFI data, we're not going to be guaranteed to get the right state
anyway.)
To make this work, we'd have to teach NMI-from-userspace to call the
callback itself. It would look like:
prepare_exit_to_usermode() {
...
while (blah blah blah) {
if (cached_flags & TIF_PERF_CAPTURE_USER_STATE)
perf_capture_user_state();
...
}
...
}
and then, on NMI exit, we'd call perf_capture_user_state directly,
since we don't want to enable IRQs or do opportunsitic sysret on exit
from NMI. (Why not? Because NMIs are still masked, and we don't want
to pay for double-IRET to unmask them, so we really want to leave IRQs
off and IRET straight back to user mode.)
There's an unavoidable race in which we enter user mode with
TIF_PERF_CAPTURE_USER_STATE still set. In principle, we could
IPI-to-self from the NMI handler to cover that case (mostly -- we
capture the wrong state if we're on our way to an IRET fault), or we
could just check on entry if the flag is still set and, if so, admit
defeat.
Peter, can this be done without breaking the perf ABI? If we were
designing all of this stuff from scratch right now, I'd suggest doing
it this way, but I'm not sure whether it makes sense to try to
retrofit it in.
If we decide to stick with option 2, then I've now convinced myself
that banning all kernel breakpoints and watchpoints during NMI
processing is probably for the best. Maybe we should go one step
farther and ban all DR7 breakpoints period. Sure, it will slow down
perf if there are user breakpoints or watchpoints set, but, having
looked at the asm, returning from #DB using RET is, while doable,
distinctly ugly.
--Andy
On Thu, Jul 23, 2015 at 1:21 PM, Andy Lutomirski <[email protected]> wrote:
>
> 2. Forbid IRET inside NMIs. Doable but maybe not that pretty.
>
> We haven't considered:
>
> 3. Forbid faults (other than MCE) inside NMI.
I'd really prefer #2. #3 depends on us getting many things right, and
never introducing new cases in the future.
#2, in contrast, seems to be fairly localized. Yes, RF is an issue,
but returning to user space with RF clear doesn't really seem to be
all that problematic.
The point of RF is to make forward progress in the face of debug
register faults, but I don't see what was wrong with the whole
"disable any debug events that happen with interrupts disabled".
And no, I do *not* believe that we should disable debug faults ahead
of time. We should take them, disable them, and return with 'ret'. No
complex "you can't put breakpoints in this region" crap, no magic
rules, no subtle issues.
I really think your "disallow #DB" is pointless. I think your "prevent
instruction breakpoints in NMI" is wrong. Let them happen. Take them
and disable them. Return with RT clear. Go on with your life.
And the "take them and disable them" is really simple. No "am I in an
NMI contect" thing (because that leads to the whole question about
"what is NMI context"). That's not the real rule anyway.
No, make it very simple and straightforward. Make the test be "uhhuh,
I got a #DB in kernel mode, and interrupts were disabled - I know I'm
going to return with "ret", so I'm just going to have to disable this
breakpoint".
Nothing clever. Nothing subtle. Nothing that needs "this range of
instructions is magical". No. Just a very simple rule: if the context
we return to is kernel mode and interrupts are disabled, we're using
'ret', so we cannot suppress debug faults.
Did I miss something? There were a lot of emails flying around, but I
*thought* I saw them all..
Linus
On Thu, Jul 23, 2015 at 1:38 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Jul 23, 2015 at 1:21 PM, Andy Lutomirski <[email protected]> wrote:
>>
>> 2. Forbid IRET inside NMIs. Doable but maybe not that pretty.
>>
>> We haven't considered:
>>
>> 3. Forbid faults (other than MCE) inside NMI.
>
> I'd really prefer #2. #3 depends on us getting many things right, and
> never introducing new cases in the future.
>
> #2, in contrast, seems to be fairly localized. Yes, RF is an issue,
> but returning to user space with RF clear doesn't really seem to be
> all that problematic.
>
> The point of RF is to make forward progress in the face of debug
> register faults, but I don't see what was wrong with the whole
> "disable any debug events that happen with interrupts disabled".
>
> And no, I do *not* believe that we should disable debug faults ahead
> of time. We should take them, disable them, and return with 'ret'. No
> complex "you can't put breakpoints in this region" crap, no magic
> rules, no subtle issues.
>
> I really think your "disallow #DB" is pointless. I think your "prevent
> instruction breakpoints in NMI" is wrong. Let them happen. Take them
> and disable them. Return with RT clear. Go on with your life.
>
> And the "take them and disable them" is really simple. No "am I in an
> NMI contect" thing (because that leads to the whole question about
> "what is NMI context"). That's not the real rule anyway.
>
> No, make it very simple and straightforward. Make the test be "uhhuh,
> I got a #DB in kernel mode, and interrupts were disabled - I know I'm
> going to return with "ret", so I'm just going to have to disable this
> breakpoint".
>
> Nothing clever. Nothing subtle. Nothing that needs "this range of
> instructions is magical". No. Just a very simple rule: if the context
> we return to is kernel mode and interrupts are disabled, we're using
> 'ret', so we cannot suppress debug faults.
There are some subtleties in here.
Issue A: to return with RF clear, we need to disarm the breakpoint.
If it's limited to the duration of the NMI, that's easy. If not, when
do we re-arm? New prepare_exit_to_usermode hook? Hmm, setting ti
flags during context switch may target the wrong task.
Issue B: single-step exception after SYSENTER. The patches I just
sent fix that, though.
Issue C: #DB with invalid stack pointer (can happen due to watchpoints
during SYSCALL entry or SYSRET exit). I guess we need to ban such
watchpoints.
Issue D: debug exception inside EFI (especially mixed-mode EFI). We
can't return using RET, so we need to catch that case.
These issues mostly go away if we preemptively disarm DR7 early in NMI
processing and rearm it at the end.
--Andy
On Thu, Jul 23, 2015 at 01:38:33PM -0700, Linus Torvalds wrote:
> On Thu, Jul 23, 2015 at 1:21 PM, Andy Lutomirski <[email protected]> wrote:
> >
> > 2. Forbid IRET inside NMIs. Doable but maybe not that pretty.
> >
> > We haven't considered:
> >
> > 3. Forbid faults (other than MCE) inside NMI.
>
> I'd really prefer #2. #3 depends on us getting many things right, and
> never introducing new cases in the future.
>
> #2, in contrast, seems to be fairly localized. Yes, RF is an issue,
> but returning to user space with RF clear doesn't really seem to be
> all that problematic.
What's the worst case that can happen with RF cleared when returing
to user space ? My understanding is that it's just that we risk to
break again on an instruction that had a break point set and which
already triggered the breakpoint, right ?
If so the problem probably is whether there's a risk of looping again
without ever getting a chance to execute this instruction normally.
But if the NMIs don't bomb as fast as we can process them, at some
point the instruction should get a chance to be executed, so the
problem doesn't seem dramatic.
That makes me think that I have no idea what happens if we try to
step-trace "int 2", I don't even know if we pass through the NMI
handler.
Willy
On Thu, Jul 23, 2015 at 1:52 PM, Willy Tarreau <[email protected]> wrote:
> On Thu, Jul 23, 2015 at 01:38:33PM -0700, Linus Torvalds wrote:
>> On Thu, Jul 23, 2015 at 1:21 PM, Andy Lutomirski <[email protected]> wrote:
>> >
>> > 2. Forbid IRET inside NMIs. Doable but maybe not that pretty.
>> >
>> > We haven't considered:
>> >
>> > 3. Forbid faults (other than MCE) inside NMI.
>>
>> I'd really prefer #2. #3 depends on us getting many things right, and
>> never introducing new cases in the future.
>>
>> #2, in contrast, seems to be fairly localized. Yes, RF is an issue,
>> but returning to user space with RF clear doesn't really seem to be
>> all that problematic.
>
> What's the worst case that can happen with RF cleared when returing
> to user space ? My understanding is that it's just that we risk to
> break again on an instruction that had a break point set and which
> already triggered the breakpoint, right ?
I assume Linus meant returning to kernel space with RF clear. Returns
to userspace have their own fancy logic here, and it's survived for a
couple of releases, including through an explicit test of RF handling
:)
--Andy
On Thu, Jul 23, 2015 at 01:53:34PM -0700, Andy Lutomirski wrote:
> On Thu, Jul 23, 2015 at 1:52 PM, Willy Tarreau <[email protected]> wrote:
> > On Thu, Jul 23, 2015 at 01:38:33PM -0700, Linus Torvalds wrote:
> >> On Thu, Jul 23, 2015 at 1:21 PM, Andy Lutomirski <[email protected]> wrote:
> >> >
> >> > 2. Forbid IRET inside NMIs. Doable but maybe not that pretty.
> >> >
> >> > We haven't considered:
> >> >
> >> > 3. Forbid faults (other than MCE) inside NMI.
> >>
> >> I'd really prefer #2. #3 depends on us getting many things right, and
> >> never introducing new cases in the future.
> >>
> >> #2, in contrast, seems to be fairly localized. Yes, RF is an issue,
> >> but returning to user space with RF clear doesn't really seem to be
> >> all that problematic.
> >
> > What's the worst case that can happen with RF cleared when returing
> > to user space ? My understanding is that it's just that we risk to
> > break again on an instruction that had a break point set and which
> > already triggered the breakpoint, right ?
>
> I assume Linus meant returning to kernel space with RF clear. Returns
> to userspace have their own fancy logic here, and it's survived for a
> couple of releases, including through an explicit test of RF handling
> :)
Ah you must be right, got it. Yes you want to break into the NMI handler
and you either disable all breakpoints/single-step until the NMI's iret
by clearing DR7, or you loop over and over on the same instruction if
you try to restart the stopped instruction with RF clear. That makes
sense.
Thanks,
Willy
On Thu, Jul 23, 2015 at 1:49 PM, Andy Lutomirski <[email protected]> wrote:
>
> Issue A: to return with RF clear, we need to disarm the breakpoint.
> If it's limited to the duration of the NMI, that's easy. If not, when
> do we re-arm? New prepare_exit_to_usermode hook? Hmm, setting ti
> flags during context switch may target the wrong task.
We don't re-arm it.
We can entertain the notion *eventually* to do something clever, but
for now, just say: stability and simplicity is more important.
People can use tracepoints in interrupts-off code (they get rewritten
with 'int3', that's fine), but not instruction breakpoints.
> Issue C: #DB with invalid stack pointer (can happen due to watchpoints
> during SYSCALL entry or SYSRET exit). I guess we need to ban such
> watchpoints.
.. but this is unrelated, to NMI, just "syscall is a nasty interface".
Don't we already ban them?
> Issue D: debug exception inside EFI (especially mixed-mode EFI). We
> can't return using RET, so we need to catch that case.
If NMI code calls EFI code, then it's broken.
> These issues mostly go away if we preemptively disarm DR7 early in NMI
> processing and rearm it at the end.
I'm not *violently* opposed to that, but it's just a band-aid. It
doesn't *fix* anything. You aren't protecting against random DB
exceptions just because somebody put a data breakpoint on the NMI
stack, for example. You still get page faults. Etc etc.
So I thinkt he whole "use ret instead" is a pretty simple approach.
Make that "just work".
Then, if you want to play with dr7 inside NMI to make it more likely
that you can have breakpoints live in irq-off situation, I think
that's a magic special case. It shouldn't be part of the design.
Things should work without it.
Linus
On Thu, Jul 23, 2015 at 1:52 PM, Willy Tarreau <[email protected]> wrote:
>
> What's the worst case that can happen with RF cleared when returing
> to user space ?
Not a good idea. We are fine breaking breakpoints on the kernel ("use
the tracing infrastructure instead"). Breaking it in user space is not
really an option.
And we really don't need to. We'd only use 'ret' when returning to
kernel code. And not even for the usual case, only for the "interrupts
are off" case. If somebody tries to put a breakpoint on something
that is used in an irq-off situation, they are doing something very
specialized, and we cna tell them: "sorry, we had to break your use
case because it's crazy any other way".
Those kind of people are by definition not "users". They are mucking
with kernel internals. Breaking them is not a regression.
Btw, we should still ask Intel for that "fast iret that doesn't
re-enable NMI". So for possible future CPU's we might let people do
crazy things again.
Linus
On Thu, Jul 23, 2015 at 01:21:16PM -0700, Andy Lutomirski wrote:
> 3. Forbid faults (other than MCE) inside NMI.
>
> Option 3 is almost easy. There are really only two kinds of faults
> that can legitimately nest inside NMI: #PF and #DB. #DB is easy to
> fix (e.g. with my patches or Peter's patches).
>
> What if we went all out and forbade page faults in NMI as well. There
> are two reasons that I can think of that we might page fault inside an
> NMI:
>
> b) User memory access faults.
>
> The reason we access user state in general from an NMI is to allow
> perf to capture enough user stack data to let the tooling backtrace
> back to user space. What if we did it differently? Instead of
> capturing this data in NMI context, capture it in
> prepare_exit_to_usermode.
> Peter, can this be done without breaking the perf ABI? If we were
> designing all of this stuff from scratch right now, I'd suggest doing
> it this way, but I'm not sure whether it makes sense to try to
> retrofit it in.
Not really; but also almost :/
So the thing is that we currently attach the user backtrace to all
events -- and there can be many before we return to userspace again.
So none of those events would have a userspace stack, I'm sure that's
going to confuse the tooling.
OTOH, userspace stacks are a best effort thing, we bail at the first
sign of trouble (eg. the stack page is not there).
Now realistically this 'never' happens, and it would result in
consistently truncated user traces, where your proposal would result in
a whole bunch of events with no user traces and then an 'extra' event
with a one.
On Thu, Jul 23, 2015 at 02:13:16PM -0700, Linus Torvalds wrote:
> On Thu, Jul 23, 2015 at 1:52 PM, Willy Tarreau <[email protected]> wrote:
> >
> > What's the worst case that can happen with RF cleared when returing
> > to user space ?
>
> Not a good idea. We are fine breaking breakpoints on the kernel ("use
> the tracing infrastructure instead"). Breaking it in user space is not
> really an option.
But that wouldn't disable the breakpoint, just make it strike again,
so the user would not be hurt.
> And we really don't need to. We'd only use 'ret' when returning to
> kernel code. And not even for the usual case, only for the "interrupts
> are off" case. If somebody tries to put a breakpoint on something
> that is used in an irq-off situation, they are doing something very
> specialized, and we cna tell them: "sorry, we had to break your use
> case because it's crazy any other way".
>
> Those kind of people are by definition not "users". They are mucking
> with kernel internals. Breaking them is not a regression.
>
> Btw, we should still ask Intel for that "fast iret that doesn't
> re-enable NMI". So for possible future CPU's we might let people do
> crazy things again.
I'm just thinking that there should be an option for this : task switching.
You can store the EFLAGS in the TSS, so by preparing a dummy task with
everything needed to emulate iret, we might be able to do it without the
iret instruction. Or is this a stupid idea ? At least now I've well
understood that ugliness is not an excuse for not proposing something :-)
Willy
On Thu, Jul 23, 2015 at 01:38:33PM -0700, Linus Torvalds wrote:
> And the "take them and disable them" is really simple. No "am I in an
> NMI contect" thing (because that leads to the whole question about
> "what is NMI context"). That's not the real rule anyway.
>
> No, make it very simple and straightforward. Make the test be "uhhuh,
> I got a #DB in kernel mode, and interrupts were disabled - I know I'm
> going to return with "ret", so I'm just going to have to disable this
> breakpoint".
>
> Nothing clever. Nothing subtle. Nothing that needs "this range of
> instructions is magical". No. Just a very simple rule: if the context
> we return to is kernel mode and interrupts are disabled, we're using
> 'ret', so we cannot suppress debug faults.
>
> Did I miss something? There were a lot of emails flying around, but I
> *thought* I saw them all..
So the NMI could trigger userspace debug register faults, and simply
disabling them would make the whole debug register thing entirely
unreliable.
On Thu, 23 Jul 2015 13:21:16 -0700
Andy Lutomirski <[email protected]> wrote:
> 3. Forbid faults (other than MCE) inside NMI.
>
> Option 3 is almost easy. There are really only two kinds of faults
> that can legitimately nest inside NMI: #PF and #DB. #DB is easy to
> fix (e.g. with my patches or Peter's patches).
What about int3? Which is needed to make ftrace work. This was a
requirement to get rid of stomp-machine when updating ftrace functions,
as well as the rational for doing the whole NMI nesting work in the
first place.
>
> What if we went all out and forbade page faults in NMI as well. There
> are two reasons that I can think of that we might page fault inside an
> NMI:
>
> a) vmalloc fault. I think Ingo already half-implemented a rework to
> eliminate vmalloc faults entirely.
>
> b) User memory access faults.
c) stack tracing faults
I would have NMIs debug deadlocks with printing stack traces. The stack
tracer can page fault, and before the NMI nesting code, while debugging
machines, these stack dumps would randomly reboot the box. While
writing the NMI nesting code I realized why those reboots happened, and
that was due to the stack trace faulting, and the printk from NMI was
slow enough to have another NMI go off and stomp over the outer NMIs
stack. Which lead to triple faults and such.
-- Steve
On Thu, 23 Jul 2015 14:08:59 -0700
Linus Torvalds <[email protected]> wrote:
> On Thu, Jul 23, 2015 at 1:49 PM, Andy Lutomirski <[email protected]> wrote:
> >
> > Issue A: to return with RF clear, we need to disarm the breakpoint.
> > If it's limited to the duration of the NMI, that's easy. If not, when
> > do we re-arm? New prepare_exit_to_usermode hook? Hmm, setting ti
> > flags during context switch may target the wrong task.
>
> We don't re-arm it.
>
Let me get this straight. The idea is in the #DB handler to detect that
it was triggered in NMI context, and if so, simply disarm that
breakpoint permanently, right?
Nothing should be adding hw breakpoints to NMI code anyway. Sounds
perfectly reasonable to me. Of course, how we tell we are in NMI
brings back all the races as we had in the nesting code. We can check
the per-cpu variable that is set with nmi_enter() and cleared at
nmi_exit() but what happens if the breakpoint is outside those calls.
We can check the stack pointer, but then we are back to userspace
fooling us. Maybe add the DF trick again?
-- Steve
On Thu, Jul 23, 2015 at 2:20 PM, Peter Zijlstra <[email protected]> wrote:
>
> So the NMI could trigger userspace debug register faults, and simply
> disabling them would make the whole debug register thing entirely
> unreliable.
We could easily set something to re-enable them for when we actually
return to user space. I'd be ok with just setting the
_TIF_USER_WORK_MASK.
But even that should not be a requirement for the basic stability and
core integrity of the kernel. Not like the current horrid mess with
NMI nesting and ESP fixing etc.
And realistically, nobody will ever even notice. So the whole "ok, we
can use _TIF_USER_WORK_MASK to re-enable dr7" is a tiny tiny detail
that is more like cleaning up things, not a core issue.
Linus
On Thu, Jul 23, 2015 at 2:35 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Jul 23, 2015 at 2:20 PM, Peter Zijlstra <[email protected]> wrote:
>>
>> So the NMI could trigger userspace debug register faults, and simply
>> disabling them would make the whole debug register thing entirely
>> unreliable.
>
> We could easily set something to re-enable them for when we actually
> return to user space. I'd be ok with just setting the
> _TIF_USER_WORK_MASK.
>
> But even that should not be a requirement for the basic stability and
> core integrity of the kernel. Not like the current horrid mess with
> NMI nesting and ESP fixing etc.
>
> And realistically, nobody will ever even notice. So the whole "ok, we
> can use _TIF_USER_WORK_MASK to re-enable dr7" is a tiny tiny detail
> that is more like cleaning up things, not a core issue.
>
Or we just re-enable them on the way out of NMI (i.e. the very last
thing we do in the NMI handler). I don't want to break regular
userspace gdb when perf is running.
--Andy
On Thu, Jul 23, 2015 at 05:31:05PM -0400, Steven Rostedt wrote:
> On Thu, 23 Jul 2015 14:08:59 -0700
> Linus Torvalds <[email protected]> wrote:
>
> > On Thu, Jul 23, 2015 at 1:49 PM, Andy Lutomirski <[email protected]> wrote:
> > >
> > > Issue A: to return with RF clear, we need to disarm the breakpoint.
> > > If it's limited to the duration of the NMI, that's easy. If not, when
> > > do we re-arm? New prepare_exit_to_usermode hook? Hmm, setting ti
> > > flags during context switch may target the wrong task.
> >
> > We don't re-arm it.
> >
>
> Let me get this straight. The idea is in the #DB handler to detect that
> it was triggered in NMI context, and if so, simply disarm that
> breakpoint permanently, right?
>
> Nothing should be adding hw breakpoints to NMI code anyway. Sounds
> perfectly reasonable to me. Of course, how we tell we are in NMI
> brings back all the races as we had in the nesting code. We can check
> the per-cpu variable that is set with nmi_enter() and cleared at
> nmi_exit() but what happens if the breakpoint is outside those calls.
> We can check the stack pointer, but then we are back to userspace
> fooling us. Maybe add the DF trick again?
Can't the back link of the TSS tell us where we come from ? At least
it should not be manipulable from user-space.
Willy
On Thu, Jul 23, 2015 at 2:20 PM, Steven Rostedt <[email protected]> wrote:
> On Thu, 23 Jul 2015 13:21:16 -0700
> Andy Lutomirski <[email protected]> wrote:
>
>> 3. Forbid faults (other than MCE) inside NMI.
>>
>> Option 3 is almost easy. There are really only two kinds of faults
>> that can legitimately nest inside NMI: #PF and #DB. #DB is easy to
>> fix (e.g. with my patches or Peter's patches).
>
> What about int3? Which is needed to make ftrace work. This was a
> requirement to get rid of stomp-machine when updating ftrace functions,
> as well as the rational for doing the whole NMI nesting work in the
> first place.
OK, I'm convinced.
So I'll keep working on fixing up int3 to be less magical. Patches
coming eventually.
--Andy
On Thu, Jul 23, 2015 at 2:46 PM, Willy Tarreau <[email protected]> wrote:
> On Thu, Jul 23, 2015 at 05:31:05PM -0400, Steven Rostedt wrote:
>> On Thu, 23 Jul 2015 14:08:59 -0700
>> Linus Torvalds <[email protected]> wrote:
>>
>> > On Thu, Jul 23, 2015 at 1:49 PM, Andy Lutomirski <[email protected]> wrote:
>> > >
>> > > Issue A: to return with RF clear, we need to disarm the breakpoint.
>> > > If it's limited to the duration of the NMI, that's easy. If not, when
>> > > do we re-arm? New prepare_exit_to_usermode hook? Hmm, setting ti
>> > > flags during context switch may target the wrong task.
>> >
>> > We don't re-arm it.
>> >
>>
>> Let me get this straight. The idea is in the #DB handler to detect that
>> it was triggered in NMI context, and if so, simply disarm that
>> breakpoint permanently, right?
>>
>> Nothing should be adding hw breakpoints to NMI code anyway. Sounds
>> perfectly reasonable to me. Of course, how we tell we are in NMI
>> brings back all the races as we had in the nesting code. We can check
>> the per-cpu variable that is set with nmi_enter() and cleared at
>> nmi_exit() but what happens if the breakpoint is outside those calls.
>> We can check the stack pointer, but then we are back to userspace
>> fooling us. Maybe add the DF trick again?
>
> Can't the back link of the TSS tell us where we come from ? At least
> it should not be manipulable from user-space.
Not on 64-bit -- there are no tasks :)
--Andy
On Thu, Jul 23, 2015 at 2:31 PM, Steven Rostedt <[email protected]> wrote:
>
> Let me get this straight. The idea is in the #DB handler to detect that
> it was triggered in NMI context, and if so, simply disarm that
> breakpoint permanently, right?
No, for simplicity, I'd make it cover not just NMI code, but any
"kernel code with interrupts disabled".
Because that's the test we'd use for "use ret instead of iret".
And that wider test is exactly because it's so damn hard to get the
exact instruction boundaries right. Let's *not* go down the path
(again) of having to get the whole %rip range and "magic stack pointer
values" etc.
Make it simple and completely unambiguous. The rule really would be:
- if we return to kernel space and interrupts are disabled, we will
use "ret" rather than "iret"
Hard rule. Simple. Straightforward. No random %rip values. No
random %rsp values. NO CRAP.
- but because we use "ret" rather than "iret" we can't get RF
semantics, it means that #DB is special. RF is supposed to make us
make forward progress
So for that reason, #DB just says "if the breakpoint happened
during that interrupts-ff reghion, I will clear %dr7 to guarantee
forward progress"
So those would be the two main rules. Very simple, and avoiding all nasty cases.
Now, I'd be willing to then hide the "oops, we clear dr7 very
agrressively" issue by having a few additional _heuristics_. But I
call them "heuristics" because unlike the current NMI nesting games,
they aren't about core stability. They are about "ok, maybe somebody
wants to trigger those faults, and we'll be _nice_ and try to make it
easy for them", but nothing more.
So for example, if that "#DB clears %dr7" happened, it sounds easy to
set _TIF_USER_WORK_MASK, and just force %dr7 to be re-loaded from a
cached value, so that if we disabled things because of some user stack
trace access, it will be re-enabled by the time we return to user
space. I think that sounds reasonable, but it's not something the core
low-level entry x86 assembly code needs to even care about. It's not
that level of "core", it's just being polite.
Linus
On Thu, Jul 23, 2015 at 2:48 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Jul 23, 2015 at 2:31 PM, Steven Rostedt <[email protected]> wrote:
>>
>> Let me get this straight. The idea is in the #DB handler to detect that
>> it was triggered in NMI context, and if so, simply disarm that
>> breakpoint permanently, right?
>
> No, for simplicity, I'd make it cover not just NMI code, but any
> "kernel code with interrupts disabled".
>
> Because that's the test we'd use for "use ret instead of iret".
>
> And that wider test is exactly because it's so damn hard to get the
> exact instruction boundaries right. Let's *not* go down the path
> (again) of having to get the whole %rip range and "magic stack pointer
> values" etc.
>
> Make it simple and completely unambiguous. The rule really would be:
>
> - if we return to kernel space and interrupts are disabled, we will
> use "ret" rather than "iret"
>
> Hard rule. Simple. Straightforward. No random %rip values. No
> random %rsp values. NO CRAP.
>
> - but because we use "ret" rather than "iret" we can't get RF
> semantics, it means that #DB is special. RF is supposed to make us
> make forward progress
>
> So for that reason, #DB just says "if the breakpoint happened
> during that interrupts-ff reghion, I will clear %dr7 to guarantee
> forward progress"
What if we relax it slightly: "if the breakpoint happened during that
interrupts-off region, I will clear all *kernel breakpoints* in %dr7
to guarantee forward progress"?
Watchpoints don't need RF to make forward progress, and, by leaving
watchpoints alone, we avoid breaking gdb.
>
> So those would be the two main rules. Very simple, and avoiding all nasty cases.
>
> Now, I'd be willing to then hide the "oops, we clear dr7 very
> agrressively" issue by having a few additional _heuristics_. But I
> call them "heuristics" because unlike the current NMI nesting games,
> they aren't about core stability. They are about "ok, maybe somebody
> wants to trigger those faults, and we'll be _nice_ and try to make it
> easy for them", but nothing more.
>
> So for example, if that "#DB clears %dr7" happened, it sounds easy to
> set _TIF_USER_WORK_MASK, and just force %dr7 to be re-loaded from a
> cached value, so that if we disabled things because of some user stack
> trace access, it will be re-enabled by the time we return to user
> space. I think that sounds reasonable, but it's not something the core
> low-level entry x86 assembly code needs to even care about. It's not
> that level of "core", it's just being polite.
Once we limit it to instruction breakpoints, I don't think re-enabling
before returning to userspace matters.
--Andy
On Thu, Jul 23, 2015 at 02:46:49PM -0700, Andy Lutomirski wrote:
> On Thu, Jul 23, 2015 at 2:46 PM, Willy Tarreau <[email protected]> wrote:
> > Can't the back link of the TSS tell us where we come from ? At least
> > it should not be manipulable from user-space.
>
> Not on 64-bit -- there are no tasks :)
Ah crap, sorry for the noise then!
Willy
On Thu, Jul 23, 2015 at 2:45 PM, Andy Lutomirski <[email protected]> wrote:
>
> Or we just re-enable them on the way out of NMI (i.e. the very last
> thing we do in the NMI handler). I don't want to break regular
> userspace gdb when perf is running.
I'd really prefer it if we don't touch NMI code in those kinds of
ways. The NMI code is fragile as hell. All the problems we have with
it is exactly due to "where is the boundary" issues.
That's why I *don't* want NMI code to do magic crap. Anything that
says "disable this during this magic window" is broken. The problems
we've had are exactly about atomicity of the entry/exit conditions,
and there is no really good way to get them right.
I'd be much happier with a _TIF_USER_WORK_MASK approach exactly
because it's so *obvious* that it's not a boundary condition.
I dislike the "disable and re-enable dr7 in the NMI handler" exactly
because it smells like "we can only handle faults in _this_ region".
It may be true, but it's also what I want us to get away from. I'd
much rather have the "big picture" be that we can take faults anywhere
at all (*), and that none of the core code really cares. Then we "fix
up" user space.
Linus
(*) And yes, sysenter and not having a stack at all is very special,
and I think we will *always* have to have that magical special case of
the first few instructions there. But that's a separate hardware
limitation we can't get around.
On Thu, Jul 23, 2015 at 2:54 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Jul 23, 2015 at 2:45 PM, Andy Lutomirski <[email protected]> wrote:
>>
>> Or we just re-enable them on the way out of NMI (i.e. the very last
>> thing we do in the NMI handler). I don't want to break regular
>> userspace gdb when perf is running.
>
> I'd really prefer it if we don't touch NMI code in those kinds of
> ways. The NMI code is fragile as hell. All the problems we have with
> it is exactly due to "where is the boundary" issues.
>
> That's why I *don't* want NMI code to do magic crap. Anything that
> says "disable this during this magic window" is broken. The problems
> we've had are exactly about atomicity of the entry/exit conditions,
> and there is no really good way to get them right.
>
> I'd be much happier with a _TIF_USER_WORK_MASK approach exactly
> because it's so *obvious* that it's not a boundary condition.
>
> I dislike the "disable and re-enable dr7 in the NMI handler" exactly
> because it smells like "we can only handle faults in _this_ region".
> It may be true, but it's also what I want us to get away from. I'd
> much rather have the "big picture" be that we can take faults anywhere
> at all (*), and that none of the core code really cares. Then we "fix
> up" user space.
OK, new proposal:
In do_debug, if we trip an instruction breakpoint while
!user_mode(regs) && ((regs->flags & X86_EFLAGS_IF) == 0), then disarm
*that breakpoint*.
Why? It only looks at hardware state (dr6 and dr7), and it can't
break gdb, because gdb can't set a breakpoint that will cause this
problem.
All the other variants of this either need cached state or break gdb
watchpoints on stack variables with perf running.
--Andy
>
> Linus
>
> (*) And yes, sysenter and not having a stack at all is very special,
> and I think we will *always* have to have that magical special case of
> the first few instructions there. But that's a separate hardware
> limitation we can't get around.
--
Andy Lutomirski
AMA Capital Management, LLC
On Thu, Jul 23, 2015 at 2:50 PM, Andy Lutomirski <[email protected]> wrote:
>
> What if we relax it slightly: "if the breakpoint happened during that
> interrupts-off region, I will clear all *kernel breakpoints* in %dr7
> to guarantee forward progress"?
>
> Watchpoints don't need RF to make forward progress, and, by leaving
> watchpoints alone, we avoid breaking gdb.
Hmmm. I thought watchpoints were "before the instruction" too, but
that's just because I haven't used them in ages, and I didn't remember
the details. I just looked it up.
You're right - the memory watchpoints trigger after the instruction
has executed, so RF isn't an issue. So yes, the only issue is
instruction breakpoints, and those are the only ones we need to clear.
And that makes it really easy.
So yes, I agree. We only need to clear all kernel breakpoints.
So we don't even need that _TIF_USER_WORK_MASK thing, because user
space isn't setting kernel code breakpoints, it's just kgdb.
Sounds good to me.
Linus
On Thu, Jul 23, 2015 at 2:59 PM, Andy Lutomirski <[email protected]> wrote:
> OK, new proposal:
>
> In do_debug, if we trip an instruction breakpoint while
> !user_mode(regs) && ((regs->flags & X86_EFLAGS_IF) == 0), then disarm
> *that breakpoint*.
Ack. The more targeted we can make this while still guaranteeing
forward progress, the better. So that sounds really good.
Linus
On Thu, Jul 23, 2015 at 02:59:56PM -0700, Linus Torvalds wrote:
> Hmmm. I thought watchpoints were "before the instruction" too, but
> that's just because I haven't used them in ages, and I didn't remember
> the details. I just looked it up.
>
> You're right - the memory watchpoints trigger after the instruction
> has executed, so RF isn't an issue. So yes, the only issue is
> instruction breakpoints, and those are the only ones we need to clear.
>
> And that makes it really easy.
>
> So yes, I agree. We only need to clear all kernel breakpoints.
But but but, we can access userspace with !IF, imagine someone doing:
local_irq_disable();
copy_from_user_inatomic();
and as luck would have it, there's a breakpoint on the user memory we
just touched. And we go and disable a user breakpoint.
On Fri, Jul 24, 2015 at 10:13:26AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 23, 2015 at 02:59:56PM -0700, Linus Torvalds wrote:
> > Hmmm. I thought watchpoints were "before the instruction" too, but
> > that's just because I haven't used them in ages, and I didn't remember
> > the details. I just looked it up.
> >
> > You're right - the memory watchpoints trigger after the instruction
> > has executed, so RF isn't an issue. So yes, the only issue is
> > instruction breakpoints, and those are the only ones we need to clear.
> >
> > And that makes it really easy.
> >
> > So yes, I agree. We only need to clear all kernel breakpoints.
>
> But but but, we can access userspace with !IF, imagine someone doing:
>
> local_irq_disable();
> copy_from_user_inatomic();
>
> and as luck would have it, there's a breakpoint on the user memory we
> just touched. And we go and disable a user breakpoint.
Then shouldn't we use !IF && RSP matches NMI's stack ?
User-space cannot control the two at once.
Willy
On Thu, Jul 23, 2015 at 02:59:46PM -0700, Andy Lutomirski wrote:
> OK, new proposal:
>
> In do_debug, if we trip an instruction breakpoint while
> !user_mode(regs) && ((regs->flags & X86_EFLAGS_IF) == 0), then disarm
> *that breakpoint*.
Doesn't !IF already imply that it must be kernel space? AFAIK user space
cannot clear IF.
On Thu, Jul 23, 2015 at 02:54:54PM -0700, Linus Torvalds wrote:
> On Thu, Jul 23, 2015 at 2:45 PM, Andy Lutomirski <[email protected]> wrote:
> >
> > Or we just re-enable them on the way out of NMI (i.e. the very last
> > thing we do in the NMI handler). I don't want to break regular
> > userspace gdb when perf is running.
>
> I'd really prefer it if we don't touch NMI code in those kinds of
> ways. The NMI code is fragile as hell. All the problems we have with
> it is exactly due to "where is the boundary" issues.
>
> That's why I *don't* want NMI code to do magic crap. Anything that
> says "disable this during this magic window" is broken. The problems
> we've had are exactly about atomicity of the entry/exit conditions,
> and there is no really good way to get them right.
>
> I'd be much happier with a _TIF_USER_WORK_MASK approach exactly
> because it's so *obvious* that it's not a boundary condition.
>
> I dislike the "disable and re-enable dr7 in the NMI handler" exactly
> because it smells like "we can only handle faults in _this_ region".
> It may be true, but it's also what I want us to get away from. I'd
> much rather have the "big picture" be that we can take faults anywhere
> at all (*), and that none of the core code really cares. Then we "fix
> up" user space.
A wee bit something like so?
We need the intermediate self-IPI because NMI/MCE etc do not deal with
TIF flags.
I further cleared all of DR7 in an attempt at reducing the amount of
state tracked. And it doesn't distinguish between kernel/user
watchpoints because the kernel can touch both from !IF.
---
arch/x86/kernel/traps.c | 37 +++++++++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 8e65d8a9b8db..e8308e9c2b1e 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -570,6 +570,33 @@ struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s)
NOKPROBE_SYMBOL(fixup_bad_iret);
#endif
+struct do_debug_state {
+ unsigned long dr7;
+ struct irq_work irq_work;
+ struct callback_head task_work;
+};
+
+static void __debug_irq_trampoline(struct irq_work *work)
+{
+ struct do_debug_state *dds =
+ container_of(work, struct do_debug_state, irq_work);
+
+ task_work_add(current, &dds->task_work, true);
+}
+
+static void __debug_restore_dr7(struct callback_head *work)
+{
+ struct do_debug_state *dds =
+ container_of(work, struct do_debug_state, task_work);
+
+ set_debugreg(dds->dr7, 7);
+}
+
+static DEFINE_PER_CPU(struct do_debug_state, do_debug_state) = {
+ .irq_work = { .func = __debug_irq_trampoline, },
+ .task_work = { .func = __debug_restore_dr7, },
+};
+
/*
* Our handling of the processor debug registers is non-trivial.
* We do not clear them on entry and exit from the kernel. Therefore
@@ -603,6 +630,16 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
ist_enter(regs);
+ if (arch_irqs_disabled_flags(regs->flags)) {
+ struct do_debug_state *dds = this_cpu_ptr(&do_debug_state);
+
+ get_debugreg(dds->dr7, 7);
+ set_debugreg(0, 7);
+ irq_work_queue(&dds->irq_work);
+
+ goto exit;
+ }
+
get_debugreg(dr6, 6);
/* Filter out all the reserved bits which are preset to 1 */
On Fri, 24 Jul 2015 10:13:26 +0200
Peter Zijlstra <[email protected]> wrote:
> On Thu, Jul 23, 2015 at 02:59:56PM -0700, Linus Torvalds wrote:
> > Hmmm. I thought watchpoints were "before the instruction" too, but
> > that's just because I haven't used them in ages, and I didn't remember
> > the details. I just looked it up.
> >
> > You're right - the memory watchpoints trigger after the instruction
> > has executed, so RF isn't an issue. So yes, the only issue is
> > instruction breakpoints, and those are the only ones we need to clear.
> >
> > And that makes it really easy.
> >
> > So yes, I agree. We only need to clear all kernel breakpoints.
>
> But but but, we can access userspace with !IF, imagine someone doing:
>
> local_irq_disable();
> copy_from_user_inatomic();
>
> and as luck would have it, there's a breakpoint on the user memory we
> just touched. And we go and disable a user breakpoint.
Where does the kernel do that to user text? I would think that user
data would only have watchpoints, and Andy and Linus said that those
would not be disabled (I'm guessing because they don't have the RF flag
set, and forward progress can proceed). If the kernel does the above to
user code and there's a breakpoint there, would it even trigger?
I'm not too familiar with how to use hw breakpoints, but I'm guessing
(correct me if I'm wrong) that breakpoints on code that trigger when
executed, but watchpoints on data trigger when accessed. Then
copy_from_user_inatomic() would only trigger on watchpoints (it's not
executing that code, at least I hope it isn't!), and those wont bother
us.
Or am I totally off base here?
-- Steve
On Fri, Jul 24, 2015 at 07:58:41AM -0400, Steven Rostedt wrote:
> On Fri, 24 Jul 2015 10:13:26 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > On Thu, Jul 23, 2015 at 02:59:56PM -0700, Linus Torvalds wrote:
> > > Hmmm. I thought watchpoints were "before the instruction" too, but
> > > that's just because I haven't used them in ages, and I didn't remember
> > > the details. I just looked it up.
> > >
> > > You're right - the memory watchpoints trigger after the instruction
> > > has executed, so RF isn't an issue. So yes, the only issue is
> > > instruction breakpoints, and those are the only ones we need to clear.
> > >
> > > And that makes it really easy.
> > >
> > > So yes, I agree. We only need to clear all kernel breakpoints.
> >
> > But but but, we can access userspace with !IF, imagine someone doing:
> >
> > local_irq_disable();
> > copy_from_user_inatomic();
> >
> > and as luck would have it, there's a breakpoint on the user memory we
> > just touched. And we go and disable a user breakpoint.
>
> Where does the kernel do that to user text? I would think that user
> data would only have watchpoints, and Andy and Linus said that those
> would not be disabled (I'm guessing because they don't have the RF flag
> set, and forward progress can proceed). If the kernel does the above to
> user code and there's a breakpoint there, would it even trigger?
>
> I'm not too familiar with how to use hw breakpoints, but I'm guessing
> (correct me if I'm wrong) that breakpoints on code that trigger when
> executed, but watchpoints on data trigger when accessed. Then
> copy_from_user_inatomic() would only trigger on watchpoints (it's not
> executing that code, at least I hope it isn't!), and those wont bother
> us.
These things can be: RW, W, X.
Sure, hitting a user X watchpoint is going to be 'interesting', but its
fairly easy to hit a RW one.
Just watch an on-stack variable and get perf to copy a huge chunk of
stack (like it does for the dwarf stuff).
On Fri, 24 Jul 2015 14:43:04 +0200
Peter Zijlstra <[email protected]> wrote:
> > I'm not too familiar with how to use hw breakpoints, but I'm guessing
> > (correct me if I'm wrong) that breakpoints on code that trigger when
> > executed, but watchpoints on data trigger when accessed. Then
> > copy_from_user_inatomic() would only trigger on watchpoints (it's not
> > executing that code, at least I hope it isn't!), and those wont bother
> > us.
>
> These things can be: RW, W, X.
>
> Sure, hitting a user X watchpoint is going to be 'interesting', but its
> fairly easy to hit a RW one.
But do we care if we do hit one? The return from the #DB handler can
use a RET. Right?
-- Steve
>
> Just watch an on-stack variable and get perf to copy a huge chunk of
> stack (like it does for the dwarf stuff).
On Fri, Jul 24, 2015 at 09:03:42AM -0400, Steven Rostedt wrote:
> On Fri, 24 Jul 2015 14:43:04 +0200
> Peter Zijlstra <[email protected]> wrote:
>
>
> > > I'm not too familiar with how to use hw breakpoints, but I'm guessing
> > > (correct me if I'm wrong) that breakpoints on code that trigger when
> > > executed, but watchpoints on data trigger when accessed. Then
> > > copy_from_user_inatomic() would only trigger on watchpoints (it's not
> > > executing that code, at least I hope it isn't!), and those wont bother
> > > us.
> >
> > These things can be: RW, W, X.
> >
> > Sure, hitting a user X watchpoint is going to be 'interesting', but its
> > fairly easy to hit a RW one.
>
> But do we care if we do hit one? The return from the #DB handler can
> use a RET. Right?
My understanding is that by using RET we can't set the RF flag and #DB
will immediately strike again when the operation is attempted again. Thus
we have to completely disable the breakpoints on leaving after the first
one strikes, resulting in some userland breakpoints being missed. Maybe
it can be accepted as a limitation when perf is running. I don't know if
the output of perf is that relevant when a debugger is present BTW.
Willy
On Fri, Jul 24, 2015 at 03:21:28PM +0200, Willy Tarreau wrote:
> On Fri, Jul 24, 2015 at 09:03:42AM -0400, Steven Rostedt wrote:
> > On Fri, 24 Jul 2015 14:43:04 +0200
> > Peter Zijlstra <[email protected]> wrote:
> >
> >
> > > > I'm not too familiar with how to use hw breakpoints, but I'm guessing
> > > > (correct me if I'm wrong) that breakpoints on code that trigger when
> > > > executed, but watchpoints on data trigger when accessed. Then
> > > > copy_from_user_inatomic() would only trigger on watchpoints (it's not
> > > > executing that code, at least I hope it isn't!), and those wont bother
> > > > us.
> > >
> > > These things can be: RW, W, X.
> > >
> > > Sure, hitting a user X watchpoint is going to be 'interesting', but its
> > > fairly easy to hit a RW one.
> >
> > But do we care if we do hit one? The return from the #DB handler can
> > use a RET. Right?
Look at do_debug(), it has lovely bits like:
preempt_conditional_sti();
in it, we do _NOT_ want to be re-enabling interrupts if we're called
from an !IF context, that'd be _bad_.
> My understanding is that by using RET we can't set the RF flag and #DB
> will immediately strike again when the operation is attempted again. Thus
> we have to completely disable the breakpoints on leaving after the first
> one strikes, resulting in some userland breakpoints being missed. Maybe
> it can be accepted as a limitation when perf is running. I don't know if
> the output of perf is that relevant when a debugger is present BTW.
The patch I posted will re-enable the breakpoints before returning to
userspace. So userspace will only 'miss' events generated by the kernel.
Missing reads from the kernel is not a problem -- and maybe even
expected, but certainly unavoidable.
Missing updates from the kernel might be a problem, you'd get a variable
change content even though you have a W watchpoint on it, that'd be
surprising.
Then again, I suppose we can argue the variable changed through another
mapping and watchpoints work on the virtual address, so tough cookies or
somesuch -- the kernel could in fact do this on highmem kernel anyway.
On Fri, Jul 24, 2015 at 03:30:13PM +0200, Peter Zijlstra wrote:
> > > But do we care if we do hit one? The return from the #DB handler can
> > > use a RET. Right?
>
> Look at do_debug(), it has lovely bits like:
>
> preempt_conditional_sti();
>
> in it, we do _NOT_ want to be re-enabling interrupts if we're called
> from an !IF context, that'd be _bad_.
Ah, I forgot the conditional thing was the STI depending on regs->flags
& IF..
In any case, better safe than sorry and simply not do #DB ever if !IF.
On Fri, 24 Jul 2015 15:21:28 +0200
Willy Tarreau <[email protected]> wrote:
> My understanding is that by using RET we can't set the RF flag and #DB
But the RF flag is only set for instruction (executing) breakpoints. It
is not set for data (RW) ones.
-- Steve
> will immediately strike again when the operation is attempted again. Thus
> we have to completely disable the breakpoints on leaving after the first
> one strikes, resulting in some userland breakpoints being missed. Maybe
> it can be accepted as a limitation when perf is running. I don't know if
> the output of perf is that relevant when a debugger is present BTW.
>
> Willy
On Fri, Jul 24, 2015 at 10:31:27AM -0400, Steven Rostedt wrote:
> On Fri, 24 Jul 2015 15:21:28 +0200
> Willy Tarreau <[email protected]> wrote:
>
> > My understanding is that by using RET we can't set the RF flag and #DB
>
> But the RF flag is only set for instruction (executing) breakpoints. It
> is not set for data (RW) ones.
True but these also are the most complicated to deal with. The data
accesses can always be emulated (not what I'm suggesting here) while
instructions are much harder to emulate.
Willy
On Fri, 24 Jul 2015 16:59:01 +0200
Willy Tarreau <[email protected]> wrote:
> On Fri, Jul 24, 2015 at 10:31:27AM -0400, Steven Rostedt wrote:
> > On Fri, 24 Jul 2015 15:21:28 +0200
> > Willy Tarreau <[email protected]> wrote:
> >
> > > My understanding is that by using RET we can't set the RF flag and #DB
> >
> > But the RF flag is only set for instruction (executing) breakpoints. It
> > is not set for data (RW) ones.
>
> True but these also are the most complicated to deal with. The data
> accesses can always be emulated (not what I'm suggesting here) while
> instructions are much harder to emulate.
The point is, if we trigger a #DB on an instruction breakpoint
while !IF, then we simply disable that breakpoint and do the RET. What
emulation is needed?
-- Steve
On Fri, Jul 24, 2015 at 11:16:21AM -0400, Steven Rostedt wrote:
> On Fri, 24 Jul 2015 16:59:01 +0200
> Willy Tarreau <[email protected]> wrote:
>
> > On Fri, Jul 24, 2015 at 10:31:27AM -0400, Steven Rostedt wrote:
> > > On Fri, 24 Jul 2015 15:21:28 +0200
> > > Willy Tarreau <[email protected]> wrote:
> > >
> > > > My understanding is that by using RET we can't set the RF flag and #DB
> > >
> > > But the RF flag is only set for instruction (executing) breakpoints. It
> > > is not set for data (RW) ones.
> >
> > True but these also are the most complicated to deal with. The data
> > accesses can always be emulated (not what I'm suggesting here) while
> > instructions are much harder to emulate.
>
> The point is, if we trigger a #DB on an instruction breakpoint
> while !IF, then we simply disable that breakpoint and do the RET.
Yes but the breakpoint remains disabled then. Or I'm missing
something.
> What emulation is needed?
I was speaking about redoing the operation with BP disabled before
re-enabling it. But that's not the point here anyway.
Willy
On Fri, Jul 24, 2015 at 05:26:37PM +0200, Willy Tarreau wrote:
> >
> > The point is, if we trigger a #DB on an instruction breakpoint
> > while !IF, then we simply disable that breakpoint and do the RET.
>
> Yes but the breakpoint remains disabled then. Or I'm missing
> something.
http://marc.info/?l=linux-kernel&m=143773601130974
We re-enable before going back to userspace.
On Fri, Jul 24, 2015 at 05:30:54PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 24, 2015 at 05:26:37PM +0200, Willy Tarreau wrote:
> > >
> > > The point is, if we trigger a #DB on an instruction breakpoint
> > > while !IF, then we simply disable that breakpoint and do the RET.
> >
> > Yes but the breakpoint remains disabled then. Or I'm missing
> > something.
>
> http://marc.info/?l=linux-kernel&m=143773601130974
>
> We re-enable before going back to userspace.
Ah OK thanks Peter. I'm sorry if I'm adding more noise than
anything here, it's hard to follow and it becomes a bit complex.
Willy
On Fri, 24 Jul 2015 17:26:37 +0200
Willy Tarreau <[email protected]> wrote:
> > The point is, if we trigger a #DB on an instruction breakpoint
> > while !IF, then we simply disable that breakpoint and do the RET.
>
> Yes but the breakpoint remains disabled then. Or I'm missing
> something.
Do we care? If it was an instruction breakpoint with !IF set, then it
had to have happened in the kernel. And kgdb or whatever added it there
needs to deal with that.
There should be no instances in the kernel where we execute userspace
code with interrupts disabled.
-- Steve
On Fri, Jul 24, 2015 at 1:13 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Jul 23, 2015 at 02:59:56PM -0700, Linus Torvalds wrote:
>> Hmmm. I thought watchpoints were "before the instruction" too, but
>> that's just because I haven't used them in ages, and I didn't remember
>> the details. I just looked it up.
>>
>> You're right - the memory watchpoints trigger after the instruction
>> has executed, so RF isn't an issue. So yes, the only issue is
>> instruction breakpoints, and those are the only ones we need to clear.
>>
>> And that makes it really easy.
>>
>> So yes, I agree. We only need to clear all kernel breakpoints.
>
> But but but, we can access userspace with !IF, imagine someone doing:
>
> local_irq_disable();
> copy_from_user_inatomic();
>
> and as luck would have it, there's a breakpoint on the user memory we
> just touched. And we go and disable a user breakpoint.
>
The Intel SDM says:
17.3.1.2 Data Memory and I/O Breakpoint Exception Conditions
Data memory and I/O breakpoints are reported when the processor
attempts to access a memory or I/O address
specified in a breakpoint-address register (DR0 through DR3) that has
been set up to detect data or I/O accesses
(R/W flag is set to 1, 2, or 3). The processor generates the exception
after it executes the instruction that made the
access, so these breakpoint condition causes a trap-class exception to
be generated.
So by the time we detect that we've hit a watchpoint, the instruction
that tripped it is done and we don't need RF. Furthermore, after
reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
we hit a watchpoint. So this might be as simple as:
if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |
X86_EFLAGS_IF)) == X86_EFLAGS_RF && !user_mode(regs))
for (i = 0; i < 4; i++)
if (dr6 & (DR_TRAP0<<i)) {
/* hit a kernel breakpoint with IF clear */
dr7 &= ~(DR_GLOBAL_ENABLE << (i * DR_ENABLE_SHIFT));
}
I'm not saying that your code is wrong, but I think this is simpler
and avoids poking at yet more per-cpu state from NMI context, which is
kind of nice.
If you don't like the RF games above, it would also be straightforward
to parse dr0..dr3 for each DR_TRAP bit that's set and see if it's a
breakpoint.
--Andy
On Fri, Jul 24, 2015 at 11:34:26AM -0400, Steven Rostedt wrote:
> On Fri, 24 Jul 2015 17:26:37 +0200
> Willy Tarreau <[email protected]> wrote:
>
>
> > > The point is, if we trigger a #DB on an instruction breakpoint
> > > while !IF, then we simply disable that breakpoint and do the RET.
> >
> > Yes but the breakpoint remains disabled then. Or I'm missing
> > something.
>
> Do we care? If it was an instruction breakpoint with !IF set, then it
> had to have happened in the kernel. And kgdb or whatever added it there
> needs to deal with that.
I was concerned that an RW BP would remain disabled when returning to
user space but Peter cleared that out by pointing me to the discussion
where it was explained that they are re-enabled when returning to user
space.
So no problem here for me.
Thanks,
Willy
On Fri, 24 Jul 2015 08:48:57 -0700
Andy Lutomirski <[email protected]> wrote:
> So by the time we detect that we've hit a watchpoint, the instruction
> that tripped it is done and we don't need RF. Furthermore, after
> reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
> we hit a watchpoint. So this might be as simple as:
>
> if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |
Um, isn't 0xf * DR_TRAP0 same as a constant "true"?
-- Steve
> X86_EFLAGS_IF)) == X86_EFLAGS_RF && !user_mode(regs))
> for (i = 0; i < 4; i++)
> if (dr6 & (DR_TRAP0<<i)) {
> /* hit a kernel breakpoint with IF clear */
> dr7 &= ~(DR_GLOBAL_ENABLE << (i * DR_ENABLE_SHIFT));
> }
>
> I'm not saying that your code is wrong, but I think this is simpler
> and avoids poking at yet more per-cpu state from NMI context, which is
> kind of nice.
>
> If you don't like the RF games above, it would also be straightforward
> to parse dr0..dr3 for each DR_TRAP bit that's set and see if it's a
> breakpoint.
>
> --Andy
On Fri, 24 Jul 2015 08:48:57 -0700
Andy Lutomirski <[email protected]> wrote:
> So by the time we detect that we've hit a watchpoint, the instruction
> that tripped it is done and we don't need RF. Furthermore, after
> reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
> we hit a watchpoint. So this might be as simple as:
>
> if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |
> X86_EFLAGS_IF)) == X86_EFLAGS_RF && !user_mode(regs))
Also, wouldn't !(regs->X86_EFLAGS_IF) && !user_mode(regs) be a bug?
When do we allow coming from userspace with interrupts disabled?
-- Steve
> for (i = 0; i < 4; i++)
> if (dr6 & (DR_TRAP0<<i)) {
> /* hit a kernel breakpoint with IF clear */
> dr7 &= ~(DR_GLOBAL_ENABLE << (i * DR_ENABLE_SHIFT));
> }
>
> I'm not saying that your code is wrong, but I think this is simpler
> and avoids poking at yet more per-cpu state from NMI context, which is
> kind of nice.
>
> If you don't like the RF games above, it would also be straightforward
> to parse dr0..dr3 for each DR_TRAP bit that's set and see if it's a
> breakpoint.
>
> --Andy
On Fri, Jul 24, 2015 at 12:02:49PM -0400, Steven Rostedt wrote:
> On Fri, 24 Jul 2015 08:48:57 -0700
> Andy Lutomirski <[email protected]> wrote:
>
> > So by the time we detect that we've hit a watchpoint, the instruction
> > that tripped it is done and we don't need RF. Furthermore, after
> > reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
> > we hit a watchpoint. So this might be as simple as:
> >
> > if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |
>
> Um, isn't 0xf * DR_TRAP0 same as a constant "true"?
For me it's a typo, it should have been :
if ((dr6 & (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |
(the purpose was to read all 4 lower bits at once).
Willy
On Fri, Jul 24, 2015 at 08:48:57AM -0700, Andy Lutomirski wrote:
> So by the time we detect that we've hit a watchpoint, the instruction
> that tripped it is done and we don't need RF. Furthermore, after
> reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
> we hit a watchpoint.
Apparently after reading 17.3.1.1, it seems like RF can still be set
if a data breakpoint triggers in a repeated string instruction before
the last iteration. However I don't think we care because as long as
we return to the string instruction, since the data location was already
visited it won't trigger again so the loss of the flag should be safe.
Willy
On Fri, 24 Jul 2015 18:08:06 +0200
Willy Tarreau <[email protected]> wrote:
> > Um, isn't 0xf * DR_TRAP0 same as a constant "true"?
>
> For me it's a typo, it should have been :
>
> if ((dr6 & (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |
>
> (the purpose was to read all 4 lower bits at once).
I figured that after I sent it, but the 0xf * DR_TRAP0 is also
confusing to me. Actually, why not use proper naming:
dr6 & DR_TRAP_BITS
-- Steve
On Thu, 2015-07-23 at 13:21 -0700, Andy Lutomirski wrote:
> [moved to a new thread, cc list trimmed]
>
> Hi all-
>
> We've considered two approaches to dealing with NMIs:
>
> 1. Allow nesting. We know quite well how messy that is.
This might be a stupid question, but
1. What exactly does the NMI handler handle
2. Is it possible for the NMI handler to just increment a counter and
return if it nests, and let the outer handler notice and rerun itself.
> 2. Forbid IRET inside NMIs. Doable but maybe not that pretty.
>
> We haven't considered:
>
> 3. Forbid faults (other than MCE) inside NMI.
>
> Option 3 is almost easy. There are really only two kinds of faults
> that can legitimately nest inside NMI: #PF and #DB. #DB is easy to
> fix (e.g. with my patches or Peter's patches).
>
> What if we went all out and forbade page faults in NMI as well. There
> are two reasons that I can think of that we might page fault inside an
> NMI:
>
> a) vmalloc fault. I think Ingo already half-implemented a rework to
> eliminate vmalloc faults entirely.
>
> b) User memory access faults.
>
> The reason we access user state in general from an NMI is to allow
> perf to capture enough user stack data to let the tooling backtrace
> back to user space. What if we did it differently? Instead of
> capturing this data in NMI context, capture it in
> prepare_exit_to_usermode. That would let us capture user state
> *correctly*, which we currently can't really do. There's a
> never-ending series of minor bugs in which we try to guess the user
> register state from NMI context, and it sort of works. In
> prepare_exit_to_usermode, we really truly know the user state.
> There's a race where an NMI hits during or after
> prepare_exit_to_usermode, but maybe that's okay -- just admit defeat
> in that case and don't show the user state. (Realistically, without
> CFI data, we're not going to be guaranteed to get the right state
> anyway.)
>
> To make this work, we'd have to teach NMI-from-userspace to call the
> callback itself. It would look like:
>
> prepare_exit_to_usermode() {
> ...
> while (blah blah blah) {
> if (cached_flags & TIF_PERF_CAPTURE_USER_STATE)
> perf_capture_user_state();
> ...
> }
> ...
> }
>
> and then, on NMI exit, we'd call perf_capture_user_state directly,
> since we don't want to enable IRQs or do opportunsitic sysret on exit
> from NMI. (Why not? Because NMIs are still masked, and we don't want
> to pay for double-IRET to unmask them, so we really want to leave IRQs
> off and IRET straight back to user mode.)
>
> There's an unavoidable race in which we enter user mode with
> TIF_PERF_CAPTURE_USER_STATE still set. In principle, we could
> IPI-to-self from the NMI handler to cover that case (mostly -- we
> capture the wrong state if we're on our way to an IRET fault), or we
> could just check on entry if the flag is still set and, if so, admit
> defeat.
>
> Peter, can this be done without breaking the perf ABI? If we were
> designing all of this stuff from scratch right now, I'd suggest doing
> it this way, but I'm not sure whether it makes sense to try to
> retrofit it in.
>
>
> If we decide to stick with option 2, then I've now convinced myself
> that banning all kernel breakpoints and watchpoints during NMI
> processing is probably for the best. Maybe we should go one step
> farther and ban all DR7 breakpoints period. Sure, it will slow down
> perf if there are user breakpoints or watchpoints set, but, having
> looked at the asm, returning from #DB using RET is, while doable,
> distinctly ugly.
>
> --Andy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Fri, Jul 24, 2015 at 08:48:57AM -0700, Andy Lutomirski wrote:
> So by the time we detect that we've hit a watchpoint, the instruction
> that tripped it is done and we don't need RF. Furthermore, after
> reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
> we hit a watchpoint. So this might be as simple as:
>
> if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |
> X86_EFLAGS_IF)) == X86_EFLAGS_RF && !user_mode(regs))
> for (i = 0; i < 4; i++)
> if (dr6 & (DR_TRAP0<<i)) {
> /* hit a kernel breakpoint with IF clear */
> dr7 &= ~(DR_GLOBAL_ENABLE << (i * DR_ENABLE_SHIFT));
> }
>
> I'm not saying that your code is wrong, but I think this is simpler
> and avoids poking at yet more per-cpu state from NMI context, which is
> kind of nice.
>
> If you don't like the RF games above, it would also be straightforward
> to parse dr0..dr3 for each DR_TRAP bit that's set and see if it's a
> breakpoint.
Andy, section 5.8 of the SDM makes me think we could possibly abuse SYSRET
to emulate IRET, and then possibly simplify the flags processing. It says
that it takes the CPL3 code segment but nowhere it says that the target is
validated for effectively being userland, and further it suggests that it
doesn't validate anything :
"It is the responsibility of the OS to ensure the descriptors in
the GDT/LDT correspond to the selectors loaded by SYSCALL/SYSRET
(consistent with the base, limit, and attribute values forced by
the instructions)."
The OS has to set the RSP by itself before doing SYSRET, which opens a
race between "mov rsp" and "sysret", but if we only take that path once
we figure we come from NMI (using just IF+RSP), we know that IRQs and
NMIs are still disabled and cannot strike at this instant. Maybe MCEs
can, but they would execute within the NMI's stack just as if they were
triggered inside the NMI as well so I don't see a problem here.
I tried to imagine a case where kernel page faults, then NMI comes in,
then debug strikes and we have to return from debug to NMI then to fault
handler and I don't think we break the chain. Of course there are many
subtleties I can't grab because I don't understand all the details.
Do you think that could simplify things or that it's another stupid idea ?
Willy
On Fri, Jul 24, 2015 at 10:10 AM, Willy Tarreau <[email protected]> wrote:
> On Fri, Jul 24, 2015 at 08:48:57AM -0700, Andy Lutomirski wrote:
>> So by the time we detect that we've hit a watchpoint, the instruction
>> that tripped it is done and we don't need RF. Furthermore, after
>> reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
>> we hit a watchpoint. So this might be as simple as:
>>
>> if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |
>> X86_EFLAGS_IF)) == X86_EFLAGS_RF && !user_mode(regs))
>> for (i = 0; i < 4; i++)
>> if (dr6 & (DR_TRAP0<<i)) {
>> /* hit a kernel breakpoint with IF clear */
>> dr7 &= ~(DR_GLOBAL_ENABLE << (i * DR_ENABLE_SHIFT));
>> }
>>
>> I'm not saying that your code is wrong, but I think this is simpler
>> and avoids poking at yet more per-cpu state from NMI context, which is
>> kind of nice.
>>
>> If you don't like the RF games above, it would also be straightforward
>> to parse dr0..dr3 for each DR_TRAP bit that's set and see if it's a
>> breakpoint.
>
> Andy, section 5.8 of the SDM makes me think we could possibly abuse SYSRET
> to emulate IRET, and then possibly simplify the flags processing. It says
> that it takes the CPL3 code segment but nowhere it says that the target is
> validated for effectively being userland, and further it suggests that it
> doesn't validate anything :
>
> "It is the responsibility of the OS to ensure the descriptors in
> the GDT/LDT correspond to the selectors loaded by SYSCALL/SYSRET
> (consistent with the base, limit, and attribute values forced by
> the instructions)."
You are an evil bastard. I seriously doubt that this will work.
SYSRET goes to CPL3 no matter what. Also, I don't think you want to
start poking at MSRs to return.
--Andy
On Fri, Jul 24, 2015 at 07:10:18PM +0200, Willy Tarreau wrote:
> The OS has to set the RSP by itself before doing SYSRET, which opens a
> race between "mov rsp" and "sysret", but if we only take that path once
> we figure we come from NMI (using just IF+RSP), we know that IRQs and
> NMIs are still disabled and cannot strike at this instant. Maybe MCEs
> can, but they would execute within the NMI's stack just as if they were
> triggered inside the NMI as well so I don't see a problem here.
OK too bad I just found the response here in the code :-(
* SYSRET can't restore RF. SYSRET can restore TF, but unlike IRET,
* restoring TF results in a trap from userspace immediately after
* SYSRET. This would cause an infinite loop whenever #DB happens
* with register state that satisfies the opportunistic SYSRET
* conditions.
Willy
On Fri, Jul 24, 2015 at 9:25 AM, Willy Tarreau <[email protected]> wrote:
> On Fri, Jul 24, 2015 at 08:48:57AM -0700, Andy Lutomirski wrote:
>> So by the time we detect that we've hit a watchpoint, the instruction
>> that tripped it is done and we don't need RF. Furthermore, after
>> reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
>> we hit a watchpoint.
>
> Apparently after reading 17.3.1.1, it seems like RF can still be set
> if a data breakpoint triggers in a repeated string instruction before
> the last iteration. However I don't think we care because as long as
> we return to the string instruction, since the data location was already
> visited it won't trigger again so the loss of the flag should be safe.
>
Oh, right. So my proposal is wrong: it'll clear a watchpoint
incorrectly if we hit it in the middle of a string operation.
So we should either parse dr0..dr3 (whichever one triggered) or do
Peter's think and clear dr7 entirely. I still prefer just clearing
the breakpoint that triggered.
--Andy
On Fri, Jul 24, 2015 at 8:30 AM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Jul 24, 2015 at 05:26:37PM +0200, Willy Tarreau wrote:
>> >
>> > The point is, if we trigger a #DB on an instruction breakpoint
>> > while !IF, then we simply disable that breakpoint and do the RET.
>>
>> Yes but the breakpoint remains disabled then. Or I'm missing
>> something.
>
> http://marc.info/?l=linux-kernel&m=143773601130974
>
> We re-enable before going back to userspace.
Actually, Andy had a good argument that we don't even need this.
We just don't ever need to disable data breakpoints. Even if we end up doing
cli();
copy_from_user_inatomic();
that actually works fine. If there are data breakpoints, we will have
(a) things will run slow as hell anyway. Intel CPU's slow down to a
relative crawl.
(b) let's say we have a data breakpoint on the data we're reading above
(c) we take a #DB fault after the instruction has completed, so we
make forward progress even if we return with RF clear
(d) even if the data breakpoint is unaligned and triggers multiple
times, it's going to be a "small number" of multiple times. And see
(a). This never happens in practice, and the much bigger slowdown is
how data breakpoints tend to slow things down in general.
(e) yes, the string instructions may hit the data breakpoint multilpe
times for the "same" instruction, but the forward progress part is
still true even for the string instructions. In fact, it's actually
likely <i>more</i> true for string instructions, because they are
documented to be less exact, and may trigger the data watchpoint only
after a "group of iterations".
so I think we just leave data breakpoint alone. The only debug
conditions that are *faults* rather than traps are the instruction
breakpoints, and we can detect and disable those by just saying "oh,
we're in kernel mode".
So in the #DB handler, we would basically only clear instruction
breakpoints, and only when they trigger. If we have a data breakpoint
that triggers (even in kernel mode, and with interrupts disabled), let
it trigger and return with "ret" anyway. No biggie.
(Ok, so the "General detect fault" is also a fault rather than a trap,
but that's the "write to debug registers when it's disabled" thing,
very different)
Linus
On Fri, Jul 24, 2015 at 11:29 AM, Linus Torvalds
<[email protected]> wrote:
>
> So in the #DB handler, we would basically only clear instruction
> breakpoints, and only when they trigger. If we have a data breakpoint
> that triggers (even in kernel mode, and with interrupts disabled), let
> it trigger and return with "ret" anyway. No biggie.
So we'd not only look at "which breakpoint triggered", we'd also look
at the actual debug register and check that "R/Wn == 0", and only
disable it for that case.
So you'd read %dr6 and %dr7, and then iterate 0..3 and check whether
it triggerd (bit #n in %dr6), and that R/Wn (bits 16-17+n*4 of %dr7)
is zero, and if so, clear LGn bits (bits 0-1+n*2) in %dr7.
Something like
unsigned long mask = 0;
unsigned int dr6 = debug_read(6);
unsigned int dr7 = debug_read(7)
int i;
for (i = 0; i < 4; i++) {
if ((dr6 >> i) & 1) {
if (!((dr7 >> (4*i+16)) & 3))
mask |= 3 << (i*2);
}
}
if (mask)
debug_write(dr7 & ~mask, 7);
(yeah, I could easily have screwed that up)
But the above should only clear bits in dr7 that are actually
associated with the instruction breakpoint that triggered, and since
it's a _kernel_ instruction breakpoint, not a user one, we can clear
it and forget it. No need to re-enable at all.
Hmm?
Linus
On Fri, 24 Jul 2015 11:41:55 -0700
Linus Torvalds <[email protected]> wrote:
> On Fri, Jul 24, 2015 at 11:29 AM, Linus Torvalds
> <[email protected]> wrote:
> >
> > So in the #DB handler, we would basically only clear instruction
> > breakpoints, and only when they trigger. If we have a data breakpoint
> > that triggers (even in kernel mode, and with interrupts disabled), let
> > it trigger and return with "ret" anyway. No biggie.
>
> So we'd not only look at "which breakpoint triggered", we'd also look
> at the actual debug register and check that "R/Wn == 0", and only
> disable it for that case.
>
> So you'd read %dr6 and %dr7, and then iterate 0..3 and check whether
> it triggerd (bit #n in %dr6), and that R/Wn (bits 16-17+n*4 of %dr7)
> is zero, and if so, clear LGn bits (bits 0-1+n*2) in %dr7.
>
> Something like
>
> unsigned long mask = 0;
> unsigned int dr6 = debug_read(6);
> unsigned int dr7 = debug_read(7)
> int i;
>
> for (i = 0; i < 4; i++) {
> if ((dr6 >> i) & 1) {
> if (!((dr7 >> (4*i+16)) & 3))
> mask |= 3 << (i*2);
> }
> }
>
> if (mask)
> debug_write(dr7 & ~mask, 7);
Macros would be nice for readability.
for (i = 0; i < 4; i++) {
if ((dr6 >> i) & 1) {
int shift = DR_CONTROL_SIZE * i + DR_CONTROL_SHIFT;
if (!((dr7 >> shift) & DR_RW_READ))
mask |= (DR_LOCAL_ENABLE|DR_GLOBAL_ENABLE) << (i * DR_ENABLE_SIZE);
}
}
-- Steve
>
> (yeah, I could easily have screwed that up)
>
> But the above should only clear bits in dr7 that are actually
> associated with the instruction breakpoint that triggered, and since
> it's a _kernel_ instruction breakpoint, not a user one, we can clear
> it and forget it. No need to re-enable at all.
>
> Hmm?
>
> Linus
On Fri, Jul 24, 2015 at 11:29:29AM -0700, Linus Torvalds wrote:
> On Fri, Jul 24, 2015 at 8:30 AM, Peter Zijlstra <[email protected]> wrote:
> > On Fri, Jul 24, 2015 at 05:26:37PM +0200, Willy Tarreau wrote:
> >> >
> >> > The point is, if we trigger a #DB on an instruction breakpoint
> >> > while !IF, then we simply disable that breakpoint and do the RET.
> >>
> >> Yes but the breakpoint remains disabled then. Or I'm missing
> >> something.
> >
> > http://marc.info/?l=linux-kernel&m=143773601130974
> >
> > We re-enable before going back to userspace.
>
> Actually, Andy had a good argument that we don't even need this.
>
> We just don't ever need to disable data breakpoints. Even if we end up doing
>
> cli();
> copy_from_user_inatomic();
>
> that actually works fine. If there are data breakpoints, we will have
I worry that we'll end up running the do_debug() handlers from effective
NMI context.
The NMI might have preempted locks which these handlers require etc..
On Fri, Jul 24, 2015 at 12:55 PM, Peter Zijlstra <[email protected]> wrote:
>
> I worry that we'll end up running the do_debug() handlers from effective
> NMI context.
>
> The NMI might have preempted locks which these handlers require etc..
If #DB takes any locks like that, then #DB is broken.
Pretty much by definition, a data breakpoint can happen on pretty much
absolutely any code. This is in no way NMI-specific as far as I can
tell.
Do we really take locks in the #DB handler?
Linus
On Fri, Jul 24, 2015 at 01:22:11PM -0700, Linus Torvalds wrote:
> On Fri, Jul 24, 2015 at 12:55 PM, Peter Zijlstra <[email protected]> wrote:
> >
> > I worry that we'll end up running the do_debug() handlers from effective
> > NMI context.
> >
> > The NMI might have preempted locks which these handlers require etc..
>
> If #DB takes any locks like that, then #DB is broken.
>
> Pretty much by definition, a data breakpoint can happen on pretty much
> absolutely any code. This is in no way NMI-specific as far as I can
> tell.
>
> Do we really take locks in the #DB handler?
do_debug()
send_sigtrap()
force_sig_info()
spin_lock_irqsave()
Now, I don't pretend to understand the condition before send_sigtrap(),
so it _might_ be ok, but it sure as heck could do with a comment.
On Fri, 24 Jul 2015 22:51:19 +0200
Peter Zijlstra <[email protected]> wrote:
> > Do we really take locks in the #DB handler?
>
> do_debug()
> send_sigtrap()
> force_sig_info()
> spin_lock_irqsave()
>
> Now, I don't pretend to understand the condition before send_sigtrap(),
> so it _might_ be ok, but it sure as heck could do with a comment.
Or that force_sig_info() in send_sigtrap() looks like it can easily be
change to use an irq work queue.
-- Steve
On Fri, Jul 24, 2015 at 1:51 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Jul 24, 2015 at 01:22:11PM -0700, Linus Torvalds wrote:
>> On Fri, Jul 24, 2015 at 12:55 PM, Peter Zijlstra <[email protected]> wrote:
>> >
>> > I worry that we'll end up running the do_debug() handlers from effective
>> > NMI context.
>> >
>> > The NMI might have preempted locks which these handlers require etc..
>>
>> If #DB takes any locks like that, then #DB is broken.
>>
>> Pretty much by definition, a data breakpoint can happen on pretty much
>> absolutely any code. This is in no way NMI-specific as far as I can
>> tell.
>>
>> Do we really take locks in the #DB handler?
>
> do_debug()
> send_sigtrap()
> force_sig_info()
> spin_lock_irqsave()
>
> Now, I don't pretend to understand the condition before send_sigtrap(),
> so it _might_ be ok, but it sure as heck could do with a comment.
Let's try to decode it.
user_icebp is set if int $0x01 happens, except it isn't because user
code can't actually do that -- it'll cause #GP instead.
user_icebp is also set if the user has a bloody in-circuit emulator,
given the name. But who on Earth has one of those on a system new
enough to run Linux and, even if they have one, why on Earth are they
using it to send SIGTRAP.
In any event, user_icebp is only set if user_mode(regs), so it's safe
locking-wise. But please let's delete it.
Otherwise, we do send_sigtrap if we got a single-step exception from
user mode (because we suppress single-step exceptions from kernel mode
a couple lines above, but we should really BUG on those except for the
single case of SYSENTER with TF set) or if we get a breakpoint
exception that wasn't eaten by perf.
For *#&!'s sake. we should rewrite this pile of crap.
// before kprobes and notify_die
#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
if (!user_mode(regs) && regs->ip == sysenter_target) {
fix it up and return;
}
notify_die, etc.
preempt_conditional_sti(regs);
do_trap(X86_TRAP_DB, SIGTRAP, "debug", regs, error_code, NULL);
preempt_conditional_cli(regs);
except we should do something to disallow fixup_exception here. Or we
could open-code if(user_mode) send_sigtrap() else die() here.
I really don't think that we should be sending signals to userspace
due to user address watchpoints that hit in kernel mode. Or, if we do
think we should send signals for those, then, as Steven said, we
should make that explicit and use IRQ work for that.
As it stands, this is probably an exploitable DoS -- just point a
watchpoint down the stack a little bit from yourself and call raise().
--Andy
--
Andy Lutomirski
AMA Capital Management, LLC
On Fri, Jul 24, 2015 at 1:51 PM, Peter Zijlstra <[email protected]> wrote:
>
> do_debug()
> send_sigtrap()
> force_sig_info()
> spin_lock_irqsave()
>
> Now, I don't pretend to understand the condition before send_sigtrap(),
> so it _might_ be ok, but it sure as heck could do with a comment.
Ugh. As Andy said, I think that's ok, because it's actually the
single-step case, and won't trigger for kernel mode. So we should be
ok. Although the code I agree is not good.
I'd personally be more worried about the usual crazy "notify_die()"
crap. I absoluely detest those notifier chain things. They are hooks
for random crap that shouldn't be hooked into, but whatever. It's not
a problem in practice, it's just a sign of a certain kind of diseased
mind.
On the whole I think we're ok. I'd love to get rid of things, and yes,
I think we should probably explicitly handle the in-kernel case first
and just return without doing anything, just to make the code more
obviously safe. But it doesn't look like a fundamental problem spot.
Linus
On 24/07/2015 23:08, Andy Lutomirski wrote:
> user_icebp is set if int $0x01 happens, except it isn't because user
> code can't actually do that -- it'll cause #GP instead.
>
> user_icebp is also set if the user has a bloody in-circuit emulator,
> given the name. But who on Earth has one of those on a system new
> enough to run Linux and, even if they have one, why on Earth are they
> using it to send SIGTRAP.
You do not need either "int $0x01" or an ICE to set user_icebp = 1. You
can use the 0xf1 opcode, which is kinda like 0xcc but generates #DB
instead of #BP.
The historical name is ICEBP because in-circuit emulators used it for
software breakpoints, just like your usual debugger used 0xcc aka int3.
And just like 0xcc it's unprivileged, so you can actually get a SIGTRAP
with asm(".byte 0xf1").
So...
> In any event, user_icebp is only set if user_mode(regs), so it's safe
> locking-wise. But please let's delete it.
... it's safe, but it has some use (!).
Paolo
On 24/07/2015 19:20, Andy Lutomirski wrote:
> > Andy, section 5.8 of the SDM makes me think we could possibly abuse SYSRET
> > to emulate IRET, and then possibly simplify the flags processing. It says
> > that it takes the CPL3 code segment but nowhere it says that the target is
> > validated for effectively being userland, and further it suggests that it
> > doesn't validate anything :
> >
> > "It is the responsibility of the OS to ensure the descriptors in
> > the GDT/LDT correspond to the selectors loaded by SYSCALL/SYSRET
> > (consistent with the base, limit, and attribute values forced by
> > the instructions)."
> You are an evil bastard. I seriously doubt that this will work.
> SYSRET goes to CPL3 no matter what. Also, I don't think you want to
> start poking at MSRs to return.
On Intel the bottom two bits of the selector are forced to 11. The
pseudocode of SYSRET in the SDM has an explicit
CS.Selector ← (IA32_STAR[63:48]+ either 0 or 16) OR 3;
...
SS.Selector ← (IA32_STAR[63:48]+8) OR 3;
On AMD it's even worse, because you get a weird state with
CS.DPL=CS.RPL=SS.DPL=SS.RPL=0 but still the CPL is 3. This is seriously
messed up because the CPL is always SS.DPL except in this case. AMD
even had to add a separate field for the CPL to their VM control block,
just to account for this case. Intel more sanely uses SS.DPL.
Paolo
On Thu, Jul 30, 2015 at 8:41 AM, Paolo Bonzini <[email protected]> wrote:
>
>
> On 24/07/2015 23:08, Andy Lutomirski wrote:
>> user_icebp is set if int $0x01 happens, except it isn't because user
>> code can't actually do that -- it'll cause #GP instead.
>>
>> user_icebp is also set if the user has a bloody in-circuit emulator,
>> given the name. But who on Earth has one of those on a system new
>> enough to run Linux and, even if they have one, why on Earth are they
>> using it to send SIGTRAP.
>
> You do not need either "int $0x01" or an ICE to set user_icebp = 1. You
> can use the 0xf1 opcode, which is kinda like 0xcc but generates #DB
> instead of #BP.
Great. There's an opcode that invokes an interrupt gate that's not
marked as allowing unprivileged access, and that opcode doesn't appear
in the SDM. It appears in the APM opcode map with no explanation at
all.
Thanks, CPU vendors.
--Andy
On Thu, Jul 30, 2015 at 5:22 PM, Andy Lutomirski <[email protected]> wrote:
> On Thu, Jul 30, 2015 at 8:41 AM, Paolo Bonzini <[email protected]> wrote:
>>
>>
>> On 24/07/2015 23:08, Andy Lutomirski wrote:
>>> user_icebp is set if int $0x01 happens, except it isn't because user
>>> code can't actually do that -- it'll cause #GP instead.
>>>
>>> user_icebp is also set if the user has a bloody in-circuit emulator,
>>> given the name. But who on Earth has one of those on a system new
>>> enough to run Linux and, even if they have one, why on Earth are they
>>> using it to send SIGTRAP.
>>
>> You do not need either "int $0x01" or an ICE to set user_icebp = 1. You
>> can use the 0xf1 opcode, which is kinda like 0xcc but generates #DB
>> instead of #BP.
>
> Great. There's an opcode that invokes an interrupt gate that's not
> marked as allowing unprivileged access, and that opcode doesn't appear
> in the SDM. It appears in the APM opcode map with no explanation at
> all.
>
> Thanks, CPU vendors.
>
> --Andy
Some Windows programs (running in Wine) use this opcode for
anti-debugging code. See commit
a1e80fafc9f0742a1776a0490258cb64912411b0.
--
Brian Gerst
On Thu, 30 Jul 2015, Andy Lutomirski wrote:
> On Thu, Jul 30, 2015 at 8:41 AM, Paolo Bonzini <[email protected]> wrote:
> >
> >
> > On 24/07/2015 23:08, Andy Lutomirski wrote:
> >> user_icebp is set if int $0x01 happens, except it isn't because user
> >> code can't actually do that -- it'll cause #GP instead.
> >>
> >> user_icebp is also set if the user has a bloody in-circuit emulator,
> >> given the name. But who on Earth has one of those on a system new
> >> enough to run Linux and, even if they have one, why on Earth are they
> >> using it to send SIGTRAP.
> >
> > You do not need either "int $0x01" or an ICE to set user_icebp = 1. You
> > can use the 0xf1 opcode, which is kinda like 0xcc but generates #DB
> > instead of #BP.
>
> Great. There's an opcode that invokes an interrupt gate that's not
> marked as allowing unprivileged access, and that opcode doesn't appear
> in the SDM. It appears in the APM opcode map with no explanation at
> all.
The only SDM reference I found is:
"The opcodes D6 and F1 are undefined opcodes reserved by the Intel 64
and IA-32 architectures. These opcodes, even though undefined, do
not generate an invalid opcode exception."
D6 is actually something useful:
if (carry flag set)
AL = FF
else
AL = 0
It's been there since i386. It has been conveniant for return code
magic from ASM to C. I haven't thought of it for at least a decade :)
So all we need to worry about is F1, but thats bad enough :(
Thanks,
tglx
On Thu, Jul 30, 2015 at 02:22:06PM -0700, Andy Lutomirski wrote:
> Great. There's an opcode that invokes an interrupt gate that's not
> marked as allowing unprivileged access, and that opcode doesn't appear
> in the SDM. It appears in the APM opcode map with no explanation at
> all.
>
> Thanks, CPU vendors.
Here's something better:
http://www.rcollins.org/secrets/opcodes/ICEBP.html
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
--
On Thu, Jul 30, 2015 at 9:22 PM, Borislav Petkov <[email protected]> wrote:
> On Thu, Jul 30, 2015 at 02:22:06PM -0700, Andy Lutomirski wrote:
>> Great. There's an opcode that invokes an interrupt gate that's not
>> marked as allowing unprivileged access, and that opcode doesn't appear
>> in the SDM. It appears in the APM opcode map with no explanation at
>> all.
>>
>> Thanks, CPU vendors.
>
> Here's something better:
>
> http://www.rcollins.org/secrets/opcodes/ICEBP.html
This instruction is awesome. Binutils can disassemble it (it's called
"icebp") but it can't assemble it. KVM has special handling for it on
VMX and actually reports it to QEMU on SVM (complete with a defined
ABI). We have an asm macro so we can assemble it for 32-bit but not
64-bit, despite the fact that it works on 64-bit.
The kernel instruction decoder can't decode it.
Fortunately, it looks like the vm86 case is correct (or as correct as
any of the vm86 junk can be), although I haven't tested it. I bet
that icebp is like int3 in that it punches through vm86 mode instead
of sending #GP.
--Andy
On 31/07/2015 07:11, Andy Lutomirski wrote:
> This instruction is awesome. Binutils can disassemble it (it's called
> "icebp") but it can't assemble it. KVM has special handling for it on
> VMX and actually reports it to QEMU on SVM (complete with a defined
> ABI).
FWIW it's not reported to QEMU, it's only reported to a nested
hypervisor. So the ABI is simply the SVM spec.
It's not surprising that VMX support was provided by the Wine guys...
Paolo
> We have an asm macro so we can assemble it for 32-bit but not
> 64-bit, despite the fact that it works on 64-bit.
On Thu, Jul 30, 2015 at 10:11:40PM -0700, Andy Lutomirski wrote:
> This instruction is awesome. Binutils can disassemble it (it's called
> "icebp") but it can't assemble it. KVM has special handling for it on
> VMX and actually reports it to QEMU on SVM (complete with a defined
> ABI).
Fun.
> We have an asm macro so we can assemble it for 32-bit but not
> 64-bit, despite the fact that it works on 64-bit.
>
> The kernel instruction decoder can't decode it.
Yeah, the kernel insn decoder needs to be fixed. Even my decoder can
decode it:
$ echo "0xf1" | ./x86d -
0: f1 icebp
Big deal. :-)
Let's do some fun and games:
$ cat icebp.c
int main()
{
asm volatile(".byte 0xf1");
return 0;
}
$ gcc -Wall -o icebp{,.c}
$ objdump -d icebp
...
00000000004004ac <main>:
4004ac: 55 push %rbp
4004ad: 48 89 e5 mov %rsp,%rbp
4004b0: f1 icebp
4004b1: b8 00 00 00 00 mov $0x0,%eax
4004b6: 5d pop %rbp
4004b7: c3 retq
4004b8: 90 nop
...
$ ./icebp
Trace/breakpoint trap
^ this in qemu.
On baremetal it gets a SIGTRAP with TRAP_BRKPT. Looks like signal
handling knows about it...
$ strace /tmp/icebp
execve("/tmp/icebp", ["/tmp/icebp"], [/* 27 vars */]) = 0
brk(0) = 0x1680000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f71e243d000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=127070, ...}) = 0
mmap(NULL, 127070, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f71e241d000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\34\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1729984, ...}) = 0
mmap(NULL, 3836448, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f71e1e76000
mprotect(0x7f71e2015000, 2097152, PROT_NONE) = 0
mmap(0x7f71e2215000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x19f000) = 0x7f71e2215000
mmap(0x7f71e221b000, 14880, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f71e221b000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f71e241c000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f71e241b000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f71e241a000
arch_prctl(ARCH_SET_FS, 0x7f71e241b700) = 0
mprotect(0x7f71e2215000, 16384, PROT_READ) = 0
mprotect(0x7f71e243f000, 4096, PROT_READ) = 0
munmap(0x7f71e241d000, 127070) = 0
--- SIGTRAP {si_signo=SIGTRAP, si_code=TRAP_BRKPT, si_pid=4195505, si_uid=0} ---
+++ killed by SIGTRAP +++
Trace/breakpoint trap
> Fortunately, it looks like the vm86 case is correct (or as correct as
> any of the vm86 junk can be), although I haven't tested it. I bet
> that icebp is like int3 in that it punches through vm86 mode instead
> of sending #GP.
Yeah, INT 1. I wonder whether INT 1, i.e. CD imm8 does the same thing.
But why do you say it is special - it simply raises #DB, i.e. vector 1.
Web page seems to say so when interrupt redirection is disabled. It
sounds like a nice and quick way to generate a breakpoint. You can do
that with INT 01, i.e., the CD opcode, too.
If I'd had to guess, it isn't documented because of the proprietary ICE
aspect. And no one uses ICEs anymore so it is going to be forgotten with
people popping off and on and asking about the undocumented opcode.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
--
On 31/07/2015 10:03, Borislav Petkov wrote:
> $ ./icebp
> Trace/breakpoint trap
>
> ^ this in qemu.
Is the strace different between KVM and baremetal? QEMU dynamic
translation is broken I think, but KVM should be the same as baremetal.
>> Fortunately, it looks like the vm86 case is correct (or as correct as
>> any of the vm86 junk can be), although I haven't tested it. I bet
>> that icebp is like int3 in that it punches through vm86 mode instead
>> of sending #GP.
>
> Yeah, INT 1. I wonder whether INT 1, i.e. CD imm8 does the same thing.
No, it sends #GP.
> But why do you say it is special - it simply raises #DB, i.e. vector 1.
> Web page seems to say so when interrupt redirection is disabled. It
> sounds like a nice and quick way to generate a breakpoint. You can do
> that with INT 01, i.e., the CD opcode, too.
>
> If I'd had to guess, it isn't documented because of the proprietary ICE
> aspect. And no one uses ICEs anymore so it is going to be forgotten with
> people popping off and on and asking about the undocumented opcode.
The reason why it isn't documented is probably hidden within Intel.
Besides ICEBP, which is a bit fringe, there's no reason not to document
SALC which Thomas mentioned. SALC all has been there since the 8086,
and has been undocumented for thirty-odd years.
The AAM/AAD variants with immediates other than 10 also have been
undocumented for fifteen years or so (an instruction doing a division by
10 where the second byte of the opcode is 10? oh, certainly no one is
going to try changing the second byte...)
Paolo
On Fri, Jul 31, 2015 at 11:27:13AM +0200, Paolo Bonzini wrote:
> Is the strace different between KVM and baremetal?
Yes, the signal part is missing from kvm:
$ strace ./icebp
execve("./icebp", ["./icebp"], [/* 20 vars */]) = 0
brk(0) = 0x601000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffff7ff6000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=95207, ...}) = 0
mmap(NULL, 95207, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7ffff7fde000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\357\1\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1595408, ...}) = 0
mmap(NULL, 3709016, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7ffff7a53000
mprotect(0x7ffff7bd3000, 2097152, PROT_NONE) = 0
mmap(0x7ffff7dd3000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x180000) = 0x7ffff7dd3000
mmap(0x7ffff7dd8000, 18520, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7ffff7dd8000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffff7fdd000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffff7fdc000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffff7fdb000
arch_prctl(ARCH_SET_FS, 0x7ffff7fdc700) = 0
mprotect(0x7ffff7dd3000, 16384, PROT_READ) = 0
mprotect(0x7ffff7ffc000, 4096, PROT_READ) = 0
munmap(0x7ffff7fde000, 95207) = 0
exit_group(0) = ?
> No, it sends #GP.
True story:
[ 697.707990] traps: icebp[3537] general protection ip:4004b0 sp:7fffffffe610 error:a in icebp[400000+1000]
but why? I guess our IDT entry at 1 is funny... Too lazy to check.
> The reason why it isn't documented is probably hidden within Intel.
> Besides ICEBP, which is a bit fringe, there's no reason not to document
> SALC which Thomas mentioned. SALC all has been there since the 8086,
> and has been undocumented for thirty-odd years.
That one is invalid (on an IVB):
[ 1306.231408] traps: icebp[3783] trap invalid opcode ip:4004b0 sp:7fffffffe610 error:0 in icebp[400000+1000]
AMD APM documents it as invalid too.
> The AAM/AAD variants with immediates other than 10 also have been
> undocumented for fifteen years or so (an instruction doing a division
> by 10 where the second byte of the opcode is 10? oh, certainly no one
> is going to try changing the second byte...)
There's this in the AMD APM:
"In most modern assemblers, the AAM instruction adjusts to base-10
values. However, by coding the instruction directly in binary, it can
adjust to any base specified by the immediate byte value (ib) suffixed
onto the D4h opcode. For example, code D408h for octal, D40Ah for
decimal, and D40Ch for duodecimal (base 12)."
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
--
On 31/07/2015 12:25, Borislav Petkov wrote:
>> > The reason why it isn't documented is probably hidden within Intel.
>> > Besides ICEBP, which is a bit fringe, there's no reason not to document
>> > SALC which Thomas mentioned. SALC all has been there since the 8086,
>> > and has been undocumented for thirty-odd years.
> That one is invalid (on an IVB):
>
> [ 1306.231408] traps: icebp[3783] trap invalid opcode ip:4004b0 sp:7fffffffe610 error:0 in icebp[400000+1000]
>
> AMD APM documents it as invalid too.
It's valid in 32-bit.
Paolo
On Fri, Jul 31, 2015 at 12:26:34PM +0200, Paolo Bonzini wrote:
>
>
> On 31/07/2015 12:25, Borislav Petkov wrote:
> >> > The reason why it isn't documented is probably hidden within Intel.
> >> > Besides ICEBP, which is a bit fringe, there's no reason not to document
> >> > SALC which Thomas mentioned. SALC all has been there since the 8086,
> >> > and has been undocumented for thirty-odd years.
> > That one is invalid (on an IVB):
> >
> > [ 1306.231408] traps: icebp[3783] trap invalid opcode ip:4004b0 sp:7fffffffe610 error:0 in icebp[400000+1000]
> >
> > AMD APM documents it as invalid too.
>
> It's valid in 32-bit.
Yap, no invalid opcode there. I guess there's another bug in the APM's
opcode table then.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
--