2010-06-02 19:24:46

by Oleg Nesterov

[permalink] [raw]
Subject: Bug 16061 - single stepping in a signal handler can cause the single step flag to get "stuck"

(see https://bugzilla.kernel.org/show_bug.cgi?id=16061)

Roland McGrath wrote:
>
> Oleg, please get an appropriate test case for this into the ptrace-tests suite.

The first thing I did, I created the test-case ;)

#include <signal.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/user.h>
#include <assert.h>
#include <stddef.h>

void handler(int n)
{
asm("nop; nop; nop");
}

int child(void)
{
assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
signal(SIGALRM, handler);
kill(getpid(), SIGALRM);
return 0x23;
}

void *getip(int pid)
{
return (void*)ptrace(PTRACE_PEEKUSER, pid,
offsetof(struct user, regs.rip), 0);
}

int main(void)
{
int pid, status;

pid = fork();
if (!pid)
return child();

assert(wait(&status) == pid);
assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGALRM);

assert(ptrace(PTRACE_SINGLESTEP, pid, 0, SIGALRM) == 0);
assert(wait(&status) == pid);
assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGTRAP);
assert((getip(pid) - (void*)handler) == 0);

assert(ptrace(PTRACE_SINGLESTEP, pid, 0, SIGALRM) == 0);
assert(wait(&status) == pid);
assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGTRAP);
assert((getip(pid) - (void*)handler) == 1);

assert(ptrace(PTRACE_CONT, pid, 0,0) == 0);
assert(wait(&status) == pid);
assert(WIFEXITED(status) && WEXITSTATUS(status) == 0x23);

return 0;
}

It is x86 specific and needs -O2. Probably I can just remove getip()
and related asserts and send it to Jan.

> That change might be the right one, but we should discuss it more in email, and
> look at the situation on other machines.

Yes. And I think it is better to discuss this on lkml.

I do not know what is the right fix. I do not like the fix in
https://bugzilla.kernel.org/attachment.cgi?id=26587 even if it is correct.
I can't explain this, but I think that tracehook.h is not the right place
to call disable_step(). And note that handle_signal() plays with TF anyway.

I am starting to think we should fix this per arch. As for x86, perhaps
we should start with this one-liner

spin_unlock_irq(&current->sighand->siglock);

tracehook_signal_handler(sig, info, ka, regs,
- test_thread_flag(TIF_SINGLESTEP));
+ test_and_clear_thread_flag(TIF_SINGLESTEP));

return 0;
}

then do other changes.

However, what I am thinking about is the more "clever" change (it
passed ptrace-tests).

Do you think it can be correct? I am asking because I never understood
the TIF_SINGLESTEP/TIF_FORCED_TF interaction. But otoh, shouldn't
TIF_FORCED_TF + X86_EFLAGS_TF always imply TIF_SINGLESTEP? at least
in handle_signal().

IOW, help! To me, the patch below is also cleanup. But only if you think
it can fly ;)

Oleg.

--- 34-rc1/arch/x86/kernel/signal.c~BZ16061_MAYBE_FIX 2010-06-02 21:11:07.000000000 +0200
+++ 34-rc1/arch/x86/kernel/signal.c 2010-06-02 21:11:48.000000000 +0200
@@ -682,6 +682,7 @@ static int
handle_signal(unsigned long sig, siginfo_t *info, struct k_sigaction *ka,
sigset_t *oldset, struct pt_regs *regs)
{
+ bool stepping;
int ret;

/* Are we from a system call? */
@@ -706,13 +707,10 @@ handle_signal(unsigned long sig, siginfo
}
}

- /*
- * If TF is set due to a debugger (TIF_FORCED_TF), clear the TF
- * flag so that register information in the sigcontext is correct.
- */
- if (unlikely(regs->flags & X86_EFLAGS_TF) &&
- likely(test_and_clear_thread_flag(TIF_FORCED_TF)))
- regs->flags &= ~X86_EFLAGS_TF;
+ stepping = test_thread_flag(TIF_SINGLESTEP);
+ if (stepping)
+ // do this before setup_sigcontext()
+ user_disable_single_step(current);

ret = setup_rt_frame(sig, ka, info, oldset, regs);

@@ -748,8 +746,7 @@ handle_signal(unsigned long sig, siginfo
recalc_sigpending();
spin_unlock_irq(&current->sighand->siglock);

- tracehook_signal_handler(sig, info, ka, regs,
- test_thread_flag(TIF_SINGLESTEP));
+ tracehook_signal_handler(sig, info, ka, regs, stepping);

return 0;
}


2010-06-02 20:08:48

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Bug 16061 - single stepping in a signal handler can cause the single step flag to get "stuck"

sorry for noise, forgot to mention...

On 06/02, Oleg Nesterov wrote:
>
> However, what I am thinking about is the more "clever" change (it
> passed ptrace-tests).
>
> Do you think it can be correct? I am asking because I never understood
> the TIF_SINGLESTEP/TIF_FORCED_TF interaction. But otoh, shouldn't
> TIF_FORCED_TF + X86_EFLAGS_TF always imply TIF_SINGLESTEP? at least
> in handle_signal().
>
> IOW, help! To me, the patch below is also cleanup. But only if you think
> it can fly ;)

and it is not clear to me if we should keep this code

/*
* Clear TF when entering the signal handler, but
* notify any tracer that was single-stepping it.
* The tracer may want to single-step inside the
* handler too.
*/
regs->flags &= ~X86_EFLAGS_TF;

in handle_signal(). If TF was set by us, it was cleared by
user_disable_single_step(). Otherwise, why should we clear it if
the tracer did set_flags(X86_EFLAGS_TF) ?

Oleg.

> --- 34-rc1/arch/x86/kernel/signal.c~BZ16061_MAYBE_FIX 2010-06-02 21:11:07.000000000 +0200
> +++ 34-rc1/arch/x86/kernel/signal.c 2010-06-02 21:11:48.000000000 +0200
> @@ -682,6 +682,7 @@ static int
> handle_signal(unsigned long sig, siginfo_t *info, struct k_sigaction *ka,
> sigset_t *oldset, struct pt_regs *regs)
> {
> + bool stepping;
> int ret;
>
> /* Are we from a system call? */
> @@ -706,13 +707,10 @@ handle_signal(unsigned long sig, siginfo
> }
> }
>
> - /*
> - * If TF is set due to a debugger (TIF_FORCED_TF), clear the TF
> - * flag so that register information in the sigcontext is correct.
> - */
> - if (unlikely(regs->flags & X86_EFLAGS_TF) &&
> - likely(test_and_clear_thread_flag(TIF_FORCED_TF)))
> - regs->flags &= ~X86_EFLAGS_TF;
> + stepping = test_thread_flag(TIF_SINGLESTEP);
> + if (stepping)
> + // do this before setup_sigcontext()
> + user_disable_single_step(current);
>
> ret = setup_rt_frame(sig, ka, info, oldset, regs);
>
> @@ -748,8 +746,7 @@ handle_signal(unsigned long sig, siginfo
> recalc_sigpending();
> spin_unlock_irq(&current->sighand->siglock);
>
> - tracehook_signal_handler(sig, info, ka, regs,
> - test_thread_flag(TIF_SINGLESTEP));
> + tracehook_signal_handler(sig, info, ka, regs, stepping);
>
> return 0;
> }

2010-06-06 16:40:20

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH 0/1] ptrace: x86: stepping in a signal handler leaks X86_EFLAGS_TF

On 06/02, Oleg Nesterov wrote:
>
> I am starting to think we should fix this per arch. As for x86, perhaps
> we should start with this one-liner
>
> spin_unlock_irq(&current->sighand->siglock);
>
> tracehook_signal_handler(sig, info, ka, regs,
> - test_thread_flag(TIF_SINGLESTEP));
> + test_and_clear_thread_flag(TIF_SINGLESTEP));
>
> return 0;
> }
>
> then do other changes.

I am sending this patch. It is still not clear to me what is the
"right" fix, we need more discussion. Let's fix the bug first.

Oleg.

2010-06-06 16:40:57

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH 1/1] ptrace: x86: stepping in a signal handler leaks X86_EFLAGS_TF

See https://bugzilla.kernel.org/show_bug.cgi?id=16061

When the TIF_SINGLESTEP tracee dequeues a signal, handle_signal()
clears TIF_FORCED_TF and X86_EFLAGS_TF but leaves TIF_SINGLESTEP set.

If the tracer does PTRACE_SINGLESTEP again, enable_single_step() sets
X86_EFLAGS_TF but not TIF_FORCED_TF. This means that the subsequent
PTRACE_CONT doesn't not clear X86_EFLAGS_TF, and the tracee gets the
wrong SIGTRAP.

Test-case (based on the excellent report from Evan):

#include <unistd.h>
#include <stdio.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/user.h>
#include <assert.h>
#include <stddef.h>

void handler(int n)
{
asm("nop");
}

int child(void)
{
assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
signal(SIGALRM, handler);
kill(getpid(), SIGALRM);
return 0x23;
}

void *getip(int pid)
{
return (void*)ptrace(PTRACE_PEEKUSER, pid,
offsetof(struct user, regs.rip), 0);
}

int main(void)
{
int pid, status;

pid = fork();
if (!pid)
return child();

assert(wait(&status) == pid);
assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGALRM);

assert(ptrace(PTRACE_SINGLESTEP, pid, 0, SIGALRM) == 0);
assert(wait(&status) == pid);
assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGTRAP);
assert((getip(pid) - (void*)handler) == 0);

assert(ptrace(PTRACE_SINGLESTEP, pid, 0, SIGALRM) == 0);
assert(wait(&status) == pid);
assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGTRAP);
assert((getip(pid) - (void*)handler) == 1);

assert(ptrace(PTRACE_CONT, pid, 0,0) == 0);
assert(wait(&status) == pid);
assert(WIFEXITED(status) && WEXITSTATUS(status) == 0x23);

return 0;
}

Change handle_signal() to clear TIF_SINGLESTEP as well. We cleared
X86_EFLAGS_TF and TIF_FORCED_TF, we are not going to return to user-mode
if TIF_SINGLESTEP was set, and we already passed syscall_trace_leave().
We are going to sleep until the next ptrace_resume() which should set
these flags correctly if needed.

Note: this is the most simple fix now, most probably we need more
changes/cleanups on top of this patch.

Reported-by: Evan Teran <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
---

arch/x86/kernel/signal.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- 34-rc1/arch/x86/kernel/signal.c~BZ16061_TEMPORARY_FIX 2010-06-06 16:22:55.000000000 +0200
+++ 34-rc1/arch/x86/kernel/signal.c 2010-06-06 16:47:55.000000000 +0200
@@ -749,7 +749,7 @@ handle_signal(unsigned long sig, siginfo
spin_unlock_irq(&current->sighand->siglock);

tracehook_signal_handler(sig, info, ka, regs,
- test_thread_flag(TIF_SINGLESTEP));
+ test_and_clear_thread_flag(TIF_SINGLESTEP));

return 0;
}

2010-06-16 02:31:17

by Roland McGrath

[permalink] [raw]
Subject: Re: Bug 16061 - single stepping in a signal handler can cause the single step flag to get "stuck"

Sorry I'm slow in getting back to this. I was sick last week and I'm
still rather foggy-headed. I haven't actually tried today to think
through all the corners you've talked about.

I'm hoping a couple of comments might clarify things for you enough
that you'll restate it all in ways that don't require me to get less
foggy and follow every detail of what you've said so far. :-}

x86's TIF_SINGLESTEP is slightly confusingly overloaded for two things
that are sort of different, but also perhaps are entirely redundant.
(I'm the one who did that to begin with when originally fixing the
syscall-step and signal-step corners years ago, but I still get to
complain. ;-) Its original use is for do_debug:

if ((dr6 & DR_STEP) && !user_mode(regs)) {

This is for non-interrupt kernel entries where the hardware does not
clear TF from the incoming user-mode eflags. I believe that's only the
sysenter instruction, but I'm not entirely positive off hand. (This is
helpfully not true of the syscall instruction that x86-64 has and that
AMD CPUs use instead of sysenter for 32-bit processes on x86-64--grep
MSR_SYSCALL_MASK.) Here the original idea was that the user registers
have TF set, but it took effect in kernel mode. So it has to be cleared
to stop interfering with the normal kernel asm paths. But it wants to
be set again before we return to user mode. This is simplistic thinking
just trying to mimic the hardware logic for interrupt entries (including
'int $0x80' syscalls), where the hardware stores TF in the trap frame
with the rest of the user-mode state, and clears TF before executing the
first kernel-mode instruction. (Since we set the MSR right, that's what
'syscall' instructions do in hardware too.) In the problem cases (such
as ia32_sysenter_target), the hardware just goes to kernel mode and only
changes cs and ip so the kernel code has to manually store all the
user-mode state, including its eflags. Since eflags is left as it was,
TF there causes the trap to do_debug after the first kernel-mode
instruction--a few instructions before the kernel asm has stored the
user eflags anywhere. So the original meaning of TIF_SINGLESTEP was
simply to turn TF back on before going to user mode (I think it might
originally have been handled in do_notify_resume). That was before I
got involved. Perhaps some ancient LKML archives can reveal the actual
history, but I've never looked. I suspect the way it went was that
people noticed an unhandled trap in kernel mode after sysenter support
was added, and just scratched that itch by clearing it in do_debug.
Then it would have been noticed that a sysenter (vs an 'int $0x80')
caused TF state just to be lost, so that both user-mode setting TF
itself, and PTRACE_SINGLESTEP as it existed at the time, stopped working
as they had. That's solved by adding TIF_SINGLESTEP as it originally
was, just to get TF set again in user mode.

Then it was cited as a problem that single-step over a system call
instruction (of whichever flavor) didn't actually single-step, but
double-stepped. This is when I first got involved, and this was the
beginning of my understanding of the kernel entry asm paths--I was
probably quite vague then on how the do_debug trap in kernel mode case
might happen. For 'int $0x80' system calls, this is because the
user-mode TF is just restored by the hardware in 'iret' (which is what
transitions the hardware back from kernel to user mode), so it fires
when the next user instruction completes--but from the user perspective,
the last instruction was the 'int $0x80' itself (done with TF set), so
by the hardware rules as they apply to all unprivileged instructions,
the single-step trap should be right before the following instruction,
not after it. For sysenter or syscall, it's similar: the sysexit (or
sysret) instruction restores the user-mode eflags at the same time it
transitions to user mode, so that TF applies to the following user
instruction rather than the sysenter (syscall) that has just completed
from the user perspective.

To address that, we added the TIF_SINGLESTEP path to syscall_trace_leave
(to further confuse things, it was a unified syscall_trace at the time).
That originally just treated TIF_SINGLESTEP the same as it did
TIF_SYSCALL_TRACE for exit tracing, since basic ptrace stops look the
same either way. (We later cleaned that up since PTRACE_O_TRACESYSGOOD
makes a syscall-exit stop look different from a single-step trap, and
PTRACE_SINGLESTEP should look like a step, not like PTRACE_SYSCALL when
you didn't use that. I also tried to clean it up more with thoughts of
utrace, where syscall-tracing and single-step are independent knobs
rather than being mutually exclusive as the old ptrace interface has it.
And, of course, there was TIF_SYSCALL_EMU thrown in there somewhere to
really complicate the possible control flow.)

Much later, the TIF_SINGLESTEP path to syscall_trace_enter was added.
(I'm skipping ahead in the chronology here to stay closer on topic.)
This is for the (original) scenario of the DR_STEP trap in kernel mode,
i.e. the sysenter entry path. This is to notice that the user-mode TF
was artificially cleared by do_debug, and restore it into
task_pt_regs()->flags as early as possible. (This would have been a
good way to do the original TIF_SINGLESTEP back in prehistory, but
nobody thought of doing it that way.) This is so that an examination of
task_pt_regs() sees the "real" user-mode state, e.g. ptrace peeking at
it during the syscall-entry tracing stop.

Back to prehistory. It's always been possible in the hardware for
user-mode to set TF itself, i.e.:

pushf
orw $0x200, (%esp)
popf

Weird applications actually do this, for strange ideas about code that
checks who calls it, pure user self-debugging, and other weirdness.
(That all may be ill-advised to do, but it's always been well-defined.)
When user code does this, what happens is a natural analogue of what the
hardware does: you get a SIGTRAP signal (presumably having installed a
handler), with the whole register state (including TF set in eflags) in
the struct sigcontext for the handler, but TF is cleared before running
the signal handler, so it doesn't recurse. This is why signal.c clears
TF in the task_pt_regs() (same pointer as regs arg to handle_signal)
when setting up a signal handler.

Around the same time as the syscall-step "double-step" issue was
noticed, someone noticed that PTRACE_SINGLESTEP,signr for a handled
signal had a similar issue: it gives you a SIGTRAP after executing the
first instruction of the signal handler. But a person using a debugger
would like to see a "stepi" operation go from one known user instruction
that was about to be executed to the very next user-mode instruction
that will be executed--i.e., stop before the first instruction of the
signal handler and have the chance to look at registers and memory
before executing it. Since in the signal-handling scenario, there is no
state at which you were before one user instruction with TF set, and at
the completion of that instruction you are at the first user instruction
of the signal handler, there is no way to have an actual trap (do_debug)
at the user register state we want to see. So instead we added a "fake
trap" for PTRACE_SINGLESTEP, in the form of a ptrace_notify() call.
Later that became tracehook_signal_handler().

Not long after, I wanted to clean things up, and added TIF_FORCED_TF.
This is to keep debugger single-stepping coherent when user mode sets TF
itself, and to only ever show the "real" user mode eflags bits even when
debugger single-stepping is in use. This is for user_regset (e.g. for
ptrace peeking at registers) and for struct sigcontext when the debugger
single-steps into a signal handler. Linus had one of the aforementioned
weird applications in binary form and wanted to debug it, so he went
whole-hog and also added the is_setting_trap_flag() code that is now in
step.c--so single-stepping that "popf" clears TIF_FORCED_TF, and we
won't later confuse the user-chosen "real" TF (or lack thereof) with one
induced by user_enable_single_step(). (That detection is imperfect in
various ways, but solved the problem Linus had.)

To complete the background, there is one more set of wrinkles. There
are the various potential ptrace stops that take place "inside" some
system call (execve, clone, fork, or vfork). In these cases, from the
time a debugger could do PTRACE_SINGLESTEP with the user-mode PC about
to do the system call instruction (whichever one) until the next actual
execution of a user-mode instruction (i.e. the following one or a signal
handler preempting it, or the very first one at the entry point for an
exec), there is only a single "step" to make from one instruction to the
next. But there might be three places for ptrace stops before you get
to where resuming would actually execute that next instruction:
syscall-entry tracing, in-syscall tracing (i.e. PTRACE_EVENT_*), and
syscall-exit tracing. Then there might be a signal stop. At all these
places, the debugger can decide to PTRACE_SINGLESTEP or not, after
either it or user-mode originally was single-stepping into the system
call instruction. So what does that all mean? Historically, we've
decided that if the last kind of resumption before signal handling was
single-step (or now, block-step) then we act as if you'd single-stepped
into the system call instruction, i.e. stop before the next user-mode
instruction. So, if you had not been single-stepping and the first time
you looked at the user registers was in a PTRACE_EVENT_* stop, and then
you do PTRACE_SINGLESTEP, you'd next see a SIGTRAP (now sent inside
tracehook_report_syscall_exit(,1)) with the same PC that you just saw
(the one following the system call instruction).

One final note about the current behavior, for user-mode setting TF
itself. Because the trap in kernel mode (sysenter instruction) makes
do_debug set TIF_SINGLESTEP, this triggers the syscall_trace_leave path.
So a user that does sysenter with TF set will get his SIGTRAP
immediately after the sysenter instruction. Conversely, a syscall entry
with TF set via 'int $0x80' or via x86-64's 'syscall' just stores the
user's full eflags into task_pt_regs()->flags and never notices whether
it contains TF, never sets TIF_SINGLESTEP. Hence, one of these entries
(i.e. all syscalls from 64-bit user mode, and non-sysenter syscalls from
32-bit user mode on both 32-bit and 64-bit kernels) will "double-step"
over the system call instruction and the one after it. Unless I'm
mistaken, this is how it's always been for 'int $0x80' with TF set, and
was originally that way when there was first sysenter support too, and
from the time x86-64 came along, 64-bit system calls have behaved that
way too. In the now several years since we overloaded TIF_SINGLESTEP
for PTRACE_SINGLESTEP over system calls not to double-step, user-mode
setting TF executing sysenter has behaved differently from 'int $0x80'.
If it has ever been done in actual application code, that is. It's
likely that all the users of TF in pure user mode either don't care
about system calls at all, or use TF only on code that only uses 'int
$0x80' system calls. (But who knows.) So the status quo for user-mode
setting TF and doing system calls of various sorts is of dubious
correctness in the abstract, but apparently good enough for the music of
the people--and if the double-step is what it is, then the sysenter
variation from that is dubious; but since nobody complains, apparently
if anyone has ever actually cared they already catered to the quirk. So
there is not much motivation to have any of the behavior seen by pure
user TF machinations change at all, even if that would be in the
direction of "obviously right in the abstract" (i.e. single-step means
single-step, not double-step if the first instruction was too magical).

So there you have it. What does it all mean for how it ought to be today?

TIF_SINGLESTEP today means either or both of two things: TF was in force
when user-mode executed sysenter (whether that TF was set by user-mode
or by the debugger); or user_enable_single_step() was used more recently
than user_disable_single_step(), i.e. if we're in a system call now,
we're meant to act as if we'd single-stepped into the system call
instruction.

TIF_FORCED_TF today means that the user-mode eflags has TF clear,
whether or not TF is actually set in task_pt_regs()->flags at the
moment. Various things (both hardware and kernel code) can clear TF
from the user-mode eflags but fail to clear TIF_FORCED_TF.

It may make sense to have signal.c just notice TIF_FORCED_TF and mask TF
out of what it stores in the signal frame rather than actually clearing
it in the registers beforehand. (Note you must change ia32_signal.c
too. If you use a helper for that, perhaps just globalize and rename
get_flags() from ptrace.c, since we have it there too.) But it still
needs to clear TF for the handler entry if TIF_SINGLESTEP is not set.
And notice also restore_sigcontext (and ia32_restore_sigcontext), which
also lets userland choose the TF value without regard to TIF_FORCED_TF.
That should clear TIF_FORCED_TF if the new user value has TF, or else
keep TF set if TIF_FORCED_TF, like set_flags() from ptrace.c does.
(ptrace.c, signal.c, and ia32_signal.c all have their own ideas about
the mask of eflags that userland (i.e.) sigreturn-self or ptrace-other
is allowed to set or clear. That's another dubious situation of its
own, but if we harmonize them then both directions of sigcontext
handling can just share the ptrace.c code.)

The change to the tracehook_signal_handler() call seems quite wrong to
me. I'm not sure I could even define exactly what TIF_SINGLESTEP means
after that. I think the problem it tries to address won't exist if all
the signal paths that touch TF maintain TIF_FORCED_TF judiciously
(e.g. by sharing the ptrace.c code), as I just described.

It might make sense to replace these two TIF flags with two flags that
have different meanings than these, and adjust everything accordingly to
wind up with a situation that's easier to follow. But I don't have a
clear idea of what the definitions of those flags would be.


Thanks,
Roland

2010-06-16 19:58:29

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Bug 16061 - single stepping in a signal handler can cause the single step flag to get "stuck"

Roland, thanks a lot for this (long!) explanation. Another email from
you which I should save in ~/DOCS. I don't know how many time you spent
writing this message, but I bet I spent more trying to understand it ;)
And still I need to read it again later.

On 06/15, Roland McGrath wrote:
>
> x86's TIF_SINGLESTEP is slightly confusingly overloaded for two things
> that are sort of different, but also perhaps are entirely redundant.

Btw yes, I had the same feeling when I did ptrace-utrace.

> (I'm the one who did that to begin with when originally fixing the
> syscall-step and signal-step corners years ago, but I still get to
> complain. ;-) Its original use is for do_debug:
>
> if ((dr6 & DR_STEP) && !user_mode(regs)) {
>
> This is for non-interrupt kernel entries where the hardware does not
> clear TF from the incoming user-mode eflags.

Aha.

Damn. Can't resists. And when I read to this point, I decided to
learn what exactly DR_STEP and db6 (I looked at native_get_debugreg)
mean. More than hour I was trying to google and search in the intel
pdf's I have. No lack. Nothing about db[0-6]! However, I discovered
that volume 3b system programming guide describes DR6 register (not db!)
and it has BS=14 bit (matches DR_STEP == (1 << 14)) which merely means:
"the debug exception was triggered by the single-step execution mode".
Heh. Finally I can understand this code, more or less.

> [... snip a lot of good-to-know details ...]

Thanks.

> hardware stores TF in the trap frame
> with the rest of the user-mode state, and clears TF before executing the
> first kernel-mode instruction. (Since we set the MSR right, that's what
> 'syscall' instructions do in hardware too.)

grep, grep, grep, grep... syscall_init()->wrmsrl(X86_EFLAGS_TF), right?
I didn't know.

OK. And now I spent a couple more hours. Because I decided to learn
how actually syscalls work on x86_64. I read the sysenter code for
i386 a long ago, and I found my understanding doesn't match the current
reality. grep * a_lot.

So, the user-space doesn't use vdso/vsyscall, it merely calls the
"syscall" insn.

But what about vdso and vsyscall? Looks like, they both do (almost)
the same, but using different the methods.

IIUC, we put the context of vsyscall_64.o into the gate_vma address
(VSYSCALL_START), and then the user-space can call vgetcpu() directly
if the application knows about the VSYSCALL_ADDR() magic.

But, otoh, we also have vdso. And this is what ldd shows as
linux-vdso.so.1. And it also has the "getcpu" helper, __vdso_getcpu().
But, unlike vgetcpu(), this function is "visible" to dynamic linker
and thus looks like a normal function.

Correct? But why does it have both vdso and vsyscall? I guess, one
of them (probably vsyscall) is more or less historical?

OK, after this grepping I see it was off-topic. And I do not even
remember my concerns which forced me to study this magic now.

> So the original meaning of TIF_SINGLESTEP was
> simply to turn TF back on before going to user mode
> ...
> Much later, the TIF_SINGLESTEP path to syscall_trace_enter was added.
> This is to notice that the user-mode TF
> was artificially cleared by do_debug, and restore it into
> task_pt_regs()->flags as early as possible.
> This is so that an examination of
> task_pt_regs() sees the "real" user-mode state, e.g. ptrace peeking at
> it during the syscall-entry tracing stop.

Aha. So, in short: we restore TF in syscall_trace_enter() to expose
this flag to debugger, right? This was my (vague) understanding, but
I wasn't sure.

> Back to prehistory. It's always been possible in the hardware for
> user-mode to set TF itself, i.e.:
>
> pushf
> orw $0x200, (%esp)
> popf

Quick question: is it connected to is_setting_trap_flag() ? IOW,
what is_setting_trap_flag() actually tells us? Looks like, it should
return true if the next insn changes the flags, correct?

> but TF is cleared before running
> the signal handler, so it doesn't recurse. This is why signal.c clears
> TF in the task_pt_regs() (same pointer as regs arg to handle_signal)
> when setting up a signal handler.

This is not clear to me. Why should we clear TF? The text above says
"before running the signal handler". But

- if it was set by us: TIF_SINGLESTEP should be true, and
we are not going to run the signal handler, we are going
to ptrace_notify(SIGTRAP).

- else: it was set by the app or debugger. Why should we
clear TF?

OK, probably we need this if the app sets TF for the
self-debugging and has a handerl for SIGTRAP...

If it was debugger, then I am not sure. OTOH, I do not
know why debugger may want to do set_flags(X86_EFLAGS_TF).

> Around the same time as the syscall-step "double-step" issue was
> noticed, someone noticed that PTRACE_SINGLESTEP,signr for a handled
> signal had a similar issue: it gives you a SIGTRAP after executing the
> first instruction of the signal handler.
> So instead we added a "fake
> trap" for PTRACE_SINGLESTEP

Yes, this is clear. Thanks.

> Not long after, I wanted to clean things up, and added TIF_FORCED_TF.

In short, TIF_FORCED_TF means: we believe we "own" X86_EFLAGS_TF.

> and also added the is_setting_trap_flag() code that is now in
> step.c--so single-stepping that "popf" clears TIF_FORCED_TF, and we
> won't later confuse the user-chosen "real" TF (or lack thereof) with one
> induced by user_enable_single_step().

OK, this probably answers my question about is_setting_trap_flag().
But I need to re-read the code and your explanation again, I am
starting to lose the concentration.

> (That detection is imperfect in
> various ways

Ah, good. I thought that my misunderstanding is the only problem.

> To complete the background, there is one more set of wrinkles. There
> are the various potential ptrace stops that take place "inside" some
> system call (execve, clone, fork, or vfork).

Oh, thanks. I didn't think about this. Again, will re-read your explanation
tomorrow. (Probably this doesn't matter in the context of this particular
bug, but I am not sure...)

> One final note about the current behavior, for user-mode setting TF
> itself. Because the trap in kernel mode (sysenter instruction) makes
> do_debug set TIF_SINGLESTEP, this triggers the syscall_trace_leave path.

Another thing I didn't think about.

> So a user that does sysenter with TF set will get his SIGTRAP
> immediately after the sysenter instruction. Conversely, a syscall entry
> with TF set via 'int $0x80' or via x86-64's 'syscall' just stores the
> user's full eflags into task_pt_regs()->flags and never notices whether
> it contains TF, never sets TIF_SINGLESTEP.

Heh.

> Hence, one of these entries
> (i.e. all syscalls from 64-bit user mode, and non-sysenter syscalls from
> 32-bit user mode on both 32-bit and 64-bit kernels) will "double-step"
> over the system call instruction and the one after it.

Confused... syscall insn doesn't lead to TIF_SINGLESTEP, so
syscall_trace_leave() should return to user-space, and then we should
have a single step after the next instruction? (assuming TIF_SYSCALL_TRACE
is not set).

> It may make sense to have signal.c just notice TIF_FORCED_TF and mask TF
> out of what it stores in the signal frame rather than actually clearing
> it in the registers beforehand.

Yes, the patch I sent you privately does exactly this.

> (Note you must change ia32_signal.c

OOPS, indeed.

> If you use a helper for that, perhaps just globalize and rename
> get_flags() from ptrace.c, since we have it there too.)

Well, this is minor, but get_flags() from ptrace.c takes the task_struct
pointer, while we already have pt_regs in setup_sigcontext... OK, I agree,
but then we need a good name for this helper. ptrace_get_task_flags?

> But it still
> needs to clear TF for the handler entry if TIF_SINGLESTEP is not set.

Aha. This partly answers my "Why should we clear TF" question above.
At least you agree that if TIF_SINGLESTEP is set, this is not needed.

I still can't fully understand why should we clear it otherwise. But,
in any case, I guess we should do this because we do not want the user
visible changes.

> And notice also restore_sigcontext (and ia32_restore_sigcontext), which
> also lets userland choose the TF value without regard to TIF_FORCED_TF.

Oh! thanks... Again, again, again, will reread tomorrow. But at first
glance, doesn't this mean another problem/patch?

> The change to the tracehook_signal_handler() call seems quite wrong to
> me.

Yes, yes, this is clear.

See you tomorrow. Thanks!

Oleg.

2010-06-16 20:53:18

by Roland McGrath

[permalink] [raw]
Subject: Re: Bug 16061 - single stepping in a signal handler can cause the single step flag to get "stuck"

> Roland, thanks a lot for this (long!) explanation. Another email from
> you which I should save in ~/DOCS. I don't know how many time you spent
> writing this message, but I bet I spent more trying to understand it ;)

That was my plan. ;-)

> Damn. Can't resists. And when I read to this point, I decided to
> learn what exactly DR_STEP and db6 (I looked at native_get_debugreg)
> mean. More than hour I was trying to google and search in the intel
> pdf's I have. No lack. Nothing about db[0-6]! However, I discovered
> that volume 3b system programming guide describes DR6 register (not db!)

Yes, the assembly syntax is %dbN, some comments and docs say DBn, and
others say DRn. They're called the "debug registers", abbreviate randomly.
Go team.

> and it has BS=14 bit (matches DR_STEP == (1 << 14)) which merely means:
> "the debug exception was triggered by the single-step execution mode".
> Heh. Finally I can understand this code, more or less.

Nobody understands it more than more or less. ;-)
To get more confused, try to figure out when the hardware (or kernel) does
or doesn't clear the DR_STEP bit (or others) in %db6. (Actually, don't.
You can find some old LKML thread about hw_breakpoint where we figured that
out, but my brain hurts just recalling that I once almost knew.)

> grep, grep, grep, grep... syscall_init()->wrmsrl(X86_EFLAGS_TF), right?

Right.

> OK. And now I spent a couple more hours. Because I decided to learn
> how actually syscalls work on x86_64. I read the sysenter code for
> i386 a long ago, and I found my understanding doesn't match the current
> reality. grep * a_lot.

Muwhahaha. Indeed, this is not directly relevant and I judiciously said
"the next instruction" for sysenter without mentioning the vDSO magic that
is actually where that PC always is. You don't really need to know about
that stuff, for this subject. But go ahead and learn about it, and then
you'll get stuck fixing it next time there is a problem. ;-)

> So, the user-space doesn't use vdso/vsyscall, it merely calls the
> "syscall" insn.

Right. x86-64 started out with a single, sensible and optimal mechanism
for native system calls. The only reason i386 has __kernel_vsyscall is
that there are multiple variants best for different hardware, and you
don't know which flavor you want until you boot and see the actual CPU.

> But what about vdso and vsyscall? Looks like, they both do (almost)
> the same, but using different the methods.

Yes, the non-DSO vsyscall page is an older hack and now we are stuck with
that as a permanent part of the x86-64 user/kernel ABI. The vDSO is the
"modern" method, and is treated similarly on i386 and several other
machines (e.g. powerpc has exported calls there too).

> OK, after this grepping I see it was off-topic. And I do not even
> remember my concerns which forced me to study this magic now.

Too late! Now we know that you know, and you will be the expert when I am
hit by the bus.

> Aha. So, in short: we restore TF in syscall_trace_enter() to expose
> this flag to debugger, right? This was my (vague) understanding, but
> I wasn't sure.

Right. It also exposes it correctly in struct sigcontext if user-mode had
set TF itself. This fix came long after most of the rest of the logic. So
some other existing code might be redundant or strange-looking now that we
have this.

> > Back to prehistory. It's always been possible in the hardware for
> > user-mode to set TF itself, i.e.:
> >
> > pushf
> > orw $0x200, (%esp)
> > popf
>
> Quick question: is it connected to is_setting_trap_flag() ? IOW,
> what is_setting_trap_flag() actually tells us? Looks like, it should
> return true if the next insn changes the flags, correct?

Correct. is_setting_trap_flag() returns true when the PC is pointing at
that "popf", for example.

> - if it was set by us: TIF_SINGLESTEP should be true, and
> we are not going to run the signal handler, we are going
> to ptrace_notify(SIGTRAP).

Right.

> - else: it was set by the app or debugger. Why should we
> clear TF?
>
> OK, probably we need this if the app sets TF for the
> self-debugging and has a handerl for SIGTRAP...

Correct. That's always been the only way to usefully set TF yourself in
user mode (otherwise you'd just get infinite SIGTRAPs). It corresponds to
how the hardware trap behavior works, e.g. if you were using signal
handlers to execute old DOS code in emulation and had a DOS debugger in
there.

> If it was debugger, then I am not sure. OTOH, I do not
> know why debugger may want to do set_flags(X86_EFLAGS_TF).

It's just the general pedanticism that the debugger should be able to set
up any state that user-mode code could have created itself.

> > Not long after, I wanted to clean things up, and added TIF_FORCED_TF.
>
> In short, TIF_FORCED_TF means: we believe we "own" X86_EFLAGS_TF.

Well, only sort of. It means we "own" it being on. If it's off, that
doesn't mean we're choosing to turn it off when the userland state is on,
for example. That might be a nice and clear thing for a flag like that to
mean. But there are so many ways that it can be implicitly cleared by
hardware and such that it would take more work to make it mean that.

> > (That detection is imperfect in various ways
>
> Ah, good. I thought that my misunderstanding is the only problem.

I could tell you some ways it could be wrong. But that's left as an
exercise for the reader, and you really just don't care.

> > To complete the background, there is one more set of wrinkles. There
> > are the various potential ptrace stops that take place "inside" some
> > system call (execve, clone, fork, or vfork).
>
> Oh, thanks. I didn't think about this. Again, will re-read your explanation
> tomorrow. (Probably this doesn't matter in the context of this particular
> bug, but I am not sure...)

I don't think it directly influences anything we're talking about changing.
But it is a nonobvious thing I wanted to put in the record.

> Confused... syscall insn doesn't lead to TIF_SINGLESTEP, so
> syscall_trace_leave() should return to user-space, and then we should
> have a single step after the next instruction? (assuming TIF_SYSCALL_TRACE
> is not set).

You're not confused! That's exactly how it works.
>From the userland perspective this is a "double-step":

0000000000400078 <_start>:
400078: 9c pushfq
400079: 66 81 0c 24 00 02 orw $0x200,(%rsp)
40007f: 9d popfq
400080: 0f 05 syscall
400082: f4 hlt

This doesn't get a SIGTRAP at 400082 as it would for any normal instruction
instead of "syscall". Instead it gets SIGSEGV because "hlt" is executed.

> > It may make sense to have signal.c just notice TIF_FORCED_TF and mask TF
> > out of what it stores in the signal frame rather than actually clearing
> > it in the registers beforehand.
>
> Yes, the patch I sent you privately does exactly this.

Right. I think that direction is where the right fixes lie.

> Well, this is minor, but get_flags() from ptrace.c takes the task_struct
> pointer, while we already have pt_regs in setup_sigcontext... OK, I agree,
> but then we need a good name for this helper. ptrace_get_task_flags?

I'd call it user_[gs]et_eflags, but names don't matter much. Feel free to
change the signatures to take the pt_regs pointer too. Half the ptrace.c
callers have the pointer on hand already too. And it wouldn't hurt to
change the signature of the local {get,put}reg{,32} calls to take the
argument and have their callers use task_pt_regs() only once, too.

> > But it still
> > needs to clear TF for the handler entry if TIF_SINGLESTEP is not set.
>
> Aha. This partly answers my "Why should we clear TF" question above.
> At least you agree that if TIF_SINGLESTEP is set, this is not needed.

Right.

> I still can't fully understand why should we clear it otherwise. But,
> in any case, I guess we should do this because we do not want the user
> visible changes.

It's the only way that user-mode being able to set TF on itself can ever
make any kind of sense.

> > And notice also restore_sigcontext (and ia32_restore_sigcontext), which
> > also lets userland choose the TF value without regard to TIF_FORCED_TF.
>
> Oh! thanks... Again, again, again, will reread tomorrow. But at first
> glance, doesn't this mean another problem/patch?

Yes, those should use user_set_eflags. With today's kernel I think you
could write another test case where a signal handler returns after setting
TF in the sigcontext, so that PTRACE_SINGLESTEP over the sigreturn syscall
breaks the untraced userland behavior by the next PTRACE_CONT clearing TF
when userland wants it set.


Thanks,
Roland