LinuxLists.cc - Re: [RFC PATCH] set TASK_TRACED before arch

2008-06-03 06:04:54

Subject: Re: [RFC PATCH] set TASK_TRACED before arch_ptrace code to fix a race

On Tue, May 27, 2008 at 2:25 PM, Petr Tesarik <[email protected]> wrote:
> On Mon, 2008-05-26 at 23:12 -0700, Roland McGrath wrote:
>> > [<a00000010000a9f0>] skip_rbs_switch+0xe0/0x110
>> > sp=e000000141c9fe30 bsp=e000000141c90cf8
>> > [<a000000000010740>] __kernel_syscall_via_break+0x0/0x20
>> > sp=e000000141ca0000 bsp=e000000141c90cf8
>
> Indeed, there seems to be a large hole here. So, this is either a bug in
> the unwinder, or a bug in the RBS synchronization, which causes
> corruption. My test machine currently needs some work to run 2.6.25
> again, but I'll try your test case as soon as I re-install it later this
> week.

Just want to check if the test case works for you?

2008-06-03 09:02:44

by Petr Tesařík

[permalink] [raw]

Subject: Re: [RFC PATCH] set TASK_TRACED before arch_ptrace code to fix a race

Luming Yu wrote:
> On Tue, May 27, 2008 at 2:25 PM, Petr Tesarik <[email protected]> wrote:
>> On Mon, 2008-05-26 at 23:12 -0700, Roland McGrath wrote:
>>>> [<a00000010000a9f0>] skip_rbs_switch+0xe0/0x110
>>>> sp=e000000141c9fe30 bsp=e000000141c90cf8
>>>> [<a000000000010740>] __kernel_syscall_via_break+0x0/0x20
>>>> sp=e000000141ca0000 bsp=e000000141c90cf8
>> Indeed, there seems to be a large hole here. So, this is either a bug in
>> the unwinder, or a bug in the RBS synchronization, which causes
>> corruption. My test machine currently needs some work to run 2.6.25
>> again, but I'll try your test case as soon as I re-install it later this
>> week.
>
> Just want to check if the test case works for you?

Yes, the test case hangs here too. But the problem seems to be
elsewhere. Did you look into the strace output? This line is pretty
suspicious:

3258 clone2(child_stack=0, stack_size=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x200000000004e290) = 1

Obviously, strace cannot attach PID 1, and since it is not designed to
handle this situation, it hangs. I'm going to investigate why the return
value of the clone2 syscall is seen as 1 by the tracer. Might even turn
out to be a bug in strace...

Petr Tesarik

2008-06-03 14:34:22

by Petr Tesařík

[permalink] [raw]

Subject: Re: [RFC PATCH] set TASK_TRACED before arch_ptrace code to fix a race

Petr Tesarik wrote:
> Luming Yu wrote:
>> On Tue, May 27, 2008 at 2:25 PM, Petr Tesarik <[email protected]> wrote:
>>> On Mon, 2008-05-26 at 23:12 -0700, Roland McGrath wrote:
>>>>> [<a00000010000a9f0>] skip_rbs_switch+0xe0/0x110
>>>>> sp=e000000141c9fe30 bsp=e000000141c90cf8
>>>>> [<a000000000010740>] __kernel_syscall_via_break+0x0/0x20
>>>>> sp=e000000141ca0000 bsp=e000000141c90cf8
>>> Indeed, there seems to be a large hole here. So, this is either a bug in
>>> the unwinder, or a bug in the RBS synchronization, which causes
>>> corruption. My test machine currently needs some work to run 2.6.25
>>> again, but I'll try your test case as soon as I re-install it later this
>>> week.
>> Just want to check if the test case works for you?
>
> Yes, the test case hangs here too. But the problem seems to be
> elsewhere. Did you look into the strace output? This line is pretty
> suspicious:
>
> 3258 clone2(child_stack=0, stack_size=0,
> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> child_tidptr=0x200000000004e290) = 1
>
> Obviously, strace cannot attach PID 1, and since it is not designed to
> handle this situation, it hangs. I'm going to investigate why the return
> value of the clone2 syscall is seen as 1 by the tracer. Might even turn
> out to be a bug in strace...

It's definitely a bug in strace. For some reason (I don't care about)
the execve() syscall produces an extra notification. However, this
notification message is suppressed when SIGTRAP is blocked. This
explains why the test case fails only when SIGTRAP is blocked.

Now, you may ask why it only fails on ia64 and not on i386 or x86_64.
Well, I was so good that I even looked into strace sources to make sure.
Whereas for i386 and x86_64, the value of EAX/RAX is checked for -ENOSYS
in syscall_fixup(), for ia64 the first ptrace() after an execve() is
unconditionally ignored, see code in get_scno().

I don't know why Luming's fix helps here, but, please, fix strace, don't
introduce weird behaviour in the kernel.

The only thing I'm willing to talk about is why the extra notification
message is sent, and how userspace (strace) is supposed to recognize it.
FWIW the backtrace (system tap was at __group_send_sig_info):

0xa0000001000b1a60 : __group_send_sig_info+0x0/0x180 []
0xa0000001000b1e30 : do_notify_parent_cldstop+0x250/0x2c0 []
0xa0000001000b2230 : ptrace_stop+0x2b0/0x3c0 []
0xa0000001000b5200 : get_signal_to_deliver+0x200/0xa40 []
0xa000000100035920 : ia64_do_signal+0xa0/0xee0 []
0xa000000100014b60 : do_notify_resume_user+0x100/0x160 []
0xa00000010000d040 : notify_resume_user+0x40/0x60 []
0xa00000010000cf40 : skip_rbs_switch+0xf0/0x150 []
0xa000000000010620 : __kernel_syscall_via_break+0x0/0x20 []

Regards,
Petr Tesarik

2008-06-03 21:01:43

by Roland McGrath

[permalink] [raw]

Subject: Re: [RFC PATCH] set TASK_TRACED before arch_ptrace code to fix a race

> It's definitely a bug in strace. For some reason (I don't care about)
> the execve() syscall produces an extra notification. However, this
> notification message is suppressed when SIGTRAP is blocked. This
> explains why the test case fails only when SIGTRAP is blocked.

What do you mean by "extra"? There is a SIGTRAP sent after execve
completes when ptraced, even when PTRACE_SYSCALL is not being used.
So for an execve that succeeds under PTRACE_SYSCALL, there is a
ptrace_notify at syscall entry, then a SIGTRAP queued (i.e., not seen
by ptrace if blocked), then a ptrace_notify at syscall exit. If
that's what's happening (including the blocked SIGTRAP not being seen
by the ptracer, i.e. strace), then there is no mystery (and no bug).

Thanks,
Roland

2008-06-03 21:32:21

by Luck, Tony

[permalink] [raw]

Subject: RE: [RFC PATCH] set TASK_TRACED before arch_ptrace code to fix a race

> What do you mean by "extra"? There is a SIGTRAP sent after execve
> completes when ptraced, even when PTRACE_SYSCALL is not being used.
> So for an execve that succeeds under PTRACE_SYSCALL, there is a
> ptrace_notify at syscall entry, then a SIGTRAP queued (i.e., not seen
> by ptrace if blocked), then a ptrace_notify at syscall exit. If
> that's what's happening (including the blocked SIGTRAP not being seen
> by the ptracer, i.e. strace), then there is no mystery (and no bug).

This might not be the same bug ... but I do have a definite 100%
reproducible bug (latest git kernel, old version of strace (4.5.15-1.el4.1))

Run:

$ strace -o logit -f make

in any directory where make is actually going to have to do some
work. You'll see that the command hangs after make outputs the
first action that it will take. Looking at the stack traces of
the 3 processes involved it seems that make forked, the child
stopped in ptrace waiting for some action from strace, but strace
isn't woken from its sleep in wait().

Backtrace of pid 6442 (strace)

Call Trace:
[<a0000001007069b0>] schedule+0x11f0/0x1380
sp=e0000001b28cfdb0 bsp=e0000001b28c0e00
[<a0000001000842d0>] do_wait+0x1110/0x1520
sp=e0000001b28cfdd0 bsp=e0000001b28c0d58
[<a0000001000849c0>] sys_wait4+0x140/0x1a0
sp=e0000001b28cfe30 bsp=e0000001b28c0cd8
[<a00000010000aa60>] ia64_ret_from_syscall+0x0/0x20
sp=e0000001b28cfe30 bsp=e0000001b28c0cd8
[<a000000000010740>] __kernel_syscall_via_break+0x0/0x20
sp=e0000001b28d0000 bsp=e0000001b28c0cd8

Backtrace of pid 6443 (make)

Call Trace:
[<a0000001007069b0>] schedule+0x11f0/0x1380
sp=e0000001b768fb40 bsp=e0000001b7680d58
[<a000000100707800>] schedule_timeout+0x40/0x180
sp=e0000001b768fb60 bsp=e0000001b7680d28
[<a000000100706d60>] wait_for_common+0x220/0x380
sp=e0000001b768fb90 bsp=e0000001b7680cd8
[<a000000100706f00>] wait_for_completion+0x40/0x60
sp=e0000001b768fbf0 bsp=e0000001b7680cb8
[<a0000001000794d0>] do_fork+0x430/0x4a0
sp=e0000001b768fbf0 bsp=e0000001b7680c60
[<a00000010000a340>] sys_clone+0x60/0x80
sp=e0000001b768fc20 bsp=e0000001b7680c10
[<a00000010000a990>] ia64_trace_syscall+0xd0/0x110
sp=e0000001b768fe30 bsp=e0000001b7680c10
[<a000000000010740>] __kernel_syscall_via_break+0x0/0x20
sp=e0000001b7690000 bsp=e0000001b7680c10

Backtrace of pid 6444 (make)

Call Trace:
[<a0000001007069b0>] schedule+0x11f0/0x1380
sp=e0000001b803fd60 bsp=e0000001b8030dd8
[<a000000100097590>] ptrace_stop+0x2d0/0x380
sp=e0000001b803fd80 bsp=e0000001b8030da0
[<a000000100097c90>] get_signal_to_deliver+0x1d0/0x6a0
sp=e0000001b803fd80 bsp=e0000001b8030d38
[<a000000100034a10>] ia64_do_signal+0xb0/0xd00
sp=e0000001b803fd80 bsp=e0000001b8030c90
[<a000000100012c60>] do_notify_resume_user+0x100/0x180
sp=e0000001b803fe20 bsp=e0000001b8030c60
[<a00000010000b0c0>] notify_resume_user+0x40/0x60
sp=e0000001b803fe20 bsp=e0000001b8030c10
[<a00000010000aff0>] skip_rbs_switch+0xe0/0x110
sp=e0000001b803fe30 bsp=e0000001b8030c10
[<a000000000010740>] __kernel_syscall_via_break+0x0/0x20
sp=e0000001b8040000 bsp=e0000001b8030c10

2008-06-03 22:14:21

by Roland McGrath

[permalink] [raw]

Subject: RE: [RFC PATCH] set TASK_TRACED before arch_ptrace code to fix a race

> This might not be the same bug ... but I do have a definite 100%
> reproducible bug (latest git kernel, old version of strace (4.5.15-1.el4.1))

Please start a thread with a sensical subject line about that.

> Backtrace of pid 6443 (make)
>
> Call Trace:
> [<a0000001007069b0>] schedule+0x11f0/0x1380
> sp=e0000001b768fb40 bsp=e0000001b7680d58
> [<a000000100707800>] schedule_timeout+0x40/0x180
> sp=e0000001b768fb60 bsp=e0000001b7680d28
> [<a000000100706d60>] wait_for_common+0x220/0x380
> sp=e0000001b768fb90 bsp=e0000001b7680cd8
> [<a000000100706f00>] wait_for_completion+0x40/0x60
> sp=e0000001b768fbf0 bsp=e0000001b7680cb8
> [<a0000001000794d0>] do_fork+0x430/0x4a0
> sp=e0000001b768fbf0 bsp=e0000001b7680c60
> [<a00000010000a340>] sys_clone+0x60/0x80
> sp=e0000001b768fc20 bsp=e0000001b7680c10
> [<a00000010000a990>] ia64_trace_syscall+0xd0/0x110
> sp=e0000001b768fe30 bsp=e0000001b7680c10
> [<a000000000010740>] __kernel_syscall_via_break+0x0/0x20
> sp=e0000001b7690000 bsp=e0000001b7680c10

This trace (do_fork->wait_for_completion) tells us this is a vfork call.
It is waiting for its child (presumably 6444) to exit or exec.

> Backtrace of pid 6444 (make)
>
> Call Trace:
> [<a0000001007069b0>] schedule+0x11f0/0x1380
> sp=e0000001b803fd60 bsp=e0000001b8030dd8
> [<a000000100097590>] ptrace_stop+0x2d0/0x380
> sp=e0000001b803fd80 bsp=e0000001b8030da0
> [<a000000100097c90>] get_signal_to_deliver+0x1d0/0x6a0
> sp=e0000001b803fd80 bsp=e0000001b8030d38
> [<a000000100034a10>] ia64_do_signal+0xb0/0xd00
> sp=e0000001b803fd80 bsp=e0000001b8030c90
> [<a000000100012c60>] do_notify_resume_user+0x100/0x180
> sp=e0000001b803fe20 bsp=e0000001b8030c60
> [<a00000010000b0c0>] notify_resume_user+0x40/0x60
> sp=e0000001b803fe20 bsp=e0000001b8030c10
> [<a00000010000aff0>] skip_rbs_switch+0xe0/0x110
> sp=e0000001b803fe30 bsp=e0000001b8030c10
> [<a000000000010740>] __kernel_syscall_via_break+0x0/0x20
> sp=e0000001b8040000 bsp=e0000001b8030c10

This is the normal trace for the child having received a signal and stopped
to tell ptrace about it (not a syscall tracing stop).

I think you need to look into what strace is doing. There is far too much
going to know much of anything just from the kernel state where the
processes sit. In particular, the sequence of ptrace and wait calls strace
made. If the same strace (identical everything) behaved differently with
an older kernel, then compare the sequence of ptrace and wait calls and see
where it differs.

Thanks,
Roland

2008-06-04 02:16:46

by Luming Yu

[permalink] [raw]

Subject: Re: [RFC PATCH] set TASK_TRACED before arch_ptrace code to fix a race

> It's definitely a bug in strace. For some reason (I don't care about)
> the execve() syscall produces an extra notification. However, this
> notification message is suppressed when SIGTRAP is blocked. This
> explains why the test case fails only when SIGTRAP is blocked.

This is exact problem I suspected and I was trying to address in my hack..
Since there are several processes involved in the pretty complex
ptrace scenario.,
I need to capture all processes context with kdump to confirm this is
exact root-cause
for the problem. But kdump doesn't work for me..I'm trying to solve it now..

I'm also in doubt about the semantic correctness of the test case..
Since SIGTRAP is so necessary to get ptrace work, is it legitimate to
block it in test case?

One more thing I need to say is:
Same strace works for utrace enabled kernel on IA64.. If the bug is in
strace, how could it happen?

2008-06-04 09:17:31

by Petr Tesařík

[permalink] [raw]

Subject: Re: [RFC PATCH] set TASK_TRACED before arch_ptrace code to fix a race

Luming Yu wrote:
>> It's definitely a bug in strace. For some reason (I don't care about)
>> the execve() syscall produces an extra notification. However, this
>> notification message is suppressed when SIGTRAP is blocked. This
>> explains why the test case fails only when SIGTRAP is blocked.
>
> This is exact problem I suspected and I was trying to address in my hack..
> Since there are several processes involved in the pretty complex
> ptrace scenario.,
> I need to capture all processes context with kdump to confirm this is
> exact root-cause
> for the problem. But kdump doesn't work for me..I'm trying to solve it now..
>
> I'm also in doubt about the semantic correctness of the test case..
> Since SIGTRAP is so necessary to get ptrace work, is it legitimate to
> block it in test case?
>
> One more thing I need to say is:
> Same strace works for utrace enabled kernel on IA64.. If the bug is in
> strace, how could it happen?

No idea, but send me the strace.log file from running

strace -o strace.log strace -f -o log.txt ./test1

and I may be able to tell.

Petr Tesarik

2008-06-05 01:49:54

by Luming Yu

[permalink] [raw]

Subject: Re: [RFC PATCH] set TASK_TRACED before arch_ptrace code to fix a race

On Wed, Jun 4, 2008 at 5:16 PM, Petr Tesarik <[email protected]> wrote:
> Luming Yu wrote:
>>> It's definitely a bug in strace. For some reason (I don't care about)
>>> the execve() syscall produces an extra notification. However, this
>>> notification message is suppressed when SIGTRAP is blocked. This
>>> explains why the test case fails only when SIGTRAP is blocked.
>>
>> This is exact problem I suspected and I was trying to address in my hack..
>> Since there are several processes involved in the pretty complex
>> ptrace scenario.,
>> I need to capture all processes context with kdump to confirm this is
>> exact root-cause
>> for the problem. But kdump doesn't work for me..I'm trying to solve it now..
>>
>> I'm also in doubt about the semantic correctness of the test case..
>> Since SIGTRAP is so necessary to get ptrace work, is it legitimate to
>> block it in test case?
>>
>> One more thing I need to say is:
>> Same strace works for utrace enabled kernel on IA64.. If the bug is in
>> strace, how could it happen?
>
> No idea, but send me the strace.log file from running
>
> strace -o strace.log strace -f -o log.txt ./test1
>
> and I may be able to tell.

Please check the attachment!

>
> Petr Tesarik
>

Attachments:

(No filename) (1.20 kB)
strace.log (15.71 kB)
Download all attachments

2008-06-05 11:17:10

by Petr Tesařík

[permalink] [raw]

Subject: Re: [RFC PATCH] set TASK_TRACED before arch_ptrace code to fix a race

On Thu, 2008-06-05 at 09:49 +0800, Luming Yu wrote:
> On Wed, Jun 4, 2008 at 5:16 PM, Petr Tesarik <[email protected]> wrote:
> > Luming Yu wrote:
> >>> It's definitely a bug in strace. For some reason (I don't care about)
> >>> the execve() syscall produces an extra notification. However, this
> >>> notification message is suppressed when SIGTRAP is blocked. This
> >>> explains why the test case fails only when SIGTRAP is blocked.
> >>
> >> This is exact problem I suspected and I was trying to address in my hack..
> >> Since there are several processes involved in the pretty complex
> >> ptrace scenario.,
> >> I need to capture all processes context with kdump to confirm this is
> >> exact root-cause
> >> for the problem. But kdump doesn't work for me..I'm trying to solve it now..
> >>
> >> I'm also in doubt about the semantic correctness of the test case..
> >> Since SIGTRAP is so necessary to get ptrace work, is it legitimate to
> >> block it in test case?
> >>
> >> One more thing I need to say is:
> >> Same strace works for utrace enabled kernel on IA64.. If the bug is in
> >> strace, how could it happen?
> >
> > No idea, but send me the strace.log file from running
> >
> > strace -o strace.log strace -f -o log.txt ./test1
> >
> > and I may be able to tell.
>
> Please check the attachment!
>
> >
> > Petr Tesarik
> >

Hm, I think without utrace, it gets out-of-sync once, so syscall entries
and exits are swapped from that point on. With utrace, it gets
out-of-sync _twice_, so it eventually looks fine. But the strace output
definitely looks incorrect even with utrace:

5718 execve("./test2.sh", [], [/* 23 vars */]) = 1
5718 execve("", [0x840c001000100003, 0x26230c14203032, 0x8cb0008800140a81, 0xa643100801808402, 0x2400905000040088, 0x11600a0072000001, 0xad814a00402e0, 0x2200012464009344, 0x1180418512c40026, 0x400003081880008, 0x2100010840910404, 0x8045120000800003, 0x6400000c0000600, 0xc20063440501400, 0x1048015002008081, 0xe02226005008c010, ...], [/* 0 vars */]) = 1
5718 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)

Note that strace missed a brk() syscall, although I can actually see
this in the other trace you sent me:

wait4(-1, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP}], __WALL, NULL) = 5704
ptrace(PTRACE_PEEKUSER, 5704, psr, NULL) = 4398046511120
ptrace(PTRACE_PEEKUSER, 5704, r15, NULL) = 1060
ptrace(PTRACE_SYSCALL, 5704, 0x1, SIG_0) = 0

Look at the value of r15, and compare it with unistd.h:
#define __NR_brk 1060

I _guess_ this is caused by the fact that test2.sh is a shell script, so
the kernel executes the shell, and maybe utrace produces a second execve
notifications in this case? Roland, can you shed some light?

Petr Tesarik

2008-06-10 08:23:45

by Luming Yu

[permalink] [raw]

Subject: Re: [RFC PATCH] set TASK_TRACED before arch_ptrace code to fix a race

On Wed, Jun 4, 2008 at 5:31 AM, Luck, Tony <[email protected]> wrote:
>> What do you mean by "extra"? There is a SIGTRAP sent after execve
>> completes when ptraced, even when PTRACE_SYSCALL is not being used.
>> So for an execve that succeeds under PTRACE_SYSCALL, there is a
>> ptrace_notify at syscall entry, then a SIGTRAP queued (i.e., not seen
>> by ptrace if blocked), then a ptrace_notify at syscall exit. If
>> that's what's happening (including the blocked SIGTRAP not being seen
>> by the ptracer, i.e. strace), then there is no mystery (and no bug).
>
> This might not be the same bug ... but I do have a definite 100%
> reproducible bug (latest git kernel, old version of strace (4.5.15-1.el4.1))
>
> Run:
>
> $ strace -o logit -f make

For this test case, utrace doesn't work too..

2008-06-27 20:51:08

by Roland McGrath

[permalink] [raw]

Subject: Re: [RFC PATCH] set TASK_TRACED before arch_ptrace code to fix a race

> I _guess_ this is caused by the fact that test2.sh is a shell script, so
> the kernel executes the shell, and maybe utrace produces a second execve
> notifications in this case? Roland, can you shed some light?

Not really. The utrace kernels Luming is trying are intended to match the
vanilla ptrace behavior. I don't think it's very useful to worry about the
difference between some utrace kernel and the current vanilla kernel.
Let's just look at what the current vanilla kernel is doing and compare
that to what an older vanilla kernel did if older versions produced
different results for the test case.

Thanks,
Roland