2003-02-16 19:05:49

by Russell King

[permalink] [raw]
Subject: Signal/gdb oddity in 2.5.61

Hi,

I'm seeing some weird behaviour with signal handling/gdb on 2.5.61:

[root@assabet /root]$cat /dev/zero > /dev/null &
[1] 132
[root@assabet /root]$gdb /bin/cat
GNU gdb 5.0
Copyright 2000 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "armv4l-rmk-linux"...(no debugging symbols found)...
(gdb) attach 132
Attaching to program: /bin/cat, Pid 132
Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/ld-linux.so.2
0x20027a0 in _IO_putc ()
(gdb) stepi

Program received signal SIGSTOP, Stopped (signal).
0x20027a0 in _IO_putc ()
(gdb)
0x20027a4 in _IO_putc ()
(gdb)
0x4008d154 in putc () from /lib/libc.so.6
(gdb) quit

Notice the "Program received signal SIGSTOP".

Asking for the process list via <sysrq>t shows the following after
attaching gdb:

cat T C023CE94 3263624 132 135 (NOTLB)
[<c023cb98>] (schedule+0x0/0x3a0)
from [<c024d020>] (get_signal_to_deliver+0x1c0/0x3e4)
[<c024ce60>] (get_signal_to_deliver+0x0/0x3e4)
from [<c0226a00>] (do_signal+0x5c/0x13c)
[<c02269a4>] (do_signal+0x0/0x13c)
from [<c0226b10>] (do_notify_resume+0x30/0x34)
[<c0226ae0>] (do_notify_resume+0x0/0x34)
from [<c0222350>] (work_pending+0x1c/0x28)

and after the first stepi:

cat T C023CE94 3263624 132 135 (NOTLB)
[<c023cb98>] (schedule+0x0/0x3a0)
from [<c024cbb4>] (finish_stop+0xb0/0xc8)
[<c024cb04>] (finish_stop+0x0/0xc8)
from [<c024ce54>] (do_signal_stop+0x288/0x294)
[<c024cbcc>] (do_signal_stop+0x0/0x294)
from [<c024d184>] (get_signal_to_deliver+0x324/0x3e4)
[<c024ce60>] (get_signal_to_deliver+0x0/0x3e4)
from [<c0226a00>] (do_signal+0x5c/0x13c)
[<c02269a4>] (do_signal+0x0/0x13c)
from [<c0226b10>] (do_notify_resume+0x30/0x34)
[<c0226ae0>] (do_notify_resume+0x0/0x34)
from [<c0222350>] (work_pending+0x1c/0x28)

subsequent stepi's appear as per the first trace above.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html


2003-02-16 22:00:28

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: Signal/gdb oddity in 2.5.61

On Sun, Feb 16, 2003 at 07:15:43PM +0000, Russell King wrote:
> Hi,
>
> I'm seeing some weird behaviour with signal handling/gdb on 2.5.61:
>
> [root@assabet /root]$cat /dev/zero > /dev/null &
> [1] 132
> [root@assabet /root]$gdb /bin/cat
> GNU gdb 5.0
> Copyright 2000 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB. Type "show warranty" for details.
> This GDB was configured as "armv4l-rmk-linux"...(no debugging symbols found)...
> (gdb) attach 132
> Attaching to program: /bin/cat, Pid 132
> Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
> Loaded symbols for /lib/libc.so.6
> Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done.
> Loaded symbols for /lib/ld-linux.so.2
> 0x20027a0 in _IO_putc ()
> (gdb) stepi
>
> Program received signal SIGSTOP, Stopped (signal).
> 0x20027a0 in _IO_putc ()
> (gdb)
> 0x20027a4 in _IO_putc ()
> (gdb)
> 0x4008d154 in putc () from /lib/libc.so.6
> (gdb) quit
>
> Notice the "Program received signal SIGSTOP".
>
> Asking for the process list via <sysrq>t shows the following after
> attaching gdb:
>
> cat T C023CE94 3263624 132 135 (NOTLB)
> [<c023cb98>] (schedule+0x0/0x3a0)
> from [<c024d020>] (get_signal_to_deliver+0x1c0/0x3e4)
> [<c024ce60>] (get_signal_to_deliver+0x0/0x3e4)
> from [<c0226a00>] (do_signal+0x5c/0x13c)
> [<c02269a4>] (do_signal+0x0/0x13c)
> from [<c0226b10>] (do_notify_resume+0x30/0x34)
> [<c0226ae0>] (do_notify_resume+0x0/0x34)
> from [<c0222350>] (work_pending+0x1c/0x28)
>
> and after the first stepi:
>
> cat T C023CE94 3263624 132 135 (NOTLB)
> [<c023cb98>] (schedule+0x0/0x3a0)
> from [<c024cbb4>] (finish_stop+0xb0/0xc8)
> [<c024cb04>] (finish_stop+0x0/0xc8)
> from [<c024ce54>] (do_signal_stop+0x288/0x294)
> [<c024cbcc>] (do_signal_stop+0x0/0x294)
> from [<c024d184>] (get_signal_to_deliver+0x324/0x3e4)
> [<c024ce60>] (get_signal_to_deliver+0x0/0x3e4)
> from [<c0226a00>] (do_signal+0x5c/0x13c)
> [<c02269a4>] (do_signal+0x0/0x13c)
> from [<c0226b10>] (do_notify_resume+0x30/0x34)
> [<c0226ae0>] (do_notify_resume+0x0/0x34)
> from [<c0222350>] (work_pending+0x1c/0x28)
>
> subsequent stepi's appear as per the first trace above.


This is a consequence of ARM's separate get_signal_to_deliver.
Roland's changes for group stops require code in get_signal_to_deliver,
so if you aren't using the common version, you're out of luck.

I think you'll have to either update yours to match, or use the new
hooks David Miller added to use the common get_signal_to_deliver.

--
Daniel Jacobowitz
MontaVista Software Debian GNU/Linux Developer

2003-02-16 22:05:00

by Russell King

[permalink] [raw]
Subject: Re: Signal/gdb oddity in 2.5.61

On Sun, Feb 16, 2003 at 05:10:23PM -0500, Daniel Jacobowitz wrote:
> This is a consequence of ARM's separate get_signal_to_deliver.
>
> Roland's changes for group stops require code in get_signal_to_deliver,
> so if you aren't using the common version, you're out of luck.
>
> I think you'll have to either update yours to match, or use the new
> hooks David Miller added to use the common get_signal_to_deliver.

This is using the common version in 2.5.61.

You might want to completely review your reply in light of this.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2003-02-16 22:11:10

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: Signal/gdb oddity in 2.5.61

On Sun, Feb 16, 2003 at 10:14:54PM +0000, Russell King wrote:
> On Sun, Feb 16, 2003 at 05:10:23PM -0500, Daniel Jacobowitz wrote:
> > This is a consequence of ARM's separate get_signal_to_deliver.
> >
> > Roland's changes for group stops require code in get_signal_to_deliver,
> > so if you aren't using the common version, you're out of luck.
> >
> > I think you'll have to either update yours to match, or use the new
> > hooks David Miller added to use the common get_signal_to_deliver.
>
> This is using the common version in 2.5.61.
>
> You might want to completely review your reply in light of this.

Just checking - do you mean "with a change to 2.5.61 for ARM to use the
common version"? The copy of 2.5.61 I'm staring at right now has:

include/asm-arm/signal.h:#define HAVE_ARCH_GET_SIGNAL_TO_DELIVER

--
Daniel Jacobowitz
MontaVista Software Debian GNU/Linux Developer

2003-02-16 22:18:27

by Roland McGrath

[permalink] [raw]
Subject: Re: Signal/gdb oddity in 2.5.61

I call that a bug in gdb's "attach" code. If you use strace or
suchlike on gdb to see what it's doing, you'll see that it does
PTRACE_CONT and passes SIGSTOP. The old kernel treated SIGSTOP unlike
all other signals given to PTRACE_CONT, and just ignored it. I
changed this for two reasons.

One is consistency. Now all signals are treated alike by ptrace
tracing (except of course SIGKILL, which cannot be caught by ptrace).
Previously doing "signal SIGSTOP" in gdb would not behave as
advertised, but instead act like "cont".

Second is that it's the sane way for tracing SIGSTOP to behave in a
multithreaded program with the new flavor of threads. When something
generates a SIGSTOP (kill, tkill, or PTRACE_ATTACH), the thread in
question stops to report to ptrace as for all other signals (save
SIGKILL). When a stop signal with default action (e.g. SIGSTOP) is
passed to PTRACE_CONT, it performs the normal signal action, which is
to stop all the threads in the group. In cases other than SIGSTOP,
gdb doesn't know until it passes it along to the program whether it
will ignore it, handle it, or stop the process--so gdb shouldn't do
anything different. With this behavior, SIGSTOP can be treated like
the others rather than being a special case.

This is a slightly incompatible change in behavior, and I'm sorry I
forgot to mention it earlier. But I think it's an acceptable and
appropriate change. It is simple enough to fix gdb. (What it
currently does is pretty odd: it knows that the first SIGSTOP was
caused by its PTRACE_ATTACH and so doesn't report it to the user, but
still records it such that the next PTRACE_CONT or PTRACE_SINGLESTEP
it does uses SIGSTOP instead of 0.) A fixed gdb will work on older
kernels as it did before, because they treat SIGSTOP like 0 and the
fix will be to pass 0 instead of SIGSTOP when 0 is what gdb really
meant. The gdb maintainers are already aware of the issue.

An old gdb on a new kernel is affected as reported. But the only
failure mode is the minor annoyance of a spurious SIGSTOP report and
having to repeat the first cont or step/stepi command after an attach.
Users can work around this with "handle SIGSTOP nopass" (which has no
other practical effect, safe to put in your ~/.gdbinit).

This does break things like running an old gdb release's test suite
on a bleeding-edge kernel (only its "attach" test cases are affected).
But I think this is a worthwhile bit of suffering for progress.
There should be a fixed gdb available long before 2.5 is stable.


Thanks,
Roland

2003-02-16 22:18:31

by Russell King

[permalink] [raw]
Subject: Re: Signal/gdb oddity in 2.5.61

On Sun, Feb 16, 2003 at 05:21:04PM -0500, Daniel Jacobowitz wrote:
> On Sun, Feb 16, 2003 at 10:14:54PM +0000, Russell King wrote:
> > On Sun, Feb 16, 2003 at 05:10:23PM -0500, Daniel Jacobowitz wrote:
> > > This is a consequence of ARM's separate get_signal_to_deliver.
> > >
> > > Roland's changes for group stops require code in get_signal_to_deliver,
> > > so if you aren't using the common version, you're out of luck.
> > >
> > > I think you'll have to either update yours to match, or use the new
> > > hooks David Miller added to use the common get_signal_to_deliver.
> >
> > This is using the common version in 2.5.61.
> >
> > You might want to completely review your reply in light of this.
>
> Just checking - do you mean "with a change to 2.5.61 for ARM to use the
> common version"? The copy of 2.5.61 I'm staring at right now has:
>
> include/asm-arm/signal.h:#define HAVE_ARCH_GET_SIGNAL_TO_DELIVER

If Linus pulls my BK tree, then Linus will also have the code.
If you also look at the backtraces I provided, you will also notice
that the functions concerned *are* the generic ones in kernel/signal.c

This /is/ using the generic get_signal_to_deliver() from kernel/signal.c
in Linus' released 2.5.61 kernel tree.

If you don't believe me, please try to reproduce it in an x86 2.5.61
box and report the results.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2003-02-16 23:18:13

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: Signal/gdb oddity in 2.5.61

On Sun, Feb 16, 2003 at 02:28:09PM -0800, Roland McGrath wrote:
> This is a slightly incompatible change in behavior, and I'm sorry I
> forgot to mention it earlier. But I think it's an acceptable and
> appropriate change. It is simple enough to fix gdb. (What it
> currently does is pretty odd: it knows that the first SIGSTOP was
> caused by its PTRACE_ATTACH and so doesn't report it to the user, but
> still records it such that the next PTRACE_CONT or PTRACE_SINGLESTEP
> it does uses SIGSTOP instead of 0.) A fixed gdb will work on older
> kernels as it did before, because they treat SIGSTOP like 0 and the
> fix will be to pass 0 instead of SIGSTOP when 0 is what gdb really
> meant. The gdb maintainers are already aware of the issue.

(Meaning you told someone at Red Hat? It didn't come up on the GDB
list, where some of us GDB maintainers get our information...)

> An old gdb on a new kernel is affected as reported. But the only
> failure mode is the minor annoyance of a spurious SIGSTOP report and
> having to repeat the first cont or step/stepi command after an attach.
> Users can work around this with "handle SIGSTOP nopass" (which has no
> other practical effect, safe to put in your ~/.gdbinit).
>
> This does break things like running an old gdb release's test suite
> on a bleeding-edge kernel (only its "attach" test cases are affected).
> But I think this is a worthwhile bit of suffering for progress.
> There should be a fixed gdb available long before 2.5 is stable.

So that's why CVS gdb's had failures in attach.exp for some time. I
wish whoever you spoke to about it would go ahead and actually fix the
problem in GDB. It seems like a GDB bug. I think it will be a little
tricky to fix, but certainly not overly complex.

That said, I've still got two issues with your change. For one thing,
the version of GDB that Russell was running, you'll note, was 5.0. A
lot of people haven't upgraded GDB in years, and have some dispute with
the present version that means they don't want to upgrade. I've only
just stopped seeing people using 4.18. In conversation with Russell
I've already encountered another reason he doesn't want to upgrade.
And I'm also concerned that other programs may use it.

I've also got a conceptual issue with your change. Continuing a
process normally overrides a pending stop. Why shouldn't this be true
with ptrace too? It used to be - not in the POSIX sense, since we
wouldn't override things like SIGTSTP, but the point holds.

--
Daniel Jacobowitz
MontaVista Software Debian GNU/Linux Developer

2003-02-16 23:38:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: Signal/gdb oddity in 2.5.61


On Sun, 16 Feb 2003, Daniel Jacobowitz wrote:
>
> I've also got a conceptual issue with your change. Continuing a
> process normally overrides a pending stop. Why shouldn't this be true
> with ptrace too? It used to be - not in the POSIX sense, since we
> wouldn't override things like SIGTSTP, but the point holds.

I do agree. It seems that the old behaviour was more logical than the new
one is.

Linus

2003-02-17 00:50:51

by Roland McGrath

[permalink] [raw]
Subject: Re: Signal/gdb oddity in 2.5.61

> That said, I've still got two issues with your change. For one thing,
> the version of GDB that Russell was running, you'll note, was 5.0. A
> lot of people haven't upgraded GDB in years, and have some dispute with
> the present version that means they don't want to upgrade. I've only
> just stopped seeing people using 4.18. In conversation with Russell
> I've already encountered another reason he doesn't want to upgrade.

Anyone who wants to use an old gdb with a new kernel can use "handle
SIGSTOP nopass". Is that a real imposition? Anyway, aside from the test
suite, it only affects gdb users in a way that may confuse them for a few
seconds but doesn't prevent them from debugging normally.

> And I'm also concerned that other programs may use it.

Other programs may use PTRACE_CONT with SIGSTOP and expect it to act like
PTRACE_CONT with 0? It's certainly possible. But since the quirk with
SIGSTOP was so counterintuitive to begin with, it seems unlikely to me that
someone would have expected that behavior in particular. Some programs
like strace are written to treat all signals the same and pass them through
to PTRACE_CONT (actually PTRACE_SYSCALL); they will now cause an endless
stream of SIGSTOP stops until someone uses SIGCONT, instead of swallowing
the SIGSTOP--now they do for SIGSTOP what they've always done for SIGTSTP
et al.

> I've also got a conceptual issue with your change. Continuing a
> process normally overrides a pending stop. Why shouldn't this be true
> with ptrace too? It used to be - not in the POSIX sense, since we
> wouldn't override things like SIGTSTP, but the point holds.

I don't think that point holds at all. It was never consistent with the
SIGCONT sense of continuing, which does discard SIGTSTP et al as well and
moreover clears all pending stop signals regardless of which one you
continued in response to. I think it is specious to talk about a single
"continuing a process" conceptual action. There is SIGCONT, which has its
set of semantics. There is PTRACE_CONT/PTRACE_SYSCALL/PTRACE_SINGLESTEP,
which has its set of semantics (the three are the same for what we're
discussing here and I'll just say PTRACE_CONT to mean all of them).

PTRACE_CONT has never "overridden a pending stop". If someone has called
kill (or now tkill) with SIGSTOP or another stop signal since the task
stopped the first time (reported to ptrace), then one of those pending
stops happens right away and the others remain pending. SIGCONT clears all
pending stop signals, i.e. overrides a pending stop, and always has.
PTRACE_CONT has never done so.

If you want to play word games, I'd say that the stop for the ptrace report
has already happened and is not pending, and the stop for the SIGSTOP (or
other stop signal with default action) given as argument to PTRACE_CONT has
not happened until the ptrace call requested it, and so is not pending.

As well as not matching the facts, I don't think it's even desireable to
have a single notion of "continue a process". SIGCONT has pretty
strange and hairy semantics mandated by POSIX (e.g. it now includes
resuming all threads, and all the resumption semantics are magic
generation-time semantics unlike most signal effects). The purpose of
ptrace is to trace exactly what's going on, so it should interact with
other semantics as little as possible. By the nature of the facility,
it perturbs the process by stopping it. But you don't want it to
involve all sorts of other semantics implicitly--then you are perturbing
more than you have to. PTRACE_CONT resumes a single thread (unlike
SIGCONT), and the useful model for it is to pick up exactly where things
left off when the thread stopped except as explicitly changed by the
ptrace calls (including the PTRACE_CONT argument). If the PTRACE_CONT
argument matches the signal that caused the ptrace stop, then it should
do what it would have done in the absence of ptrace (which for a stop
signal is to stop the whole process, aka thread group). If you want a
change in what it was doing, such as to continue regardless of a signal
saying to stop, then you can indicate that explicitly by given
PTRACE_CONT a different argument (i.e. 0).

I don't have any strenuous objection to reintroducing the special case for
SIGSTOP. Not changing exactly what was seen before just for the sake of
precise compatibility is always a reasonable argument. But I really cannot
see any defensible claim that the old semantics were somehow more
consistent, comprehensible, or useful than the new ones.

Perhaps the greatest net happiness would be if we call this a "bug
compatibility mode" feature that is disabled by setting PTRACE_O_TRACEGOOD
(an option Dan and I have privately discussed adding to regularize the
ptrace event reporting interface). That way nothing changes for old
programs, and programs using new features Dan and I are working on get the
cleaner and consistent behavior for all signals.


Thanks,
Roland

2003-02-17 02:30:38

by Jeff Dike

[permalink] [raw]
Subject: Re: Signal/gdb oddity in 2.5.61

[email protected] said:
> Anyone who wants to use an old gdb with a new kernel can use "handle
> SIGSTOP nopass". Is that a real imposition? Anyway, aside from the
> test suite, it only affects gdb users in a way that may confuse them
> for a few seconds but doesn't prevent them from debugging normally.

It may also affect UML, since it has come to know exactly what to expect
from a ptraced process. So, when you have the semantics nailed down and
implemented, can you see if UML still runs?

Not that it's a showstopper if it doesn't, but I'd like to know so I can
fiddle UML so that it continues to run.

Jeff

2003-02-17 02:52:43

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: Signal/gdb oddity in 2.5.61

On Sun, Feb 16, 2003 at 05:00:36PM -0800, Roland McGrath wrote:
> > That said, I've still got two issues with your change. For one thing,
> > the version of GDB that Russell was running, you'll note, was 5.0. A
> > lot of people haven't upgraded GDB in years, and have some dispute with
> > the present version that means they don't want to upgrade. I've only
> > just stopped seeing people using 4.18. In conversation with Russell
> > I've already encountered another reason he doesn't want to upgrade.
>
> Anyone who wants to use an old gdb with a new kernel can use "handle
> SIGSTOP nopass". Is that a real imposition? Anyway, aside from the test
> suite, it only affects gdb users in a way that may confuse them for a few
> seconds but doesn't prevent them from debugging normally.
>
> > And I'm also concerned that other programs may use it.
>
> Other programs may use PTRACE_CONT with SIGSTOP and expect it to act like
> PTRACE_CONT with 0? It's certainly possible. But since the quirk with
> SIGSTOP was so counterintuitive to begin with, it seems unlikely to me that
> someone would have expected that behavior in particular. Some programs
> like strace are written to treat all signals the same and pass them through
> to PTRACE_CONT (actually PTRACE_SYSCALL); they will now cause an endless
> stream of SIGSTOP stops until someone uses SIGCONT, instead of swallowing
> the SIGSTOP--now they do for SIGSTOP what they've always done for SIGTSTP
> et al.

I think I'm convinced. Sorry for wasting your time. If it comes up we
can put it on a GDB FAQ somewhere.

--
Daniel Jacobowitz
MontaVista Software Debian GNU/Linux Developer