2008-06-02 01:37:55

by Joe Peterson

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

Hi Harold,

I just also discovered this problem independently, and when I tracked it
down to stty and googled for it, I found your post. In my test case, it
seems to get stuck in stty as run from the user's .bashrc (i.e., "su
user", where the user's .bashrc has the stty command). In my case, the
arguments to stty do not seem to matter (well, I've tried "-ixany" and
"echoctl" - same results). Also, the problem is made more reliable if a
sleep is done before the stty. E.g., here's my test .bashrc:

sleep 2
stty -ixany

Note that if run from the console or a tty, having the user logged in
already seems to avoid the hang, but doing it within an xterm shows the
hang. Strange, since with my original [more complex] test case, it
seemed to require *not* running X (tty/console only).

Most recent kernels show the issue - the only one that doesn't is
2.6.25-git17. I am running Gentoo. It does happen in a recent 2.6.26
git (an rc4 git from a couple of days ago).

Doing "ps" while hung shows stty in the "T" state. "killall -9 stty"
releases it.

-Joe

P.S. Please cc my address on reply.


2008-06-02 05:12:28

by Harald Dunkel

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

Hi Joe,

Joe Peterson wrote:
> Hi Harold,
>
> Doing "ps" while hung shows stty in the "T" state. "killall -9 stty"
> releases it.
>

Does strace give you the same output if you attach it to the blocking
stty (strace -p $pid)?

I got


:
ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig icanon echo ...}) = ? ERESTARTSYS (To be restarted)
--- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
--- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
:


Regards

Harri

2008-06-02 05:33:25

by Willy Tarreau

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

On Mon, Jun 02, 2008 at 07:12:06AM +0200, Harald Dunkel wrote:
> Hi Joe,
>
> Joe Peterson wrote:
> >Hi Harold,
> >
> >Doing "ps" while hung shows stty in the "T" state. "killall -9 stty"
> >releases it.
> >
>
> Does strace give you the same output if you attach it to the blocking
> stty (strace -p $pid)?
>
> I got
>
>
> :
> ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig icanon echo ...}) =
> ? ERESTARTSYS (To be restarted)
> --- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
> --- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
> :

Guys, you should test if "kill -CONT $pid" wakes the process up.
It might be possible that some obscure bug appeared in the tty
code resulting in SIGTTOU sometimes being sent to the caller,
although that seems rather strange :-/

Willy

2008-06-02 05:42:34

by Joe Peterson

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

Harald Dunkel wrote:
> Joe Peterson wrote:
>> Hi Harold,
>>
>> Doing "ps" while hung shows stty in the "T" state. "killall -9 stty"
>> releases it.
>>
>
> Does strace give you the same output if you attach it to the blocking
> stty (strace -p $pid)?
>
> I got
>
>
> :
> ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig icanon echo ...}) = ? ERESTARTSYS (To be restarted)
> --- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
> --- SIGTTOU (Stopped (tty output)) @ 0 (0) ---

Yep, almost the same. I get (repeating):

ioctl(0, SNDCTL_TMR_STOP or TCSETSW, {B38400 opost isig icanon echo
...}) = ? ERESTARTSYS (To be restarted)
--- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
--- SIGTTOU (Stopped (tty output)) @ 0 (0) ---


-Joe

2008-06-02 05:55:21

by Joe Peterson

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

Willy Tarreau wrote:
> Guys, you should test if "kill -CONT $pid" wakes the process up.
> It might be possible that some obscure bug appeared in the tty
> code resulting in SIGTTOU sometimes being sent to the caller,
> although that seems rather strange :-/

Just tried this ("kill -CONT <pid>") - no luck.

BTW, it should be possible, I would think, for others to duplicate this
fairly easily. Just:

1) make a user, "foo", with login shell set to /bin/bash
2) create a .bashrc in foo's home dir with contents:

sleep 2
stty -ixany

3) cp .bashrc .bash_profile (only needed to test "su - foo" too)
4) become root
5) type "su foo" (or "su - foo")

Sometimes it takes a second try to get it to happen. If the su hangs,
check to see if the stty process is in state "T". Also, it may make a
difference if you are logged in already as foo or are using X. I first
noticed this with no users logged in (except root) and no X running (but
I can reproduce with X/xterm as well using this simple test case). It
seems timing is a factor, so it's worth trying various things.

-Joe

2008-06-02 08:26:21

by Alan

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

> Guys, you should test if "kill -CONT $pid" wakes the process up.
> It might be possible that some obscure bug appeared in the tty
> code resulting in SIGTTOU sometimes being sent to the caller,
> although that seems rather strange :-/

Not really. The task would get suspended if it attempted to change the
tty settings while not being session leader. This is part of the POSIX
and BSD job control. A race (either kernel or in something like
sshd/bash) would do that and could have been caused by any of the timing
changes recently.

That would also explain why I can't duplicate it, and the sleep
observation.

2008-06-02 09:01:58

by David Newall

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

Alan Cox wrote:
> Not really. The task would get suspended if it attempted to change the
> tty settings while not being session leader. This is part of the POSIX
> and BSD job control.

I haven't heard about this new restriction, but it begs the observation
that stty, when forked from a shell (the usual case), is never a session
leader.

2008-06-02 09:36:23

by Alan

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

On Mon, 02 Jun 2008 18:31:34 +0930
David Newall <[email protected]> wrote:

> Alan Cox wrote:
> > Not really. The task would get suspended if it attempted to change the
> > tty settings while not being session leader. This is part of the POSIX
> > and BSD job control.
>
> I haven't heard about this new restriction, but it begs the observation
> that stty, when forked from a shell (the usual case), is never a session
> leader.

Sorry I mean part of the current session. I was thinking about the
specific case of bash or the ssh->bash setup where the question would be
whether the shell was session leader.

Someone who can dup this needs to instrument it in tty_ioctl really.

Alan

2008-06-02 10:17:14

by Vegard Nossum

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

On Mon, Jun 2, 2008 at 11:20 AM, Alan Cox <[email protected]> wrote:
> On Mon, 02 Jun 2008 18:31:34 +0930
> David Newall <[email protected]> wrote:
>
>> Alan Cox wrote:
>> > Not really. The task would get suspended if it attempted to change the
>> > tty settings while not being session leader. This is part of the POSIX
>> > and BSD job control.
>>
>> I haven't heard about this new restriction, but it begs the observation
>> that stty, when forked from a shell (the usual case), is never a session
>> leader.
>
> Sorry I mean part of the current session. I was thinking about the
> specific case of bash or the ssh->bash setup where the question would be
> whether the shell was session leader.
>
> Someone who can dup this needs to instrument it in tty_ioctl really.

Hi,

I have written a short test program that seems to reproduce it for me
(see attachment), even though the original su/stty stuff wouldn't.

Basically, the strace shows this:
ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig icanon echo
...}) = ? ERESTARTSYS (To be restarted)
--- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
--- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig icanon echo
...}) = ? ERESTARTSYS (To be restarted)
--- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
--- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
... (repeating)

The exact code path triggering this seems to be:

tcsetattr() -> ioctl(TCSETS) -> set_termios() -> tty_check_change()

This is on a 2.6.24.5-85.fc8 kernel.

I don't know what's wrong, but I hope this helps.


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036


Attachments:
(No filename) (1.78 kB)
reproduce.c (1.05 kB)
Download all attachments

2008-06-02 10:39:38

by Vegard Nossum

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

On Mon, Jun 2, 2008 at 12:16 PM, Vegard Nossum <[email protected]> wrote:
> On Mon, Jun 2, 2008 at 11:20 AM, Alan Cox <[email protected]> wrote:
>> On Mon, 02 Jun 2008 18:31:34 +0930
>> David Newall <[email protected]> wrote:
>>
>>> Alan Cox wrote:
>>> > Not really. The task would get suspended if it attempted to change the
>>> > tty settings while not being session leader. This is part of the POSIX
>>> > and BSD job control.
>>>
>>> I haven't heard about this new restriction, but it begs the observation
>>> that stty, when forked from a shell (the usual case), is never a session
>>> leader.
>>
>> Sorry I mean part of the current session. I was thinking about the
>> specific case of bash or the ssh->bash setup where the question would be
>> whether the shell was session leader.
>>
>> Someone who can dup this needs to instrument it in tty_ioctl really.
>
> Hi,
>
> I have written a short test program that seems to reproduce it for me
> (see attachment), even though the original su/stty stuff wouldn't.
>
> Basically, the strace shows this:
> ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig icanon echo
> ...}) = ? ERESTARTSYS (To be restarted)
> --- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
> --- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
> ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig icanon echo
> ...}) = ? ERESTARTSYS (To be restarted)
> --- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
> --- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
> ... (repeating)
>
> The exact code path triggering this seems to be:
>
> tcsetattr() -> ioctl(TCSETS) -> set_termios() -> tty_check_change()
>
> This is on a 2.6.24.5-85.fc8 kernel.
>
> I don't know what's wrong, but I hope this helps.

The error seems that tty_check_change() returns -ERESTARTSYS.
Shouldn't it be EINTR to allow the signal to be processed and let the
process decide whether to retry the tcsetattr()?

Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2008-06-02 10:51:16

by Alan Cox

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

On Mon, Jun 02, 2008 at 12:16:56PM +0200, Vegard Nossum wrote:
> ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig icanon echo
> ...}) = ? ERESTARTSYS (To be restarted)
> --- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
> --- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
> ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig icanon echo
> ...}) = ? ERESTARTSYS (To be restarted)
> --- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
> --- SIGTTOU (Stopped (tty output)) @ 0 (0) ---
> ... (repeating)
>
> The exact code path triggering this seems to be:
>
> tcsetattr() -> ioctl(TCSETS) -> set_termios() -> tty_check_change()

This looks correct to me and in fact I see the behaviour you report on 2.6.23
when running it. If I tell it to ignore SIGTTOU that also then behaves as
expected.

If
your pgrp is not the pgrp of the tty
and you are not ignoring TTOU
and you are not orphaned (as a group)

Then we are *supposed* to send you SIGTTOU and kick you back
into touch.


This is so that if you do

someapp
^Z
bg
otherapp

And someapp wants to change the tty settings it blocks back to the shell.

This is correct behaviour and behaviour we've had for years.

Alan

2008-06-02 10:52:36

by Alan Cox

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

On Mon, Jun 02, 2008 at 12:39:29PM +0200, Vegard Nossum wrote:
> Shouldn't it be EINTR to allow the signal to be processed and let the
> process decide whether to retry the tcsetattr()?

The signal is processed, and then application retries the tcsetattr and
gets another one. The default TTOU behaviour is to block and then fg
continues the call so RESTARTSYS is both correct and has been used for
years

2008-06-02 10:57:19

by Vegard Nossum

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

On Mon, Jun 2, 2008 at 12:52 PM, Alan Cox <[email protected]> wrote:
> On Mon, Jun 02, 2008 at 12:39:29PM +0200, Vegard Nossum wrote:
>> Shouldn't it be EINTR to allow the signal to be processed and let the
>> process decide whether to retry the tcsetattr()?
>
> The signal is processed, and then application retries the tcsetattr and
> gets another one. The default TTOU behaviour is to block and then fg
> continues the call so RESTARTSYS is both correct and has been used for
> years
>

Hm, yes, that seems correct. I'm sorry for the wrong suggestions.

I guess this still doesn't explain why TTOU doesn't block (IOW, stop
the process, right?) in this case, because my test program does not
touch it.


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2008-06-02 12:28:51

by Alan Cox

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

On Mon, Jun 02, 2008 at 12:57:07PM +0200, Vegard Nossum wrote:
> I guess this still doesn't explain why TTOU doesn't block (IOW, stop
> the process, right?) in this case, because my test program does not
> touch it.

I see the parent process sleeping and the child taking TTOU and going to
state T. That again is correct.

alan 3219 0.0 0.0 3652 384 pts/5 S 13:11 0:00 ./repro
alan 3220 0.0 0.0 3652 204 pts/5 T 13:11 0:00 ./repro

If you run it without any straces etc do you see it blocked in T or sitting
in R ?

Alan

2008-06-02 14:31:21

by Vegard Nossum

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

On Mon, Jun 2, 2008 at 2:28 PM, Alan Cox <[email protected]> wrote:
> On Mon, Jun 02, 2008 at 12:57:07PM +0200, Vegard Nossum wrote:
>> I guess this still doesn't explain why TTOU doesn't block (IOW, stop
>> the process, right?) in this case, because my test program does not
>> touch it.
>
> I see the parent process sleeping and the child taking TTOU and going to
> state T. That again is correct.
>
> alan 3219 0.0 0.0 3652 384 pts/5 S 13:11 0:00 ./repro
> alan 3220 0.0 0.0 3652 204 pts/5 T 13:11 0:00 ./repro
>
> If you run it without any straces etc do you see it blocked in T or sitting
> in R ?

Without any straces, it is blocked in T. Like Joe's report.

With strace, it's in R.

Exactly as you said, correct and expected behaviour.

So this is not a kernel problem at all.

I'm sorry for having wasted your time :-(


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2008-06-02 15:26:59

by Joe Peterson

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

Alan Cox wrote:
> Someone who can dup this needs to instrument it in tty_ioctl really.

Alan, since I can get it to happen faithfully, I can try this - any
suggestions on where to instrument?

Thanks, Joe

P.S. My stty process sits in "T" - did you say that it would be in "R"
if straced and that is correct?

2008-06-02 15:52:05

by Alan Cox

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

On Mon, Jun 02, 2008 at 09:26:48AM -0600, Joe Peterson wrote:
> P.S. My stty process sits in "T" - did you say that it would be in "R"
> if straced and that is correct?

T would be correct. I'll put together a small diff to printk useful stuff
when it happens and sent it you tonight/tomorrow


--
--
Take control of enterprise infrastructure
Sign up for starfleet academy today

2008-06-02 16:03:28

by Joe Peterson

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

Alan Cox wrote:
> On Mon, Jun 02, 2008 at 09:26:48AM -0600, Joe Peterson wrote:
>> P.S. My stty process sits in "T" - did you say that it would be in "R"
>> if straced and that is correct?
>
> T would be correct. I'll put together a small diff to printk useful stuff
> when it happens and sent it you tonight/tomorrow

Awesome; that would be great - thanks!

-Joe

2008-06-04 14:43:20

by Joe Peterson

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

Alan Cox wrote:
> On Mon, Jun 02, 2008 at 09:26:48AM -0600, Joe Peterson wrote:
>> P.S. My stty process sits in "T" - did you say that it would be in "R"
>> if straced and that is correct?
>
> T would be correct. I'll put together a small diff to printk useful stuff
> when it happens and sent it you tonight/tomorrow

[Alan, thanks for the tips on where to instrument this]

What I have verified so far is that when the problem occurs, it gets to
this point in [tty_io.c] tty_check_change():

1229 kill_pgrp(task_pgrp(current), SIGTTOU, 1);
1230 set_thread_flag(TIF_SIGPENDING);
1231 ret = -ERESTARTSYS;
1232 out:
1233 return ret;

So the error that gets returned to set_termios() is -512.

Also, the various checks before this point (of course) did not pass
(current->signal->tty != tty, !tty->pgrp, task_pgrp(current) ==
tty->pgrp, is_ignored(SIGTTOU), is_current_pgrp_orphaned()). I have not
printed out the various values from these - let me know if this would be
helpful. I wanted to pass this info along now in case it is of help.

-Joe

2008-06-04 15:17:25

by Alan Cox

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

On Wed, Jun 04, 2008 at 08:43:00AM -0600, Joe Peterson wrote:
>
> So the error that gets returned to set_termios() is -512.
>
> Also, the various checks before this point (of course) did not pass
> (current->signal->tty != tty, !tty->pgrp, task_pgrp(current) ==
> tty->pgrp, is_ignored(SIGTTOU), is_current_pgrp_orphaned()). I have not
> printed out the various values from these - let me know if this would be
> helpful. I wanted to pass this info along now in case it is of help.

See what tty->pgrp is at that point when it hangs and that might identify
who is owning the tty and tty setup

2008-06-04 16:53:05

by Joe Peterson

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

Alan Cox wrote:
> See what tty->pgrp is at that point when it hangs and that might identify
> who is owning the tty and tty setup

tty = current->signal->tty = -142080000 or 0xf7880800
task->pgrg = -142405824 or 0xf7830f40

-Joe

2008-06-04 17:11:27

by Alan Cox

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

> tty = current->signal->tty = -142080000 or 0xf7880800
> task->pgrg = -142405824 or 0xf7830f40

task->pgrp is a struct pid - you need the value it holds

2008-06-04 20:33:22

by Joe Peterson

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

Alan Cox wrote:
>> tty = current->signal->tty = -142080000 or 0xf7880800
>> task->pgrg = -142405824 or 0xf7830f40
>
> task->pgrp is a struct pid - you need the value it holds

Yeah, I figured later that giving you the addresses was rather useless. :)

Anyway, here is more info:

tty_check_change: current->signal->tty = f7880800
tty_check_change: tty = f7880800
tty_check_change: tty->pgrp = f7b99e40
tty->pgrp->count = 5
tty->pgrp->level = 0
tty->pgrp->numbers[0].nr = 6951
tty_check_change: task_pgrp(current) = f7b99d40
task_pgrp(current)->count = 1
task_pgrp(current)->level = 0
task_pgrp(current)->numbers[0].nr = 6952
tty_check_change: kill_pgrp called; returning -ERESTARTSYS
set_termios: error return value (-512) from tty_check_change
foo 6951 0.0 0.1 2332 1096 tty1 S+ 14:18 0:00 su foo
foo 6952 0.0 0.1 2988 1464 tty1 S 14:18 0:00 bash


So, looks like the tty->pgrp's process is the "su" command itself, and
the task_pgrp(current)'s process is "bash" - the shell started by the su.

-Joe

2008-06-11 14:04:34

by Joe Peterson

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

Joe Peterson wrote:
> Anyway, here is more info:
>
> tty_check_change: current->signal->tty = f7880800
> tty_check_change: tty = f7880800
> tty_check_change: tty->pgrp = f7b99e40
> tty->pgrp->count = 5
> tty->pgrp->level = 0
> tty->pgrp->numbers[0].nr = 6951
> tty_check_change: task_pgrp(current) = f7b99d40
> task_pgrp(current)->count = 1
> task_pgrp(current)->level = 0
> task_pgrp(current)->numbers[0].nr = 6952
> tty_check_change: kill_pgrp called; returning -ERESTARTSYS
> set_termios: error return value (-512) from tty_check_change
> foo 6951 0.0 0.1 2332 1096 tty1 S+ 14:18 0:00 su foo
> foo 6952 0.0 0.1 2988 1464 tty1 S 14:18 0:00 bash
>
>
> So, looks like the tty->pgrp's process is the "su" command itself, and
> the task_pgrp(current)'s process is "bash" - the shell started by the su.

If anyone has any tips for my further debugging of this, given the
above, let me know. I'd like to help resolve this.

Thanks! Joe

2008-06-12 11:52:21

by Vegard Nossum

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

On Wed, Jun 11, 2008 at 4:04 PM, Joe Peterson <[email protected]> wrote:
> Joe Peterson wrote:
>> Anyway, here is more info:
>>
>> tty_check_change: current->signal->tty = f7880800
>> tty_check_change: tty = f7880800
>> tty_check_change: tty->pgrp = f7b99e40
>> tty->pgrp->count = 5
>> tty->pgrp->level = 0
>> tty->pgrp->numbers[0].nr = 6951
>> tty_check_change: task_pgrp(current) = f7b99d40
>> task_pgrp(current)->count = 1
>> task_pgrp(current)->level = 0
>> task_pgrp(current)->numbers[0].nr = 6952
>> tty_check_change: kill_pgrp called; returning -ERESTARTSYS
>> set_termios: error return value (-512) from tty_check_change
>> foo 6951 0.0 0.1 2332 1096 tty1 S+ 14:18 0:00 su foo
>> foo 6952 0.0 0.1 2988 1464 tty1 S 14:18 0:00 bash
>>
>>
>> So, looks like the tty->pgrp's process is the "su" command itself, and
>> the task_pgrp(current)'s process is "bash" - the shell started by the su.
>
> If anyone has any tips for my further debugging of this, given the
> above, let me know. I'd like to help resolve this.

I think knowing the pgrps of the above processes (there is possibly
one more involved, stty?) would be useful; try:

$ ps -eo pid,pgrp,tpgid,user,args

..as this problem occurs because a process tries to change the
terminal settings (and subsequently gets suspended because of that)
while it's not the owner of the terminal.

This can happen if you fork something off to the background, e.g. like

$ stty 9600 &

(which should immediately give you [1]+ Stopped stty 9600),

so can you please look for anything like that in your login scripts or
shell rc files?

I don't know any other way to debug this further, sorry :-(

Thanks.


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2008-06-14 01:50:18

by Joe Peterson

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

Vegard Nossum wrote:
> I think knowing the pgrps of the above processes (there is possibly
> one more involved, stty?) would be useful; try:
>
> $ ps -eo pid,pgrp,tpgid,user,args

OK, I performed this test again (getting the su to hang), and here is
the info:

tty_check_change: current->signal->tty = f7879800
tty_check_change: tty = f7879800
tty_check_change: tty->pgrp = f78639c0
tty->pgrp->count = 5
tty->pgrp->level = 0
tty->pgrp->numbers[0].nr = 7036
tty_check_change: task_pgrp(current) = f7863f00
task_pgrp(current)->count = 1
task_pgrp(current)->level = 0
task_pgrp(current)->numbers[0].nr = 7037
tty_check_change: kill_pgrp called; returning -ERESTARTSYS
set_termios: error return value (-512) from tty_check_change

scorpius ~ # ps aux | grep 7036
foo 7036 0.0 0.1 2336 1100 tty1 S+ 19:30 0:00 su foo

scorpius ~ # ps aux | grep 7037
foo 7037 0.0 0.1 2988 1460 tty1 S 19:30 0:00 bash

scorpius ~ # ps -eo pid,pgrp,tpgid,user,args | grep 7036
6902 6902 7036 root /bin/login --
6922 6922 7036 root -bash
7036 7036 7036 foo su foo
7037 7037 7036 foo bash
7042 7037 7036 foo stty -ixany

scorpius ~ # ps -eo pid,pgrp,tpgid,user,args | grep 7037
7037 7037 7036 foo bash
7042 7037 7036 foo stty -ixany

scorpius ~ # ps aux | grep 7042
foo 7042 0.0 0.0 1608 376 tty1 T 19:30 0:00 stty -ixany

scorpius ~ # ps -eo pid,pgrp,tpgid,user,args | grep 7042
7042 7037 7036 foo stty -ixany

(I omitted, of course, when grep found itself, and I compressed some
white space to allow lines to fit nicely in the email)

> ..as this problem occurs because a process tries to change the
> terminal settings (and subsequently gets suspended because of that)
> while it's not the owner of the terminal.
>
> This can happen if you fork something off to the background, e.g. like
>
> $ stty 9600 &
>
> (which should immediately give you [1]+ Stopped stty 9600),
>
> so can you please look for anything like that in your login scripts or
> shell rc files?

I do use stty in my .bashrc (that's why this happens), but I do not put
it in the background.

Anyway, hope the additional info above is of help...

Thanks, Joe

2008-06-14 07:45:28

by Vegard Nossum

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

On Sat, Jun 14, 2008 at 3:49 AM, Joe Peterson <[email protected]> wrote:
> Vegard Nossum wrote:
>> I think knowing the pgrps of the above processes (there is possibly
>> one more involved, stty?) would be useful; try:
>>
>> $ ps -eo pid,pgrp,tpgid,user,args
>
> OK, I performed this test again (getting the su to hang), and here is
> the info:

[snip]

> scorpius ~ # ps -eo pid,pgrp,tpgid,user,args | grep 7036
> 6902 6902 7036 root /bin/login --
> 6922 6922 7036 root -bash
> 7036 7036 7036 foo su foo
> 7037 7037 7036 foo bash
> 7042 7037 7036 foo stty -ixany

So this clearly shows what's wrong; 7036 is the "controlling process"
group id. But only "su foo" is in this group, the bash and stty
processes have their own group, 7037.

On my own system, when I do "su", I get this:
2891 2891 2892 root su temp
2892 2892 2892 temp bash

...and here the "bash" process is in the right group, 2892, while "su"
is the one in the background!

Can you try to run strace on the su to see where things go wrong, i.e.

$ strace -f -e trace=process su foo

...and we're only interested in what happens up to the point where it
hangs. That should hopefully tell us which process is doing the wrong
thing. In either case, as Alan pointed out, this seems unlikely to be
a kernel problem.

[snip]

>> so can you please look for anything like that in your login scripts or
>> shell rc files?
>
> I do use stty in my .bashrc (that's why this happens), but I do not put
> it in the background.

Yeah, most likely the process that calls stty is first put in the
background itself (or never brought to the foreground?). But I don't
know why... when you get the trace, we can compare and find out where
it deviates.


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2008-06-14 17:43:45

by Joe Peterson

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

Vegard Nossum wrote:
> So this clearly shows what's wrong; 7036 is the "controlling process"
> group id. But only "su foo" is in this group, the bash and stty
> processes have their own group, 7037.
>
> On my own system, when I do "su", I get this:
> 2891 2891 2892 root su temp
> 2892 2892 2892 temp bash
>
> ...and here the "bash" process is in the right group, 2892, while "su"
> is the one in the background!

Hmm.

> Can you try to run strace on the su to see where things go wrong, i.e.
>
> $ strace -f -e trace=process su foo
>
> ...and we're only interested in what happens up to the point where it
> hangs. That should hopefully tell us which process is doing the wrong
> thing. In either case, as Alan pointed out, this seems unlikely to be
> a kernel problem.

OK, I attached this as a text file at the end. But (*bummer*), using
strace makes it impossible to reproduce the hang (figures, and I believe
someone earlier in the thread also had this problem).

As for whether the kernel is at fault, not sure (i.e. does this hang
behavior implicate the kernel automatically or can a user-space process
cause itself such an issue?). But I *do* see different behavior
depending on the kernel version. There were a couple of git kernels in
which I could not reproduce it. Still, if it is a race or something, it
might be that the conditions were just slightly perturbed.

I attached the strace log just in case it is of help.

-Joe


Attachments:
su_strace.log (2.44 kB)

2008-06-14 20:35:03

by Vegard Nossum

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

On Sat, Jun 14, 2008 at 7:43 PM, Joe Peterson <[email protected]> wrote:
>> Can you try to run strace on the su to see where things go wrong, i.e.
>>
>> $ strace -f -e trace=process su foo
>>
>> ...and we're only interested in what happens up to the point where it
>> hangs. That should hopefully tell us which process is doing the wrong
>> thing. In either case, as Alan pointed out, this seems unlikely to be
>> a kernel problem.
>
> OK, I attached this as a text file at the end. But (*bummer*), using
> strace makes it impossible to reproduce the hang (figures, and I believe
> someone earlier in the thread also had this problem).

Yeah, but doesn't it loop indefinitely calling ioctl() and getting a
SIGTTOU? Tracing up till this point is okay (and what I had in mind).

>
> As for whether the kernel is at fault, not sure (i.e. does this hang
> behavior implicate the kernel automatically or can a user-space process
> cause itself such an issue?). But I *do* see different behavior
> depending on the kernel version. There were a couple of git kernels in
> which I could not reproduce it. Still, if it is a race or something, it
> might be that the conditions were just slightly perturbed.

Yeah, a user-space process can do this, and it's the right behaviour
for the kernel. I did post a program that would "reproduce" what
you're seeing. I do now believe that it's something timing-related, as
Alan suggested initially. (But timing-related with your scripts, that
is. I must say, that "sleep 2" does look a bit suspicious; I have no
idea what that is supposed to do :-))

I suppose it would be more useful to see a trace where you include a
few more system calls, can you try:

# strace -e trace=process,ioctl,setpgid -f su foo

instead?

Just for the record, I'm probably not the best person to debug this,
so I'm just trying to figure it out as we go. On the other hand, I
don't see better suggestions from anybody else. Thank you for
persisting, though! :-)

(And the fact that the results differ with the kernel versions does
make this relevant for LKML still.)


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2008-06-14 20:52:21

by Joe Peterson

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

Vegard Nossum wrote:
> Yeah, a user-space process can do this, and it's the right behaviour
> for the kernel. I did post a program that would "reproduce" what
> you're seeing. I do now believe that it's something timing-related, as
> Alan suggested initially. (But timing-related with your scripts, that
> is. I must say, that "sleep 2" does look a bit suspicious; I have no
> idea what that is supposed to do :-))

Ah, that is something I put in there to artificially make it more
reproducible. Here's the reason: when I first encountered the problem,
it was happening if the home dir of the user was on the "btrfs"
filesystem (the new checksumming one from Oracle). This made me suspect
btrfs initially. But I reproduced the problem [more sporadically] when
the home was on ext3 as well. Since btrfs has a different performance
profile, especially when first accessed after a mount (and it is a
filesystem still under development, so some optimizations are yet to
come), I figured it might be timing-related, and sure enough, adding the
"sleep 2" proved that.

So without the sleep 2 and with a home of ext3, it rarely happens, since
it takes very little time to read the homedir files (.bashrc, etc.).
Putting in the sleep makes it almost always happen. It seems like the
delay invoked by the sleep causes that subsequent stty call to hang.

> I suppose it would be more useful to see a trace where you include a
> few more system calls, can you try:
>
> # strace -e trace=process,ioctl,setpgid -f su foo
>
> instead?

OK, attached.

> Just for the record, I'm probably not the best person to debug this,
> so I'm just trying to figure it out as we go. On the other hand, I
> don't see better suggestions from anybody else. Thank you for
> persisting, though! :-)
>
> (And the fact that the results differ with the kernel versions does
> make this relevant for LKML still.)

Thanks for helping. Yes, this is the kind of nagging issue that really
bugs me, since it is intermittent and makes things feel unstable. If we
determine the problem is in something else (like stty or bash), then at
least I can file a bug with them.

-Joe


Attachments:
strace_su.log (5.55 kB)

2008-06-14 21:33:49

by Vegard Nossum

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

On Sat, Jun 14, 2008 at 10:52 PM, Joe Peterson <[email protected]> wrote:
> Vegard Nossum wrote:
>> Yeah, a user-space process can do this, and it's the right behaviour
>> for the kernel. I did post a program that would "reproduce" what
>> you're seeing. I do now believe that it's something timing-related, as
>> Alan suggested initially. (But timing-related with your scripts, that
>> is. I must say, that "sleep 2" does look a bit suspicious; I have no
>> idea what that is supposed to do :-))
>
> Ah, that is something I put in there to artificially make it more
> reproducible. Here's the reason: when I first encountered the problem,
> it was happening if the home dir of the user was on the "btrfs"
> filesystem (the new checksumming one from Oracle). This made me suspect
> btrfs initially. But I reproduced the problem [more sporadically] when
> the home was on ext3 as well. Since btrfs has a different performance
> profile, especially when first accessed after a mount (and it is a
> filesystem still under development, so some optimizations are yet to
> come), I figured it might be timing-related, and sure enough, adding the
> "sleep 2" proved that.

I'm not sure it is. Try adding sleep 3 instead. Because I have the
"sleep 2" when I run "su foo" as well, and I _didn't_ put it there:

[pid 6298] execve("/bin/sleep", ["sleep", "2"], [/* 47 vars */]
<unfinished ...>


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2008-06-14 21:34:51

by Joe Peterson

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

Vegard Nossum wrote:
> I'm not sure it is. Try adding sleep 3 instead. Because I have the
> "sleep 2" when I run "su foo" as well, and I _didn't_ put it there:
>
> [pid 6298] execve("/bin/sleep", ["sleep", "2"], [/* 47 vars */]
> <unfinished ...>

Weird! OK, I tried it with "sleep 3" in .bashrc, and it says
"...execve("/usr/bin/sleep", ["sleep", "3"], [/* 30 vars */]) = 0".
This sounds like what I'd expect. I don't understand why you see a
sleep 2 when you did not have one in your config.....

-Joe

2008-06-17 15:33:19

by Joe Peterson

[permalink] [raw]
Subject: Re: 2.6.25.3: su gets stuck for root

Alan Cox wrote:
> This looks correct to me and in fact I see the behaviour you report on 2.6.23
> when running it. If I tell it to ignore SIGTTOU that also then behaves as
> expected.
>
> If
> your pgrp is not the pgrp of the tty
> and you are not ignoring TTOU
> and you are not orphaned (as a group)
>
> Then we are *supposed* to send you SIGTTOU and kick you back
> into touch.

OK, I am still baffled. I've thought of several different theories,
wondering if bash does not have the right parent process, how there
could be a race in the kernel or elsewhere, but as far as I can tell,
things are in order. Here's the ps -ax --forest output while hung:

6435 tty3 Ss 0:00 /bin/login --
7954 tty3 S 0:00 \_ -bash
7958 tty3 S+ 0:00 \_ su foo
7959 tty3 S 0:00 \_ bash
7964 tty3 T 0:00 \_ stty -ixany

I had logged into the tty as root (with shell set to bash), then su'd to
foo (with shell set to bash), so this tree makes sense. During the
sleep before the stty, sleep is under the final bash similar to the way
stty is while it is hung.

Note that the stty is a child of bash (which, BTW, sometimes appears as
"-su" instead - I am not clear on that), and they all lead back to the
original tty, which I gather is the session leader (or is it the "su"?).

Now, the debugging I did shows that the reason that tty_check_change()
returns an error is that the tty->pgrg != task_pgrp(current). The
former is the "su foo" process, and the latter is the bash child process.

So I guess that when it does work, they are the same process, but why
would they be the same (or not, as it were)? Does something happen
during bash startup that causes bash to become the session leader?

Please, please, someone who understands the mechanics better than I let
me know how I can explore this more deeply.

Thanks, Joe

2008-07-02 18:04:58

by Joe Peterson

[permalink] [raw]
Subject: tty session leader issue (was Re: 2.6.25.3: su gets stuck for root)

I have done some more investigation on this problem, and I am posting
here my results in hope that someone can point me in the right direction
for further investigation...

Summary: during the initialization of a new bash shell, the terminal
foreground process group often reverts back to that of the parent of the
bash shell (after being set *to* the bash shell pgrp by bash),
prohibiting commands like stty from being run by the init scripts. The
result is that the execution of these commands will hang until killed,
causing the bash prompt to not appear. Adding a delay in the script
(using sleep) increases the chance of this having time to happen.

For example, putting the following in a user's .bashrc:

sleep 2
stty -ixany

is a good way to reproduce this. doing "su <user>" from root (note that
the fact that no password is required helps the timing) will then often
hang. Killing -9 stty will allow the bash prompt to appear.

I have instrumented the bash source code in an attempt to see why this
is happeneing, partly because I suspected a bug in bash. What I have
found is this:

1) bash calls tcsetpgrp() with the pgrp of the bash process (two times)
before starting to execute init scripts. This makes sense, since bash
needs to be the session leader. It is never called again until just
before the bash shell exits normally (at which time it returns control
to the parent).

2) During the processing of the init scripts (sometimes .bashrc, but
sometimes a system script that is processed first), calling tcgetpgrp()
shows that the pgrp has reverted back to the "su <user>" process. It
does not appear that bash reverted it in my testing so far. Running
stty while in the reverted state causes a hang, since bash is not the
session leader.

So here is the question: is there a way/reason the kernel would revert
the pgrp of the session leader after bash sets it? Is there some more
instrumenting in the kernel or in bash that might reveal what is going
on? I have heard yet another report of this happening since I added to
the thread, and I can get it to happen easily on two different machines
(a desktop and a laptop).

Thanks, Joe

2008-07-02 19:22:16

by markus reichelt

[permalink] [raw]
Subject: Re: tty session leader issue (was Re: 2.6.25.3: su gets stuck for root)

* Joe Peterson <[email protected]> wrote:

> I have done some more investigation on this problem, and I am
> posting here my results in hope that someone can point me in the
> right direction for further investigation...

I cannot reproduce this with 2.6.25.9 (on Slackware 12.0)

--
left blank, right bald


Attachments:
(No filename) (306.00 B)
(No filename) (197.00 B)
Download all attachments

2008-07-06 14:14:21

by Tim Connors

[permalink] [raw]
Subject: Re: tty session leader issue (was Re: 2.6.25.3: su gets stuck for root)

On Wed, 2 Jul 2008, Joe Peterson wrote:

> I have done some more investigation on this problem, and I am posting
> here my results in hope that someone can point me in the right direction
> for further investigation...
>
> Summary: during the initialization of a new bash shell, the terminal
> foreground process group often reverts back to that of the parent of the
> bash shell (after being set *to* the bash shell pgrp by bash),
> prohibiting commands like stty from being run by the init scripts. The
> result is that the execution of these commands will hang until killed,
> causing the bash prompt to not appear. Adding a delay in the script
> (using sleep) increases the chance of this having time to happen.
...
> So here is the question: is there a way/reason the kernel would revert
> the pgrp of the session leader after bash sets it? Is there some more
> instrumenting in the kernel or in bash that might reveal what is going
> on? I have heard yet another report of this happening since I added to
> the thread, and I can get it to happen easily on two different machines
> (a desktop and a laptop).

In fact, in various laptops (Eeeepc, dell inspiron 1520, Dell inspiron
4000), I've got various tty screwups that have been introduced since
circa 2.6.19.

The 6 year old inspiron 4000 gets stuck at stty erase ^? . Randomly, but
most of the time.

All of my machines exhibit the ctrl-C being slower than ctrl-Z discussed
elswhere (I've almost developed a habit of typing ctrl-Z kill %1 <RET>).
Although even ctrl-Z recently has been reluctant to always work. I wonder
if this is the cause of dpkg recently not responding to ctrl-Z's? (debian
bug #486222). dpkg does respond to kill -STOP

ctrl-s doesn't always work anymore. Again, what prompted me to write this
email, was I couldn't pause dpkg. It's particularly unreliable at
stopping scrolling messages at bootup, and if I press it at the wrong time
at bootup (not a specific place - it can be starting up any number of
scripts), something deadlocks and won't resume upon a ctrl-q.
alt-sysrq-k is enough to kill whatever has deadlocked. I have a feeling,
but don't want to test on this system right now, that pressing scroll-lock
as opposed to ctrl-q once unlocked such a stuck display.

In summary, something in tty is certainly screwed. Does anyone see a
connection between all of these?

--
TimC
> cat ~/.signature
Electromagnetic pulse received (core dumped)

2008-07-06 16:45:44

by Alan Cox

[permalink] [raw]
Subject: Re: tty session leader issue (was Re: 2.6.25.3: su gets stuck for root)

On Mon, Jul 07, 2008 at 12:08:58AM +1000, Tim Connors wrote:
> In summary, something in tty is certainly screwed. Does anyone see a
> connection between all of these?

That they don't happen for me - at all is the only one I can suggest ? Most
of your comments are also not ones I've seen reported before.

Unfortunately 'works for me' doesn't tell me whether that is luck, distribution
specific, user configuration choices, gcc version, bugs in code , or whatever
and someone who sees the ^C problem is going to have to track it down.

Alan

2008-07-06 18:49:29

by Joe Peterson

[permalink] [raw]
Subject: Re: tty session leader issue [cause now known!] (was Re: 2.6.25.3: su gets stuck for root)

Tim Connors wrote:
> On Wed, 2 Jul 2008, Joe Peterson wrote:
>
>> I have done some more investigation on this problem, and I am posting
>> here my results in hope that someone can point me in the right direction
>> for further investigation...
>>
>> Summary: during the initialization of a new bash shell, the terminal
>> foreground process group often reverts back to that of the parent of the
>> bash shell (after being set *to* the bash shell pgrp by bash),
>> prohibiting commands like stty from being run by the init scripts. The
>> result is that the execution of these commands will hang until killed,
>> causing the bash prompt to not appear. Adding a delay in the script
>> (using sleep) increases the chance of this having time to happen.

I have done more investigation, and I now know the cause of the
bash/stty problem. It appears to be a race condition in bash (well,
between two different bash shells, actually). I saw a post from a while
back about something similar by Ingo Molnar, so I have copied him here too.

Here is the ps tree of the test case where stty has hung:

4704 ? S 0:00 \_ xterm
4706 pts/3 Ss 0:00 | \_ -bash
4739 pts/3 S 0:00 | \_ su
4742 pts/3 S 0:00 | \_ bash
4746 pts/3 S+ 0:00 | \_ su foo
4747 pts/3 S 0:00 | \_ bash
4752 pts/3 T 0:00 | \_ stty -ixany

What should happen is: when "su foo" (4746) is run, it spawns a bash
shell (4747) that then makes itself the session leader when it
initializes its job control. The stty command (in the child bash's
.bashrc) will then be able to work (and not hang).

However, the hang happens when the parent bash (4742) interferes by
reverting the tty session leader back to its child (the "su foo"
process: 4746) shortly after the child bash (4747) becomes the leader.
The parent does this when it calls
execute_command_internal()->stop_pipeline()->give_terminal_to(). This
seems to happen at a slightly random time, making the issue intermittent
- it depends which one wins the race.

In summary, when the bug does *not* occur, here is the approximate
sequence (note I am :

1) parent bash (4742) runs 'su foo' (4746)
2) parent bash sets tty leader to 'su' (4746)
3) child bash (4747) initializes and sets itself to be the leader
4) stty command in .bashrc runs successfully

When the bug occurs, here is the sequence:

1) parent bash (4742) runs 'su foo' (4746)
2) child bash (4747) initializes and sets itself to be the leader
3) parent bash sets tty leader *back* to 'su' (4746)
4) stty command runs and fails/hangs because its parent is not leader

The various calls to tcsetpgrp() that do this are interleaved from the
two bash processes, and sometimes the parent does it slightly *after*
the child bash initializes job control - that's when the problem happens.

I have not looked further to find a solution (but it's a great start to
know the cause...!). Any further help is welcome.

> The 6 year old inspiron 4000 gets stuck at stty erase ^? . Randomly, but
> most of the time.
>
> All of my machines exhibit the ctrl-C being slower than ctrl-Z discussed
> elswhere (I've almost developed a habit of typing ctrl-Z kill %1 <RET>).
> Although even ctrl-Z recently has been reluctant to always work. I wonder
> if this is the cause of dpkg recently not responding to ctrl-Z's? (debian
> bug #486222). dpkg does respond to kill -STOP

I doubt that this is related. See the following thread for more info on
this:

http://marc.info/?l=linux-kernel&m=121528829718840&w=2

> ctrl-s doesn't always work anymore. Again, what prompted me to write this
> email, was I couldn't pause dpkg. It's particularly unreliable at
> stopping scrolling messages at bootup, and if I press it at the wrong time
> at bootup (not a specific place - it can be starting up any number of
> scripts), something deadlocks and won't resume upon a ctrl-q.
> alt-sysrq-k is enough to kill whatever has deadlocked. I have a feeling,
> but don't want to test on this system right now, that pressing scroll-lock
> as opposed to ctrl-q once unlocked such a stuck display.

Hmm, not sure; I have not seen that behavior.

> In summary, something in tty is certainly screwed. Does anyone see a
> connection between all of these?

I doubt there is a connection between the bash issue and what you are
seeing with ctrl-C/ctrl-S, etc.

-Joe