2009-09-17 20:01:37

by Peter Volkov

[permalink] [raw]
Subject: 2.6.31 regression: system hang after pptp connection established

Hi.

After pptp connection is established my 2.6.31 system freezes while
2.6.30 works as expected. Bissecting gave me the following result:

commit ac89a9174decf343de049a06fad75681f71890eb
Author: Linus Torvalds <[email protected]>
Date: Sat Sep 5 13:27:10 2009 -0700

pty: don't limit the writes to 'pty_space()' inside 'pty_write()'

and looks like reverting this patch from 2.6.31 fixes the problem.


Other observations: It's hard to say when exactly system hangs - it
hangs not immediatly, sometimes when I try to send some traffic,
sometimes when I switch ppp connection off. Hang is not complete: mouse
cursor keeps moving in Xorg, but every click gets no respond, I'm unable
to start new programs, in open xconsoles it's possible to input
something but after I press enter consoles hang too. Also there is no
way out of X (ctrl+alt+FN combo does not work) and connected to this
computer ssh consoles hang too (again, it's possible to put ls there but
after that it hangs). No new ssh connections possible due to time out.

In hope to get any oops I've started netconsole but at hang no new ouput
was there. I've managed to gather some information with SysRq (it's
gzipped in attachment) but I'm not sure how useful it is.

I've tried to establish pptp connection both over wireless and wired
connections and system hanged with both, so it looks like networking
drivers are not the reason here. BTW, I'm using networkmanager to
establish connection.

gzipped kernel config is in attachment.

Is this problem known? Does anybody experience same problem? Do you have
a fix? :)

--
Peter.


Attachments:
sysrq.txt.gz (9.89 kB)
config.gz (14.42 kB)
Download all attachments

2009-09-17 20:42:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.31 regression: system hang after pptp connection established



On Thu, 17 Sep 2009, Peter Volkov wrote:
>
> After pptp connection is established my 2.6.31 system freezes while
> 2.6.30 works as expected. Bissecting gave me the following result:
>
> commit ac89a9174decf343de049a06fad75681f71890eb
> Author: Linus Torvalds <[email protected]>
> Date: Sat Sep 5 13:27:10 2009 -0700
>
> pty: don't limit the writes to 'pty_space()' inside 'pty_write()'
>
> and looks like reverting this patch from 2.6.31 fixes the problem.

Hmm. The only thing it should cause is that pty_write() will effectively
allow a larger buffer for writes (limited to ~64kB rather than 8kB).

But considering how fragile ppp has been, I guess I shouldn't be surprised
that this can cause a hang in itself.

> In hope to get any oops I've started netconsole but at hang no new ouput
> was there. I've managed to gather some information with SysRq (it's
> gzipped in attachment) but I'm not sure how useful it is.

It's interesting, but I don't know how _useful_ it is.

What's interesting about it is that it shows a problem, but the problem it
shows would seem to have nothing at all to do with ppp or networking or
pty's. The problem seems to be processes stuck in disk-wait:

events/0 D ffff88007d0c7b50 0 7 2 0x00000000
events/0 D ffff88007d0c7b50 0 7 2 0x00000000
kacpi_notify D ffff88007d2ffbe8 0 170 2 0x00000000
khubd D ffff88007d211ae8 0 260 2 0x00000000
pdflush D ffff88007d26bd40 0 326 2 0x00000000
kjournald D ffff88007b2f3df8 0 3361 2 0x00000000
kjournald D ffff88007c65ddf8 0 3362 2 0x00000000
reiserfs/0 D [<ffffffff810725f7>] ? delayacct_end+0x81/0x8c
events/0 D ffff88007d0c7b50 0 7 2 0x00000000
kacpi_notify D ffff88007d2ffbe8 0 170 2 0x00000000
khubd D ffff88007d211ae8 0 260 2 0x00000000
pdflush D ffff88007d26bd40 0 326 2 0x00000000
kjournald D ffff88007b2f3df8 0 3361 2 0x00000000
kjournald D ffff88007c65ddf8 0 3362 2 0x00000000

which explains your symptoms - hung X (with just cursor moving) and ssh's
hanging.

It's just that while it all explains your symptoms, none of the above
should have anything what-so-ever to do with pty's!

pdflush, for example, seems to be stuck waiting for &jl->j_commit_mutex in
reiserfs. Odd. It really looks like you have something stuck waiting for
IO.

But your CPU 1 backtrace looks relevant, and seems hung on a spinlock in
tty_buffer_request_room() and has that pty_write() thing there. I'm not
seeing why the 'D' states above happen, though.

> I've tried to establish pptp connection both over wireless and wired
> connections and system hanged with both, so it looks like networking
> drivers are not the reason here. BTW, I'm using networkmanager to
> establish connection.
>
> gzipped kernel config is in attachment.
>
> Is this problem known? Does anybody experience same problem? Do you have
> a fix? :)

Not a known problem, but it's entirely possible that there is some bug in
the "tty buffer out of memory" handling - that nobody has ever seen
because in practice everybody always hit other limits first.

Let me look at it a bit, and see if I can come up with test patches for
you.

Linus

2009-09-17 21:13:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.31 regression: system hang after pptp connection established



On Thu, 17 Sep 2009, Linus Torvalds wrote:
>
> What's interesting about it is that it shows a problem, but the problem it
> shows would seem to have nothing at all to do with ppp or networking or
> pty's. The problem seems to be processes stuck in disk-wait:

Ahh. I think I see what may be going on.

Somebody got a filesystem mutex, and then went to sleep due to IO. Then
pptp comes in, and seems to be stuck in a loop in kernel space, and
it seems to be stuck with preemption off.

So one CPU is stuck, and the thing that we want to run is on the same
run-queue, and not preempting. An looking at your CPU#1 trace, it's likely
looping in ppp_async_push().

And that whole loop is insane (and very prone to infinite loops), but it
also depends on that tty wakeup() thing.

Does this patch make a difference? Make sure to _not_ try to do the whole
wakeup thing if we couldn't actually insert anything into the tty buffers.

Linus
---
drivers/char/pty.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/char/pty.c b/drivers/char/pty.c
index b33d668..53761ce 100644
--- a/drivers/char/pty.c
+++ b/drivers/char/pty.c
@@ -120,8 +120,10 @@ static int pty_write(struct tty_struct *tty, const unsigned char *buf, int c)
/* Stuff the data into the input queue of the other end */
c = tty_insert_flip_string(to, buf, c);
/* And shovel */
- tty_flip_buffer_push(to);
- tty_wakeup(tty);
+ if (c) {
+ tty_flip_buffer_push(to);
+ tty_wakeup(tty);
+ }
}
return c;
}

2009-09-18 11:22:42

by Peter Volkov

[permalink] [raw]
Subject: Re: 2.6.31 regression: system hang after pptp connection established

The patch fixes the problem here. Thank you very much.

--
Peter.

В Чтв, 17/09/2009 в 14:12 -0700, Linus Torvalds пишет:
>
> On Thu, 17 Sep 2009, Linus Torvalds wrote:
> >
> > What's interesting about it is that it shows a problem, but the problem it
> > shows would seem to have nothing at all to do with ppp or networking or
> > pty's. The problem seems to be processes stuck in disk-wait:
>
> Ahh. I think I see what may be going on.
>
> Somebody got a filesystem mutex, and then went to sleep due to IO. Then
> pptp comes in, and seems to be stuck in a loop in kernel space, and
> it seems to be stuck with preemption off.
>
> So one CPU is stuck, and the thing that we want to run is on the same
> run-queue, and not preempting. An looking at your CPU#1 trace, it's likely
> looping in ppp_async_push().
>
> And that whole loop is insane (and very prone to infinite loops), but it
> also depends on that tty wakeup() thing.
>
> Does this patch make a difference? Make sure to _not_ try to do the whole
> wakeup thing if we couldn't actually insert anything into the tty buffers.
>
> Linus
> ---
> drivers/char/pty.c | 6 ++++--
> 1 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/char/pty.c b/drivers/char/pty.c
> index b33d668..53761ce 100644
> --- a/drivers/char/pty.c
> +++ b/drivers/char/pty.c
> @@ -120,8 +120,10 @@ static int pty_write(struct tty_struct *tty, const unsigned char *buf, int c)
> /* Stuff the data into the input queue of the other end */
> c = tty_insert_flip_string(to, buf, c);
> /* And shovel */
> - tty_flip_buffer_push(to);
> - tty_wakeup(tty);
> + if (c) {
> + tty_flip_buffer_push(to);
> + tty_wakeup(tty);
> + }
> }
> return c;
> }

2009-09-18 14:17:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.31 regression: system hang after pptp connection established


On Fri, 18 Sep 2009, Peter Volkov wrote:
>
> The patch fixes the problem here. Thank you very much.

Hey, thank _you_ for the sysrq output, that made it quite debuggable.

Committed as 202c4675c, and I cc'd stable.

Linus

2009-09-18 16:07:49

by Andrey Rahmatullin

[permalink] [raw]
Subject: Re: 2.6.31 regression: system hang after pptp connection established

On Thu, Sep 17, 2009 at 11:59:13PM +0400, Peter Volkov wrote:
> Is this problem known?
Yes, it's described at http://bugzilla.kernel.org/show_bug.cgi?id=14179
since Tuesday.

On Fri, Sep 18, 2009 at 07:16:39AM -0700, Linus Torvalds wrote:
> > The patch fixes the problem here. Thank you very much.
> Hey, thank _you_ for the sysrq output, that made it quite debuggable.
> Committed as 202c4675c, and I cc'd stable.
Thanks for the fix, but should I send bugreports directly here next time
instead of filing a bug in bugzilla.kernel.org and waiting for response
that will never come?

--
WBR, wRAR (ALT Linux Team)


Attachments:
(No filename) (615.00 B)
signature.asc (490.00 B)
Digital signature
Download all attachments

2009-09-18 15:49:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.31 regression: system hang after pptp connection established



On Fri, 18 Sep 2009, Andrey Rahmatullin wrote:
>
> On Thu, Sep 17, 2009 at 11:59:13PM +0400, Peter Volkov wrote:
> > Is this problem known?
> Yes, it's described at http://bugzilla.kernel.org/show_bug.cgi?id=14179
> since Tuesday.
>
> On Fri, Sep 18, 2009 at 07:16:39AM -0700, Linus Torvalds wrote:
> > > The patch fixes the problem here. Thank you very much.
> > Hey, thank _you_ for the sysrq output, that made it quite debuggable.
> > Committed as 202c4675c, and I cc'd stable.
> Thanks for the fix, but should I send bugreports directly here next time
> instead of filing a bug in bugzilla.kernel.org and waiting for response
> that will never come?

Bugzilla is great, but you should _also_ target the maintainers directly
and let them know. And especially if you have bisected things, always cc
everybody that is listed in the commit.

Otherwise, what happens is that other people not directly involved will
eventually look at the regression list, and see it - but that generally
happens much later. So things will get fixed from just the bugzilla report
too, but you'll have a much longer latency than required.

In fact, if you can bisect it to a single commit (especially a small one
like this), then bugzilla is the secondary, rather than the primary place.
Bugzilla is great for keeping track of things and trying to avoid losing
reports, but that comes at the expense of not being very convenient for
short-term stuff.

So if you have a very targeted bugreport, and know who to send a report
to, try the direct route first. Then, if nothing happens immediately, open
a bugzilla (or open the bugzilla immediately, just in case, but see it
as a "fallback" thing).

Linus