2011-06-02 07:19:04

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: tty vs workqueue oddities

Hi Alan !

Current upstream (but that's been around for at least 2 or 3 days) seems
to have a strange behaviour on one of my powerbooks. Something like
"dmesg" or "cat" of a large file in an X terminal "hangs" the machine
litterally for minutes. It generally recovers, so not always.

Network is unresponsive as well.

My attempts at stopping it into xmon always landed in process_one_work()
or flush_to_ldisc() from what I can tell, and a simple ftrace run shows
something that looks like an -enormous- lot of:

kworker/0:1-258 [000] 412.105871: flush_to_ldisc <-process_one_work
kworker/0:1-258 [000] 412.105871: tty_ldisc_ref <-flush_to_ldisc
kworker/0:1-258 [000] 412.105872: n_tty_receive_buf <-flush_to_ldisc
kworker/0:1-258 [000] 412.105872: kill_fasync <-n_tty_receive_buf
kworker/0:1-258 [000] 412.105873: __wake_up <-n_tty_receive_buf
kworker/0:1-258 [000] 412.105873: __wake_up_common <-__wake_up
kworker/0:1-258 [000] 412.105874: default_wake_function <-__wake_up_common
kworker/0:1-258 [000] 412.105874: try_to_wake_up <-default_wake_function
kworker/0:1-258 [000] 412.105874: tty_throttle <-n_tty_receive_buf
kworker/0:1-258 [000] 412.105875: mutex_lock <-tty_throttle
kworker/0:1-258 [000] 412.105875: mutex_unlock <-tty_throttle
kworker/0:1-258 [000] 412.105876: schedule_work <-flush_to_ldisc
kworker/0:1-258 [000] 412.105876: queue_work <-schedule_work
kworker/0:1-258 [000] 412.105877: queue_work_on <-queue_work
kworker/0:1-258 [000] 412.105877: __queue_work <-queue_work_on
kworker/0:1-258 [000] 412.105878: insert_work <-__queue_work
kworker/0:1-258 [000] 412.105878: tty_ldisc_deref <-flush_to_ldisc
kworker/0:1-258 [000] 412.105879: put_ldisc <-tty_ldisc_deref
kworker/0:1-258 [000] 412.105879: __wake_up <-put_ldisc
kworker/0:1-258 [000] 412.105880: __wake_up_common <-__wake_up
kworker/0:1-258 [000] 412.105880: cwq_dec_nr_in_flight <-process_one_work
kworker/0:1-258 [000] 412.105880: process_one_work <-worker_thread

and repeat that sequence more/less identical ad nauseum

Sometimes it breaks out and makes progress, usually after a few mn.

2.6.39 is fine. I'm going to attempt a bisection but it's a bit slow on
those machines and I'm running out of time today, so I wanted to shoot
that to you in case it rings a bell.

Cheers,
Ben.


2011-06-02 08:37:34

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: tty breakage in X (Was: tty vs workqueue oddities)

On Thu, 2011-06-02 at 17:17 +1000, Benjamin Herrenschmidt wrote:
> Hi Alan !

Hrm... looks like Alan is innocent ... interesting tho, the culprit
patch looks like something he (or somebody known to understand the tty
code :-) should have reviewed.

So I bisected the problem down to

Commit: b1c43f82c5aa265442f82dba31ce985ebb7aa71c
Author: Felipe Balbi <[email protected]> 2011-03-21 21:25:08
Committer: Greg Kroah-Hartman <[email protected]> 2011-04-23 10:31:53

tty: make receive_buf() return the amout of bytes received

it makes it simpler to keep track of the amount of
bytes received and simplifies how flush_to_ldisc counts
the remaining bytes. It also fixes a bug of lost bytes
on n_tty when flushing too many bytes via the USB
serial gadget driver.

Tested-by: Stefan Bigler <[email protected]>
Tested-by: Toby Gray <[email protected]>
Signed-off-by: Felipe Balbi <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

It looks like the patch is causing some major malfunctions of the X
server for me, possibly related to PTYs. For example, cat'ing a large
file in a gnome terminal hangs the kernel for -minutes- in a loop of
what looks like flush_to_ldisc/workqueue code, (some ftrace data in the
quoted bits further down).

It's pretty gross and it doesn't look powerpc related in any ways (tho I
haven't had a chance to test on an x86 box), on the other hand I'm
surprised nobody else complained :-)

Should it just be reverted ? Is there a fix ?

Hand-reverting it on top of upstream (with some bluetooth manual fixups)
fixes the problems for me, X is back to normal.

Cheers,
Ben.

> Current upstream (but that's been around for at least 2 or 3 days) seems
> to have a strange behaviour on one of my powerbooks. Something like
> "dmesg" or "cat" of a large file in an X terminal "hangs" the machine
> litterally for minutes. It generally recovers, so not always.
>
> Network is unresponsive as well.
>
> My attempts at stopping it into xmon always landed in process_one_work()
> or flush_to_ldisc() from what I can tell, and a simple ftrace run shows
> something that looks like an -enormous- lot of:
>
> kworker/0:1-258 [000] 412.105871: flush_to_ldisc <-process_one_work
> kworker/0:1-258 [000] 412.105871: tty_ldisc_ref <-flush_to_ldisc
> kworker/0:1-258 [000] 412.105872: n_tty_receive_buf <-flush_to_ldisc
> kworker/0:1-258 [000] 412.105872: kill_fasync <-n_tty_receive_buf
> kworker/0:1-258 [000] 412.105873: __wake_up <-n_tty_receive_buf
> kworker/0:1-258 [000] 412.105873: __wake_up_common <-__wake_up
> kworker/0:1-258 [000] 412.105874: default_wake_function <-__wake_up_common
> kworker/0:1-258 [000] 412.105874: try_to_wake_up <-default_wake_function
> kworker/0:1-258 [000] 412.105874: tty_throttle <-n_tty_receive_buf
> kworker/0:1-258 [000] 412.105875: mutex_lock <-tty_throttle
> kworker/0:1-258 [000] 412.105875: mutex_unlock <-tty_throttle
> kworker/0:1-258 [000] 412.105876: schedule_work <-flush_to_ldisc
> kworker/0:1-258 [000] 412.105876: queue_work <-schedule_work
> kworker/0:1-258 [000] 412.105877: queue_work_on <-queue_work
> kworker/0:1-258 [000] 412.105877: __queue_work <-queue_work_on
> kworker/0:1-258 [000] 412.105878: insert_work <-__queue_work
> kworker/0:1-258 [000] 412.105878: tty_ldisc_deref <-flush_to_ldisc
> kworker/0:1-258 [000] 412.105879: put_ldisc <-tty_ldisc_deref
> kworker/0:1-258 [000] 412.105879: __wake_up <-put_ldisc
> kworker/0:1-258 [000] 412.105880: __wake_up_common <-__wake_up
> kworker/0:1-258 [000] 412.105880: cwq_dec_nr_in_flight <-process_one_work
> kworker/0:1-258 [000] 412.105880: process_one_work <-worker_thread
>
> and repeat that sequence more/less identical ad nauseum
>
> Sometimes it breaks out and makes progress, usually after a few mn.
>
> 2.6.39 is fine. I'm going to attempt a bisection but it's a bit slow on
> those machines and I'm running out of time today, so I wanted to shoot
> that to you in case it rings a bell.
>
> Cheers,
> Ben.
>

2011-06-02 09:30:33

by Andreas Schwab

[permalink] [raw]
Subject: Re: tty breakage in X (Was: tty vs workqueue oddities)

Benjamin Herrenschmidt <[email protected]> writes:

> It looks like the patch is causing some major malfunctions of the X
> server for me, possibly related to PTYs. For example, cat'ing a large
> file in a gnome terminal hangs the kernel for -minutes- in a loop of
> what looks like flush_to_ldisc/workqueue code, (some ftrace data in the
> quoted bits further down).

Btw., do you already have d72bce0 ("rcu: Cure load woes")? Although
that is about an independent bug, it appears to amplify this issue a
lot.

Andreas.

--
Andreas Schwab, [email protected]
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

2011-06-02 10:02:07

by Alan

[permalink] [raw]
Subject: Re: tty vs workqueue oddities

On Thu, 02 Jun 2011 17:17:25 +1000
Benjamin Herrenschmidt <[email protected]> wrote:

> Hi Alan !
>
> Current upstream (but that's been around for at least 2 or 3 days) seems
> to have a strange behaviour on one of my powerbooks. Something like
> "dmesg" or "cat" of a large file in an X terminal "hangs" the machine
> litterally for minutes. It generally recovers, so not always.
>
> Network is unresponsive as well.
>
> My attempts at stopping it into xmon always landed in process_one_work()
> or flush_to_ldisc() from what I can tell, and a simple ftrace run shows
> something that looks like an -enormous- lot of:
>
> kworker/0:1-258 [000] 412.105871: flush_to_ldisc <-process_one_work
> kworker/0:1-258 [000] 412.105871: tty_ldisc_ref <-flush_to_ldisc
> kworker/0:1-258 [000] 412.105872: n_tty_receive_buf <-flush_to_ldisc
> kworker/0:1-258 [000] 412.105872: kill_fasync <-n_tty_receive_buf
> kworker/0:1-258 [000] 412.105873: __wake_up <-n_tty_receive_buf
> kworker/0:1-258 [000] 412.105873: __wake_up_common <-__wake_up
> kworker/0:1-258 [000] 412.105874: default_wake_function <-__wake_up_common
> kworker/0:1-258 [000] 412.105874: try_to_wake_up <-default_wake_function
> kworker/0:1-258 [000] 412.105874: tty_throttle <-n_tty_receive_buf
> kworker/0:1-258 [000] 412.105875: mutex_lock <-tty_throttle
> kworker/0:1-258 [000] 412.105875: mutex_unlock <-tty_throttle
> kworker/0:1-258 [000] 412.105876: schedule_work <-flush_to_ldisc
> kworker/0:1-258 [000] 412.105876: queue_work <-schedule_work
> kworker/0:1-258 [000] 412.105877: queue_work_on <-queue_work
> kworker/0:1-258 [000] 412.105877: __queue_work <-queue_work_on
> kworker/0:1-258 [000] 412.105878: insert_work <-__queue_work
> kworker/0:1-258 [000] 412.105878: tty_ldisc_deref <-flush_to_ldisc
> kworker/0:1-258 [000] 412.105879: put_ldisc <-tty_ldisc_"Yedvab, Nadav" <[email protected]>

deref
> kworker/0:1-258 [000] 412.105879: __wake_up <-put_ldisc
> kworker/0:1-258 [000] 412.105880: __wake_up_common <-__wake_up
> kworker/0:1-258 [000] 412.105880: cwq_dec_nr_in_flight <-process_one_work
> kworker/0:1-258 [000] 412.105880: process_one_work <-worker_thread
>
> and repeat that sequence more/less identical ad nauseum
>
> Sometimes it breaks out and makes progress, usually after a few mn.
>
> 2.6.39 is fine. I'm going to attempt a bisection but it's a bit slow on
> those machines and I'm running out of time today, so I wanted to shoot
> that to you in case it rings a bell.

Possibly

b1c43f82c5aa265442f82dba31ce985ebb7aa71c

other suspect would be

a5660b41af6a28f8004e70eb261e1202ad55c5e3

but 2.6.39 working suggests its not

2011-06-02 10:05:46

by Alan

[permalink] [raw]
Subject: Re: tty breakage in X (Was: tty vs workqueue oddities)

On Thu, 02 Jun 2011 18:37:01 +1000
Benjamin Herrenschmidt <[email protected]> wrote:

> On Thu, 2011-06-02 at 17:17 +1000, Benjamin Herrenschmidt wrote:
> > Hi Alan !
>
> Hrm... looks like Alan is innocent ... interesting tho, the culprit
> patch looks like something he (or somebody known to understand the tty
> code :-) should have reviewed.

I did review it, and ran it and it worked beautifully on my system 8)

I do wonder if it has an interaction with Linus earlier changes to queue
flushing as I've never tested both together.

2011-06-03 00:56:41

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: tty breakage in X (Was: tty vs workqueue oddities)

On Thu, 2011-06-02 at 11:07 +0100, Alan Cox wrote:
> On Thu, 02 Jun 2011 18:37:01 +1000
> Benjamin Herrenschmidt <[email protected]> wrote:
>
> > On Thu, 2011-06-02 at 17:17 +1000, Benjamin Herrenschmidt wrote:
> > > Hi Alan !
> >
> > Hrm... looks like Alan is innocent ... interesting tho, the culprit
> > patch looks like something he (or somebody known to understand the tty
> > code :-) should have reviewed.
>
> I did review it, and ran it and it worked beautifully on my system 8)
>
> I do wonder if it has an interaction with Linus earlier changes to queue
> flushing as I've never tested both together.

I just noticed it doesn't happen (or if it does, it recovers fast enough
to not be noticable) on an SMP machine (dual G5). However, if I boot the
same machine with maxcpus=1, the problem is back. A simple "dmesg" in
gnome terminal shows it.

However, on that much faster machine, it also recovers a lot faster. On
the powerbook, it hangs a few minutes, on the G5 it hangs a few seconds.

I don't have the bandwidth to dive into the workqueue/tty before this
week-end, I'll give it a shot next week if nobody beats me to it.

Cheers,
Ben.

2011-06-03 06:25:13

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: tty breakage in X (Was: tty vs workqueue oddities)

On Fri, 2011-06-03 at 10:56 +1000, Benjamin Herrenschmidt wrote:

> I just noticed it doesn't happen (or if it does, it recovers fast enough
> to not be noticable) on an SMP machine (dual G5). However, if I boot the
> same machine with maxcpus=1, the problem is back. A simple "dmesg" in
> gnome terminal shows it.
>
> However, on that much faster machine, it also recovers a lot faster. On
> the powerbook, it hangs a few minutes, on the G5 it hangs a few seconds.
>
> I don't have the bandwidth to dive into the workqueue/tty before this
> week-end, I'll give it a shot next week if nobody beats me to it.

Some more data: It -looks- like what happens is that the flush_to_ldisc
work queue entry constantly re-queues itself (because the PTY is full ?)
and the workqueue thread will basically loop forver calling it without
ever scheduling, thus starving the consumer process that could have
emptied the PTY.

At least that's a semi half-assed theory. If I add a schedule() to
process_one_work() after dropping the lock, the problem disappears.

So there's a combination of things here that are quite interesting:

- A lot of work queued for the kworker will essentially go on without
scheduling for as long as it takes to empty all work items. That doesn't
sound very nice latency-wise. At least on a non-PREEMPT kernel.

- flush_to_ldisc seems to be nasty and requeues itself over and over
again from what I can tell, when it can't push the data out, in this
case, I suspect because the PTY is full but I don't know for sure yet.

Cheers,
Ben.

2011-06-03 06:56:54

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: tty breakage in X (Was: tty vs workqueue oddities)

On Fri, 2011-06-03 at 16:17 +1000, Benjamin Herrenschmidt wrote:

> Some more data: It -looks- like what happens is that the flush_to_ldisc
> work queue entry constantly re-queues itself (because the PTY is full ?)
> and the workqueue thread will basically loop forver calling it without
> ever scheduling, thus starving the consumer process that could have
> emptied the PTY.
>
> At least that's a semi half-assed theory. If I add a schedule() to
> process_one_work() after dropping the lock, the problem disappears.
>
> So there's a combination of things here that are quite interesting:
>
> - A lot of work queued for the kworker will essentially go on without
> scheduling for as long as it takes to empty all work items. That doesn't
> sound very nice latency-wise. At least on a non-PREEMPT kernel.
>
> - flush_to_ldisc seems to be nasty and requeues itself over and over
> again from what I can tell, when it can't push the data out, in this
> case, I suspect because the PTY is full but I don't know for sure yet.

Interesting results from x86. I could not initially reproduce there at
all on my little Atom board (the one from kernel summit last year).

Eventually I looked at the kernel config, switched off PREEMPT_VOLUNTARY
and I can now reproduce on x86 too. Again, if you have both threads/core
running, the problem isn't as visible (you do see "hickups" when cat'ing
a large file, the atom is slow enough I suppose).

But offline a cpu (leave only one up) and cat a large file (dmesg is
enough for me to trigger it) and you see the hangs.

So I think my theory stands that flush_to_ldisc constantly reschedule
itself causing the worker thread to eat all CPU and starve the consumer
of the PTY. I won't have time to dig much deeper today nor probably this
week-end so I'm sending this email for others who want to look.

Cheers,
Ben.

2011-06-03 09:37:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: tty breakage in X (Was: tty vs workqueue oddities)

On Fri, Jun 3, 2011 at 3:56 PM, Benjamin Herrenschmidt
<[email protected]> wrote:
>
> So I think my theory stands that flush_to_ldisc constantly reschedule
> itself causing the worker thread to eat all CPU and starve the consumer
> of the PTY. I won't have time to dig much deeper today nor probably this
> week-end so I'm sending this email for others who want to look.

Gaah - that commit is total and utter crap.

I *removed* the whole "reschedule the work queue again" crap. It was
totally wrong. See commit a5660b41af6a.

But then commit b1c43f82c5aa added the crap right back in.

So does it all work if you just remove the broken schedule_work() line
from flush_to_ldisc?

Otherwise we'll need to just revert it. The commit is clearly totally broken.

Linus

2011-06-03 10:23:37

by Milton Miller

[permalink] [raw]
Subject: Re: tty breakage in X (Was: tty vs workqueue oddities)

[not on list so made up cc list]

On Fri Jun 03 2011 about 05:37:52 EST, Linus wrote:
> On Fri, Jun 3, 2011 at 3:56 PM, Benjamin Herrenschmidt wrote:
> > So I think my theory stands that flush_to_ldisc constantly reschedule
> > itself causing the worker thread to eat all CPU and starve the consumer
> > of the PTY. I won't have time to dig much deeper today nor probably this
> > week-end so I'm sending this email for others who want to look.
>
> Gaah - that commit is total and utter crap.
>
> I *removed* the whole "reschedule the work queue again" crap. It was
> totally wrong. See commit a5660b41af6a.
>
> But then commit b1c43f82c5aa added the crap right back in.
>
> So does it all work if you just remove the broken schedule_work() line
> from flush_to_ldisc?
>
> Otherwise we'll need to just revert it. The commit is clearly totally broken.

Its further broken in that the recieve_buf functions return -ERRORS
but the callers don't check for negative values.

milton

2011-06-05 14:37:42

by Guillaume Chazarain

[permalink] [raw]
Subject: Re: tty breakage in X (Was: tty vs workqueue oddities)

On Fri, Jun 3, 2011 at 11:36 AM, Linus Torvalds
<[email protected]> wrote:
> Otherwise we'll need to just revert it. The commit is clearly totally broken.

Even after the revert in http://git.kernel.org/linus/55db4c6 I can
reproduce some pty oddities using the test program in
http://lkml.org/lkml/2005/2/14/212.

Instead of having the output being different from the input, the child
process hangs reading while the parent is done writing.

The hangs should be easily seen with:

$ while :; do date; yes | head -n 5000 | ./a.out > /dev/null; done

--
Guillaume

2011-06-06 14:24:18

by Guillaume Chazarain

[permalink] [raw]
Subject: Re: tty breakage in X (Was: tty vs workqueue oddities)

On Sun, Jun 5, 2011 at 4:37 PM, Guillaume Chazarain <[email protected]> wrote:
> $ while :; do date; yes | head -n 5000 | ./a.out > /dev/null; done

After reverting http://git.kernel.org/linus/a5660b4 "tty: fix endless
work loop when the buffer fills up" I cannot reproduce the hangs on
SMP anymore but it brings back the busy loop on UP.


--
Guillaume

2011-06-08 02:52:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: tty breakage in X (Was: tty vs workqueue oddities)

On Mon, Jun 6, 2011 at 7:24 AM, Guillaume Chazarain <[email protected]> wrote:
>
> After reverting http://git.kernel.org/linus/a5660b4 "tty: fix endless
> work loop when the buffer fills up" I cannot reproduce the hangs on
> SMP anymore but it brings back the busy loop on UP.

Hmm. The n_tty layer has some rather distressing locking, and doesn't
lock "tty->receive_room" changes at all, for example (and uses
multiple locks for some other things).

It may well be that there is some SMP race there. The n_tty line
discipline has its own locking for its counts, and the tty buffer code
has its own locking, and "receive_room" kind o fends up being in the
middle between them.

The sad part is that the patch that made receive_buf() return the
amout of bytes received was actually trying to do the right thing, it
just did it entirely in the wrong way (re-introducing the crazy
re-arming of the workqueue from within itself, and using all the wrong
sign issues).

I'd love to get rid of receive_room entirely - and just letting the
tty line discipline handler say how much it actually received. in
other words, having receive_buf() just tell us how much it used, and
not looking at receive_room in the caller is absolutely the right
thing.

It just needs to be done properly.

Linus

2011-06-08 03:31:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: tty breakage in X (Was: tty vs workqueue oddities)

On Tue, Jun 7, 2011 at 7:44 PM, Linus Torvalds
<[email protected]> wrote:
>
> Hmm. The n_tty layer has some rather distressing locking, and doesn't
> lock "tty->receive_room" changes at all, for example (and uses
> multiple locks for some other things).
>
> It may well be that there is some SMP race there.

Actually, I think it's simpler than that.

Does this patch fix things for you? It just removes the "stop if
you've seen the tail, but somebody added a new buffer in the meantime"
logic.

We might want to keep the "re-arm the work" for just that case, but
let's see what happens if we just remove the logic entirely.

Linus


Attachments:
patch.diff (1.46 kB)

2011-06-08 08:28:36

by Felipe Balbi

[permalink] [raw]
Subject: Re: tty breakage in X (Was: tty vs workqueue oddities)

Hi,

On Tue, Jun 07, 2011 at 07:44:48PM -0700, Linus Torvalds wrote:
> On Mon, Jun 6, 2011 at 7:24 AM, Guillaume Chazarain <[email protected]> wrote:
> >
> > After reverting http://git.kernel.org/linus/a5660b4 "tty: fix endless
> > work loop when the buffer fills up" I cannot reproduce the hangs on
> > SMP anymore but it brings back the busy loop on UP.
>
> Hmm. The n_tty layer has some rather distressing locking, and doesn't
> lock "tty->receive_room" changes at all, for example (and uses
> multiple locks for some other things).
>
> It may well be that there is some SMP race there. The n_tty line
> discipline has its own locking for its counts, and the tty buffer code
> has its own locking, and "receive_room" kind o fends up being in the
> middle between them.
>
> The sad part is that the patch that made receive_buf() return the
> amout of bytes received was actually trying to do the right thing, it
> just did it entirely in the wrong way (re-introducing the crazy
> re-arming of the workqueue from within itself, and using all the wrong
> sign issues).

yeah sorry about that. I originally wrote that patch over one year ago
and had to send it three times until it was finally noticed. Then I had
to do a quick rebase and things went pretty bad.

Should've checked better and forget about loosing another merge window.

> I'd love to get rid of receive_room entirely - and just letting the
> tty line discipline handler say how much it actually received. in
> other words, having receive_buf() just tell us how much it used, and
> not looking at receive_room in the caller is absolutely the right
> thing.

that was the idea, unfortunately I missed the last part. Sorry about
that.

--
balbi


Attachments:
(No filename) (1.67 kB)
signature.asc (490.00 B)
Digital signature
Download all attachments

2011-06-08 08:31:58

by Guillaume Chazarain

[permalink] [raw]
Subject: Re: tty breakage in X (Was: tty vs workqueue oddities)

On Wed, Jun 8, 2011 at 5:31 AM, Linus Torvalds
<[email protected]> wrote:
> Does this patch fix things for you?

Works perfectly on SMP and UP, thanks!

--
Guillaume

2011-06-08 09:10:21

by Alan

[permalink] [raw]
Subject: Re: tty breakage in X (Was: tty vs workqueue oddities)

> I'd love to get rid of receive_room entirely - and just letting the
> tty line discipline handler say how much it actually received. in
> other words, having receive_buf() just tell us how much it used, and
> not looking at receive_room in the caller is absolutely the right
> thing.

Shouldn't be too hard - the only use of receive_room beyond setting it to
'fire at will' is n_tty.