LinuxLists.cc - [bisected] pty performance problem

2009-11-21 22:23:21

Subject: [bisected] pty performance problem

I posted this to the kernel-newbies list, but have graduated to the
adults forum:

! Two C programs are having a query-response conversation through a
! pseudo terminal:
!
! A (client) -- forever { send query; read response }
! B (server) -- forever { read query; send response }
!
! Neither has any I/O apart from the pty conversation, so I'd expect to
! see CPU usage at 100%. When I ran it, the CPU was pretty well idle.
! After a fair bit of fiddling, it turned out that both sides were
! taking about 8ms for their read() calls. At that point it seemed
! pretty clear that this was a delay in the kernel, not the code.
!
[snip]

2.6.31-rc2-00205-gb4b21ca good
2.6.31-rc2-00206-gd945cb9 bad

and still bad with the latest: 2.6.32-rc8-00011-ga8a8a66

the git log says:
! commit d945cb9cce20ac7143c2de8d88b187f62db99bdc
! Author: Alan Cox <[email protected]>
! Date: Tue Jul 7 16:39:41 2009 +0100
!
! pty: Rework the pty layer to use the normal buffering logic
!
! This fixes the ppp problems and various other issues with call locking
! caused by one side of a pty called in one locking context trying to match
! another with differing rules on the other side. We also get a big slack
! space to work with that means we can bury the flow control deadlock case
! for any conceivable real world situation.
!
! Signed-off-by: Alan Cox <[email protected]>
! Signed-off-by: Linus Torvalds <[email protected]>

I can provide reasonably stripped down code which demonstrates the
problem. It has been reproduced by one other person, though his delay
was about 2ms.

Have fun,

Rob.

2009-11-21 23:21:19

by Alan

[permalink] [raw]

Subject: Re: [bisected] pty performance problem

> I can provide reasonably stripped down code which demonstrates the
> problem. It has been reproduced by one other person, though his delay
> was about 2ms.

I would expect that.

I guess the obvious question would be "why are you using ptys for latency
sensitive communications ?" They now queue like other ttys which fixes a
whole ton of stuff but does mean they have normal tty latencies.

Alan

2009-11-22 00:27:49

by Robert Swan

[permalink] [raw]

Subject: Re: [bisected] pty performance problem

On Sat, Nov 21, 2009 at 11:23:20PM +0000, Alan Cox wrote:
> > I can provide reasonably stripped down code which demonstrates the
> > problem. It has been reproduced by one other person, though his delay
> > was about 2ms.
>
> I would expect that.
>
> I guess the obvious question would be "why are you using ptys for latency
> sensitive communications ?" They now queue like other ttys which fixes a
> whole ton of stuff but does mean they have normal tty latencies.

I have a client program designed to communicate with an external
device through a serial port. It polls this device every 20ms or so.
I also have a server to simulate the physical device which allows me
to replay values captured previously (for testing and tuning the
client). This simulated run doesn't usually need to model the elapsed
time so I can set the inter-polling delay to 0. Obviously the 8ms
delay I now get drastically slows down these replays.

While I could rewrite the communications to optionally use pipes, or
get the client to reprocess a log file locally, what I have does seem
a reasonable use of ptys.

Have fun,

Rob.

2009-11-22 06:39:37

by Ingo Molnar

[permalink] [raw]

Subject: Re: [bisected] pty performance problem

(Cc:-ed Alan and Linus - mail repeated below. eot.)

* Robert Swan <[email protected]> wrote:

> I posted this to the kernel-newbies list, but have graduated to the
> adults forum:
>
> ! Two C programs are having a query-response conversation through a
> ! pseudo terminal:
> !
> ! A (client) -- forever { send query; read response }
> ! B (server) -- forever { read query; send response }
> !
> ! Neither has any I/O apart from the pty conversation, so I'd expect to
> ! see CPU usage at 100%. When I ran it, the CPU was pretty well idle.
> ! After a fair bit of fiddling, it turned out that both sides were
> ! taking about 8ms for their read() calls. At that point it seemed
> ! pretty clear that this was a delay in the kernel, not the code.
> !
> [snip]
>
> 2.6.31-rc2-00205-gb4b21ca good
> 2.6.31-rc2-00206-gd945cb9 bad
>
> and still bad with the latest: 2.6.32-rc8-00011-ga8a8a66
>
> the git log says:
> ! commit d945cb9cce20ac7143c2de8d88b187f62db99bdc
> ! Author: Alan Cox <[email protected]>
> ! Date: Tue Jul 7 16:39:41 2009 +0100
> !
> ! pty: Rework the pty layer to use the normal buffering logic
> !
> ! This fixes the ppp problems and various other issues with call locking
> ! caused by one side of a pty called in one locking context trying to match
> ! another with differing rules on the other side. We also get a big slack
> ! space to work with that means we can bury the flow control deadlock case
> ! for any conceivable real world situation.
> !
> ! Signed-off-by: Alan Cox <[email protected]>
> ! Signed-off-by: Linus Torvalds <[email protected]>
>
> I can provide reasonably stripped down code which demonstrates the
> problem. It has been reproduced by one other person, though his delay
> was about 2ms.
>
> Have fun,
>
> Rob.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2009-11-22 07:13:31

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [bisected] pty performance problem

On Sun, 22 Nov 2009 07:39:26 +0100
Ingo Molnar <[email protected]> wrote:
> > ! Neither has any I/O apart from the pty conversation, so I'd
> > expect to ! see CPU usage at 100%. When I ran it, the CPU was
> > pretty well idle. ! After a fair bit of fiddling, it turned out
> > that both sides were ! taking about 8ms for their read() calls. At
> > that point it seemed ! pretty clear that this was a delay in the
> > kernel, not the code. !
> > [snip]
> >
> > I can provide reasonably stripped down code which demonstrates the
> > problem. It has been reproduced by one other person, though his
> > delay was about 2ms.

let me guess; you have HZ=250 and the other person has HZ=1000...
since this seems consistent with doing a 1 jiffy delay...

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-11-22 12:21:30

by Alan

[permalink] [raw]

Subject: Re: [bisected] pty performance problem

On Sun, 22 Nov 2009 07:39:26 +0100
Ingo Molnar <[email protected]> wrote:

>
> (Cc:-ed Alan and Linus - mail repeated below. eot.)

I saw it and replied already. The kernel now queues the work for the pty
as it does normal tty devices. If the CPU load is that low and nothing
else is happening it makes me wonder what the scheduler thinks it is
doing ?

Alan

2009-11-22 12:27:40

by Alan

[permalink] [raw]

Subject: Re: [bisected] pty performance problem

> While I could rewrite the communications to optionally use pipes, or
> get the client to reprocess a log file locally, what I have does seem
> a reasonable use of ptys.

Wouldn't argue with that. The tty layer queues the characters which then
get processed but at the moment get processed after the next timer tick
not when the CPU is idle. That wants changing and it strikes me as utterly
dumb in general to be deferring work when the CPU is idle.

Alan

2009-11-22 12:41:57

by Mike Galbraith

[permalink] [raw]

Subject: Re: [bisected] pty performance problem

On Sun, 2009-11-22 at 12:29 +0000, Alan Cox wrote:
> > While I could rewrite the communications to optionally use pipes, or
> > get the client to reprocess a log file locally, what I have does seem
> > a reasonable use of ptys.
>
> Wouldn't argue with that. The tty layer queues the characters which then
> get processed but at the moment get processed after the next timer tick
> not when the CPU is idle. That wants changing and it strikes me as utterly
> dumb in general to be deferring work when the CPU is idle.

*blink*, up to 10 ms thumb twiddling session?

I wouldn't mind having a copy of the stripped down test proggy.

-Mike

2009-11-22 12:42:24

by Ingo Molnar

[permalink] [raw]

Subject: Re: [bisected] pty performance problem

* Alan Cox <[email protected]> wrote:

> On Sun, 22 Nov 2009 07:39:26 +0100
> Ingo Molnar <[email protected]> wrote:
>
> >
> > (Cc:-ed Alan and Linus - mail repeated below. eot.)
>
> I saw it and replied already. The kernel now queues the work for the
> pty as it does normal tty devices. If the CPU load is that low and
> nothing else is happening it makes me wonder what the scheduler thinks
> it is doing ?

Would be nice to see the testcase, but by the looks if it there appears
to be a jiffy delay somewhere - not a scheduler delay. On an idle box
the scheduler will run any runnable tasks, with no delay.

Ingo

2009-11-22 21:37:44

by Robert Swan

[permalink] [raw]

Subject: Re: [bisected] pty performance problem

On Sun, Nov 22, 2009 at 12:29:42PM +0000, Alan Cox wrote:
> not when the CPU is idle. That wants changing and it strikes me as utterly
> dumb in general to be deferring work when the CPU is idle.

That's how it struck me.

Rob.

2009-11-22 22:04:10

by Robert Swan

[permalink] [raw]

Subject: Re: [bisected] pty performance problem

On Sat, Nov 21, 2009 at 11:15:14PM -0800, Arjan van de Ven wrote:
> On Sun, 22 Nov 2009 07:39:26 +0100
> Ingo Molnar <[email protected]> wrote:

> let me guess; you have HZ=250 and the other person has HZ=1000...
> since this seems consistent with doing a 1 jiffy delay...

Right, but you know, I have a strange feeling that wasn't a guess.

Rob.

2009-11-23 05:00:29

by Mike Galbraith

[permalink] [raw]

Subject: Re: [bisected] pty performance problem

On Sun, 2009-11-22 at 12:23 +0000, Alan Cox wrote:
> On Sun, 22 Nov 2009 07:39:26 +0100
> Ingo Molnar <[email protected]> wrote:
>
> >
> > (Cc:-ed Alan and Linus - mail repeated below. eot.)
>
> I saw it and replied already. The kernel now queues the work for the pty
> as it does normal tty devices. If the CPU load is that low and nothing
> else is happening it makes me wonder what the scheduler thinks it is
> doing ?

Hm. Looks to me like it's doing what it was told to do.

diff --git a/drivers/char/tty_buffer.c b/drivers/char/tty_buffer.c
index 66fa4e1..92a0864 100644
--- a/drivers/char/tty_buffer.c
+++ b/drivers/char/tty_buffer.c
@@ -495,7 +495,7 @@ void tty_flip_buffer_push(struct tty_struct *tty)
if (tty->low_latency)
flush_to_ldisc(&tty->buf.work.work);
else
- schedule_delayed_work(&tty->buf.work, 1);
+ schedule_delayed_work(&tty->buf.work, 0);
}
EXPORT_SYMBOL(tty_flip_buffer_push);

Telling it to execute now made test proggy happy.. and likely broke tons
of things that need a delay there. So, what's wrong with delaying, when
that's what the customer asked for? /me must be missing something. It
could know that no delay is needed?

-Mike

2009-11-23 10:26:55

by Alan

[permalink] [raw]

Subject: Re: [bisected] pty performance problem

> Hm. Looks to me like it's doing what it was told to do.

Yes. I realised this morning too.
>
> diff --git a/drivers/char/tty_buffer.c b/drivers/char/tty_buffer.c
> index 66fa4e1..92a0864 100644
> --- a/drivers/char/tty_buffer.c
> +++ b/drivers/char/tty_buffer.c
> @@ -495,7 +495,7 @@ void tty_flip_buffer_push(struct tty_struct *tty)
> if (tty->low_latency)
> flush_to_ldisc(&tty->buf.work.work);
> else
> - schedule_delayed_work(&tty->buf.work, 1);
> + schedule_delayed_work(&tty->buf.work, 0);
> }
> EXPORT_SYMBOL(tty_flip_buffer_push);
>
> Telling it to execute now made test proggy happy.. and likely broke tons
> of things that need a delay there. So, what's wrong with delaying, when
> that's what the customer asked for? /me must be missing something. It
> could know that no delay is needed?

The old model the tty code used was to queue bytes and process them each
timer tick. The idea is that this avoids thrashing the locks and stuff
gets processed more efficiently.

It's probably completely the wrong model today and removing the delay
will now only hit fine grained locks, and will get better flow control
behaviour at high speeds.

Try it and see - worst case it becomes some kind of per tty property.

Alan

2009-11-23 11:29:23

by Alan

[permalink] [raw]

Subject: Re: [bisected] pty performance problem

> diff --git a/drivers/char/tty_buffer.c b/drivers/char/tty_buffer.c
> index 66fa4e1..92a0864 100644
> --- a/drivers/char/tty_buffer.c
> +++ b/drivers/char/tty_buffer.c
> @@ -495,7 +495,7 @@ void tty_flip_buffer_push(struct tty_struct *tty)
> if (tty->low_latency)
> flush_to_ldisc(&tty->buf.work.work);
> else
> - schedule_delayed_work(&tty->buf.work, 1);
> + schedule_delayed_work(&tty->buf.work, 0);
> }
> EXPORT_SYMBOL(tty_flip_buffer_push);

Another possibility is to do

if (tty->low_latency)
schedule_delayed_work(&tty->buf.work, 0);
else
schedule_delayed_work(&tty->buf.work, 1);

At the moment the ->low_latency flag is just used by a few drivers that
want to avoid a double delay (eg if the process events off a work queue
not in the IRQ handler), and they have to jump through other hoops around
re-entrancy. Doing it that way might make the processing a miniscule more
slow for those cases but would also let the low_latency flag do something
useful for the many drivers that don't work byte at a time and want stuff
to be blocked up.

The general case is things like UART chips where you get one char and one
push per byte receiving so usually don't want one run through the ldisc
each time when a lot of data is arriving.

If low_latency means schedule ASAP not call inline then pty can use
low_latency and we can avoid suddenly changing the behaviour of every
device but instead do them as they make sense (eg all USB should probably
be low_latency with that queueing change)

Alan

2009-11-23 11:48:51

by Ingo Molnar

[permalink] [raw]

Subject: Re: [bisected] pty performance problem

* Alan Cox <[email protected]> wrote:

> > diff --git a/drivers/char/tty_buffer.c b/drivers/char/tty_buffer.c
> > index 66fa4e1..92a0864 100644
> > --- a/drivers/char/tty_buffer.c
> > +++ b/drivers/char/tty_buffer.c
> > @@ -495,7 +495,7 @@ void tty_flip_buffer_push(struct tty_struct *tty)
> > if (tty->low_latency)
> > flush_to_ldisc(&tty->buf.work.work);
> > else
> > - schedule_delayed_work(&tty->buf.work, 1);
> > + schedule_delayed_work(&tty->buf.work, 0);
> > }
> > EXPORT_SYMBOL(tty_flip_buffer_push);
>
> Another possibility is to do
>
> if (tty->low_latency)
> schedule_delayed_work(&tty->buf.work, 0);
> else
> schedule_delayed_work(&tty->buf.work, 1);

Flaggery for low latency is kind of lame though - especially if it
defaults to off in most drivers as you say.

Ingo

2009-11-23 11:58:32

by Alan

[permalink] [raw]

Subject: Re: [bisected] pty performance problem

> > Another possibility is to do
> >
> > if (tty->low_latency)
> > schedule_delayed_work(&tty->buf.work, 0);
> > else
> > schedule_delayed_work(&tty->buf.work, 1);
>
> Flaggery for low latency is kind of lame though - especially if it
> defaults to off in most drivers as you say.

So you'd prefer to detect devices that are byte based or message based by
what method ?

Alan

2009-11-23 12:04:14

by Ingo Molnar

[permalink] [raw]

Subject: Re: [bisected] pty performance problem

* Alan Cox <[email protected]> wrote:

> > > Another possibility is to do
> > >
> > > if (tty->low_latency)
> > > schedule_delayed_work(&tty->buf.work, 0);
> > > else
> > > schedule_delayed_work(&tty->buf.work, 1);
> >
> > Flaggery for low latency is kind of lame though - especially if it
> > defaults to off in most drivers as you say.
>
> So you'd prefer to detect devices that are byte based or message based
> by what method ?

I'd not delay the worklet by default - i.e. i'd do Mike's patch.

Havent tested all effects of it though - do you have any estimation
about negative effects from such a change? We do have hard numbers
(latencies in the millisecs range) from the opposite direction and those
numbers arent pretty.

Ingo

2009-11-23 13:32:34

by Alan

[permalink] [raw]

Subject: Re: [bisected] pty performance problem

> > So you'd prefer to detect devices that are byte based or message based
> > by what method ?
>
> I'd not delay the worklet by default - i.e. i'd do Mike's patch.

Certainly stuff like pty should not delay
>
> Havent tested all effects of it though - do you have any estimation
> about negative effects from such a change? We do have hard numbers
> (latencies in the millisecs range) from the opposite direction and those
> numbers arent pretty.

On a PC I'm not too worried - we might burn a bit more CPU and Arjan
might even manage to measure it somewhere. There is the theoretical bad
case where we end up at 100% CPU because the irq, wake, process one char,
irq wake, process one char sequence fits the CPU so we don't sleep.

Embedded might be more of a concern, the old behaviour comes from 386/486
days with low CPU power.

USB doesn't worry me - USB devices generally have their own buffering
algorithm and use a timer so that they batch data efficiently into USB
buffers.

The drivers/serial layer is often run with low latency set anyway so that
seems to be ok for the most part.

Give it a go, send the patch to the maintainer, try it in -next and see
if anyone screams.

Alan

2009-11-23 17:48:49

by Ingo Molnar

[permalink] [raw]

Subject: Re: [bisected] pty performance problem

* Alan Cox <[email protected]> wrote:

> > > So you'd prefer to detect devices that are byte based or message based
> > > by what method ?
> >
> > I'd not delay the worklet by default - i.e. i'd do Mike's patch.
>
> Certainly stuff like pty should not delay
> >
> > Havent tested all effects of it though - do you have any estimation
> > about negative effects from such a change? We do have hard numbers
> > (latencies in the millisecs range) from the opposite direction and those
> > numbers arent pretty.
>
> On a PC I'm not too worried - we might burn a bit more CPU and Arjan
> might even manage to measure it somewhere. There is the theoretical bad
> case where we end up at 100% CPU because the irq, wake, process one char,
> irq wake, process one char sequence fits the CPU so we don't sleep.
>
> Embedded might be more of a concern, the old behaviour comes from 386/486
> days with low CPU power.
>
> USB doesn't worry me - USB devices generally have their own buffering
> algorithm and use a timer so that they batch data efficiently into USB
> buffers.
>
> The drivers/serial layer is often run with low latency set anyway so that
> seems to be ok for the most part.
>
> Give it a go, send the patch to the maintainer, try it in -next and see
> if anyone screams.

(Doh, i should have Cc:-ed Greg first time around - fixed that.)

Ingo