LinuxLists.cc - Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

On Tue, 1 Sep 2009, Rafael J. Wysocki wrote:
> On Tuesday 01 September 2009, Mikael Pettersson wrote:
> >
> > Starting with 2.6.31-rc8 and reverting
> >
> > 85dfd81dc57e8183a277ddd7a56aa65c96f3f487 pty: fix data loss when stopped (^S/^Q)
> > d945cb9cce20ac7143c2de8d88b187f62db99bdc pty: Rework the pty layer to use the normal buffering logic
> >
> > in that order gives me a kernel that works on both x86 and powerpc64.
> >
> > So the bug is definitely limited to the pty buffering logic change.
>
> Thanks a lot for this information, adding somme CCs to the list.

Mikael, is there any way to get the gcc testsuite to show the "expected"
vs "result" cases when the failures occur, so that we can see what the
pattern is ("it drops one character every 8kB" or something like that).

However, I get the feeling that it's really the same bug that
OGAWA-san already fixed - and that his fix just doesn't always do a 100%
of the job.

So what Ogawa did was to make sure that we flush any pending data whenever
we;re checking "do we have any data left". He did that by calling out to
tty_flush_to_ldisc(), which should flush the data through to the ldisc.

The keyword here being "should". In flush_to_ldisc(), we have at least one
case where we say "we'll delay it a bit more":

if (!tty->receive_room) {
schedule_delayed_work(&tty->buf.work, 1);
break;
}

and while I think this _should_ be ok (because if there is no
receive-room, then we'll hopefully always return non-zero from
"input_available_p()". However, we do have this really odd case that the
reader side will do "n_tty_set_room()" onlyl _after_ having checked for
input_available_p(), and so maybe we do sometimes trigger the case that

- input_available_p() tries to flush to the input buffer before checking
how much data is available, by calling 'tty_flush_to_ldisc()'

- but 'tty_flush_to_ldisc()' won't do anything, because tty->receive_room
is zero.

- so now input_available_p will say "I don't have any data", even though
there was data in the write buffers.

- we'll notice that the other end has hung up, and return EOF/EIO.

- which is very WRONG, because the other end may have hung up, but before
it did that, it wrote data that is still in the write queues, and we
should have returned that data.

Anyway, I'm not at all sure that the "receive_room == 0" case can happen
at all, but maybe it can. Ogawa-san?

Here's a totally untested trial patch. I only have this dead-slow netbook
for reading email with me, and I don't have a failing test-case anyway,
but if my analysis is right, then the patch might fix it. It just forces
the re-calculation of the receive buffer before flushing the ldisc.

(And btw, from a performance standpoint, it might make more sense to only
do this whole read-room / ldisc-flush thing if we are about to return
zero. If we already have data available, we probably shouldn't waste time
trying to see if we need to do anything fancy like this.)

CAVEAT EMPTOR. Not tested. It compiled for me, but maybe that was due to
me compiling the wrong file or something.

Linus

---
drivers/char/n_tty.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/char/n_tty.c b/drivers/char/n_tty.c
index 973be2f..7fa3452 100644
--- a/drivers/char/n_tty.c
+++ b/drivers/char/n_tty.c
@@ -1583,6 +1583,7 @@ static int n_tty_open(struct tty_struct *tty)

static inline int input_available_p(struct tty_struct *tty, int amt)
{
+ n_tty_set_room(tty);
tty_flush_to_ldisc(tty);
if (tty->icanon) {
if (tty->canon_data)

2009-09-03 11:29:49

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

Linus Torvalds <[email protected]> writes:

> On Tue, 1 Sep 2009, Rafael J. Wysocki wrote:
>> On Tuesday 01 September 2009, Mikael Pettersson wrote:
>> >
>> > Starting with 2.6.31-rc8 and reverting
>> >
>> > 85dfd81dc57e8183a277ddd7a56aa65c96f3f487 pty: fix data loss when stopped (^S/^Q)
>> > d945cb9cce20ac7143c2de8d88b187f62db99bdc pty: Rework the pty layer to use the normal buffering logic
>> >
>> > in that order gives me a kernel that works on both x86 and powerpc64.
>> >
>> > So the bug is definitely limited to the pty buffering logic change.
>>
>> Thanks a lot for this information, adding somme CCs to the list.
>
> Mikael, is there any way to get the gcc testsuite to show the "expected"
> vs "result" cases when the failures occur, so that we can see what the
> pattern is ("it drops one character every 8kB" or something like that).
>
> However, I get the feeling that it's really the same bug that
> OGAWA-san already fixed - and that his fix just doesn't always do a 100%
> of the job.
>
> So what Ogawa did was to make sure that we flush any pending data whenever
> we;re checking "do we have any data left". He did that by calling out to
> tty_flush_to_ldisc(), which should flush the data through to the ldisc.
>
> The keyword here being "should". In flush_to_ldisc(), we have at least one
> case where we say "we'll delay it a bit more":
>
> if (!tty->receive_room) {
> schedule_delayed_work(&tty->buf.work, 1);
> break;
> }
>
> and while I think this _should_ be ok (because if there is no
> receive-room, then we'll hopefully always return non-zero from
> "input_available_p()". However, we do have this really odd case that the
> reader side will do "n_tty_set_room()" onlyl _after_ having checked for
> input_available_p(), and so maybe we do sometimes trigger the case that
>
> - input_available_p() tries to flush to the input buffer before checking
> how much data is available, by calling 'tty_flush_to_ldisc()'
>
> - but 'tty_flush_to_ldisc()' won't do anything, because tty->receive_room
> is zero.
>
> - so now input_available_p will say "I don't have any data", even though
> there was data in the write buffers.
>
> - we'll notice that the other end has hung up, and return EOF/EIO.
>
> - which is very WRONG, because the other end may have hung up, but before
> it did that, it wrote data that is still in the write queues, and we
> should have returned that data.
>
> Anyway, I'm not at all sure that the "receive_room == 0" case can happen
> at all, but maybe it can. Ogawa-san?

If I'm not missing, I think it doesn't have big change with old
code. But I would need to check more deeply.

Um.., If "receive_room == 0 && tty->read_cnt == 0" is possible, I wonder
why reverting buffer handling fixes the problem.

Well, anyway, I'd like to reproduce this on my machine. Could you tell
me the version of tools? I guess gcc testsuite using the gcc's source
(svn revision?), expect, dejagnu, tcl. (BTW, I'm using debian
testing. If it can be reproduced on kvm, I can install distro version
which you are using)

Thanks.
--
OGAWA Hirofumi <[email protected]>

2009-09-03 20:27:40

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

Linus Torvalds writes:
>
>
> On Tue, 1 Sep 2009, Rafael J. Wysocki wrote:
> > On Tuesday 01 September 2009, Mikael Pettersson wrote:
> > >
> > > Starting with 2.6.31-rc8 and reverting
> > >
> > > 85dfd81dc57e8183a277ddd7a56aa65c96f3f487 pty: fix data loss when stopped (^S/^Q)
> > > d945cb9cce20ac7143c2de8d88b187f62db99bdc pty: Rework the pty layer to use the normal buffering logic
> > >
> > > in that order gives me a kernel that works on both x86 and powerpc64.
> > >
> > > So the bug is definitely limited to the pty buffering logic change.
> >
> > Thanks a lot for this information, adding somme CCs to the list.
>
> Mikael, is there any way to get the gcc testsuite to show the "expected"
> vs "result" cases when the failures occur, so that we can see what the
> pattern is ("it drops one character every 8kB" or something like that).

I don't know. It dumps a summary on stdout and saves logs in various
subdirs. Those logs do show the gcc commands executed and the output
from those commands, but they don't show why some test is considered
failed, or the exact boundaries of each fragment of text returned by
read(2) [which I guess may be significant].

What I can say for certain is that the test cases I've seen fail most
frequently deliberately generate lots of diagnostic output from the
compiler, like 100s or 1000s of lines of warning/error messages.
So the comments in the pty changes that talk about using some 8KB
buffer instead of a 64KB tty flip buffer definitely made me nervous.

> However, I get the feeling that it's really the same bug that
> OGAWA-san already fixed - and that his fix just doesn't always do a 100%
> of the job.
>
> So what Ogawa did was to make sure that we flush any pending data whenever
> we;re checking "do we have any data left". He did that by calling out to
> tty_flush_to_ldisc(), which should flush the data through to the ldisc.
>
> The keyword here being "should". In flush_to_ldisc(), we have at least one
> case where we say "we'll delay it a bit more":
>
> if (!tty->receive_room) {
> schedule_delayed_work(&tty->buf.work, 1);
> break;
> }
>
> and while I think this _should_ be ok (because if there is no
> receive-room, then we'll hopefully always return non-zero from
> "input_available_p()". However, we do have this really odd case that the
> reader side will do "n_tty_set_room()" onlyl _after_ having checked for
> input_available_p(), and so maybe we do sometimes trigger the case that
>
> - input_available_p() tries to flush to the input buffer before checking
> how much data is available, by calling 'tty_flush_to_ldisc()'
>
> - but 'tty_flush_to_ldisc()' won't do anything, because tty->receive_room
> is zero.
>
> - so now input_available_p will say "I don't have any data", even though
> there was data in the write buffers.
>
> - we'll notice that the other end has hung up, and return EOF/EIO.
>
> - which is very WRONG, because the other end may have hung up, but before
> it did that, it wrote data that is still in the write queues, and we
> should have returned that data.
>
> Anyway, I'm not at all sure that the "receive_room == 0" case can happen
> at all, but maybe it can. Ogawa-san?
>
> Here's a totally untested trial patch. I only have this dead-slow netbook
> for reading email with me, and I don't have a failing test-case anyway,
> but if my analysis is right, then the patch might fix it. It just forces
> the re-calculation of the receive buffer before flushing the ldisc.
>
> (And btw, from a performance standpoint, it might make more sense to only
> do this whole read-room / ldisc-flush thing if we are about to return
> zero. If we already have data available, we probably shouldn't waste time
> trying to see if we need to do anything fancy like this.)
>
> CAVEAT EMPTOR. Not tested. It compiled for me, but maybe that was due to
> me compiling the wrong file or something.
>
> Linus

Thanks. I'll give this a try tomorrow.

/Mikael

> drivers/char/n_tty.c | 1 +
> 1 files changed, 1 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/char/n_tty.c b/drivers/char/n_tty.c
> index 973be2f..7fa3452 100644
> --- a/drivers/char/n_tty.c
> +++ b/drivers/char/n_tty.c
> @@ -1583,6 +1583,7 @@ static int n_tty_open(struct tty_struct *tty)
>
> static inline int input_available_p(struct tty_struct *tty, int amt)
> {
> + n_tty_set_room(tty);
> tty_flush_to_ldisc(tty);
> if (tty->icanon) {
> if (tty->canon_data)
>

2009-09-03 21:00:07

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

OGAWA Hirofumi writes:
> Well, anyway, I'd like to reproduce this on my machine. Could you tell
> me the version of tools? I guess gcc testsuite using the gcc's source
> (svn revision?), expect, dejagnu, tcl. (BTW, I'm using debian
> testing. If it can be reproduced on kvm, I can install distro version
> which you are using)

Nothing fancy needed. You can use the gcc-4.4.1 release tarball and any recent
gcc-4.3 weekly snapshot tarball, like 4.3-20090830.

I've always seen the bogus errors in the C or C++ testsuites, so to save
some time you can just --enable-languages=c,c++ when building gcc.

The machines I've been running the testsuites on are a mix of architectures
running older or stability-oriented distros:

M1: i686 PC, Fedora Core 6, dejagnu-1.4.4-5.1, expect-5.43.0-5.1, tcl-8.4.13-3.fc6
M2: powerpc64 (G5), YellowDog 6.2, dejagnu-1.4.4-5.1, expect-5.43.0-5.1, tcl-8.4.13-3
M3: ARM, FC8-based, dejagnu-1.4.4-12.fc8, expect-5.43.0-9.fc8, tcl-8.4.17-1.fc8

/Mikael

2009-09-04 00:02:24

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

On Thu, 3 Sep 2009, OGAWA Hirofumi wrote:
>
> If I'm not missing, I think it doesn't have big change with old
> code. But I would need to check more deeply.

The thing is, the old pty code pushed _directly_ to the receiving ldisc,
with no buffering. I'm not entirely sure why Alan felt it needed changing,
but moving over to the generic tty buffering code did get rid of some
duplicate logic, and the locking is now done in one place, so that's
probably the main reason.

Anyway, the old pty code would be entirely synchronous, and would do the

ld->ops->receive_buf(to, buf, NULL, c);

to push the data all the way to the receive side frm pty_write(). So with
the old code, the destination "receive_room" was always accurate, because
both the reading side and the writing side basically accessed it directly.

With the new code, it all goes through tty_buffer.c, and the bugs have
been mostly about the receiving side not seeing all the data in the
buffers. And those buffers simply didn't use to exist before.

> Um.., If "receive_room == 0 && tty->read_cnt == 0" is possible, I wonder
> why reverting buffer handling fixes the problem.

In the old code, if 'receive_room' was zero, then the writer would simply
stop writing (no buffers in between). So in the old code, you could never
get into a situation where receive_room was zero and there was still
pending data.

At least that's how I read the situation.

If I'm right, I'm hoping that the patch I sent out fixes it, and if so,
we'll do that for 2.6.31 (and then after that maybe re-think whether the
extra buffering is worth all this pain).

And if it _doesn't_ fix it, then I think we'll just have to revert the
commits in question. We won't have time to root-cause it if the above
isn't it.

Linus

2009-09-04 01:41:24

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

Linus Torvalds <[email protected]> writes:

> On Thu, 3 Sep 2009, OGAWA Hirofumi wrote:
>>
>> If I'm not missing, I think it doesn't have big change with old
>> code. But I would need to check more deeply.
>
> The thing is, the old pty code pushed _directly_ to the receiving ldisc,
> with no buffering.

Yes.

> I'm not entirely sure why Alan felt it needed changing,
> but moving over to the generic tty buffering code did get rid of some
> duplicate logic, and the locking is now done in one place, so that's
> probably the main reason.

IIRC, ppp had the locking issue without that patch?

> Anyway, the old pty code would be entirely synchronous, and would do the
>
> ld->ops->receive_buf(to, buf, NULL, c);
>
> to push the data all the way to the receive side frm pty_write(). So with
> the old code, the destination "receive_room" was always accurate, because
> both the reading side and the writing side basically accessed it directly.
>
> With the new code, it all goes through tty_buffer.c, and the bugs have
> been mostly about the receiving side not seeing all the data in the
> buffers. And those buffers simply didn't use to exist before.

Yes. However, pty_write() checks tty_buffer instead of receive_room. So
I thought, the change of write side is mainly buffer size (receive_room
size + tty_buffer size). It will stop after filling tty_buffer, not
receive_room.

And (I hope) the read side guarantees to consume both buffers. If it is
right, I guessed the change is timing issues with more larger buffer
size.

>> Um.., If "receive_room == 0 && tty->read_cnt == 0" is possible, I wonder
>> why reverting buffer handling fixes the problem.
>
> In the old code, if 'receive_room' was zero, then the writer would simply
> stop writing (no buffers in between). So in the old code, you could never
> get into a situation where receive_room was zero and there was still
> pending data.
>
> At least that's how I read the situation.

Another possibility in my guess is the change of pty_flush_buffer() and
pty_chars_in_buffer(). I'm not sure at all though, especially, I'm
suspecting pty_flush_buffer() may change the behaviors.

> If I'm right, I'm hoping that the patch I sent out fixes it, and if so,
> we'll do that for 2.6.31 (and then after that maybe re-think whether the
> extra buffering is worth all this pain).

I also hope it works.

> And if it _doesn't_ fix it, then I think we'll just have to revert the
> commits in question. We won't have time to root-cause it if the above
> isn't it.

At least for me, it sounds like good if revert works. I have no
preference about it.

FWIW, meanwhile, I'll just try to see the root-cause of this as
another/fallback solution.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2009-09-04 01:52:48

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

On Fri, 4 Sep 2009, OGAWA Hirofumi wrote:
>
> Yes. However, pty_write() checks tty_buffer instead of receive_room. So
> I thought, the change of write side is mainly buffer size (receive_room
> size + tty_buffer size).

The problem has never been the write side. That side works - just with
extra buffering.

> It will stop after filling tty_buffer, not receive_room.

Yes.

> And (I hope) the read side guarantees to consume both buffers. If it is
> right, I guessed the change is timing issues with more larger buffer
> size.

That's the change. The read side only consumes the buffers it _sees_. And
it doesn't look at the buffers that the write side has written, only at
the 'received' buffers. That's why we had to add that 'tty_flush_to_ldisc'
so that the buffers that got written were properly moved to the receive
side.

And that's the part that I suspect is broken - ie tty_flush_to_ldisc
doesn't always guarantee that it moves all the written stuff to the
receive side.

Before, this wasn't an issue, because the writer always filled up the
receive buffers directly, so there was never any flushing issues.

> Another possibility in my guess is the change of pty_flush_buffer() and
> pty_chars_in_buffer(). I'm not sure at all though, especially, I'm
> suspecting pty_flush_buffer() may change the behaviors.

I don't think 'pty_flush_buffer()' is ever called in any normal
circumstances. Afaik, it's only called for a TIOCFLUSH ioctl (or whatever
it's called) when the user asks for all the contents to be thrown away.

> FWIW, meanwhile, I'll just try to see the root-cause of this as
> another/fallback solution.

Absolutely. If you can find some other possibility, that would be great.
I'm not really sure how that 'receive_room == 0' case would ever happen in
practice, so my patch was really based on the assumption that the bug is
in the flushing code.

The bug could easily be elsewhere.

Linus

2009-09-04 13:24:33

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

Linus Torvalds writes:
> Here's a totally untested trial patch. I only have this dead-slow netbook
> for reading email with me, and I don't have a failing test-case anyway,
> but if my analysis is right, then the patch might fix it. It just forces
> the re-calculation of the receive buffer before flushing the ldisc.
>
> (And btw, from a performance standpoint, it might make more sense to only
> do this whole read-room / ldisc-flush thing if we are about to return
> zero. If we already have data available, we probably shouldn't waste time
> trying to see if we need to do anything fancy like this.)
>
> CAVEAT EMPTOR. Not tested. It compiled for me, but maybe that was due to
> me compiling the wrong file or something.
>
> Linus
>
> ---
> drivers/char/n_tty.c | 1 +
> 1 files changed, 1 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/char/n_tty.c b/drivers/char/n_tty.c
> index 973be2f..7fa3452 100644
> --- a/drivers/char/n_tty.c
> +++ b/drivers/char/n_tty.c
> @@ -1583,6 +1583,7 @@ static int n_tty_open(struct tty_struct *tty)
>
> static inline int input_available_p(struct tty_struct *tty, int amt)
> {
> + n_tty_set_room(tty);
> tty_flush_to_ldisc(tty);
> if (tty->icanon) {
> if (tty->canon_data)
>

Unfortunately this did not fix the bug. The gcc-4.3 testsuite failed
as usual in gcc.dg/c99-typespec-1.c.

Comparing the gcc outputs for this test case from runs with 2.6.30 and
2.6.31-rc8 shows that 2.6.31-rc8 lost a single newline (\n) byte at byte
offset 131660. So two lines of diagnostics were fused together and the
testsuite framework failed to match the second of those lines.

This is what 2.6.30 output at that place:

/mnt/work1/gcc-4.3-20090830/gcc/testsuite/gcc.dg/c99-typespec-1.c:1143: error: two or more data types in declaration specifiers
/mnt/work1/gcc-4.3-20090830/gcc/testsuite/gcc.dg/c99-typespec-1.c:1144: error: two or more data types in declaration specifiers
/mnt/work1/gcc-4.3-20090830/gcc/testsuite/gcc.dg/c99-typespec-1.c:1145: error: both 'long' and 'short' in declaration specifiers
/mnt/work1/gcc-4.3-20090830/gcc/testsuite/gcc.dg/c99-typespec-1.c:1146: error: two or more data types in declaration specifiers

And this is what 2.6.31-rc8 + the patch output at that place:

/mnt/work1/gcc-4.3-20090830/gcc/testsuite/gcc.dg/c99-typespec-1.c:1143: error: two or more data types in declaration specifiers
/mnt/work1/gcc-4.3-20090830/gcc/testsuite/gcc.dg/c99-typespec-1.c:1144: error: two or more data types in declaration specifiers/mnt/work1/gcc-4.3-20090830/gcc/testsuite/gcc.dg/c99-typespec-1.c:1145: error: both 'long' and 'short' in declaration specifiers
/mnt/work1/gcc-4.3-20090830/gcc/testsuite/gcc.dg/c99-typespec-1.c:1146: error: two or more data types in declaration specifiers

The actual logs use \r\n line endings, so between the diagnostics for source
lines 1144 and 1145 there is now a single \r. Some software will display \r
line ending as \r\n, so a missing \n may not be visible. So I've removed the
\r characters in the text above to avoid affecting how it is presented.
The original logs are available in <http://user.it.uu.se/~mikpe/linux/pty-bug/>
if you need them.

/Mikael

2009-09-04 15:03:16

by Alan

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

> And if it _doesn't_ fix it, then I think we'll just have to revert the
> commits in question. We won't have time to root-cause it if the above
> isn't it.

In which case ppp will no longer work properly in some cases (ditto
other protocols) and things like the pppoe gateway wont work as they
don't in 2.6.30 - you need to go back to somewhere between 2.6.28/29 to
undo this, then apply the alternative locking patches to the
ppp/slip/ax25/etc ldiscs

Is the missing byte always part of the \r\n - that is handled slightly
specially by the n_tty ldisc code and has always been buggy if there is
flow control buffering - back to 1.3 and probably earlier. It happens on
real ttys too given sufficient flow control blockage because the char by
char opost stuff never did space/buffering checks. At least I don't think
the n_tty rework fixed it.

2009-09-04 17:31:55

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

On Fri, 4 Sep 2009, Mikael Pettersson wrote:
>
> Comparing the gcc outputs for this test case from runs with 2.6.30 and
> 2.6.31-rc8 shows that 2.6.31-rc8 lost a single newline (\n) byte at byte
> offset 131660. So two lines of diagnostics were fused together and the
> testsuite framework failed to match the second of those lines.

Goodie. That was the kind of hint I was looking for.

And I suspect that that means that the bug is related to do_output_char()
expanding '\n' into '\r\n'. And the different buffering (and the pty
'space' logic) just means that we now hit a case that we didn't use to
hit. The relevant call chain is

- n_tty handling:
n_tty_write() ->
process_output() ->
do_output_char() ->
tty_put_char(tty, '\r')
tty_put_char(tty, '\n')

I'll see what I can find. But your "loses \n character" does mean that the
'lost bytes at the end when the other end closed it' is probably not the
issue, and we're talking about a different kind of bug entirely.

Linus

2009-09-04 17:34:04

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

On Fri, 4 Sep 2009, Alan Cox wrote:
>
> In which case ppp will no longer work properly in some cases (ditto
> other protocols) and things like the pppoe gateway wont work as they
> don't in 2.6.30 - you need to go back to somewhere between 2.6.28/29 to
> undo this, then apply the alternative locking patches to the
> ppp/slip/ax25/etc ldiscs

It should be fairly trivial to just add the locking to the pty write
routines. That said, we need to fix the 2.6.31 regression, and right now
that is the big one. If we go back to broken 2.6.30 situation, that's way
more acceptable.

But I'm still hoping we can fix this.

Linus

2009-09-04 17:54:21

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

On Fri, 4 Sep 2009, Linus Torvalds wrote:
>
> And I suspect that that means that the bug is related to do_output_char()
> expanding '\n' into '\r\n'. And the different buffering (and the pty
> 'space' logic) just means that we now hit a case that we didn't use to
> hit. The relevant call chain is
>
> - n_tty handling:
> n_tty_write() ->
> process_output() ->
> do_output_char() ->
> tty_put_char(tty, '\r')
> tty_put_char(tty, '\n')

Hmm. I think I have a clue.

process_output() does

space = tty_write_room(tty);
retval = do_output_char(c, tty, space);

so 'space' can never become off-by-one, since it's always re-calculated
just before. And do_output_char() checks that there is room for two
characters, and won't do just the '\r'.

So the fact that you see the '\r' and not the '\n' means that something
dropped the second character _despite_ tty_write_room() saying there was
room for two characters.

Now, with flow control that can in theory happen in case 'tty->stopped'
gets set asynchronously in between, but that's not an issue here.

So the most likely cause is just that the pty_write_room() function is
simply buggered, or at least doesn't work together with the new world
order.

How about something like this? It's way too anal - it says that we can
only write data if there's enough space to always push it all the way to
the receive buffer (including all the data that was already buffered up,
ie the "memory_used" part). But if it finally makes the problem go away,
we have another clue.

Linus

2009-09-04 17:55:55

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

On Fri, 4 Sep 2009, Linus Torvalds wrote:
>
> How about something like this? It's way too anal - it says that we can
> only write data if there's enough space to always push it all the way to
> the receive buffer (including all the data that was already buffered up,
> ie the "memory_used" part). But if it finally makes the problem go away,
> we have another clue.

I forgot to actually include the patch. Duh.

And again - UNTESTED. Maybe this makes the buffering _too_ small (the
'memory_used' thing is not really counted in bytes buffered, it's counted
in how much buffer space we've allocated) and things break even worse and
pty's don't work at all. But I think it might work.

Linus

---
drivers/char/pty.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/char/pty.c b/drivers/char/pty.c
index d083c73..139fa5a 100644
--- a/drivers/char/pty.c
+++ b/drivers/char/pty.c
@@ -91,7 +91,7 @@ static void pty_unthrottle(struct tty_struct *tty)

static int pty_space(struct tty_struct *to)
{
- int n = 8192 - to->buf.memory_used;
+ int n = to->receive_room - to->buf.memory_used;
if (n < 0)
return 0;
return n;

2009-09-04 18:12:11

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

On Fri, 4 Sep 2009, Linus Torvalds wrote:
>
> And again - UNTESTED. Maybe this makes the buffering _too_ small (the
> 'memory_used' thing is not really counted in bytes buffered, it's counted
> in how much buffer space we've allocated) and things break even worse and
> pty's don't work at all. But I think it might work.

Actually, scratch that patch.

After writing the above, the voices in my head started clamoring about
this "space allocated" vs "bytes buffered" thing, which I was obviously
aware of, but hadn't thought about as an issue.

And you know what? The thing about "space allocated" vs "bytes buffered"
is that writing _one_ byte (the '\r') can cause a lot more than one byte
to be allocated for a buffer (we do minimum 256-byte buffers).

So let's say that 'space' was initially 20 - plenty big enough to hold two
characters. But if the '\r' just happened to need a new buffer, it would
actually increase 'memory_used' by 256, and now the next time we call
'pty_space()' it doesn't return 19, but 0 - because now memory_used is
larger than the 8192 we allowed.

So I'm starting to suspect that the real bug is that we do that
'pty_space()' in pty_write() call at all. The _callers_ should already
have done the write_room() check, and if somebody doesn't do it, then the
tty buffering will eventually do a hard limit at the 65kB allocation mark.

So doing it in pty_write() is (a) unnecessary and (b) actively wrong,
because it means that in the situation above, pty_write() won't be allowin
the slop that it _needs_ to allow due to the buffering not being exact
"this many bytes buffered up", but "this many bytes allocated for
buffering".

So rather than the previous patch, try this one instead.

Linus
---
drivers/char/pty.c | 6 ------
1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/drivers/char/pty.c b/drivers/char/pty.c
index d083c73..45a7ca2 100644
--- a/drivers/char/pty.c
+++ b/drivers/char/pty.c
@@ -118,12 +118,6 @@ static int pty_write(struct tty_struct *tty, const unsigned char *buf,
if (tty->stopped)
return 0;

- /* This isn't locked but our 8K is quite sloppy so no
- big deal */
-
- c = pty_space(to);
- if (c > count)
- c = count;
if (c > 0) {
/* Stuff the data into the input queue of the other end */
c = tty_insert_flip_string(to, buf, c);

2009-09-04 19:12:57

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

On Fri, 4 Sep 2009, Linus Torvalds wrote:
>
> So I'm starting to suspect that the real bug is that we do that
> 'pty_space()' in pty_write() call at all. The _callers_ should already
> have done the write_room() check, and if somebody doesn't do it, then the
> tty buffering will eventually do a hard limit at the 65kB allocation mark.

Ok, so the thought was right, but the patch was obviously not even
compiled, because the compiler points out that 'c' was not initialized.

I'm sure you already figured the obvious meaning out, but here's a fixed
version.

Linus
---
drivers/char/pty.c | 10 +---------
1 files changed, 1 insertions(+), 9 deletions(-)

diff --git a/drivers/char/pty.c b/drivers/char/pty.c
index d083c73..b33d668 100644
--- a/drivers/char/pty.c
+++ b/drivers/char/pty.c
@@ -109,21 +109,13 @@ static int pty_space(struct tty_struct *to)
* the other side of the pty/tty pair.
*/

-static int pty_write(struct tty_struct *tty, const unsigned char *buf,
- int count)
+static int pty_write(struct tty_struct *tty, const unsigned char *buf, int c)
{
struct tty_struct *to = tty->link;
- int c;

if (tty->stopped)
return 0;

- /* This isn't locked but our 8K is quite sloppy so no
- big deal */
-
- c = pty_space(to);
- if (c > count)
- c = count;
if (c > 0) {
/* Stuff the data into the input queue of the other end */
c = tty_insert_flip_string(to, buf, c);

2009-09-04 19:19:53

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

On Fri, 4 Sep 2009, Linus Torvalds wrote:
>
> I'm sure you already figured the obvious meaning out, but here's a fixed
> version.

And here's another patch that may also fix this, simply by virtue of
writing the "\r\n" as a single string, rather than as two characters. That
way, we should never get into the situation that th '\r' allocates a new
buffer (larger than one character), and then the later '\n' writing
decides that we've filled up.

Besides, it's a cleanup. An untested one, naturally.

Linus

---
drivers/char/n_tty.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/drivers/char/n_tty.c b/drivers/char/n_tty.c
index 973be2f..4e28b35 100644
--- a/drivers/char/n_tty.c
+++ b/drivers/char/n_tty.c
@@ -300,8 +300,7 @@ static int do_output_char(unsigned char c, struct tty_struct *tty, int space)
if (space < 2)
return -1;
tty->canon_column = tty->column = 0;
- tty_put_char(tty, '\r');
- tty_put_char(tty, c);
+ tty->ops->write(tty, "\r\n", 2);
return 2;
}
tty->canon_column = tty->column;

2009-09-04 20:13:19

by Alan

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

> After writing the above, the voices in my head started clamoring about
> this "space allocated" vs "bytes buffered" thing, which I was obviously
> aware of, but hadn't thought about as an issue.
>
> And you know what? The thing about "space allocated" vs "bytes buffered"
> is that writing _one_ byte (the '\r') can cause a lot more than one byte
> to be allocated for a buffer (we do minimum 256-byte buffers)

Doh yes thats an utterly dumb bug on my part - we do have to work on
chars.

100% agree with the diagnosis

2009-09-05 10:46:41

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

Linus Torvalds writes:
>
>
> On Fri, 4 Sep 2009, Linus Torvalds wrote:
> >
> > I'm sure you already figured the obvious meaning out, but here's a fixed
> > version.
>
> And here's another patch that may also fix this, simply by virtue of
> writing the "\r\n" as a single string, rather than as two characters. That
> way, we should never get into the situation that th '\r' allocates a new
> buffer (larger than one character), and then the later '\n' writing
> decides that we've filled up.

Thanks, I'm testing this and the pty_write() fix on i686 and ppc64 now.

Sometimes the bug is difficult to trigger, so I may need to do loads of
testing with different gcc versions before I dare to say that it's fixed.

/Mikael

2009-09-05 17:01:00

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

Linus Torvalds <[email protected]> writes:

> On Fri, 4 Sep 2009, Linus Torvalds wrote:
>>
>> So I'm starting to suspect that the real bug is that we do that
>> 'pty_space()' in pty_write() call at all. The _callers_ should already
>> have done the write_room() check, and if somebody doesn't do it, then the
>> tty buffering will eventually do a hard limit at the 65kB allocation mark.
>
> Ok, so the thought was right, but the patch was obviously not even
> compiled, because the compiler points out that 'c' was not initialized.
>
> I'm sure you already figured the obvious meaning out, but here's a fixed
> version.

This is not meaning to object to your patch though, I think we would be
good to fix pty_space(), not leaving as wrong. With fix it, I guess we
don't get strange behavior in the near of buffer limit.

Also, it seems the non-n_tty path doesn't use tty_write_room() check,
and instead it just try to write and check written bytes which returned
by tty->ops->write().

So, it will use the 64kb limit at least few paths, and I'm not sure
though, non-n_tty path (e.g. ppp) doesn't use tty_write_room() check
always. It may not be consistent if we removed pty_space() in pty_write().

So, it may not be issue though, I made this patch to fix
pty_space(). What do you think?

Well, anyway, I've tested your patch and this patch fixed the
gcc-testsuite on my machine.

Thanks.
--
OGAWA Hirofumi <[email protected]>

Signed-off-by: OGAWA Hirofumi <[email protected]>
---

diff -puN drivers/char/pty.c~tty-debug drivers/char/pty.c
--- linux-2.6/drivers/char/pty.c~tty-debug 2009-09-05 20:10:35.000000000 +0900
+++ linux-2.6-hirofumi/drivers/char/pty.c 2009-09-06 01:50:44.000000000 +0900
@@ -91,7 +91,7 @@ static void pty_unthrottle(struct tty_st

static int pty_space(struct tty_struct *to)
{
- int n = 8192 - to->buf.memory_used;
+ int n = 8192 - tty_buffer_used(to);
if (n < 0)
return 0;
return n;
diff -puN drivers/char/tty_buffer.c~tty-debug drivers/char/tty_buffer.c
--- linux-2.6/drivers/char/tty_buffer.c~tty-debug 2009-09-05 21:02:48.000000000 +0900
+++ linux-2.6-hirofumi/drivers/char/tty_buffer.c 2009-09-05 21:08:47.000000000 +0900
@@ -231,6 +231,30 @@ int tty_buffer_request_room(struct tty_s
EXPORT_SYMBOL_GPL(tty_buffer_request_room);

/**
+ * tty_buffer_used - return used tty buffer size
+ * @tty: tty structure
+ *
+ * Return used tty buffer size.
+ */
+size_t tty_buffer_used(struct tty_struct *tty)
+{
+ size_t size;
+ int left;
+ unsigned long flags;
+
+ spin_lock_irqsave(&tty->buf.lock, flags);
+ if (tty->buf.tail)
+ left = tty->buf.tail->size - tty->buf.tail->used;
+ else
+ left = 0;
+ size = tty->buf.memory_used - left;
+ spin_unlock_irqrestore(&tty->buf.lock, flags);
+
+ return size;
+}
+EXPORT_SYMBOL_GPL(tty_buffer_used);
+
+/**
* tty_insert_flip_string - Add characters to the tty buffer
* @tty: tty structure
* @chars: characters
diff -puN include/linux/tty_flip.h~tty-debug include/linux/tty_flip.h
--- linux-2.6/include/linux/tty_flip.h~tty-debug 2009-09-05 21:06:25.000000000 +0900
+++ linux-2.6-hirofumi/include/linux/tty_flip.h 2009-09-05 21:06:39.000000000 +0900
@@ -2,6 +2,7 @@
#define _LINUX_TTY_FLIP_H

extern int tty_buffer_request_room(struct tty_struct *tty, size_t size);
+extern size_t tty_buffer_used(struct tty_struct *tty);
extern int tty_insert_flip_string(struct tty_struct *tty, const unsigned char *chars, size_t size);
extern int tty_insert_flip_string_flags(struct tty_struct *tty, const unsigned char *chars, const char *flags, size_t size);
extern int tty_prepare_flip_string(struct tty_struct *tty, unsigned char **chars, size_t size);
_

2009-09-05 18:07:33

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

On Sun, 6 Sep 2009, OGAWA Hirofumi wrote:
>
> This is not meaning to object to your patch though, I think we would be
> good to fix pty_space(), not leaving as wrong. With fix it, I guess we
> don't get strange behavior in the near of buffer limit.

I'd actually rather not make that function any more complicated.

Just make the rules be very simple:

- the pty layer has ~64kB buffering, and if you just blindly do a
->write() op, you can see how many characters you were able to write.

- before doing a ->write() op, you can ask how many characters you are
guaranteed to be able to write by doing a "->write_room()" call.

..and then the bug literally was just that "pty_write()" was confused, and
thought that it should do that "write_room()" thing, which it really
shouldn't ever have done.

So I really think that the true fix is to just remove the code from
pty_write(), and not do anything more complicated. I'll also commit the
change to write '\r\n' as one single string, because quite frankly, it's
just stupid to do it as two characters, but at that point it's just a
cleanup.

> Also, it seems the non-n_tty path doesn't use tty_write_room() check,
> and instead it just try to write and check written bytes which returned
> by tty->ops->write().

.. and I think that's fine. I think write_room() should be used sparingly,
and only by code that cares about being able to fit at least 'n'
characters in the tty buffers. In fact, I think even n_tty would likely in
general be better off without it (and just check the return value), but
because of the stateful character translation (that doesn't actually keep
any state around, it just wants to expand things as it goes along), and
because of historical reasons, we'll just keep it using write_room.

Linus

2009-09-05 18:56:04

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

Linus Torvalds <[email protected]> writes:

> On Sun, 6 Sep 2009, OGAWA Hirofumi wrote:
>>
>> This is not meaning to object to your patch though, I think we would be
>> good to fix pty_space(), not leaving as wrong. With fix it, I guess we
>> don't get strange behavior in the near of buffer limit.
>
> I'd actually rather not make that function any more complicated.
>
> Just make the rules be very simple:
>
> - the pty layer has ~64kB buffering, and if you just blindly do a
> ->write() op, you can see how many characters you were able to write.
>
> - before doing a ->write() op, you can ask how many characters you are
> guaranteed to be able to write by doing a "->write_room()" call.
>
> ..and then the bug literally was just that "pty_write()" was confused, and
> thought that it should do that "write_room()" thing, which it really
> shouldn't ever have done.
>
> So I really think that the true fix is to just remove the code from
> pty_write(), and not do anything more complicated. I'll also commit the
> change to write '\r\n' as one single string, because quite frankly, it's
> just stupid to do it as two characters, but at that point it's just a
> cleanup.

But, current write_room() returns almost all wrong value. For example,
if we have the 4kb preallocated buffer in some state and used it,
->memory_used will be 4kb even if we are using only a byte actually.

I thought it's strange/wrong, even if we removed the pty_space() in
pty_write().

>> Also, it seems the non-n_tty path doesn't use tty_write_room() check,
>> and instead it just try to write and check written bytes which returned
>> by tty->ops->write().
>
> .. and I think that's fine. I think write_room() should be used sparingly,
> and only by code that cares about being able to fit at least 'n'
> characters in the tty buffers. In fact, I think even n_tty would likely in
> general be better off without it (and just check the return value), but
> because of the stateful character translation (that doesn't actually keep
> any state around, it just wants to expand things as it goes along), and
> because of historical reasons, we'll just keep it using write_room.

As a bit long term solution, I agree. Current code seems to have fragile
buffer handling about echoes, \n etc. And yes, perhaps, to avoid
write_room() is clean way.

But, I felt 64kb (pty_write) vs 8kb (pty_write_room) sounds strange
currently.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2009-09-05 20:30:03

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

On Sat, 5 Sep 2009, Mikael Pettersson wrote:
>
> Thanks, I'm testing this and the pty_write() fix on i686 and ppc64 now.
>
> Sometimes the bug is difficult to trigger, so I may need to do loads of
> testing with different gcc versions before I dare to say that it's fixed.

Ok, I'm going to commit the two patches, because even if there is some
other bug hiding too (and it doesn't fix your thing - although I think it
will), this is definitely a real fix regardless.

And I want to do a -rc9 and let it get a couple of days of testing, since
I got way more pull requests than I hoped for while I was off diving.

Linus

2009-09-05 20:57:38

by Alan

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

> So, it will use the 64kb limit at least few paths, and I'm not sure
> though, non-n_tty path (e.g. ppp) doesn't use tty_write_room() check
> always. It may not be consistent if we removed pty_space() in pty_write().

The correct behaviour for most network protocols to overflow is to drop
packets so the behaviour of not checking was intentional.

Alan

2009-09-05 22:42:27

[permalink] [raw]

Subject: Re: [Bug #14015] pty regressed again, breaking expect and gcc's testsuite

Linus Torvalds writes:
>
>
> On Sat, 5 Sep 2009, Mikael Pettersson wrote:
> >
> > Thanks, I'm testing this and the pty_write() fix on i686 and ppc64 now.
> >
> > Sometimes the bug is difficult to trigger, so I may need to do loads of
> > testing with different gcc versions before I dare to say that it's fixed.
>
> Ok, I'm going to commit the two patches, because even if there is some
> other bug hiding too (and it doesn't fix your thing - although I think it
> will), this is definitely a real fix regardless.
>
> And I want to do a -rc9 and let it get a couple of days of testing, since
> I got way more pull requests than I hoped for while I was off diving.

With these two fixes my i686 box finished all gcc bootstrap/regtest
cycles I had scheduled with no signs of pty errors. My ppc64 box
isn't done yet, but so far it's not seen any errors either. So I'm
reasonably optimistic about these fixes.

/Mikael

2009-09-05 22:46:56