LinuxLists.cc - PROBLEM: Data corruption when pasting large data to terminal

2012-02-15 18:51:40

Subject: PROBLEM: Data corruption when pasting large data to terminal

Hi,

Short summary: When pasting large amount of data (>4kB) to terminals,
often the data gets mangled.

How to reproduce:
Create a text file that contains this line about 100 times:
a=(123456789123456789123456789123456789123456789123456789123456789)
(also available at http://pastebin.com/LAH2bmaw for a while)
and then copy-paste its entire contents in one step into a "bash" or
"python" running in a graphical terminal.

Expected result: The interpreter correctly parses these lines and
produces no visible result.
Actual result: They complain about syntax error.
Reproducibility: About 10% on my computer (2.6.38.8), reportedly 100%
on friends' computers running 2.6.37 and 3.1.1.

Why I believe this is a kernel bug:
- Reproducible with any source of copy-pasting (e.g. various
terminals, graphical editors, browsers).
- Reproducible with at least five different popular graphical terminal
emulators where you paste into (xterm, gnome, kde, urxvt, putty).
- Reproducble with at least two applications (bash, python).
- stracing the terminal shows that it does indeed write the correct
copy-paste buffer into /dev/ptmx, and all its writes return the full
amount of bytes requested, i.e. no short write.
- stracing the application clearly shows that it does not receive all
the desired characters from its stdin, some are simply missing, i.e. a
read(0, "3", 1) = 1 is followed by a read(0, "\n", 1) = 1 (with a
write() and some rt_sigprocmask()s in between), although the char '3'
shouldn't be followed by a newline.
- Not reproducible on MacOS.

Additional informaiton:
- On friends' computers the bug always happens from the offset 4163
which is exactly the length of the first line (data immediately
processed by the application) plus the magic 4095. The rest of that
line, up to the next newline, is cut off.

- On my computer, the bug, if happens, always happens at an offset
behind this one; moreover, there's a lone digit '3' appearing on the
display on its own line exactly 4095 bytes before the syntax error.
Here's a "screenshot" with "$ " being the bash prompt, and with my
comments after "#":

$ a=(123456789123456789123456789123456789123456789123456789123456789)
# repeated a few, varying number of times
3
# <- notice this lone '3' on the display
$ a=(123456789123456789123456789123456789123456789123456789123456789)
# 60 times, that's 4080 bytes incl. newlines
$ a=(123456789123
> a=(123456789123456789123456789123456789123456789123456789123456789)
bash: syntax error near unexpected token `('
$ a=(123456789123456789123456789123456789123456789123456789123456789)
# a few more times

- I couldn't reproduce with cat-like applications, I have a feeling
perhaps the bug only occurs in raw terminal mode, but I'm really not
sure about this.

I'd be glad if you could find the time to look at this problem, it's
quite unfortunate that I cannot safely copy-paste large amount of data
into terminals.
Thanks a lot,
egmont

2012-02-15 23:30:27

by Greg Kroah-Hartman

[permalink] [raw]

Subject: Re: PROBLEM: Data corruption when pasting large data to terminal

On Wed, Feb 15, 2012 at 07:50:58PM +0100, Egmont Koblinger wrote:
> Hi,
>
> Short summary: ?When pasting large amount of data (>4kB) to terminals,
> often the data gets mangled.
>
> How to reproduce:
> Create a text file that contains this line about 100 times:
> a=(123456789123456789123456789123456789123456789123456789123456789)
> (also available at http://pastebin.com/LAH2bmaw for a while)
> and then copy-paste its entire contents in one step into a "bash" or
> "python" running in a graphical terminal.
>
> Expected result: The interpreter correctly parses these lines and
> produces no visible result.
> Actual result: They complain about syntax error.
> Reproducibility: About 10% on my computer (2.6.38.8), reportedly 100%
> on friends' computers running?2.6.37 and 3.1.1.

Has this ever worked properly for you on older kernels? How about 3.2?
3.3-rc3? Having a "known good" point to work from here would be nice to
have.

I can reproduce this using bash, BUT, I can not reproduce it using vim
running in the same window bash was running in.

So, that implies that this is a userspace bug, not a kernel one,
otherwise the results would be the same both times, right?

> Why I believe this is a kernel bug:
> - Reproducible with any source of copy-pasting (e.g. various
> terminals, graphical editors, browsers).

Bugs are common when people start with the same original codebase :)

> - Reproducible with at least five different popular graphical terminal
> emulators where you paste into (xterm, gnome, kde, urxvt, putty).
> - Reproducble with at least two applications (bash, python).

Again, I can't duplicate this with vim in a terminal window, which rules
out the terminal, and points at bash, right?

> - stracing the terminal shows that it does indeed write the correct
> copy-paste buffer into /dev/ptmx, and all its writes return the full
> amount of bytes requested, i.e. no short write.

short writes are legal, but so many userspace programs don't handle them
properly.

> - stracing the application clearly shows that it does not receive all
> the desired characters from its stdin, some are simply missing, i.e. a
> read(0, "3", 1) = 1 is followed by a?read(0, "\n", 1) = 1 (with a
> write() and some rt_sigprocmask()s in between), although the char '3'
> shouldn't be followed by a newline.

Perhaps the buffer is overflowing as the program isn't able to keep up
properly? It's not an "endless" buffer, it can overflow if reads don't
keep up.

> - Not reproducible on MacOS.

That means nothing :)

> Additional informaiton:
> - On friends' computers the bug always happens from the offset?4163
> which is exactly the length of the first line (data immediately
> processed by the application) plus the magic 4095. The rest of that
> line, up to the next newline, is cut off.
>
> - On my computer, the bug, if happens, always happens at an offset
> behind this one; moreover, there's a lone digit '3' appearing on the
> display on its own line exactly 4095 bytes before the syntax error.
> Here's a "screenshot" with "$ "?being the bash prompt, and with my
> comments after "#":
>
> $ a=(123456789123456789123456789123456789123456789123456789123456789)
> # repeated a few, varying number of times
> 3
> # <- notice this lone '3' on the display
> $ a=(123456789123456789123456789123456789123456789123456789123456789)
> # 60 times, that's 4080 bytes incl. newlines
> $ a=(123456789123
> > a=(123456789123456789123456789123456789123456789123456789123456789)
> bash: syntax error near unexpected token `('
> $ a=(123456789123456789123456789123456789123456789123456789123456789)
> # a few more times
>
> - I couldn't reproduce with cat-like applications, I have a feeling
> perhaps the bug only occurs in raw terminal mode, but I'm really not
> sure about this.

That kind of proves the "there's a problem in the application you are
testing" theory, right?

> I'd be glad if you could find the time to look at this problem, it's
> quite unfortunate that I cannot safely copy-paste large amount of data
> into terminals.

Works for me, just use an editor to do that...

thanks,

greg k-h

2012-02-15 23:58:18

by Parag Warudkar

[permalink] [raw]

Subject: Re: PROBLEM: Data corruption when pasting large data to terminal

On Wed, Feb 15, 2012 at 1:50 PM, Egmont Koblinger <[email protected]> wrote:
> Hi,
>
> Short summary: ?When pasting large amount of data (>4kB) to terminals,
> often the data gets mangled.
>
> How to reproduce:
> Create a text file that contains this line about 100 times:
> a=(123456789123456789123456789123456789123456789123456789123456789)
> (also available at http://pastebin.com/LAH2bmaw for a while)
> and then copy-paste its entire contents in one step into a "bash" or
> "python" running in a graphical terminal.
>

FWIW, this also works fine on cygwin / Windows 7. No errors.

$ bash --version
GNU bash, version 4.1.10(4)-release (i686-pc-cygwin)

Unsure what that means though - probably nothing!

Greg - when you said it works in vim - since of course vim isn't
'parsing' the input may be you did not see an error - or did you
actually verify all 4KB somehow? ;)

Parag

2012-02-16 00:10:57

by Greg Kroah-Hartman

[permalink] [raw]

Subject: Re: PROBLEM: Data corruption when pasting large data to terminal

On Wed, Feb 15, 2012 at 06:58:12PM -0500, Parag Warudkar wrote:
> On Wed, Feb 15, 2012 at 1:50 PM, Egmont Koblinger <[email protected]> wrote:
> > Hi,
> >
> > Short summary: ?When pasting large amount of data (>4kB) to terminals,
> > often the data gets mangled.
> >
> > How to reproduce:
> > Create a text file that contains this line about 100 times:
> > a=(123456789123456789123456789123456789123456789123456789123456789)
> > (also available at http://pastebin.com/LAH2bmaw for a while)
> > and then copy-paste its entire contents in one step into a "bash" or
> > "python" running in a graphical terminal.
> >
>
> FWIW, this also works fine on cygwin / Windows 7. No errors.
>
> $ bash --version
> GNU bash, version 4.1.10(4)-release (i686-pc-cygwin)
>
> Unsure what that means though - probably nothing!
>
> Greg - when you said it works in vim - since of course vim isn't
> 'parsing' the input may be you did not see an error - or did you
> actually verify all 4KB somehow? ;)

I verified that the input actually matched the paste buffer. It's
pretty trivial to do so.

Odds are emacs also does this correctly, anyone care to verify that?

thanks,

greg k-h

2012-02-16 00:40:43

by Egmont Koblinger

[permalink] [raw]

Subject: Re: PROBLEM: Data corruption when pasting large data to terminal

Hi Greg,

Sorry, I didn't emphasize the point that makes me suspect it's a kernel issue:

- strace reveals that the terminal emulator writes the correct data
into /dev/ptmx, and the kernel reports no short writes(!), all the
write(..., ..., 68) calls actually return 68 (the length of the
example file's lines incl. newline; I'm naively assuming I can trust
strace here.)
- strace reveals that the receiving application (bash) doesn't receive
all the data from /dev/pts/N.
- so: the data gets lost after writing to /dev/ptmx, but before
reading it out from /dev/pts/N.

First I was also hoping for a bug in the terminal emulators not
handling short writes correctly, but it's not the case.

Could you please verify that stracing the terminal and the app shows
the same behavior to you? If it's the same, and if strace correctly
reports the actual number of bytes written, then can it still be an
application bug?

Not being able to reproduce in vim/whatever doesn't mean too much, as
it seems to be some kind of race condition (behaves differently on
different machines, buggy only at ~10% of the time for me), the actual
circumstances that trigger the bug might depend on timing or the way
the applications read the buffer (byte by byte, or larger chunks) or
number of processors or I don't know what.

Unfortunately I have no information about "known good" reference
point, but I recall seeing a similar bug a year or two ago, I just
didn't pay attention to it. So probably it's not a new one.

thanks a lot,
egmont

On Thu, Feb 16, 2012 at 00:30, Greg KH <[email protected]> wrote:
>
> On Wed, Feb 15, 2012 at 07:50:58PM +0100, Egmont Koblinger wrote:
> > Hi,
> >
> > Short summary: When pasting large amount of data (>4kB) to terminals,
> > often the data gets mangled.
> >
> > How to reproduce:
> > Create a text file that contains this line about 100 times:
> > a=(123456789123456789123456789123456789123456789123456789123456789)
> > (also available at http://pastebin.com/LAH2bmaw for a while)
> > and then copy-paste its entire contents in one step into a "bash" or
> > "python" running in a graphical terminal.
> >
> > Expected result: The interpreter correctly parses these lines and
> > produces no visible result.
> > Actual result: They complain about syntax error.
> > Reproducibility: About 10% on my computer (2.6.38.8), reportedly 100%
> > on friends' computers running 2.6.37 and 3.1.1.
>
> Has this ever worked properly for you on older kernels? How about 3.2?
> 3.3-rc3? Having a "known good" point to work from here would be nice to
> have.
>
> I can reproduce this using bash, BUT, I can not reproduce it using vim
> running in the same window bash was running in.
>
> So, that implies that this is a userspace bug, not a kernel one,
> otherwise the results would be the same both times, right?
>
> > Why I believe this is a kernel bug:
> > - Reproducible with any source of copy-pasting (e.g. various
> > terminals, graphical editors, browsers).
>
> Bugs are common when people start with the same original codebase :)
>
> > - Reproducible with at least five different popular graphical terminal
> > emulators where you paste into (xterm, gnome, kde, urxvt, putty).
> > - Reproducble with at least two applications (bash, python).
>
> Again, I can't duplicate this with vim in a terminal window, which rules
> out the terminal, and points at bash, right?
>
> > - stracing the terminal shows that it does indeed write the correct
> > copy-paste buffer into /dev/ptmx, and all its writes return the full
> > amount of bytes requested, i.e. no short write.
>
> short writes are legal, but so many userspace programs don't handle them
> properly.
>
> > - stracing the application clearly shows that it does not receive all
> > the desired characters from its stdin, some are simply missing, i.e. a
> > read(0, "3", 1) = 1 is followed by a read(0, "\n", 1) = 1 (with a
> > write() and some rt_sigprocmask()s in between), although the char '3'
> > shouldn't be followed by a newline.
>
> Perhaps the buffer is overflowing as the program isn't able to keep up
> properly? It's not an "endless" buffer, it can overflow if reads don't
> keep up.
>
> > - Not reproducible on MacOS.
>
> That means nothing :)
>
> > Additional informaiton:
> > - On friends' computers the bug always happens from the offset 4163
> > which is exactly the length of the first line (data immediately
> > processed by the application) plus the magic 4095. The rest of that
> > line, up to the next newline, is cut off.
> >
> > - On my computer, the bug, if happens, always happens at an offset
> > behind this one; moreover, there's a lone digit '3' appearing on the
> > display on its own line exactly 4095 bytes before the syntax error.
> > Here's a "screenshot" with "$ " being the bash prompt, and with my
> > comments after "#":
> >
> > $ a=(123456789123456789123456789123456789123456789123456789123456789)
> > # repeated a few, varying number of times
> > 3
> > # <- notice this lone '3' on the display
> > $ a=(123456789123456789123456789123456789123456789123456789123456789)
> > # 60 times, that's 4080 bytes incl. newlines
> > $ a=(123456789123
> > > a=(123456789123456789123456789123456789123456789123456789123456789)
> > bash: syntax error near unexpected token `('
> > $ a=(123456789123456789123456789123456789123456789123456789123456789)
> > # a few more times
> >
> > - I couldn't reproduce with cat-like applications, I have a feeling
> > perhaps the bug only occurs in raw terminal mode, but I'm really not
> > sure about this.
>
> That kind of proves the "there's a problem in the application you are
> testing" theory, right?
>
> > I'd be glad if you could find the time to look at this problem, it's
> > quite unfortunate that I cannot safely copy-paste large amount of data
> > into terminals.
>
> Works for me, just use an editor to do that...
>
> thanks,
>
> greg k-h

2012-02-16 00:58:28

by Greg Kroah-Hartman

[permalink] [raw]

Subject: Re: PROBLEM: Data corruption when pasting large data to terminal

A: No.
Q: Should I include quotations after my reply?

http://daringfireball.net/2007/07/on_top

On Thu, Feb 16, 2012 at 01:39:59AM +0100, Egmont Koblinger wrote:
> Hi Greg,
>
> Sorry, I didn't emphasize the point that makes me suspect it's a kernel issue:
>
> - strace reveals that the terminal emulator writes the correct data
> into /dev/ptmx, and the kernel reports no short writes(!), all the
> write(..., ..., 68) calls actually return 68 (the length of the
> example file's lines incl. newline; I'm naively assuming I can trust
> strace here.)
> - strace reveals that the receiving application (bash) doesn't receive
> all the data from /dev/pts/N.
> - so: the data gets lost after writing to /dev/ptmx, but before
> reading it out from /dev/pts/N.

Which it will, if the reader doesn't read fast enough, right? Is the
data somewhere guaranteed to never "overrun" the buffer? If so, how do
we handle not just running out of memory?

> First I was also hoping for a bug in the terminal emulators not
> handling short writes correctly, but it's not the case.

Yes, that would make things easier.

> Could you please verify that stracing the terminal and the app shows
> the same behavior to you? If it's the same, and if strace correctly
> reports the actual number of bytes written, then can it still be an
> application bug?

You can do that stracing vim if you want to, I'm currently on the road
at the moment, and have to give a presentation in a few minutes, so my
spare time for this is a bit limited :)

> Not being able to reproduce in vim/whatever doesn't mean too much, as
> it seems to be some kind of race condition (behaves differently on
> different machines, buggy only at ~10% of the time for me), the actual
> circumstances that trigger the bug might depend on timing or the way
> the applications read the buffer (byte by byte, or larger chunks) or
> number of processors or I don't know what.

Not being able to reproduce it with a different userspace program is
important, in that there is at least one "known good" userspace program
here that does things correctly.

I bet you can write a simple userspace program that also does this
correctly, have you tried that? That might be best to provide a "tiny"
reproducer.

Odds are bash and python don't do things properly, as they aren't
accustomed to such large buffers coming in at this rate of speed. They
are designed for this type of thing, while vim is used to it.

> Unfortunately I have no information about "known good" reference
> point, but I recall seeing a similar bug a year or two ago, I just
> didn't pay attention to it. So probably it's not a new one.

If you can trace something down in the kernel to point to where we are
doing something wrong, I would be glad to look at it. But without that,
there's not much I can do here, sorry.

thanks,

greg k-h

2012-02-16 01:13:20

by Egmont Koblinger

[permalink] [raw]

Subject: Re: PROBLEM: Data corruption when pasting large data to terminal

On Thu, Feb 16, 2012 at 01:54, Greg KH <[email protected]> wrote:

>> - so: the data gets lost after writing to /dev/ptmx, but before
>> reading it out from /dev/pts/N.
>
> Which it will, if the reader doesn't read fast enough, right? Is the
> data somewhere guaranteed to never "overrun" the buffer? If so, how do
> we handle not just running out of memory?

If the buffer is full, the write() into /dev/ptmx should signal it by
returning a smaller-than-requested number (short write), or 0 or it
should block (perhaps depending on whether the fd is in blocking mode)
-- am I correct? If the write() returns 68 whereas actually it stored
15 bytes and threw out the rest, then there's no way applications
could handle this. As I said, stracing the terminal tells me that the
terminal wrote 68 bytes 100 times, and all hundred times the write()
call returned 68, not less.

> I bet you can write a simple userspace program that also does this
> correctly, have you tried that? That might be best to provide a "tiny"
> reproducer.
>
> Odds are bash and python don't do things properly, as they aren't
> accustomed to such large buffers coming in at this rate of speed. They
> are designed for this type of thing, while vim is used to it.

Again: stracing bash reveals that it reads from its stdin byte by
byte, and at one point it read()s a '3' character followed by a '\n'
with the next read() - something that never occurs in the input.

I wish I had a simpler test case, but I don't have yet.

Anyway, I'm not looking at the internals of the terminal or the
application, I'm observing the interface where they talk to the pty,
and what I see is that the terminal writes the correct stuff, and the
application receives a garbled one. Whatever bash does with that data
later on is irrelevant, isn't it?

> If you can trace something down in the kernel to point to where we are
> doing something wrong, I would be glad to look at it. But without that,
> there's not much I can do here, sorry.

I still can't understand why the two strace outputs are not convincing
enough. I'm also uncertain what the desired behavior would be if one
tries to write to ptmx while the kernel's buffer is full.

thanks,
e.

2012-02-16 11:42:50

by Egmont Koblinger

[permalink] [raw]

Subject: Re: PROBLEM: Data corruption when pasting large data to terminal

On Thu, Feb 16, 2012 at 01:10, Greg KH <[email protected]> wrote:
>
> Odds are emacs also does this correctly, anyone care to verify that?

Actually, thanks for the hint, emacs is the first editor where I could
easily trigger the bug. I made the input data a bit longer (200
lines), and it's being pasted incorrectly all the time.

Stracing xterm contains exactly this, with the "= 68" at the end, 200 times:
write(5, "a=(12345678912345678912345678912"..., 68) = 68

Contrary to xterm, gnome-terminal doesn't split the buffer at
newlines, it tries to write to /dev/ptmx in one single step. In this
example I extended the file to 1000 lines long (68000 bytes), here's
what strace gnome-terminal says:
open("/dev/ptmx", O_RDWR) = 17
[...]
write(17, "a=(12345678912345678912345678912"..., 68000) = 65280
[...]
write(17, "a=(12345678912345678912345678912"..., 2720) = 2720

So copy-paste starts misbehaving after 4kB, which suggests a buffer
size of around that size; on the other hand, writes to /dev/ptmx can
return up to almost 64kB, which suggests a much larger terminal buffer
which makes it strange that it starts misbehaving as early as at 4kB.

On a side note, apparently gnome-terminal handles short writes
correctly. (It's a mere accident that the size of the first write,
65280, is dividable by the line length of my example, 68. A different
input shows that the second write indeed continues sending the buffer
from the correct offset, from the middle of a line. The return value
of the first write is the same, 65280 in that case too.)

thx,
egmont

2012-02-17 19:28:32

by Pavel Machek

[permalink] [raw]

Subject: Re: PROBLEM: Data corruption when pasting large data to terminal

Hi!

> > Sorry, I didn't emphasize the point that makes me suspect it's a kernel issue:
> >
> > - strace reveals that the terminal emulator writes the correct data
> > into /dev/ptmx, and the kernel reports no short writes(!), all the
> > write(..., ..., 68) calls actually return 68 (the length of the
> > example file's lines incl. newline; I'm naively assuming I can trust
> > strace here.)
> > - strace reveals that the receiving application (bash) doesn't receive
> > all the data from /dev/pts/N.
> > - so: the data gets lost after writing to /dev/ptmx, but before
> > reading it out from /dev/pts/N.
>
> Which it will, if the reader doesn't read fast enough, right? Is the
> data somewhere guaranteed to never "overrun" the buffer? If so, how do
> we handle not just running out of memory?

Start blocking the writer?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2012-02-17 22:04:22

by Bruno Prémont

[permalink] [raw]

Subject: Re: PROBLEM: Data corruption when pasting large data to terminal

Hi,

On Fri, 17 February 2012 Pavel Machek <[email protected]> wrote:
> > > Sorry, I didn't emphasize the point that makes me suspect it's a kernel issue:
> > >
> > > - strace reveals that the terminal emulator writes the correct data
> > > into /dev/ptmx, and the kernel reports no short writes(!), all the
> > > write(..., ..., 68) calls actually return 68 (the length of the
> > > example file's lines incl. newline; I'm naively assuming I can trust
> > > strace here.)
> > > - strace reveals that the receiving application (bash) doesn't receive
> > > all the data from /dev/pts/N.
> > > - so: the data gets lost after writing to /dev/ptmx, but before
> > > reading it out from /dev/pts/N.
> >
> > Which it will, if the reader doesn't read fast enough, right? Is the
> > data somewhere guaranteed to never "overrun" the buffer? If so, how do
> > we handle not just running out of memory?
>
> Start blocking the writer?

I did quickly write a small test program (attached). It forks a reader child
and sends data over to it, at the end both write down their copy of the buffer
to a /tmp/ptmx_{in,out}.txt file for manual comparing results (in addition
to basic output of mismatch start line)

From the time it took the writer to write larger buffers (as seen using strace)
it seems there *is* some kind of blocking, but it's not blocking long enough
or unblocking too early if the reader does not keep up.

For quick and dirty testing of effects of buffer sizes, tune "rsz", "wsz"
and "line" in main() as well as total size with BUFF_SZ define.

The effects for me are that writer writes all data but reader never sees tail
of written data (how much is being seen seems variable, probably matter of
scheduling, frequency scaling and similar racing factors).

My test system is single-core uniprocessor centrino laptop (32bit x86) with
3.2.5 kernel.

Bruno

Attachments:

(No filename) (1.82 kB)
ptmx.c (4.54 kB)
Download all attachments

2012-02-19 20:56:07

by Egmont Koblinger

[permalink] [raw]

Subject: Re: PROBLEM: Data corruption when pasting large data to terminal

Hi Bruno,

Unfortunately the lost tail is a different thing: the terminal is in
cooked mode by default, so the kernel intentionally keeps the data in
its buffer until it sees a complete line. A quick-and-dirty way of
changing to byte-based transmission (I'm lazy to look up the actual
system calls, apologies for the terribly ugly way of doing this) is:
pty = open(ptsdname, O_RDWR):
if (pty == -1) { ... }
+ char cmd[100];
+ sprintf(cmd, "stty raw <>%s", ptsdname);
+ system(cmd);
ptmx_slave_test(pty, line, rsz);

Anyway, thanks very much for your test program, I'll try to modify it
to trigger the data corruption bug.

egmont

On Fri, Feb 17, 2012 at 22:57, Bruno Prémont <[email protected]> wrote:
> Hi,
>
> On Fri, 17 February 2012 Pavel Machek <[email protected]> wrote:
>> > > Sorry, I didn't emphasize the point that makes me suspect it's a kernel issue:
>> > >
>> > > - strace reveals that the terminal emulator writes the correct data
>> > > into /dev/ptmx, and the kernel reports no short writes(!), all the
>> > > write(..., ..., 68) calls actually return 68 (the length of the
>> > > example file's lines incl. newline; I'm naively assuming I can trust
>> > > strace here.)
>> > > - strace reveals that the receiving application (bash) doesn't receive
>> > > all the data from /dev/pts/N.
>> > > - so: the data gets lost after writing to /dev/ptmx, but before
>> > > reading it out from /dev/pts/N.
>> >
>> > Which it will, if the reader doesn't read fast enough, right? Is the
>> > data somewhere guaranteed to never "overrun" the buffer? If so, how do
>> > we handle not just running out of memory?
>>
>> Start blocking the writer?
>
> I did quickly write a small test program (attached). It forks a reader child
> and sends data over to it, at the end both write down their copy of the buffer
> to a /tmp/ptmx_{in,out}.txt file for manual comparing results (in addition
> to basic output of mismatch start line)
>
> From the time it took the writer to write larger buffers (as seen using strace)
> it seems there *is* some kind of blocking, but it's not blocking long enough
> or unblocking too early if the reader does not keep up.
>
>
> For quick and dirty testing of effects of buffer sizes, tune "rsz", "wsz"
> and "line" in main() as well as total size with BUFF_SZ define.
>
>
> The effects for me are that writer writes all data but reader never sees tail
> of written data (how much is being seen seems variable, probably matter of
> scheduling, frequency scaling and similar racing factors).
>
> My test system is single-core uniprocessor centrino laptop (32bit x86) with
> 3.2.5 kernel.
>
> Bruno

2012-02-19 21:14:40

by Bruno Prémont

[permalink] [raw]

Subject: Re: PROBLEM: Data corruption when pasting large data to terminal

Hi Egmont,

On Sun, 19 February 2012 Egmont Koblinger <[email protected]> wrote:
> Unfortunately the lost tail is a different thing: the terminal is in
> cooked mode by default, so the kernel intentionally keeps the data in
> its buffer until it sees a complete line. A quick-and-dirty way of
> changing to byte-based transmission (I'm lazy to look up the actual
> system calls, apologies for the terribly ugly way of doing this) is:
> pty = open(ptsdname, O_RDWR):
> if (pty == -1) { ... }
> + char cmd[100];
> + sprintf(cmd, "stty raw <>%s", ptsdname);
> + system(cmd);
> ptmx_slave_test(pty, line, rsz);
>
> Anyway, thanks very much for your test program, I'll try to modify it
> to trigger the data corruption bug.

Well, not sure but the closing of ptmx on sender side should force kernel
to flush whatever is remaining independently on end-of-line (I was
thinking I should push an EOF over the ptmx instead of closing it before
waiting for child process though I have not yet looked-up how to do so!).

The amount of missing tail for my few runs of the test program were of
varying length, but in all cases way more than a single line, thus I would
hope it's not line-buffering by the kernel which causes the missing data!

Bruno

> egmont
>
> On Fri, Feb 17, 2012 at 22:57, Bruno Prémont <[email protected]> wrote:
> > Hi,
> >
> > On Fri, 17 February 2012 Pavel Machek <[email protected]> wrote:
> >> > > Sorry, I didn't emphasize the point that makes me suspect it's a kernel issue:
> >> > >
> >> > > - strace reveals that the terminal emulator writes the correct data
> >> > > into /dev/ptmx, and the kernel reports no short writes(!), all the
> >> > > write(..., ..., 68) calls actually return 68 (the length of the
> >> > > example file's lines incl. newline; I'm naively assuming I can trust
> >> > > strace here.)
> >> > > - strace reveals that the receiving application (bash) doesn't receive
> >> > > all the data from /dev/pts/N.
> >> > > - so: the data gets lost after writing to /dev/ptmx, but before
> >> > > reading it out from /dev/pts/N.
> >> >
> >> > Which it will, if the reader doesn't read fast enough, right? Is the
> >> > data somewhere guaranteed to never "overrun" the buffer? If so, how do
> >> > we handle not just running out of memory?
> >>
> >> Start blocking the writer?
> >
> > I did quickly write a small test program (attached). It forks a reader child
> > and sends data over to it, at the end both write down their copy of the buffer
> > to a /tmp/ptmx_{in,out}.txt file for manual comparing results (in addition
> > to basic output of mismatch start line)
> >
> > From the time it took the writer to write larger buffers (as seen using strace)
> > it seems there *is* some kind of blocking, but it's not blocking long enough
> > or unblocking too early if the reader does not keep up.
> >
> >
> > For quick and dirty testing of effects of buffer sizes, tune "rsz", "wsz"
> > and "line" in main() as well as total size with BUFF_SZ define.
> >
> >
> > The effects for me are that writer writes all data but reader never sees tail
> > of written data (how much is being seen seems variable, probably matter of
> > scheduling, frequency scaling and similar racing factors).
> >
> > My test system is single-core uniprocessor centrino laptop (32bit x86) with
> > 3.2.5 kernel.
> >
> > Bruno

2012-02-19 21:33:27

by Alan

[permalink] [raw]

Subject: Re: PROBLEM: Data corruption when pasting large data to terminal

> Well, not sure but the closing of ptmx on sender side should force kernel
> to flush whatever is remaining independently on end-of-line (I was
> thinking I should push an EOF over the ptmx instead of closing it before
> waiting for child process though I have not yet looked-up how to do so!).
>

The behaviour for the master side on a close is to hangup the child.

You would normally wait for the child to exit first

2012-02-19 21:42:31

by Egmont Koblinger

[permalink] [raw]

Subject: Re: PROBLEM: Data corruption when pasting large data to terminal

Hi Bruno,

On Sun, Feb 19, 2012 at 22:14, Bruno Prémont <[email protected]> wrote:
> Hi Egmont,
>
> On Sun, 19 February 2012 Egmont Koblinger <[email protected]> wrote:
>> Unfortunately the lost tail is a different thing: the terminal is in
>> cooked mode by default, so the kernel intentionally keeps the data in
>> its buffer until it sees a complete line. A quick-and-dirty way of
>> changing to byte-based transmission (I'm lazy to look up the actual
>> system calls, apologies for the terribly ugly way of doing this) is:
>> pty = open(ptsdname, O_RDWR):
>> if (pty == -1) { ... }
>> + char cmd[100];
>> + sprintf(cmd, "stty raw <>%s", ptsdname);
>> + system(cmd);
>> ptmx_slave_test(pty, line, rsz);
>>
>> Anyway, thanks very much for your test program, I'll try to modify it
>> to trigger the data corruption bug.
>
> Well, not sure but the closing of ptmx on sender side should force kernel
> to flush whatever is remaining independently on end-of-line (I was
> thinking I should push an EOF over the ptmx instead of closing it before
> waiting for child process though I have not yet looked-up how to do so!).

As Alan also pointed out, the way to close stuff is not handled very
nicely in the example. However, I didn't face a problem with that -
I'm not particularly interested in whether the application receives
all the data if I kill the underlying terminal. My problem is data
corruption way before the end of the stream, and actually incorrect
bytes received by the application (not just an early eof due to a
closed terminal). I'm trying hard to reproduce that with a single
example, but I haven't succeeded so far.

Note that I've triggered the bug with 4 apps so far: emacs (which is
always in char-based input mode), and three readline apps (which keep
switching back and forth between the two modes). I have no clue yet
whether the bug itself is related to raw char-based mode or not, but I
guess switching to this mode might not hurt.

egmont

>
> The amount of missing tail for my few runs of the test program were of
> varying length, but in all cases way more than a single line, thus I would
> hope it's not line-buffering by the kernel which causes the missing data!
>
> Bruno
>
>
>> egmont
>>
>> On Fri, Feb 17, 2012 at 22:57, Bruno Prémont <[email protected]> wrote:
>> > Hi,
>> >
>> > On Fri, 17 February 2012 Pavel Machek <[email protected]> wrote:
>> >> > > Sorry, I didn't emphasize the point that makes me suspect it's a kernel issue:
>> >> > >
>> >> > > - strace reveals that the terminal emulator writes the correct data
>> >> > > into /dev/ptmx, and the kernel reports no short writes(!), all the
>> >> > > write(..., ..., 68) calls actually return 68 (the length of the
>> >> > > example file's lines incl. newline; I'm naively assuming I can trust
>> >> > > strace here.)
>> >> > > - strace reveals that the receiving application (bash) doesn't receive
>> >> > > all the data from /dev/pts/N.
>> >> > > - so: the data gets lost after writing to /dev/ptmx, but before
>> >> > > reading it out from /dev/pts/N.
>> >> >
>> >> > Which it will, if the reader doesn't read fast enough, right? Is the
>> >> > data somewhere guaranteed to never "overrun" the buffer? If so, how do
>> >> > we handle not just running out of memory?
>> >>
>> >> Start blocking the writer?
>> >
>> > I did quickly write a small test program (attached). It forks a reader child
>> > and sends data over to it, at the end both write down their copy of the buffer
>> > to a /tmp/ptmx_{in,out}.txt file for manual comparing results (in addition
>> > to basic output of mismatch start line)
>> >
>> > From the time it took the writer to write larger buffers (as seen using strace)
>> > it seems there *is* some kind of blocking, but it's not blocking long enough
>> > or unblocking too early if the reader does not keep up.
>> >
>> >
>> > For quick and dirty testing of effects of buffer sizes, tune "rsz", "wsz"
>> > and "line" in main() as well as total size with BUFF_SZ define.
>> >
>> >
>> > The effects for me are that writer writes all data but reader never sees tail
>> > of written data (how much is being seen seems variable, probably matter of
>> > scheduling, frequency scaling and similar racing factors).
>> >
>> > My test system is single-core uniprocessor centrino laptop (32bit x86) with
>> > 3.2.5 kernel.
>> >
>> > Bruno

2012-02-20 17:18:53

by Egmont Koblinger

[permalink] [raw]

Subject: Re: PROBLEM: Data corruption when pasting large data to terminal

Further investigation reveals that:

- In case of emacs, strace shows that it receives the correct data on
its standard input, so it's an emacs bug, not a kernel one. My bad.

- For the other three remaining readline-based apps (bash, python,
bc), strace shows that wherever the data is correct, lines are
terminated by '\r' (as it seems to be the standard for raw terminal
mode, and the terminal always puts this character in the terminal),
whereas as soon as it's buggy, the received character becomes a '\n'
(as it seems to be the way for cooked terminal mode). Here's an
excerpt of 'strace bash', grepping only the reads from stdin:
read(0, "8", 1) = 1
read(0, "9", 1) = 1
read(0, ")", 1) = 1
read(0, "\r", 1) = 1 <-- everything's fine
read(0, "a", 1) = 1
read(0, "=", 1) = 1
read(0, "(", 1) = 1
read(0, "1", 1) = 1
...
read(0, "2", 1) = 1
read(0, "3", 1) = 1 <-- a line
shouldn't end with '3',
read(0, "\n", 1) = 1 <-- and it's a '\n'
where it's buggy
read(0, "a", 1) = 1
read(0, "=", 1) = 1
read(0, "(", 1) = 1
read(0, "1", 1) = 1
read(0, "2", 1) = 1

- This, in combination with the fact that we haven't been able to
reproduce the bug with a raw-only or cooked-only terminal, suggests
that there's somehow a race condition when writes, reads and termios
changes are all involved.

I'll keep on investigating. There's quite a lot for me to learn, e.g.
I'm wondering if maybe readline incorrectly uses the TCSETS* ioctl
attributes?

Right now readline only uses TCSETSW to change the terminal values, it
toggles back-n-forth between two states (raw when expecting user
input, cooked when processing a command), and only read()s in the raw
state, is this the correct behavior? Even if it uses the wrong one,
would it explain data missing from the input stream? TCSETSF seems to
be one that can cause data to be dropped, but according to strace,
readline doesn't use this.

I'm quite new to this area, so any hint from terminal experts on how
it should work would be appreciated.

thanks a lot,
egmont

On Sun, Feb 19, 2012 at 22:41, Egmont Koblinger <[email protected]> wrote:
> Hi Bruno,
>
> On Sun, Feb 19, 2012 at 22:14, Bruno Prémont <[email protected]> wrote:
>> Hi Egmont,
>>
>> On Sun, 19 February 2012 Egmont Koblinger <[email protected]> wrote:
>>> Unfortunately the lost tail is a different thing: the terminal is in
>>> cooked mode by default, so the kernel intentionally keeps the data in
>>> its buffer until it sees a complete line. A quick-and-dirty way of
>>> changing to byte-based transmission (I'm lazy to look up the actual
>>> system calls, apologies for the terribly ugly way of doing this) is:
>>> pty = open(ptsdname, O_RDWR):
>>> if (pty == -1) { ... }
>>> + char cmd[100];
>>> + sprintf(cmd, "stty raw <>%s", ptsdname);
>>> + system(cmd);
>>> ptmx_slave_test(pty, line, rsz);
>>>
>>> Anyway, thanks very much for your test program, I'll try to modify it
>>> to trigger the data corruption bug.
>>
>> Well, not sure but the closing of ptmx on sender side should force kernel
>> to flush whatever is remaining independently on end-of-line (I was
>> thinking I should push an EOF over the ptmx instead of closing it before
>> waiting for child process though I have not yet looked-up how to do so!).
>
> As Alan also pointed out, the way to close stuff is not handled very
> nicely in the example. However, I didn't face a problem with that -
> I'm not particularly interested in whether the application receives
> all the data if I kill the underlying terminal. My problem is data
> corruption way before the end of the stream, and actually incorrect
> bytes received by the application (not just an early eof due to a
> closed terminal). I'm trying hard to reproduce that with a single
> example, but I haven't succeeded so far.
>
> Note that I've triggered the bug with 4 apps so far: emacs (which is
> always in char-based input mode), and three readline apps (which keep
> switching back and forth between the two modes). I have no clue yet
> whether the bug itself is related to raw char-based mode or not, but I
> guess switching to this mode might not hurt.
>
>
> egmont
>
>>
>> The amount of missing tail for my few runs of the test program were of
>> varying length, but in all cases way more than a single line, thus I would
>> hope it's not line-buffering by the kernel which causes the missing data!
>>
>> Bruno
>>
>>
>>> egmont
>>>
>>> On Fri, Feb 17, 2012 at 22:57, Bruno Prémont <[email protected]> wrote:
>>> > Hi,
>>> >
>>> > On Fri, 17 February 2012 Pavel Machek <[email protected]> wrote:
>>> >> > > Sorry, I didn't emphasize the point that makes me suspect it's a kernel issue:
>>> >> > >
>>> >> > > - strace reveals that the terminal emulator writes the correct data
>>> >> > > into /dev/ptmx, and the kernel reports no short writes(!), all the
>>> >> > > write(..., ..., 68) calls actually return 68 (the length of the
>>> >> > > example file's lines incl. newline; I'm naively assuming I can trust
>>> >> > > strace here.)
>>> >> > > - strace reveals that the receiving application (bash) doesn't receive
>>> >> > > all the data from /dev/pts/N.
>>> >> > > - so: the data gets lost after writing to /dev/ptmx, but before
>>> >> > > reading it out from /dev/pts/N.
>>> >> >
>>> >> > Which it will, if the reader doesn't read fast enough, right? Is the
>>> >> > data somewhere guaranteed to never "overrun" the buffer? If so, how do
>>> >> > we handle not just running out of memory?
>>> >>
>>> >> Start blocking the writer?
>>> >
>>> > I did quickly write a small test program (attached). It forks a reader child
>>> > and sends data over to it, at the end both write down their copy of the buffer
>>> > to a /tmp/ptmx_{in,out}.txt file for manual comparing results (in addition
>>> > to basic output of mismatch start line)
>>> >
>>> > From the time it took the writer to write larger buffers (as seen using strace)
>>> > it seems there *is* some kind of blocking, but it's not blocking long enough
>>> > or unblocking too early if the reader does not keep up.
>>> >
>>> >
>>> > For quick and dirty testing of effects of buffer sizes, tune "rsz", "wsz"
>>> > and "line" in main() as well as total size with BUFF_SZ define.
>>> >
>>> >
>>> > The effects for me are that writer writes all data but reader never sees tail
>>> > of written data (how much is being seen seems variable, probably matter of
>>> > scheduling, frequency scaling and similar racing factors).
>>> >
>>> > My test system is single-core uniprocessor centrino laptop (32bit x86) with
>>> > 3.2.5 kernel.
>>> >
>>> > Bruno

2012-02-20 17:31:16

by Pavel Machek

[permalink] [raw]

Subject: Re: PROBLEM: Data corruption when pasting large data to terminal

> I'm quite new to this area, so any hint from terminal experts on how
> it should work would be appreciated.

One particular chance would be trying with some old kernel (like 2.4.0
or 2.6.0); if readline works as expected there, we have kernel
regression.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2012-02-20 21:11:57

by Egmont Koblinger

[permalink] [raw]

Subject: Re: PROBLEM: Data corruption when pasting large data to terminal

Hi,

I attach a simple self-contained test case that triggers the bug most
of the time. Moreover, it turns out that we're facing a data
corruption plus a deadlock issue -- often the test triggers randomly
one of them.

The test is a slight modification of Bruno's example (thanks!). The
most important change is: it emulates a readline app by setting the
terminal to cooked mode and doing some "work" (1 millisecond of sleep)
after every newline, then reverting it to raw mode.

Minor changes also include: ignoring the last 100 bytes (potentially
an incomplete line that stays in the kernel's buffer, the slave
doesn't expect that to arrive), plus a long sleep on the master after
writing its output (ugly hack, but definitely long enough to give the
slave time to read everything).

The behavior is:
- Often: Corrupt data read (\r versus \n changes, as well as actual
loss of data), as reported by the slave.
- Often: Deadlock, the slave hangs in a read() reading from the
terminal, while the master hangs on its write() at the same time.

You can play with parameters like the buffer size, the write size
(wsz), the blocking vs. nonblocking mode of write, TCSETS versus
TCSETSW -- they don't make much of a difference.

What does make a difference though, is the read size (rsz). The bug
is reproducible if and only if the read size is a divisor of the
length of the line excluding the terminating newline (i.e. the length
of the full line minus one); that is, a divisor of 62 in this example.
So a read size of 1 (which is used by readline) triggers the bug with
all kinds of data; larger read sizes only with certain well-crafted
buffers.

Also, the bug is still only reproducible after writing at least 4kB.

This gives me a guts feeling (without actually studying the kernel's
source) that it might be some circular buffer overrun: whenever
there's only 1 byte left in the buffer, the final newline of a line,
the writer can incorrectly wrap around in a 4k buffer and override
that -- does this make any sense?

Interestingly, the test uses \n and \r reversed compare to real life
(the buffer should contain \r instead of \n, and ICRNL should be used
instead of INLCR) -- for some reason this test didn't trigger the bug
for me after swapping the two, I don't know why.

Anyway, I hope that this test case and my findings about the read size
helps catch and fix the bug.

Thanks a lot,
egmont

Attachments:

ptmx2.c (6.01 kB)

2012-02-20 21:30:19

by Egmont Koblinger

[permalink] [raw]

Subject: Re: PROBLEM: Data corruption when pasting large data to terminal

> What does make a difference though, is the read size (rsz). The bug
> is reproducible if and only if the read size is a divisor of the
> length of the line excluding the terminating newline (i.e. the length
> of the full line minus one); that is, a divisor of 62 in this example.

Errr, forget this paragraph, this was a bug in my code. When trying
to emulate readline, rsz should really be 1. I was not scanning
through the whole buffer looking for a newline, and was looking at an
incorrect offset. Sorry for the pebkac.

Fixed version attached. Anyway, the point is still the same, the data
corruption or deadlock are still triggered the same way.

thx,
egmont

Attachments:

ptmx3.c (6.01 kB)