2013-06-06 11:54:21

by Markus Trippelsdorf

[permalink] [raw]
Subject: Strange intermittent EIO error when writing to stdout since v3.8.0

Since v3.8.0 several people reported intermittent IO errors that happen
during high system load while using "emerge" under Gentoo:
...
File "/usr/lib64/portage/pym/portage/util/_eventloop/EventLoop.py", line 260, in iteration
if not x.callback(f, event, *x.args):
File "/usr/lib64/portage/pym/portage/util/_async/PipeLogger.py", line 99, in _output_handler
stdout_buf[os.write(stdout_fd, stdout_buf):]
File "/usr/lib64/portage/pym/portage/__init__.py", line 246, in __call__
rval = self._func(*wrapped_args, **wrapped_kwargs)
OSError: [Errno 5] Input/output error

Basically 'emerge' just writes the build output to stdout in a loop:
...
def _output_handler(self, fd, event):

background = self.background
stdout_fd = self.stdout_fd
log_file = self._log_file

while True:
buf = self._read_buf(fd, event)

if buf is None:
# not a POLLIN event, EAGAIN, etc...
break

if not buf:
# EOF
self._unregister()
self.wait()
break

else:
if not background and stdout_fd is not None:
failures = 0
stdout_buf = buf
while stdout_buf:
try:
stdout_buf = \
stdout_buf[os.write(stdout_fd, stdout_buf):]
except OSError as e:
if e.errno != errno.EAGAIN:
raise
...

see: https://bugs.gentoo.org/show_bug.cgi?id=459674

(A similar issue also happens when building Firefox since v3.8.0. But
because Firefox's build process doesn't raise an exception it just dies
at random points without giving a clue.)

Now the question is: Could this be a kernel bug? Maybe in the TTY layer?

Unfortunately the issue is not easily reproducible and a git-bisect is
out of the question.

--
Markus


2013-06-06 14:18:19

by Peter Hurley

[permalink] [raw]
Subject: Re: Strange intermittent EIO error when writing to stdout since v3.8.0

On 06/06/2013 07:54 AM, Markus Trippelsdorf wrote:
> Since v3.8.0 several people reported intermittent IO errors that happen
> during high system load while using "emerge" under Gentoo:
> ...
> File "/usr/lib64/portage/pym/portage/util/_eventloop/EventLoop.py", line 260, in iteration
> if not x.callback(f, event, *x.args):
> File "/usr/lib64/portage/pym/portage/util/_async/PipeLogger.py", line 99, in _output_handler
> stdout_buf[os.write(stdout_fd, stdout_buf):]
> File "/usr/lib64/portage/pym/portage/__init__.py", line 246, in __call__
> rval = self._func(*wrapped_args, **wrapped_kwargs)
> OSError: [Errno 5] Input/output error

Looks to me like a user-space bug: EIO is returned when the other
end of the "pipe" has been closed.

FWIW, I didn't see where the OP tried to revert
'SpawnProcess: stdout_fd FD_CLOEXEC'

The only non-emerge related comment (#21 in the link provided) refers to
'a similar issue sometimes happened when I built Firefox by hand [..snip..]
And it would randomly crash during the build.

Since I've recompiled Python with gcc-4.6 this issue also never occurred
again.'

That comment doesn't really corroborate the reported bug.

Regards,
Peter Hurley

> Basically 'emerge' just writes the build output to stdout in a loop:
> ...
> def _output_handler(self, fd, event):
>
> background = self.background
> stdout_fd = self.stdout_fd
> log_file = self._log_file
>
> while True:
> buf = self._read_buf(fd, event)
>
> if buf is None:
> # not a POLLIN event, EAGAIN, etc...
> break
>
> if not buf:
> # EOF
> self._unregister()
> self.wait()
> break
>
> else:
> if not background and stdout_fd is not None:
> failures = 0
> stdout_buf = buf
> while stdout_buf:
> try:
> stdout_buf = \
> stdout_buf[os.write(stdout_fd, stdout_buf):]
> except OSError as e:
> if e.errno != errno.EAGAIN:
> raise
> ...
>
> see: https://bugs.gentoo.org/show_bug.cgi?id=459674
>
> (A similar issue also happens when building Firefox since v3.8.0. But
> because Firefox's build process doesn't raise an exception it just dies
> at random points without giving a clue.)
>
> Now the question is: Could this be a kernel bug? Maybe in the TTY layer?
>
> Unfortunately the issue is not easily reproducible and a git-bisect is
> out of the question.
>

2013-06-06 14:37:53

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Strange intermittent EIO error when writing to stdout since v3.8.0

On 2013.06.06 at 10:18 -0400, Peter Hurley wrote:
> On 06/06/2013 07:54 AM, Markus Trippelsdorf wrote:
> > Since v3.8.0 several people reported intermittent IO errors that happen
> > during high system load while using "emerge" under Gentoo:
> > ...
> > File "/usr/lib64/portage/pym/portage/util/_eventloop/EventLoop.py", line 260, in iteration
> > if not x.callback(f, event, *x.args):
> > File "/usr/lib64/portage/pym/portage/util/_async/PipeLogger.py", line 99, in _output_handler
> > stdout_buf[os.write(stdout_fd, stdout_buf):]
> > File "/usr/lib64/portage/pym/portage/__init__.py", line 246, in __call__
> > rval = self._func(*wrapped_args, **wrapped_kwargs)
> > OSError: [Errno 5] Input/output error
>
> Looks to me like a user-space bug: EIO is returned when the other
> end of the "pipe" has been closed.
>
> FWIW, I didn't see where the OP tried to revert
> 'SpawnProcess: stdout_fd FD_CLOEXEC'
>
> The only non-emerge related comment (#21 in the link provided) refers to
> 'a similar issue sometimes happened when I built Firefox by hand [..snip..]
> And it would randomly crash during the build.
>
> Since I've recompiled Python with gcc-4.6 this issue also never occurred
> again.'
>
> That comment doesn't really corroborate the reported bug.

That comment was from me (I use 'octoploid' for blog trolling, etc.) and
is wrong. The Firefox build issue happend again today. See also the rest
of my mail:

> (A similar issue also happens when building Firefox since v3.8.0. But
> because Firefox's build process doesn't raise an exception it just
> dies at random points without giving a clue.)

Please note that both the Firefox build process and Portage (emerge)
are implemented in Python.

--
Markus

2013-06-06 14:44:29

by Mikael Pettersson

[permalink] [raw]
Subject: Re: Strange intermittent EIO error when writing to stdout since v3.8.0

Markus Trippelsdorf writes:
> Since v3.8.0 several people reported intermittent IO errors that happen
> during high system load while using "emerge" under Gentoo:
> ...
> File "/usr/lib64/portage/pym/portage/util/_eventloop/EventLoop.py", line 260, in iteration
> if not x.callback(f, event, *x.args):
> File "/usr/lib64/portage/pym/portage/util/_async/PipeLogger.py", line 99, in _output_handler
> stdout_buf[os.write(stdout_fd, stdout_buf):]
> File "/usr/lib64/portage/pym/portage/__init__.py", line 246, in __call__
> rval = self._func(*wrapped_args, **wrapped_kwargs)
> OSError: [Errno 5] Input/output error
>
> Basically 'emerge' just writes the build output to stdout in a loop:
> ...
> def _output_handler(self, fd, event):
>
> background = self.background
> stdout_fd = self.stdout_fd
> log_file = self._log_file
>
> while True:
> buf = self._read_buf(fd, event)
>
> if buf is None:
> # not a POLLIN event, EAGAIN, etc...
> break
>
> if not buf:
> # EOF
> self._unregister()
> self.wait()
> break
>
> else:
> if not background and stdout_fd is not None:
> failures = 0
> stdout_buf = buf
> while stdout_buf:
> try:
> stdout_buf = \
> stdout_buf[os.write(stdout_fd, stdout_buf):]
> except OSError as e:
> if e.errno != errno.EAGAIN:
> raise
> ...
>
> see: https://bugs.gentoo.org/show_bug.cgi?id=459674
>
> (A similar issue also happens when building Firefox since v3.8.0. But
> because Firefox's build process doesn't raise an exception it just dies
> at random points without giving a clue.)
>
> Now the question is: Could this be a kernel bug? Maybe in the TTY layer?
>
> Unfortunately the issue is not easily reproducible and a git-bisect is
> out of the question.

I'm seeing a similar regression. I do a lot of gcc bootstraps and regression
test suite runs, and for the bootstraps I do

make -jN bootstrap |& tee build-log

(tcsh syntax, adjust as appropriate for your preferred shell) to get a complete
log for later inspection in case of error. N is typically the number of cores
or threads on the machine, e.g. -j8 on my Core-i7 IVB.

Up to the 3.7 kernel this never had any problems. Starting with the 3.8 kernel,
or possibly 3.9-rc1, this usually dies at some random point with an EIO.

I haven't had time to bisect it.

/Mikael

2013-06-06 17:31:13

by David Howells

[permalink] [raw]
Subject: Re: Strange intermittent EIO error when writing to stdout since v3.8.0

Markus Trippelsdorf <[email protected]> wrote:

> OSError: [Errno 5] Input/output error
>
> Basically 'emerge' just writes the build output to stdout in a loop:

I've just upgraded to kernel-3.9.4-200.fc18.x86_64 and I'm now regularly
seeing something very similar from the tee run by "fedpkg local" to log its
output when building a cross-compiler gcc RPM:

tee: standard output: Input/output error
tee: write error

Slapping strace on tee after the "Input/output error" line has occurred shows
that tee is still running and still pushing lines from its stdin to the output
named on its command line, but has ceased writing to its stdout.

The stdout is (or was) a TTY that's either the communications channel to an
xterm or the channel to an emacs buffer. With the xterm, the process can be
suspended (CTRL+Z) stuff can be written to the tty and then the process can be
rstarted. Indeed, stderr for tee goes to the same tty - and must still be
functional otherwise I wouldn't see the aforementioned error messages...

In ps fax, the tee process can be seen at the bottom of the subtree:

5158 ? S 0:06 \_ /usr/bin/xterm
5198 pts/3 Ss+ 0:00 | \_ bash
5386 pts/3 Sl 4:50 | \_ emacs -geometry 100x77
10258 pts/8 SNs+ 0:00 | \_ /usr/bin/python /usr/bin/fedpkg local
10267 pts/8 SN+ 0:00 | \_ rpmbuild --define _sourcedir /dat
10462 pts/8 SN+ 0:00 | | \_ /bin/sh -e /var/tmp/rpm-tmp.x
16248 pts/8 SN+ 0:00 | | \_ make -j12
16249 pts/8 SN+ 0:00 | | \_ /bin/sh -c fail= fail
16253 pts/8 SN+ 0:00 | | \_ /bin/sh -c fail=
16254 pts/8 SN+ 0:00 | | \_ make all-am
16256 pts/8 SN+ 0:00 | | \_ /bin/sh -
16265 pts/8 SN+ 0:00 | | | \_ /bin/
16688 pts/8 SN+ 0:00 | | | \_ /
16700 pts/8 RN+ 0:00 | | |
16704 pts/8 SN+ 0:00 | | |
16267 pts/8 SN+ 0:00 | | \_ /bin/sh -
16279 pts/8 SN+ 0:00 | | | \_ /bin/
16694 pts/8 SN+ 0:00 | | | \_ /
16707 pts/8 RN+ 0:00 | | |
16713 pts/8 SN+ 0:00 | | |
16289 pts/8 SN+ 0:00 | | \_ /bin/sh -
16306 pts/8 SN+ 0:00 | | | \_ /bin/
17045 pts/8 SN+ 0:00 | | | \_ /
17046 pts/8 RN+ 0:00 | | |
17047 pts/8 SN+ 0:00 | | |
16799 pts/8 SN+ 0:00 | | \_ /bin/sh -
16801 pts/8 SN+ 0:00 | | | \_ /bin/
17051 pts/8 SN+ 0:00 | | | \_ /
17053 pts/8 RN+ 0:00 | | |
16819 pts/8 SN+ 0:00 | | \_ /bin/sh -
16822 pts/8 SN+ 0:00 | | | \_ /bin/
16943 pts/8 SN+ 0:00 | | | \_ /
16950 pts/8 RN+ 0:00 | | |
16951 pts/8 SN+ 0:00 | | |
16903 pts/8 SN+ 0:00 | | \_ /bin/sh -
16915 pts/8 SN+ 0:00 | | \_ /bin/
17024 pts/8 SN+ 0:00 | | \_ /
17026 pts/8 RN+ 0:00 | |
17028 pts/8 SN+ 0:00 | |
10268 pts/8 SN+ 0:00 | \_ tee .build-4.8.1-1.1.fc20.log

tee's fds look like:

warthog>ls -l /proc/10268/fd/
total 0
lr-x------. 1 dhowells dhowells 64 Jun 6 18:24 0 -> pipe:[1595614]
lrwx------. 1 dhowells dhowells 64 Jun 6 18:24 1 -> /dev/pts/8
lrwx------. 1 dhowells dhowells 64 Jun 6 18:23 2 -> /dev/pts/8
l-wx------. 1 dhowells dhowells 64 Jun 6 18:24 3 -> /data/fedora/cros

David

2013-06-06 18:41:32

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Strange intermittent EIO error when writing to stdout since v3.8.0

On 2013.06.06 at 18:31 +0100, David Howells wrote:
> Markus Trippelsdorf <[email protected]> wrote:
>
> > OSError: [Errno 5] Input/output error
> >
> > Basically 'emerge' just writes the build output to stdout in a loop:
>
> I've just upgraded to kernel-3.9.4-200.fc18.x86_64 and I'm now regularly
> seeing something very similar from the tee run by "fedpkg local" to log its
> output when building a cross-compiler gcc RPM:
>
> tee: standard output: Input/output error
> tee: write error
>
> Slapping strace on tee after the "Input/output error" line has occurred shows
> that tee is still running and still pushing lines from its stdin to the output
> named on its command line, but has ceased writing to its stdout.

tee uses the same loop as portage, but keeps going in case of an EIO:

/* In the array of NFILES + 1 descriptors, make
the first one correspond to standard output. */
descriptors[0] = stdout;
files[0] = _("standard output");
setvbuf (stdout, NULL, _IONBF, 0);

for (i = 1; i <= nfiles; i++)
{
descriptors[i] = (STREQ (files[i], "-")
? stdout
: fopen (files[i], mode_string));
if (descriptors[i] == NULL)
{
error (0, errno, "%s", files[i]);
ok = false;
}
else
setvbuf (descriptors[i], NULL, _IONBF, 0);
}

while (1)
{
bytes_read = read (0, buffer, sizeof buffer);
if (bytes_read < 0 && errno == EINTR)
continue;
if (bytes_read <= 0)
break;

/* Write to all NFILES + 1 descriptors.
Standard output is the first one. */
for (i = 0; i <= nfiles; i++)
if (descriptors[i]
&& fwrite (buffer, bytes_read, 1, descriptors[i]) != 1)
{
error (0, errno, "%s", files[i]);
descriptors[i] = NULL;
ok = false;
}
}

--
Markus

2013-06-07 15:39:41

by Peter Hurley

[permalink] [raw]
Subject: Re: Strange intermittent EIO error when writing to stdout since v3.8.0

On 06/06/2013 10:37 AM, Markus Trippelsdorf wrote:
> On 2013.06.06 at 10:18 -0400, Peter Hurley wrote:
>> On 06/06/2013 07:54 AM, Markus Trippelsdorf wrote:
>>> Since v3.8.0 several people reported intermittent IO errors that happen
>>> during high system load while using "emerge" under Gentoo:
>>> ...
>>> File "/usr/lib64/portage/pym/portage/util/_eventloop/EventLoop.py", line 260, in iteration
>>> if not x.callback(f, event, *x.args):
>>> File "/usr/lib64/portage/pym/portage/util/_async/PipeLogger.py", line 99, in _output_handler
>>> stdout_buf[os.write(stdout_fd, stdout_buf):]
>>> File "/usr/lib64/portage/pym/portage/__init__.py", line 246, in __call__
>>> rval = self._func(*wrapped_args, **wrapped_kwargs)
>>> OSError: [Errno 5] Input/output error
>>
>> Looks to me like a user-space bug: EIO is returned when the other
>> end of the "pipe" has been closed.
>>
>> FWIW, I didn't see where the OP tried to revert
>> 'SpawnProcess: stdout_fd FD_CLOEXEC'
>>
>> The only non-emerge related comment (#21 in the link provided) refers to
>> 'a similar issue sometimes happened when I built Firefox by hand [..snip..]
>> And it would randomly crash during the build.
>>
>> Since I've recompiled Python with gcc-4.6 this issue also never occurred
>> again.'
>>
>> That comment doesn't really corroborate the reported bug.
>
> That comment was from me (I use 'octoploid' for blog trolling, etc.) and
> is wrong. The Firefox build issue happend again today. See also the rest
> of my mail:
>
>> (A similar issue also happens when building Firefox since v3.8.0. But
>> because Firefox's build process doesn't raise an exception it just
>> dies at random points without giving a clue.)
>
> Please note that both the Firefox build process and Portage (emerge)
> are implemented in Python.

Based on the other reports from Mikael and David, I suspect this problem
may have to do with my commit 699390354da6c258b65bf8fa79cfd5feaede50b6:

pty: Ignore slave pty close() if never successfully opened

This commit poisons the pty under certain error conditions that may
occur from parallel open()s (or parallel close() with pending write()).

It's unclear to me which error condition is triggered and how user-space
got an open file descriptor but that seems the most likely. Is the problem
reproducible enough that a debug patch would likely trigger?

Regards,
Peter Hurley

2013-06-07 16:07:11

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Strange intermittent EIO error when writing to stdout since v3.8.0

On 2013.06.07 at 11:39 -0400, Peter Hurley wrote:
> On 06/06/2013 10:37 AM, Markus Trippelsdorf wrote:
> > On 2013.06.06 at 10:18 -0400, Peter Hurley wrote:
> >> On 06/06/2013 07:54 AM, Markus Trippelsdorf wrote:
> >>> Since v3.8.0 several people reported intermittent IO errors that happen
> >>> during high system load while using "emerge" under Gentoo:
> >>> ...
> >>> File "/usr/lib64/portage/pym/portage/util/_eventloop/EventLoop.py", line 260, in iteration
> >>> if not x.callback(f, event, *x.args):
> >>> File "/usr/lib64/portage/pym/portage/util/_async/PipeLogger.py", line 99, in _output_handler
> >>> stdout_buf[os.write(stdout_fd, stdout_buf):]
> >>> File "/usr/lib64/portage/pym/portage/__init__.py", line 246, in __call__
> >>> rval = self._func(*wrapped_args, **wrapped_kwargs)
> >>> OSError: [Errno 5] Input/output error
>
> Based on the other reports from Mikael and David, I suspect this problem
> may have to do with my commit 699390354da6c258b65bf8fa79cfd5feaede50b6:
>
> pty: Ignore slave pty close() if never successfully opened
>
> This commit poisons the pty under certain error conditions that may
> occur from parallel open()s (or parallel close() with pending write()).
>
> It's unclear to me which error condition is triggered and how user-space
> got an open file descriptor but that seems the most likely. Is the problem
> reproducible enough that a debug patch would likely trigger?

A debug patch would be appreciated.
In my case the problem occurs very irregularly; sometimes it happens a
few times per day, at other times I don't see the bug for three weeks.

But if Mikael and David could also run their kernels with the debug
patch applied, hopefully that's enough to trigger the issue more regularly.

--
Markus

2013-06-07 18:25:28

by Mikael Pettersson

[permalink] [raw]
Subject: Re: Strange intermittent EIO error when writing to stdout since v3.8.0

Peter Hurley writes:
> Based on the other reports from Mikael and David, I suspect this problem
> may have to do with my commit 699390354da6c258b65bf8fa79cfd5feaede50b6:
>
> pty: Ignore slave pty close() if never successfully opened
>
> This commit poisons the pty under certain error conditions that may
> occur from parallel open()s (or parallel close() with pending write()).
>
> It's unclear to me which error condition is triggered and how user-space
> got an open file descriptor but that seems the most likely. Is the problem
> reproducible enough that a debug patch would likely trigger?

In my case the problem occurred frequently enough that I've been forced
to change my build procedures to avoid it. I'd welcome a debug patch.

2013-06-07 19:56:39

by David Howells

[permalink] [raw]
Subject: Re: Strange intermittent EIO error when writing to stdout since v3.8.0

Peter Hurley <[email protected]> wrote:

> Based on the other reports from Mikael and David, I suspect this problem
> may have to do with my commit 699390354da6c258b65bf8fa79cfd5feaede50b6:
>
> pty: Ignore slave pty close() if never successfully opened
>
> This commit poisons the pty under certain error conditions that may
> occur from parallel open()s (or parallel close() with pending write()).

In the case that I'm seeing, the tty in question isn't being opened and closed
regularly. A gcc build is dumping a load of stuff out through tee and tee
writes it to the tty it got as fd 1 from the parent shell.

David

2013-06-09 01:59:44

by Rob Landley

[permalink] [raw]
Subject: Re: Strange intermittent EIO error when writing to stdout since v3.8.0

On 06/06/2013 06:54:17 AM, Markus Trippelsdorf wrote:
> Since v3.8.0 several people reported intermittent IO errors that
> happen
> during high system load while using "emerge" under Gentoo:
...
> see: https://bugs.gentoo.org/show_bug.cgi?id=459674
>
> (A similar issue also happens when building Firefox since v3.8.0. But
> because Firefox's build process doesn't raise an exception it just
> dies
> at random points without giving a clue.)
>
> Now the question is: Could this be a kernel bug? Maybe in the TTY
> layer?
>
> Unfortunately the issue is not easily reproducible and a git-bisect is
> out of the question.

I tracked down and fixed something like this in the User Mode Linux tty
implementation many moons ago.

The trick to making it reproducible was to rename a copy of the xterm
binary to somethingunique, run a shell in said xterm (because "cat"
doesn't exercise the tty logic, that's why), run a thing in there
producing test output, and have a loop in another window doing "while
true; do killall -STOP somethingunique; sleep 1; killall -START
somethingunique; sleep 1; done".

This forces the pipe buffer to the pty to fill up and exercise the flow
control logic, plus all the fun "short write and retry" stuff...

Rob-

2013-06-11 22:15:13

by Orion Poplawski

[permalink] [raw]
Subject: Re: Strange intermittent EIO error when writing to stdout since v3.8.0

Peter Hurley <peter <at> hurleysoftware.com> writes:
> Based on the other reports from Mikael and David, I suspect this problem
> may have to do with my commit 699390354da6c258b65bf8fa79cfd5feaede50b6:
>
> pty: Ignore slave pty close() if never successfully opened
>
> This commit poisons the pty under certain error conditions that may
> occur from parallel open()s (or parallel close() with pending write()).
>
> It's unclear to me which error condition is triggered and how user-space
> got an open file descriptor but that seems the most likely. Is the problem
> reproducible enough that a debug patch would likely trigger?

I get this pretty frequently on my machine with rpmbuild -ba 2>&1 | tee.
I'd be happy to help to debug.

2013-06-13 10:39:03

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Strange intermittent EIO error when writing to stdout since v3.8.0

On 2013.06.11 at 22:14 +0000, Orion Poplawski wrote:
> Peter Hurley <peter <at> hurleysoftware.com> writes:
> > Based on the other reports from Mikael and David, I suspect this problem
> > may have to do with my commit 699390354da6c258b65bf8fa79cfd5feaede50b6:
> >
> > pty: Ignore slave pty close() if never successfully opened
> >
> > This commit poisons the pty under certain error conditions that may
> > occur from parallel open()s (or parallel close() with pending write()).
> >
> > It's unclear to me which error condition is triggered and how user-space
> > got an open file descriptor but that seems the most likely. Is the problem
> > reproducible enough that a debug patch would likely trigger?
>
> I get this pretty frequently on my machine with rpmbuild -ba 2>&1 | tee.
> I'd be happy to help to debug.

Since apparently no debugging patch is forthcoming, maybe it's time to
test the simple revert of commit 699390354da.

So can you guys please apply the following patch and see if the issue
goes away?

Thanks.

diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
index 59bfaec..f5801a6 100644
--- a/drivers/tty/pty.c
+++ b/drivers/tty/pty.c
@@ -38,12 +38,9 @@ static void pty_close(struct tty_struct *tty, struct file *filp)
if (tty->driver->subtype == PTY_TYPE_MASTER)
WARN_ON(tty->count > 1);
else {
- if (test_bit(TTY_IO_ERROR, &tty->flags))
- return;
if (tty->count > 2)
return;
}
- set_bit(TTY_IO_ERROR, &tty->flags);
wake_up_interruptible(&tty->read_wait);
wake_up_interruptible(&tty->write_wait);
tty->packet = 0;
@@ -249,8 +246,6 @@ static int pty_open(struct tty_struct *tty, struct file *filp)
if (!tty || !tty->link)
goto out;

- set_bit(TTY_IO_ERROR, &tty->flags);
-
retval = -EIO;
if (test_bit(TTY_OTHER_CLOSED, &tty->flags))
goto out;
@@ -259,7 +254,6 @@ static int pty_open(struct tty_struct *tty, struct file *filp)
if (tty->driver->subtype == PTY_TYPE_SLAVE && tty->link->count != 1)
goto out;

- clear_bit(TTY_IO_ERROR, &tty->flags);
clear_bit(TTY_OTHER_CLOSED, &tty->link->flags);
set_bit(TTY_THROTTLED, &tty->flags);
retval = 0;

--
Markus

2013-06-13 10:39:53

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Strange intermittent EIO error when writing to stdout since v3.8.0

On 2013.06.07 at 20:22 +0200, Mikael Pettersson wrote:
> Peter Hurley writes:
> > Based on the other reports from Mikael and David, I suspect this problem
> > may have to do with my commit 699390354da6c258b65bf8fa79cfd5feaede50b6:
> >
> > pty: Ignore slave pty close() if never successfully opened
> >
> > This commit poisons the pty under certain error conditions that may
> > occur from parallel open()s (or parallel close() with pending write()).
> >
> > It's unclear to me which error condition is triggered and how user-space
> > got an open file descriptor but that seems the most likely. Is the problem
> > reproducible enough that a debug patch would likely trigger?
>
> In my case the problem occurred frequently enough that I've been forced
> to change my build procedures to avoid it. I'd welcome a debug patch.

Since apparently no debugging patch is forthcoming, maybe it's time to
test the simple revert of commit 699390354da.

So can you guys please apply the following patch and see if the issue
goes away?

Thanks.

diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
index 59bfaec..f5801a6 100644
--- a/drivers/tty/pty.c
+++ b/drivers/tty/pty.c
@@ -38,12 +38,9 @@ static void pty_close(struct tty_struct *tty, struct file *filp)
if (tty->driver->subtype == PTY_TYPE_MASTER)
WARN_ON(tty->count > 1);
else {
- if (test_bit(TTY_IO_ERROR, &tty->flags))
- return;
if (tty->count > 2)
return;
}
- set_bit(TTY_IO_ERROR, &tty->flags);
wake_up_interruptible(&tty->read_wait);
wake_up_interruptible(&tty->write_wait);
tty->packet = 0;
@@ -249,8 +246,6 @@ static int pty_open(struct tty_struct *tty, struct file *filp)
if (!tty || !tty->link)
goto out;

- set_bit(TTY_IO_ERROR, &tty->flags);
-
retval = -EIO;
if (test_bit(TTY_OTHER_CLOSED, &tty->flags))
goto out;
@@ -259,7 +254,6 @@ static int pty_open(struct tty_struct *tty, struct file *filp)
if (tty->driver->subtype == PTY_TYPE_SLAVE && tty->link->count != 1)
goto out;

- clear_bit(TTY_IO_ERROR, &tty->flags);
clear_bit(TTY_OTHER_CLOSED, &tty->link->flags);
set_bit(TTY_THROTTLED, &tty->flags);
retval = 0;

--
Markus

2013-06-13 14:16:37

by Peter Hurley

[permalink] [raw]
Subject: Re: Strange intermittent EIO error when writing to stdout since v3.8.0

On 06/13/2013 06:39 AM, Markus Trippelsdorf wrote:
> On 2013.06.07 at 20:22 +0200, Mikael Pettersson wrote:
>> Peter Hurley writes:
>> > Based on the other reports from Mikael and David, I suspect this problem
>> > may have to do with my commit 699390354da6c258b65bf8fa79cfd5feaede50b6:
>> >
>> > pty: Ignore slave pty close() if never successfully opened
>> >
>> > This commit poisons the pty under certain error conditions that may
>> > occur from parallel open()s (or parallel close() with pending write()).
>> >
>> > It's unclear to me which error condition is triggered and how user-space
>> > got an open file descriptor but that seems the most likely. Is the problem
>> > reproducible enough that a debug patch would likely trigger?
>>
>> In my case the problem occurred frequently enough that I've been forced
>> to change my build procedures to avoid it. I'd welcome a debug patch.
>
> Since apparently no debugging patch is forthcoming, maybe it's time to
> test the simple revert of commit 699390354da.

I apologize for the delay.

Here's a debug patch which I hope will narrow down the circumstances of
this error condition.

Regards,
Peter Hurley

PS - Don't get excited about EIO from read() which you will see during
boot. That's expected when the slave closes and the master is parked on
a blocking read().

--- >% ---
Subject: [PATCH] tty: Debug EIO from write()


Signed-off-by: Peter Hurley <[email protected]>
---
drivers/tty/pty.c | 22 +++++++++++++++++++---
drivers/tty/tty_io.c | 51 ++++++++++++++++++++++++++++++++++++++++++---------
2 files changed, 61 insertions(+), 12 deletions(-)

diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
index 9c2f1bc..2ce2bb2 100644
--- a/drivers/tty/pty.c
+++ b/drivers/tty/pty.c
@@ -25,6 +25,16 @@
#include <linux/slab.h>
#include <linux/mutex.h>

+#define TTY_DEBUG_EIO 1
+
+#ifdef TTY_DEBUG_EIO
+#define tty_debug_eio(tty, f, args...) ({ \
+ char __b[64]; \
+ printk(KERN_DEBUG "%s: %s: " f, __func__, tty_name(tty, __b), ##args); \
+})
+#else
+#define tty_debug_eio(tty, f, args...)
+#endif

#ifdef CONFIG_UNIX98_PTYS
static struct tty_driver *ptm_driver;
@@ -246,12 +256,18 @@ static int pty_open(struct tty_struct *tty, struct file *filp)
set_bit(TTY_IO_ERROR, &tty->flags);

retval = -EIO;
- if (test_bit(TTY_OTHER_CLOSED, &tty->flags))
+ if (test_bit(TTY_OTHER_CLOSED, &tty->flags)) {
+ tty_debug_eio(tty, "other pty closed (%#lx)\n", tty->flags);
goto out;
- if (test_bit(TTY_PTY_LOCK, &tty->link->flags))
+ }
+ if (test_bit(TTY_PTY_LOCK, &tty->link->flags)) {
+ tty_debug_eio(tty, "ptm still locked\n");
goto out;
- if (tty->driver->subtype == PTY_TYPE_SLAVE && tty->link->count != 1)
+ }
+ if (tty->driver->subtype == PTY_TYPE_SLAVE && tty->link->count != 1) {
+ tty_debug_eio(tty, "ptm open count (%d)\n", tty->link->count);
goto out;
+ }

clear_bit(TTY_IO_ERROR, &tty->flags);
clear_bit(TTY_OTHER_CLOSED, &tty->link->flags);
diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 62e942d..e71c61f 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -105,11 +105,21 @@
#include <linux/kmod.h>
#include <linux/nsproxy.h>

-#undef TTY_DEBUG_HANGUP
+#define TTY_DEBUG_HANGUP 1
+#define TTY_DEBUG_EIO 1

#define TTY_PARANOIA_CHECK 1
#define CHECK_TTY_COUNT 1

+#ifdef TTY_DEBUG_EIO
+#define tty_debug_eio(tty, f, args...) ({ \
+ char __b[64]; \
+ printk(KERN_DEBUG "%s: %s: " f, __func__, tty_name(tty, __b), ##args); \
+})
+#else
+#define tty_debug_eio(tty, f, args...)
+#endif
+
struct ktermios tty_std_termios = { /* for the benefit of tty drivers */
.c_iflag = ICRNL | IXON,
.c_oflag = OPOST | ONLCR,
@@ -424,6 +434,7 @@ int tty_check_change(struct tty_struct *tty)
if (is_ignored(SIGTTOU))
goto out;
if (is_current_pgrp_orphaned()) {
+ tty_debug_eio(tty, "pgrp orphaned?? (%#lx)\n", tty->flags);
ret = -EIO;
goto out;
}
@@ -448,6 +459,9 @@ static ssize_t hung_up_tty_read(struct file *file, char __user *buf,
static ssize_t hung_up_tty_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
+ struct tty_struct *tty = file_tty(file);
+
+ tty_debug_eio(tty, "%#lx\n", tty ? tty->flags : -1L);
return -EIO;
}

@@ -1020,16 +1034,22 @@ static ssize_t tty_read(struct file *file, char __user *buf, size_t count,

if (tty_paranoia_check(tty, inode, "tty_read"))
return -EIO;
- if (!tty || (test_bit(TTY_IO_ERROR, &tty->flags)))
+ if (!tty || (test_bit(TTY_IO_ERROR, &tty->flags))) {
+ tty_debug_eio(tty, "%#lx\n", tty ? tty->flags : -1L);
return -EIO;
+ }

/* We want to wait for the line discipline to sort out in this
situation */
ld = tty_ldisc_ref_wait(tty);
- if (ld->ops->read)
+ if (ld->ops->read) {
i = (ld->ops->read)(tty, file, buf, count);
- else
+ if (i == -EIO)
+ tty_debug_eio(tty, "ldisc error (%#lx)\n", tty->flags);
+ } else {
+ tty_debug_eio(tty, "no ldisc read method???\n");
i = -EIO;
+ }
tty_ldisc_deref(ld);

if (i > 0)
@@ -1197,18 +1217,31 @@ static ssize_t tty_write(struct file *file, const char __user *buf,

if (tty_paranoia_check(tty, file_inode(file), "tty_write"))
return -EIO;
- if (!tty || !tty->ops->write ||
- (test_bit(TTY_IO_ERROR, &tty->flags)))
- return -EIO;
+ if (!tty || !tty->ops->write || (test_bit(TTY_IO_ERROR, &tty->flags))) {
+ if (tty && !tty->ops->write)
+ tty_debug_eio(tty, "no driver write method???\n");
+ else
+ tty_debug_eio(tty, "%#lx\n", tty ? tty->flags : -1L);
+ return -EIO;
+ }
/* Short term debug to catch buggy drivers */
if (tty->ops->write_room == NULL)
printk(KERN_ERR "tty driver %s lacks a write_room method.\n",
tty->driver->name);
ld = tty_ldisc_ref_wait(tty);
- if (!ld->ops->write)
+ if (!ld->ops->write) {
+ tty_debug_eio(tty, "no ldisc write method???\n");
ret = -EIO;
- else
+ } else {
ret = do_tty_write(ld->ops->write, tty, file, buf, count);
+ if (ret == -EIO) {
+ if (tty_hung_up_p(file))
+ tty_debug_eio(tty, "hung up\n");
+ else
+ tty_debug_eio(tty, "ldisc error: flags=%#lx count=%d other=%d\n",
+ tty->flags, tty->count, tty->link ? tty->link->count : -1);
+ }
+ }
tty_ldisc_deref(ld);
return ret;
}
--
1.8.1.2

2013-06-13 14:51:36

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Strange intermittent EIO error when writing to stdout since v3.8.0

On 2013.06.13 at 10:16 -0400, Peter Hurley wrote:
> On 06/13/2013 06:39 AM, Markus Trippelsdorf wrote:
> > On 2013.06.07 at 20:22 +0200, Mikael Pettersson wrote:
> >> Peter Hurley writes:
> >> > Based on the other reports from Mikael and David, I suspect this problem
> >> > may have to do with my commit 699390354da6c258b65bf8fa79cfd5feaede50b6:
> >> >
> >> > pty: Ignore slave pty close() if never successfully opened
> >> >
> >> > This commit poisons the pty under certain error conditions that may
> >> > occur from parallel open()s (or parallel close() with pending write()).
> >> >
> >> > It's unclear to me which error condition is triggered and how user-space
> >> > got an open file descriptor but that seems the most likely. Is the problem
> >> > reproducible enough that a debug patch would likely trigger?
> >>
> >> In my case the problem occurred frequently enough that I've been forced
> >> to change my build procedures to avoid it. I'd welcome a debug patch.
> >
> > Since apparently no debugging patch is forthcoming, maybe it's time to
> > test the simple revert of commit 699390354da.
>
> I apologize for the delay.
>
> Here's a debug patch which I hope will narrow down the circumstances of
> this error condition.

Unfortunately it doesn't apply:

markus@x4 linux % git describe
v3.10-rc5-167-g26e0446
markus@x4 linux % patch -p1 --dry-run < /home/markus/tty.patch
checking file drivers/tty/pty.c
Hunk #1 succeeded at 25 with fuzz 2.
Hunk #2 FAILED at 256.
1 out of 2 hunks FAILED
checking file drivers/tty/tty_io.c
Hunk #1 FAILED at 105.
Hunk #2 FAILED at 424.
Hunk #3 FAILED at 448.
Hunk #4 FAILED at 1020.
Hunk #5 FAILED at 1197.
5 out of 5 hunks FAILED

--
Markus

2013-06-13 15:27:13

by Peter Hurley

[permalink] [raw]
Subject: [PATCH] tty: Debug EIO from write()


Signed-off-by: Peter Hurley <[email protected]>
---
drivers/tty/pty.c | 22 +++++++++++++++++++---
drivers/tty/tty_io.c | 51 ++++++++++++++++++++++++++++++++++++++++++---------
2 files changed, 61 insertions(+), 12 deletions(-)

diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
index 9c2f1bc..2ce2bb2 100644
--- a/drivers/tty/pty.c
+++ b/drivers/tty/pty.c
@@ -25,6 +25,16 @@
#include <linux/slab.h>
#include <linux/mutex.h>

+#define TTY_DEBUG_EIO 1
+
+#ifdef TTY_DEBUG_EIO
+#define tty_debug_eio(tty, f, args...) ({ \
+ char __b[64]; \
+ printk(KERN_DEBUG "%s: %s: " f, __func__, tty_name(tty, __b), ##args); \
+})
+#else
+#define tty_debug_eio(tty, f, args...)
+#endif

#ifdef CONFIG_UNIX98_PTYS
static struct tty_driver *ptm_driver;
@@ -246,12 +256,18 @@ static int pty_open(struct tty_struct *tty, struct file *filp)
set_bit(TTY_IO_ERROR, &tty->flags);

retval = -EIO;
- if (test_bit(TTY_OTHER_CLOSED, &tty->flags))
+ if (test_bit(TTY_OTHER_CLOSED, &tty->flags)) {
+ tty_debug_eio(tty, "other pty closed (%#lx)\n", tty->flags);
goto out;
- if (test_bit(TTY_PTY_LOCK, &tty->link->flags))
+ }
+ if (test_bit(TTY_PTY_LOCK, &tty->link->flags)) {
+ tty_debug_eio(tty, "ptm still locked\n");
goto out;
- if (tty->driver->subtype == PTY_TYPE_SLAVE && tty->link->count != 1)
+ }
+ if (tty->driver->subtype == PTY_TYPE_SLAVE && tty->link->count != 1) {
+ tty_debug_eio(tty, "ptm open count (%d)\n", tty->link->count);
goto out;
+ }

clear_bit(TTY_IO_ERROR, &tty->flags);
clear_bit(TTY_OTHER_CLOSED, &tty->link->flags);
diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 62e942d..e71c61f 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -105,11 +105,21 @@
#include <linux/kmod.h>
#include <linux/nsproxy.h>

-#undef TTY_DEBUG_HANGUP
+#define TTY_DEBUG_HANGUP 1
+#define TTY_DEBUG_EIO 1

#define TTY_PARANOIA_CHECK 1
#define CHECK_TTY_COUNT 1

+#ifdef TTY_DEBUG_EIO
+#define tty_debug_eio(tty, f, args...) ({ \
+ char __b[64]; \
+ printk(KERN_DEBUG "%s: %s: " f, __func__, tty_name(tty, __b), ##args); \
+})
+#else
+#define tty_debug_eio(tty, f, args...)
+#endif
+
struct ktermios tty_std_termios = { /* for the benefit of tty drivers */
.c_iflag = ICRNL | IXON,
.c_oflag = OPOST | ONLCR,
@@ -424,6 +434,7 @@ int tty_check_change(struct tty_struct *tty)
if (is_ignored(SIGTTOU))
goto out;
if (is_current_pgrp_orphaned()) {
+ tty_debug_eio(tty, "pgrp orphaned?? (%#lx)\n", tty->flags);
ret = -EIO;
goto out;
}
@@ -448,6 +459,9 @@ static ssize_t hung_up_tty_read(struct file *file, char __user *buf,
static ssize_t hung_up_tty_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
+ struct tty_struct *tty = file_tty(file);
+
+ tty_debug_eio(tty, "%#lx\n", tty ? tty->flags : -1L);
return -EIO;
}

@@ -1020,16 +1034,22 @@ static ssize_t tty_read(struct file *file, char __user *buf, size_t count,

if (tty_paranoia_check(tty, inode, "tty_read"))
return -EIO;
- if (!tty || (test_bit(TTY_IO_ERROR, &tty->flags)))
+ if (!tty || (test_bit(TTY_IO_ERROR, &tty->flags))) {
+ tty_debug_eio(tty, "%#lx\n", tty ? tty->flags : -1L);
return -EIO;
+ }

/* We want to wait for the line discipline to sort out in this
situation */
ld = tty_ldisc_ref_wait(tty);
- if (ld->ops->read)
+ if (ld->ops->read) {
i = (ld->ops->read)(tty, file, buf, count);
- else
+ if (i == -EIO)
+ tty_debug_eio(tty, "ldisc error (%#lx)\n", tty->flags);
+ } else {
+ tty_debug_eio(tty, "no ldisc read method???\n");
i = -EIO;
+ }
tty_ldisc_deref(ld);

if (i > 0)
@@ -1197,18 +1217,31 @@ static ssize_t tty_write(struct file *file, const char __user *buf,

if (tty_paranoia_check(tty, file_inode(file), "tty_write"))
return -EIO;
- if (!tty || !tty->ops->write ||
- (test_bit(TTY_IO_ERROR, &tty->flags)))
- return -EIO;
+ if (!tty || !tty->ops->write || (test_bit(TTY_IO_ERROR, &tty->flags))) {
+ if (tty && !tty->ops->write)
+ tty_debug_eio(tty, "no driver write method???\n");
+ else
+ tty_debug_eio(tty, "%#lx\n", tty ? tty->flags : -1L);
+ return -EIO;
+ }
/* Short term debug to catch buggy drivers */
if (tty->ops->write_room == NULL)
printk(KERN_ERR "tty driver %s lacks a write_room method.\n",
tty->driver->name);
ld = tty_ldisc_ref_wait(tty);
- if (!ld->ops->write)
+ if (!ld->ops->write) {
+ tty_debug_eio(tty, "no ldisc write method???\n");
ret = -EIO;
- else
+ } else {
ret = do_tty_write(ld->ops->write, tty, file, buf, count);
+ if (ret == -EIO) {
+ if (tty_hung_up_p(file))
+ tty_debug_eio(tty, "hung up\n");
+ else
+ tty_debug_eio(tty, "ldisc error: flags=%#lx count=%d other=%d\n",
+ tty->flags, tty->count, tty->link ? tty->link->count : -1);
+ }
+ }
tty_ldisc_deref(ld);
return ret;
}
--
1.8.1.2

2013-06-13 16:16:10

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [PATCH] tty: Debug EIO from write()

I've reproduced the Mozilla build failure, but there's nothing exiting
in the logs. The only strange thing is the single tty_write at
1507.456817 (roughly at this point the Mozilla build failed):

...
[ 1507.453547] tty_release: pts11 (tty count=4)...
[ 1507.455695] tty_open: opening pts11...
[ 1507.455699] tty_release: pts11 (tty count=4)...
[ 1507.455854] tty_open: opening pts11...
[ 1507.455858] tty_release: pts11 (tty count=4)...
[ 1507.456815] tty_open: opening pts11...
[ 1507.456817] tty_write: pts11: 0xa03
[ 1507.456819] tty_release: pts11 (tty count=4)...
[ 1507.457313] tty_open: opening pts11...
[ 1507.457317] tty_release: pts11 (tty count=4)...
[ 1507.458711] tty_open: opening pts11...
[ 1507.458715] tty_release: pts11 (tty count=4)...
[ 1507.458918] tty_open: opening pts11...
[ 1507.458922] tty_release: pts11 (tty count=4)...
[ 1507.459062] tty_open: opening pts11...
[ 1507.459066] tty_release: pts11 (tty count=4)...
...

Looking through my logs I see a few ldisc errors, but please note that they are
all unrelated to the actual Mozilla failure.

Jun 13 17:31:07 x4 kernel: tty_open: opening pts11...
Jun 13 17:31:07 x4 kernel: tty_release: pts11 (tty count=4)...
Jun 13 17:31:08 x4 kernel: tty_release: ptm1 (tty count=1)...
Jun 13 17:31:08 x4 kernel: tty_read: pts1: ldisc error (0xa05)
Jun 13 17:31:08 x4 kernel: pts1 vhangup...
Jun 13 17:31:08 x4 kernel: hung_up_tty_write: pts1: 0x200a05
Jun 13 17:31:08 x4 kernel: tty_release: pts1 (tty count=2)...
Jun 13 17:31:08 x4 kernel: tty_release: ptm1: final close
Jun 13 17:31:08 x4 kernel: tty_release: ptm1: freeing structure...
Jun 13 17:31:08 x4 kernel: tty_release: ptm0 (tty count=1)...
Jun 13 17:31:08 x4 kernel: pts0 vhangup...
Jun 13 17:31:08 x4 kernel: tty_release: pts0 (tty count=1)...
Jun 13 17:31:08 x4 kernel: tty_release: pts0: final close
Jun 13 17:31:08 x4 kernel: tty_release: pts0: freeing structure...
Jun 13 17:31:08 x4 kernel: tty_release: tty1 (tty count=8)...
Jun 13 17:31:08 x4 kernel: tty_release: tty3 (tty count=1)...
Jun 13 17:31:08 x4 kernel: tty_release: tty3: final close
Jun 13 17:31:08 x4 kernel: tty_release: tty3: freeing structure...
Jun 13 17:31:11 x4 kernel: tty_open: opening tty1...
Jun 13 17:31:11 x4 kernel: tty_open: opening tty1...
Jun 13 17:31:11 x4 kernel: tty_release: tty1 (tty count=9)...
Jun 13 17:31:13 x4 kernel: tty_release: tty1 (tty count=8)...
Jun 13 17:31:13 x4 kernel: tty_release: tty2 (tty count=1)...
Jun 13 17:31:13 x4 kernel: tty_release: tty2: final close
Jun 13 17:31:13 x4 kernel: tty_release: tty2: freeing structure...
Jun 13 17:31:13 x4 kernel: tty_release: pts6 (tty count=2)...
Jun 13 17:31:13 x4 kernel: tty1 vhangup session...
Jun 13 17:31:13 x4 kernel: tty_open: opening pts2...
Jun 13 17:31:13 x4 kernel: tty_release: ptm5 (tty count=1)...
Jun 13 17:31:13 x4 kernel: pts5 vhangup...
Jun 13 17:31:13 x4 kernel: tty_release: ptm6 (tty count=1)...
Jun 13 17:31:13 x4 kernel: pts6 vhangup...
Jun 13 17:31:13 x4 kernel: tty_release: ptm6: final close
Jun 13 17:31:13 x4 kernel: tty_release: ptm6: freeing structure...
Jun 13 17:31:13 x4 kernel: tty_release: ptm7 (tty count=1)...
Jun 13 17:31:13 x4 kernel: pts7 vhangup...
Jun 13 17:31:13 x4 kernel: tty_release: pts5 (tty count=1)...
Jun 13 17:31:13 x4 kernel: tty_release: pts5: final close
Jun 13 17:31:13 x4 kernel: tty_release: pts5: freeing structure...
Jun 13 17:31:13 x4 kernel: tty_read: pts7: ldisc error (0x200805)
Jun 13 17:31:13 x4 kernel: tty_release: ptm8 (tty count=1)...
Jun 13 17:31:13 x4 kernel: pts8 vhangup...
Jun 13 17:31:13 x4 kernel: tty_read: pts8: ldisc error (0x200805)
Jun 13 17:31:13 x4 kernel: tty_release: tty1 (tty count=7)...
Jun 13 17:31:13 x4 kernel: tty_release: tty1 (tty count=6)...
Jun 13 17:31:13 x4 kernel: tty_release: ptm9 (tty count=1)...
Jun 13 17:31:13 x4 kernel: pts9 vhangup...
Jun 13 17:31:13 x4 kernel: tty_release: ptm10 (tty count=1)...
Jun 13 17:31:13 x4 kernel: pts10 vhangup...
Jun 13 17:31:13 x4 kernel: hung_up_tty_write: pts10: 0x40a04
Jun 13 17:31:13 x4 kernel: hung_up_tty_write: pts10: 0x40a04
Jun 13 17:31:13 x4 kernel: tty_open: opening tty1...
Jun 13 17:31:13 x4 kernel: tty_release: ptm3 (tty count=1)...
Jun 13 17:31:13 x4 kernel: pts3 vhangup...
Jun 13 17:31:13 x4 kernel: tty_read: pts3: ldisc error (0x200805)
Jun 13 17:31:13 x4 kernel: tty_release: pts7 (tty count=2)...
Jun 13 17:31:13 x4 kernel: tty_release: pts7 (tty count=1)...
Jun 13 17:31:13 x4 kernel: tty_release: pts7: final close
Jun 13 17:31:13 x4 kernel: tty_release: pts7: freeing structure...
Jun 13 17:31:13 x4 kernel: tty_release: pts4 (tty count=2)...
Jun 13 17:31:13 x4 kernel: tty_open: opening tty2...
Jun 13 17:31:13 x4 kernel: tty_release: tty2 (tty count=1)...
Jun 13 17:31:13 x4 kernel: tty_release: tty2: final close
Jun 13 17:31:13 x4 kernel: tty_release: tty2: freeing structure...
Jun 13 17:31:13 x4 kernel: tty_open: opening tty2...
Jun 13 17:31:13 x4 kernel: tty_release: tty1 (tty count=6)...
Jun 13 17:31:13 x4 kernel: tty_release: ptm4 (tty count=1)...
Jun 13 17:31:13 x4 kernel: pts4 vhangup...
Jun 13 17:31:13 x4 kernel: tty_release: ptm4: final close
Jun 13 17:31:13 x4 kernel: tty_release: ptm4: freeing structure...
Jun 13 17:31:13 x4 kernel: tty_release: ptm11 (tty count=1)...
Jun 13 17:31:13 x4 kernel: pts11 vhangup...
Jun 13 17:31:13 x4 kernel: tty_read: pts11: ldisc error (0x200a05)
Jun 13 17:31:13 x4 kernel: tty_release: pts11 (tty count=3)...
Jun 13 17:31:13 x4 kernel: tty_release: pts11 (tty count=2)...
Jun 13 17:31:13 x4 kernel: tty_release: ptm11: final close
Jun 13 17:31:13 x4 kernel: tty_release: ptm11: freeing structure...
Jun 13 17:31:13 x4 kernel: tty_release: ptm2 (tty count=1)...
Jun 13 17:31:13 x4 kernel: pts2 vhangup...
Jun 13 17:31:13 x4 kernel: tty_read: pts2: ldisc error (0x200a05)
Jun 13 17:31:13 x4 kernel: tty_release: pts2 (tty count=3)...
Jun 13 17:31:13 x4 kernel: tty_release: pts3 (tty count=2)...
Jun 13 17:31:13 x4 kernel: hung_up_tty_write: pts3: 0x40a06
Jun 13 17:31:13 x4 kernel: hung_up_tty_write: pts3: 0x40a06


--
Markus

2013-06-13 16:29:39

by Peter Hurley

[permalink] [raw]
Subject: Re: [PATCH] tty: Debug EIO from write()

On 06/13/2013 12:16 PM, Markus Trippelsdorf wrote:
> I've reproduced the Mozilla build failure, but there's nothing exiting
> in the logs. The only strange thing is the single tty_write at
> 1507.456817 (roughly at this point the Mozilla build failed):
>
> ...
> [ 1507.453547] tty_release: pts11 (tty count=4)...
> [ 1507.455695] tty_open: opening pts11...
> [ 1507.455699] tty_release: pts11 (tty count=4)...
> [ 1507.455854] tty_open: opening pts11...
> [ 1507.455858] tty_release: pts11 (tty count=4)...
> [ 1507.456815] tty_open: opening pts11...
> [ 1507.456817] tty_write: pts11: 0xa03

Totally obvious now, thanks!
Patch coming.

> [ 1507.456819] tty_release: pts11 (tty count=4)...
> [ 1507.457313] tty_open: opening pts11...
> [ 1507.457317] tty_release: pts11 (tty count=4)...
> [ 1507.458711] tty_open: opening pts11...
> [ 1507.458715] tty_release: pts11 (tty count=4)...
> [ 1507.458918] tty_open: opening pts11...
> [ 1507.458922] tty_release: pts11 (tty count=4)...
> [ 1507.459062] tty_open: opening pts11...
> [ 1507.459066] tty_release: pts11 (tty count=4)...
> ...
>
> Looking through my logs I see a few ldisc errors, but please note that they are
> all unrelated to the actual Mozilla failure.
>
> Jun 13 17:31:07 x4 kernel: tty_open: opening pts11...
> Jun 13 17:31:07 x4 kernel: tty_release: pts11 (tty count=4)...
> Jun 13 17:31:08 x4 kernel: tty_release: ptm1 (tty count=1)...
> Jun 13 17:31:08 x4 kernel: tty_read: pts1: ldisc error (0xa05)

These "ldisc error"s are expected. In this case, the master end has closed while
the slave end is reading....

> Jun 13 17:31:08 x4 kernel: pts1 vhangup...
> Jun 13 17:31:08 x4 kernel: hung_up_tty_write: pts1: 0x200a05

and writing.
The others in this snippet are similar.

> Jun 13 17:31:08 x4 kernel: tty_release: pts1 (tty count=2)...
> Jun 13 17:31:08 x4 kernel: tty_release: ptm1: final close
> Jun 13 17:31:08 x4 kernel: tty_release: ptm1: freeing structure...
> Jun 13 17:31:08 x4 kernel: tty_release: ptm0 (tty count=1)...
> Jun 13 17:31:08 x4 kernel: pts0 vhangup...
> Jun 13 17:31:08 x4 kernel: tty_release: pts0 (tty count=1)...
> Jun 13 17:31:08 x4 kernel: tty_release: pts0: final close
> Jun 13 17:31:08 x4 kernel: tty_release: pts0: freeing structure...
> Jun 13 17:31:08 x4 kernel: tty_release: tty1 (tty count=8)...
> Jun 13 17:31:08 x4 kernel: tty_release: tty3 (tty count=1)...
> Jun 13 17:31:08 x4 kernel: tty_release: tty3: final close
> Jun 13 17:31:08 x4 kernel: tty_release: tty3: freeing structure...
> Jun 13 17:31:11 x4 kernel: tty_open: opening tty1...
> Jun 13 17:31:11 x4 kernel: tty_open: opening tty1...
> Jun 13 17:31:11 x4 kernel: tty_release: tty1 (tty count=9)...
> Jun 13 17:31:13 x4 kernel: tty_release: tty1 (tty count=8)...
> Jun 13 17:31:13 x4 kernel: tty_release: tty2 (tty count=1)...
> Jun 13 17:31:13 x4 kernel: tty_release: tty2: final close
> Jun 13 17:31:13 x4 kernel: tty_release: tty2: freeing structure...
> Jun 13 17:31:13 x4 kernel: tty_release: pts6 (tty count=2)...
> Jun 13 17:31:13 x4 kernel: tty1 vhangup session...
> Jun 13 17:31:13 x4 kernel: tty_open: opening pts2...
> Jun 13 17:31:13 x4 kernel: tty_release: ptm5 (tty count=1)...
> Jun 13 17:31:13 x4 kernel: pts5 vhangup...
> Jun 13 17:31:13 x4 kernel: tty_release: ptm6 (tty count=1)...
> Jun 13 17:31:13 x4 kernel: pts6 vhangup...
> Jun 13 17:31:13 x4 kernel: tty_release: ptm6: final close
> Jun 13 17:31:13 x4 kernel: tty_release: ptm6: freeing structure...
> Jun 13 17:31:13 x4 kernel: tty_release: ptm7 (tty count=1)...
> Jun 13 17:31:13 x4 kernel: pts7 vhangup...
> Jun 13 17:31:13 x4 kernel: tty_release: pts5 (tty count=1)...
> Jun 13 17:31:13 x4 kernel: tty_release: pts5: final close
> Jun 13 17:31:13 x4 kernel: tty_release: pts5: freeing structure...
> Jun 13 17:31:13 x4 kernel: tty_read: pts7: ldisc error (0x200805)
> Jun 13 17:31:13 x4 kernel: tty_release: ptm8 (tty count=1)...
> Jun 13 17:31:13 x4 kernel: pts8 vhangup...
> Jun 13 17:31:13 x4 kernel: tty_read: pts8: ldisc error (0x200805)
> Jun 13 17:31:13 x4 kernel: tty_release: tty1 (tty count=7)...
> Jun 13 17:31:13 x4 kernel: tty_release: tty1 (tty count=6)...
> Jun 13 17:31:13 x4 kernel: tty_release: ptm9 (tty count=1)...
> Jun 13 17:31:13 x4 kernel: pts9 vhangup...
> Jun 13 17:31:13 x4 kernel: tty_release: ptm10 (tty count=1)...
> Jun 13 17:31:13 x4 kernel: pts10 vhangup...
> Jun 13 17:31:13 x4 kernel: hung_up_tty_write: pts10: 0x40a04
> Jun 13 17:31:13 x4 kernel: hung_up_tty_write: pts10: 0x40a04
> Jun 13 17:31:13 x4 kernel: tty_open: opening tty1...
> Jun 13 17:31:13 x4 kernel: tty_release: ptm3 (tty count=1)...
> Jun 13 17:31:13 x4 kernel: pts3 vhangup...
> Jun 13 17:31:13 x4 kernel: tty_read: pts3: ldisc error (0x200805)
> Jun 13 17:31:13 x4 kernel: tty_release: pts7 (tty count=2)...
> Jun 13 17:31:13 x4 kernel: tty_release: pts7 (tty count=1)...
> Jun 13 17:31:13 x4 kernel: tty_release: pts7: final close
> Jun 13 17:31:13 x4 kernel: tty_release: pts7: freeing structure...
> Jun 13 17:31:13 x4 kernel: tty_release: pts4 (tty count=2)...
> Jun 13 17:31:13 x4 kernel: tty_open: opening tty2...
> Jun 13 17:31:13 x4 kernel: tty_release: tty2 (tty count=1)...
> Jun 13 17:31:13 x4 kernel: tty_release: tty2: final close
> Jun 13 17:31:13 x4 kernel: tty_release: tty2: freeing structure...
> Jun 13 17:31:13 x4 kernel: tty_open: opening tty2...
> Jun 13 17:31:13 x4 kernel: tty_release: tty1 (tty count=6)...
> Jun 13 17:31:13 x4 kernel: tty_release: ptm4 (tty count=1)...
> Jun 13 17:31:13 x4 kernel: pts4 vhangup...
> Jun 13 17:31:13 x4 kernel: tty_release: ptm4: final close
> Jun 13 17:31:13 x4 kernel: tty_release: ptm4: freeing structure...
> Jun 13 17:31:13 x4 kernel: tty_release: ptm11 (tty count=1)...
> Jun 13 17:31:13 x4 kernel: pts11 vhangup...
> Jun 13 17:31:13 x4 kernel: tty_read: pts11: ldisc error (0x200a05)
> Jun 13 17:31:13 x4 kernel: tty_release: pts11 (tty count=3)...
> Jun 13 17:31:13 x4 kernel: tty_release: pts11 (tty count=2)...
> Jun 13 17:31:13 x4 kernel: tty_release: ptm11: final close
> Jun 13 17:31:13 x4 kernel: tty_release: ptm11: freeing structure...
> Jun 13 17:31:13 x4 kernel: tty_release: ptm2 (tty count=1)...
> Jun 13 17:31:13 x4 kernel: pts2 vhangup...
> Jun 13 17:31:13 x4 kernel: tty_read: pts2: ldisc error (0x200a05)
> Jun 13 17:31:13 x4 kernel: tty_release: pts2 (tty count=3)...
> Jun 13 17:31:13 x4 kernel: tty_release: pts3 (tty count=2)...
> Jun 13 17:31:13 x4 kernel: hung_up_tty_write: pts3: 0x40a06
> Jun 13 17:31:13 x4 kernel: hung_up_tty_write: pts3: 0x40a06
>
>

2013-06-13 19:56:59

by Peter Hurley

[permalink] [raw]
Subject: [PATCH] tty: Fix transient pty write() EIO

Commit 699390354da6c258b65bf8fa79cfd5feaede50b6
('pty: Ignore slave pty close() if never successfully opened')
introduced a bug with ptys whereby a write() in parallel with an
open() on an existing pty could mistakenly indicate an I/O error.

Only indicate an I/O error if the condition on open() actually exists.

Reported-by: Markus Trippelsdorf <[email protected]>
Signed-off-by: Peter Hurley <[email protected]>
---
drivers/tty/pty.c | 13 +++++--------
1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
index 9c2f1bc..dd08d3d 100644
--- a/drivers/tty/pty.c
+++ b/drivers/tty/pty.c
@@ -238,14 +238,9 @@ static void pty_flush_buffer(struct tty_struct *tty)

static int pty_open(struct tty_struct *tty, struct file *filp)
{
- int retval = -ENODEV;
-
if (!tty || !tty->link)
- goto out;
-
- set_bit(TTY_IO_ERROR, &tty->flags);
+ return -ENODEV;

- retval = -EIO;
if (test_bit(TTY_OTHER_CLOSED, &tty->flags))
goto out;
if (test_bit(TTY_PTY_LOCK, &tty->link->flags))
@@ -256,9 +251,11 @@ static int pty_open(struct tty_struct *tty, struct file *filp)
clear_bit(TTY_IO_ERROR, &tty->flags);
clear_bit(TTY_OTHER_CLOSED, &tty->link->flags);
set_bit(TTY_THROTTLED, &tty->flags);
- retval = 0;
+ return 0;
+
out:
- return retval;
+ set_bit(TTY_IO_ERROR, &tty->flags);
+ return -EIO;
}

static void pty_set_termios(struct tty_struct *tty,
--
1.8.1.2

2013-06-16 17:34:59

by Mikael Pettersson

[permalink] [raw]
Subject: Re: [PATCH] tty: Fix transient pty write() EIO

Peter Hurley writes:
> Commit 699390354da6c258b65bf8fa79cfd5feaede50b6
> ('pty: Ignore slave pty close() if never successfully opened')
> introduced a bug with ptys whereby a write() in parallel with an
> open() on an existing pty could mistakenly indicate an I/O error.
>
> Only indicate an I/O error if the condition on open() actually exists.
>
> Reported-by: Markus Trippelsdorf <[email protected]>
> Signed-off-by: Peter Hurley <[email protected]>

I've tested this for two days on 3.9.0 (where the bug reproduces
easily for me, for some reason I can't seem to trigger it with
3.10-rc5), and with the patch I no longer see the bug.

Tested-by: Mikael Pettersson <[email protected]>

> ---
> drivers/tty/pty.c | 13 +++++--------
> 1 file changed, 5 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
> index 9c2f1bc..dd08d3d 100644
> --- a/drivers/tty/pty.c
> +++ b/drivers/tty/pty.c
> @@ -238,14 +238,9 @@ static void pty_flush_buffer(struct tty_struct *tty)
>
> static int pty_open(struct tty_struct *tty, struct file *filp)
> {
> - int retval = -ENODEV;
> -
> if (!tty || !tty->link)
> - goto out;
> -
> - set_bit(TTY_IO_ERROR, &tty->flags);
> + return -ENODEV;
>
> - retval = -EIO;
> if (test_bit(TTY_OTHER_CLOSED, &tty->flags))
> goto out;
> if (test_bit(TTY_PTY_LOCK, &tty->link->flags))
> @@ -256,9 +251,11 @@ static int pty_open(struct tty_struct *tty, struct file *filp)
> clear_bit(TTY_IO_ERROR, &tty->flags);
> clear_bit(TTY_OTHER_CLOSED, &tty->link->flags);
> set_bit(TTY_THROTTLED, &tty->flags);
> - retval = 0;
> + return 0;
> +
> out:
> - return retval;
> + set_bit(TTY_IO_ERROR, &tty->flags);
> + return -EIO;
> }
>
> static void pty_set_termios(struct tty_struct *tty,
> --
> 1.8.1.2
>