2007-01-24 20:40:12

by Lennart Sorensen

[permalink] [raw]
Subject: Strange problem with tty layer

I am seeing a very strange problem which seems to be in the tty layer.

I am using an exar 17D154 based PCI card (like the digi neo style card)
using the jsm driver. Kernel version 2.6.16.25.

My test involves connecting two ports together with a cross over cable
and then sending a test pattern of characters. I am currently using
11520 characters at 230400bps (so 0.5s for the transfer).

Most of the time, it works perfectly, but once in a while, I never
receiver the last few characters (between 1 and 65 or so characters it
seems).

I have confirmed that the driver does in fact receive all the
characters, and that they are correct, and that they are being passed to
the tty layer using tty_insert_flip_string, and that it returns that all
the characters have been passed to the tty layer. The user space
application however still doesn't see the last few characters (when it
fails).

The problem seems to occour every few hours of testing on a 266MHz Geode
SC1200. When I change the clock to 133MHz, it happens every few minutes
instead (so much more frequently). I suspect there is some race
condition that allows the tty layer to not get around to processing all
the data in the buffer, even when asked for data by the application
(which is waiting on the serial port using select, with a 4s timeout).

Any suggestions on where to head next to debug this?

Sending a few more characters when the receive times out before getting
the complete message does not cause the missing data to arive it seems,
only completely closing and starting the test program again seems to
recover it, although I may just not have been patient enough.

--
Len Sorensen


2007-01-24 21:19:56

by Kilau, Scott

[permalink] [raw]
Subject: Re: Strange problem with tty layer

Hi Len,

> I am seeing a very strange problem which seems to be in the tty layer.
>
> I am using an exar 17D154 based PCI card (like the digi neo style
card)
> using the jsm driver. Kernel version 2.6.16.25.

There are a couple interesting things I would do here.

1) The tty "flip" buffer stuff was changed in 2.6.16+.

Maybe you could try going downwards to 2.6.15 or below and see if the
problem
still exists.

Then try going up to 2.6.19, I believe there were some bugs fixed
in 2.6.17/18/19, not sure how many of those made it backported into
2.6.16.25.

2) Try using Digi's "Out-Of-Kernel-Source-Tree"/GPL version of the
driver.
You can grab the latest and greatest beta version of it from here:
ftp://ftp1.digi.com/support/beta/linux/dgnc/
Its in source rpm format, let me know if you need it in a tarball format
instead.

You obviously will have to add in your specific PCI id into the driver,
but that should be a problem.

3) I seem to recall that someone mentioned that the Exar 17D154 PCI
chips were
easily "pushed" into the 8250 driver.
It might be interesting to try that route as well.

Scott Kilau

2007-01-24 21:20:55

by Paul Fulghum

[permalink] [raw]
Subject: Re: Strange problem with tty layer

Lennart Sorensen wrote:
> I have confirmed that the driver does in fact receive all the
> characters, and that they are correct, and that they are being passed to
> the tty layer using tty_insert_flip_string, and that it returns that all
> the characters have been passed to the tty layer. The user space
> application however still doesn't see the last few characters (when it
> fails).
>
> The problem seems to occour every few hours of testing on a 266MHz Geode
> SC1200. When I change the clock to 133MHz, it happens every few minutes
> instead (so much more frequently). I suspect there is some race
> condition that allows the tty layer to not get around to processing all
> the data in the buffer, even when asked for data by the application
> (which is waiting on the serial port using select, with a 4s timeout).

In 2.6.16 the tty buffering pushes data to the line
discipline without regard to tty->receive_room.
If the line discipline can't keep up, the data gets dropped.
I observed this data loss at higher speeds when
placing the system under heavy load.

2.6.18 added code to respect tty->receive_room.

This may or may not be your problem, but you should
be able to check by adding a conditional printk
to drivers/char/tty_io.c:flush_to_ldisc()

If tty->receive_room is less than the size of the buffer
passed to disc->receive_buf() then you are losing data.

--
Paul Fulghum
Microgate Systems, Ltd.

2007-01-24 21:34:48

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Strange problem with tty layer

On Wed, Jan 24, 2007 at 03:19:53PM -0600, Kilau, Scott wrote:
> There are a couple interesting things I would do here.
>
> 1) The tty "flip" buffer stuff was changed in 2.6.16+.
>
> Maybe you could try going downwards to 2.6.15 or below and see if the
> problem
> still exists.

That looked like a lot of changes to the tty layer. I think going up
sounds better. Finding the bug fix and backporting it would be
preferable of course.

> Then try going up to 2.6.19, I believe there were some bugs fixed
> in 2.6.17/18/19, not sure how many of those made it backported into
> 2.6.16.25.

Well I can certainly compare the code between those versions, and I have
a 2.6.18 kernel I could try out too.

> 2) Try using Digi's "Out-Of-Kernel-Source-Tree"/GPL version of the
> driver.
> You can grab the latest and greatest beta version of it from here:
> ftp://ftp1.digi.com/support/beta/linux/dgnc/
> Its in source rpm format, let me know if you need it in a tarball format
> instead.

Yes, I run Debian. rpm has very little interest. I can convert it
though.

> You obviously will have to add in your specific PCI id into the driver,
> but that should be a problem.

I did that.

> 3) I seem to recall that someone mentioned that the Exar 17D154 PCI
> chips were
> easily "pushed" into the 8250 driver.
> It might be interesting to try that route as well.

I am using the GPIO lines too, and didn't want to mess with the 8250
driver (since I use that for a serial console on a 16550 UART), plus
being able to use the 64byte fifo rather than 16byte 16550 mode fifo
seems nicer. I had to take the exar out of the 8250 driver to make it
not take control of it (although now the eeprom has been added to the
board so it no longer appears as a generic exar chip).

Thanks for the suggestions though. Gives me a few things to try.

--
Len Sorensen

2007-01-24 21:48:42

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Strange problem with tty layer

On Wed, Jan 24, 2007 at 03:20:53PM -0600, Paul Fulghum wrote:
> In 2.6.16 the tty buffering pushes data to the line
> discipline without regard to tty->receive_room.
> If the line discipline can't keep up, the data gets dropped.
> I observed this data loss at higher speeds when
> placing the system under heavy load.
>
> 2.6.18 added code to respect tty->receive_room.
>
> This may or may not be your problem, but you should
> be able to check by adding a conditional printk
> to drivers/char/tty_io.c:flush_to_ldisc()
>
> If tty->receive_room is less than the size of the buffer
> passed to disc->receive_buf() then you are losing data.

Sounds plausible. Certainly higher cpu load makes the problem occour
more often, and making the system slower made it worse too. I will take
a look at the tty_io.c in 2.6.18 and compare it against 2.6.16.

--
Len Sorensen

2007-01-25 14:31:22

by Russell King

[permalink] [raw]
Subject: Re: Strange problem with tty layer

On Wed, Jan 24, 2007 at 04:34:47PM -0500, Lennart Sorensen wrote:
> I am using the GPIO lines too, and didn't want to mess with the 8250
> driver (since I use that for a serial console on a 16550 UART), plus
> being able to use the 64byte fifo rather than 16byte 16550 mode fifo
> seems nicer.

Note that 8250 will use the deeper fifos if it knows how to.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2007-01-25 15:16:41

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Strange problem with tty layer

On Wed, Jan 24, 2007 at 03:20:53PM -0600, Paul Fulghum wrote:
> In 2.6.16 the tty buffering pushes data to the line
> discipline without regard to tty->receive_room.
> If the line discipline can't keep up, the data gets dropped.
> I observed this data loss at higher speeds when
> placing the system under heavy load.
>
> 2.6.18 added code to respect tty->receive_room.
>
> This may or may not be your problem, but you should
> be able to check by adding a conditional printk
> to drivers/char/tty_io.c:flush_to_ldisc()
>
> If tty->receive_room is less than the size of the buffer
> passed to disc->receive_buf() then you are losing data.

I am now trying this, which so far seem to help (I had a printk in there
earlier and managed to trigger that).

--- ori/drivers/char/tty_io.c 2007-01-24 18:02:48.000000000 -0500
+++ new/drivers/char/tty_io.c 2007-01-25 09:50:02.000000000 -0500
@@ -2774,6 +2778,14 @@
spin_lock_irqsave(&tty->buf.lock, flags);
while((tbuf = tty->buf.head) != NULL) {
while ((count = tbuf->commit - tbuf->read) != 0) {
+ if (!tty->receive_room) {
+ schedule_delayed_work(&tty->buf.work, 1);
+ spin_unlock_irqrestore(&tty->buf.lock, flags);
+ goto out;
+ }
+ if (count > tty->receive_room) {
+ count = tty->receive_room;
+ }
char_buf = tbuf->char_buf_ptr + tbuf->read;
flag_buf = tbuf->flag_buf_ptr + tbuf->read;
tbuf->read += count;

This appeared to be (essentially) the key change in 2.6.18 related to
the check of tty->receive_room.

I will now run a bunch more tests to see if it manages to keep it from
having any more character losses.

Thank you for the suggestion of where to look.

--
Len Sorensen

2007-01-26 14:20:43

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Strange problem with tty layer

On Thu, Jan 25, 2007 at 10:16:39AM -0500, Lennart Sorensen wrote:
> I am now trying this, which so far seem to help (I had a printk in there
> earlier and managed to trigger that).
>
> --- ori/drivers/char/tty_io.c 2007-01-24 18:02:48.000000000 -0500
> +++ new/drivers/char/tty_io.c 2007-01-25 09:50:02.000000000 -0500
> @@ -2774,6 +2778,14 @@
> spin_lock_irqsave(&tty->buf.lock, flags);
> while((tbuf = tty->buf.head) != NULL) {
> while ((count = tbuf->commit - tbuf->read) != 0) {
> + if (!tty->receive_room) {
> + schedule_delayed_work(&tty->buf.work, 1);
> + spin_unlock_irqrestore(&tty->buf.lock, flags);
> + goto out;
> + }
> + if (count > tty->receive_room) {
> + count = tty->receive_room;
> + }
> char_buf = tbuf->char_buf_ptr + tbuf->read;
> flag_buf = tbuf->flag_buf_ptr + tbuf->read;
> tbuf->read += count;
>
> This appeared to be (essentially) the key change in 2.6.18 related to
> the check of tty->receive_room.
>
> I will now run a bunch more tests to see if it manages to keep it from
> having any more character losses.
>
> Thank you for the suggestion of where to look.

Well it turns out that didn't help. Neither does 2.6.18 (that one was
the easiest newer one to try). It does seem as if the error rate is
lower with 2.6.18 than with 2.6.16, so perhaps there was more than one
place that could cause losses in the tty buffering. I had only 2
failures in 15 hours with 2.6.18, rather than a whole lot of failures
with 2.6.16. I guess I will have to try 2.6.19 or even something newer.

--
Len Sorensen

2007-01-26 14:50:57

by Paul Fulghum

[permalink] [raw]
Subject: Re: Strange problem with tty layer

Lennart Sorensen wrote:
> Well it turns out that didn't help. Neither does 2.6.18 (that one was
> the easiest newer one to try). It does seem as if the error rate is
> lower with 2.6.18 than with 2.6.16, so perhaps there was more than one
> place that could cause losses in the tty buffering. I had only 2
> failures in 15 hours with 2.6.18, rather than a whole lot of failures
> with 2.6.16. I guess I will have to try 2.6.19 or even something newer.

You can eliminate the tty buffering altogether
by observing what gets passed to the line discipline.

I assume you are using the default line discipline N_TTY.

Look at what is passed to drivers/char/n_tty.c:n_tty_receive_buf()
If all the data gets that far, then there is some issue
with the line discipline or something further downstream.
If not, then the problem is with the tty buffering (assuming
you are correct that all data gets to the tty buffering code
followed by a tty_flip_buffer_push call).

--
Paul Fulghum
Microgate Systems, Ltd.

2007-01-26 14:56:11

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Strange problem with tty layer

On Fri, Jan 26, 2007 at 08:51:02AM -0600, Paul Fulghum wrote:
> You can eliminate the tty buffering altogether
> by observing what gets passed to the line discipline.

I will have to find where in the code that is happening.

> I assume you are using the default line discipline N_TTY.
>
> Look at what is passed to drivers/char/n_tty.c:n_tty_receive_buf()
> If all the data gets that far, then there is some issue
> with the line discipline or something further downstream.
> If not, then the problem is with the tty buffering (assuming
> you are correct that all data gets to the tty buffering code
> followed by a tty_flip_buffer_push call).

I am not sure actually. I just open /dev/ttyn0 and /dev/ttyn1 and write
to one, and read from the other. I didn't even know about the line
diciplines actually. How do I tell which one I am using?

I have confirmed that all the data is being passed to
tty_insert_flip_string() in jsm_input().

--
Len Sorensen

2007-01-26 15:06:48

by Paul Fulghum

[permalink] [raw]
Subject: Re: Strange problem with tty layer

Lennart Sorensen wrote:
> I am not sure actually. I just open /dev/ttyn0 and /dev/ttyn1 and write
> to one, and read from the other. I didn't even know about the line
> diciplines actually. How do I tell which one I am using?

ioctl(TIOCSETD/TIOCGETD) sets/returns an integer identifier
that can be compared agains the N_XXX macros. If you are
not explicitly setting this then is is probably the default N_TTY.

Also at the application level, look at tcsetattr() for setting
the termios features. Look specifically at the c_cc[VTIME] and c_cc[VMIN]
members of the termios structure. These settings control how
much data must be available before returning data to a read().
Try VTIME=0 and VMIN=1.

Since your 'missing' data is always on the tail end, maybe
VMIN is set to 64 or something.

--
Paul Fulghum
Microgate Systems, Ltd.

2007-01-26 16:08:44

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Strange problem with tty layer

On Fri, Jan 26, 2007 at 09:06:52AM -0600, Paul Fulghum wrote:
> ioctl(TIOCSETD/TIOCGETD) sets/returns an integer identifier
> that can be compared agains the N_XXX macros. If you are
> not explicitly setting this then is is probably the default N_TTY.

Yes it is N_TTY (value 0). I never set it.

> Also at the application level, look at tcsetattr() for setting
> the termios features. Look specifically at the c_cc[VTIME] and c_cc[VMIN]
> members of the termios structure. These settings control how
> much data must be available before returning data to a read().
> Try VTIME=0 and VMIN=1.
>
> Since your 'missing' data is always on the tail end, maybe
> VMIN is set to 64 or something.

OK, the tty settings according to stty that I am using are:
# stty -F /dev/ttyn0 -a
speed 230400 baud; rows 0; columns 0; line = 0;
intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>; eol2 = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R;
werase = ^W; lnext = ^V; flush = ^O; min = 1; time = 0;
-parenb -parodd cs8 hupcl -cstopb cread clocal -crtscts
-ignbrk -brkint -ignpar -parmrk -inpck -istrip -inlcr -igncr icrnl -ixon -ixoff -iuclc -ixany -imaxbel
-opost -olcuc -ocrnl onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 vt0 ff0
-isig -icanon iexten -echo -echoe echok -echonl -noflsh -xcase -tostop -echoprt echoctl echoke

So min and time seems good. I am not explicitly setting those in my
test program (unless one of the other settings implies it). I will add
it explicitly just in case too.

--
Len Sorensen

2007-01-29 19:27:13

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Strange problem with tty layer

On Fri, Jan 26, 2007 at 11:08:42AM -0500, Lennart Sorensen wrote:
> OK, the tty settings according to stty that I am using are:
> # stty -F /dev/ttyn0 -a
> speed 230400 baud; rows 0; columns 0; line = 0;
> intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>; eol2 = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R;
> werase = ^W; lnext = ^V; flush = ^O; min = 1; time = 0;
> -parenb -parodd cs8 hupcl -cstopb cread clocal -crtscts
> -ignbrk -brkint -ignpar -parmrk -inpck -istrip -inlcr -igncr icrnl -ixon -ixoff -iuclc -ixany -imaxbel
> -opost -olcuc -ocrnl onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 vt0 ff0
> -isig -icanon iexten -echo -echoe echok -echonl -noflsh -xcase -tostop -echoprt echoctl echoke
>
> So min and time seems good. I am not explicitly setting those in my
> test program (unless one of the other settings implies it). I will add
> it explicitly just in case too.

I have run some tests with 8 patches from Linus's 2.6 tree on top of the
2.6.16.25 along with a bit of debug code in the n_tty.c, which ran
perfectly for 3 days. I now removed the debug code to see if it will
still run perfectly with those 8 patches. The patches I applied in
order are:

commit 70522e121a521aa09bd0f4e62e1aa68708b798e1
commit 41c28ff1635e71af072c4711ff5fadd5855d48e7
commit 1aef821a6b3aeca8c19d06aee012ed9db617d1e3
commit ee37df7877eeaa16d7761cce64854110a7c17ad9
commit e1a2509023785bd3199ac068ab80155aeba01265
commit 817d6d3bceaf34c99f5343820f9b9e6021f0655c
commit 2c3bb20f46709a0adfa7ea408013edbcab945d5a
commit 33b37a33c242542fac2980b8ccd90977388b7a8d

Hopefully it will stay working, and actually be a solution to my tty
problem in 2.6.16.25.

2.6.18.2 (Debian 2.6.18-4) on the other hand was still broken, but I am
not sure which if any of the above commits are not in that version (the
date ranges made me think they all should be, but maybe I got the date
of 2.6.18 wrong, or not all commits in the linux-2.6 tree are actually
part of a given release. I am still figuring out git).

--
Len Sorensen

2007-02-17 16:09:21

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Strange problem with tty layer

On Mon, Jan 29, 2007 at 02:27:07PM -0500, Lennart Sorensen wrote:
> I have run some tests with 8 patches from Linus's 2.6 tree on top of the
> 2.6.16.25 along with a bit of debug code in the n_tty.c, which ran
> perfectly for 3 days. I now removed the debug code to see if it will
> still run perfectly with those 8 patches. The patches I applied in
> order are:
>
> commit 70522e121a521aa09bd0f4e62e1aa68708b798e1
> commit 41c28ff1635e71af072c4711ff5fadd5855d48e7
> commit 1aef821a6b3aeca8c19d06aee012ed9db617d1e3
> commit ee37df7877eeaa16d7761cce64854110a7c17ad9
> commit e1a2509023785bd3199ac068ab80155aeba01265
> commit 817d6d3bceaf34c99f5343820f9b9e6021f0655c
> commit 2c3bb20f46709a0adfa7ea408013edbcab945d5a
> commit 33b37a33c242542fac2980b8ccd90977388b7a8d
>
> Hopefully it will stay working, and actually be a solution to my tty
> problem in 2.6.16.25.
>
> 2.6.18.2 (Debian 2.6.18-4) on the other hand was still broken, but I am
> not sure which if any of the above commits are not in that version (the
> date ranges made me think they all should be, but maybe I got the date
> of 2.6.18 wrong, or not all commits in the linux-2.6 tree are actually
> part of a given release. I am still figuring out git).

2.6.18 does work, it failed in different ways (overruns on the fifo)
which I dealt with by ensuring disk io was synced before testing the
serial port.

--
Len Sorensen