2002-09-10 22:17:39

by Dan Christian

[permalink] [raw]
Subject: 2.4.18 serial drops characters with 16654

I've got a 2.4.18-10 (RedHat) running on a 2 processor Athlon (1.5Ghz).

If I send data over a PCI 16654 serial card (Connect Tech Blue Heat) and
RTSCTS flow control is used, characters are dropped. The drops are
pretty consistent. As far as I can tell, the data can only be lost in
the driver (I'm re-trying the write until all the data gets out).

If I use a 16550, then everything is fine. Unfortunely, I can't get rid
of the 16654s.

If is use a 1 processor Athlon running 2.4.9-34 (RedHat), then
everything is fine.

I haven't been about to test the 2.4.18 SMP system in single processor
mode, because the IO-APIC goes nuts. But that's another bug...

Anybody know why the serial driver is losing data?

I'm not on linux-kernel, so please reply directly.

-Dan


2002-09-11 14:24:05

by Ed Vance

[permalink] [raw]
Subject: RE: 2.4.18 serial drops characters with 16654

On Tue, September 10, 2002 at 3:22 PM, Dan Christian wrote:
> I've got a 2.4.18-10 (RedHat) running on a 2 processor Athlon (1.5Ghz).
> If I send data over a PCI 16654 serial card (Connect Tech Blue Heat) and
> RTSCTS flow control is used, characters are dropped. The drops are
> pretty consistent. As far as I can tell, the data can only be lost in
> the driver (I'm re-trying the write until all the data gets out).
>
> If I use a 16550, then everything is fine. Unfortunately, I can't get
> rid of the 16654s.
>
> If is use a 1 processor Athlon running 2.4.9-34 (RedHat), then
> everything is fine.
>
> I haven't been about to test the 2.4.18 SMP system in single processor
> mode, because the IO-APIC goes nuts. But that's another bug...
>
> Anybody know why the serial driver is losing data?
>
> I'm not on linux-kernel, so please reply directly.

Hi Dan,

We use Exar ST16C654D chips on a cPCI 16-port mux we build and have not
(yet) had a problem report on it for this. Maybe I can reproduce the symptom
on this board. What vendor marking is on your UARTs? Could you tell me more
about your test setup and specifically how often data is dropped and how
many characters are dropped each time? What kind of device is receiving the
data and how much receive FIFO does it have left when it drops RTS to tell
the Blue Heat to stop?

Best regards,
Ed

----------------------------------------------------------------
Ed Vance serial24 (at) macrolink (dot) com
Macrolink, Inc. 1500 N. Kellogg Dr Anaheim, CA 92807
----------------------------------------------------------------

2002-09-11 23:06:17

by Andreas Steinmetz

[permalink] [raw]
Subject: Re: 2.4.18 serial drops characters with 16654

I did see something that looks quite similarl like dropped characters on
Redhat and 2.4.9 based UP systems (that's customers choice and couldn't
be changed) equipped with a NS-87336.
I can't go into detail but my company did port an application from DOS
to Linux. The application communicates with an electronic cash device
over a serial port. The coding for this communication wasn't modified
during the port from DOS to Linux. It is furthermore important to know
that both DOS and Linux run on exactly the same hardware.
The only thing different between DOS and Linux regarding the serial port
is that the DOS application has assembler code for serial port access
whereas the Linux version uses the standard kernel interface.
What happened was that while the DOS version worked the Linux version
had communication problems which pointed to single ACK bytes not being
transmitted sporadically. OTOH logging in the application showed that
these ACK bytes were delivered to and acceped by the kernel.
The only way th handle this problem was to implement some protocol based
workaround. I can only state what happened, no testing or analysis is
possible from my side in this case. The same is true for further
information, i.e. I can't go any bit more into detail. Sorry.

Ed Vance wrote:
> On Tue, September 10, 2002 at 3:22 PM, Dan Christian wrote:
>
>>I've got a 2.4.18-10 (RedHat) running on a 2 processor Athlon (1.5Ghz).
>>If I send data over a PCI 16654 serial card (Connect Tech Blue Heat) and
>>RTSCTS flow control is used, characters are dropped. The drops are
>>pretty consistent. As far as I can tell, the data can only be lost in
>>the driver (I'm re-trying the write until all the data gets out).
>>
>>If I use a 16550, then everything is fine. Unfortunately, I can't get
>>rid of the 16654s.
>>
>>If is use a 1 processor Athlon running 2.4.9-34 (RedHat), then
>>everything is fine.
>>
>>I haven't been about to test the 2.4.18 SMP system in single processor
>>mode, because the IO-APIC goes nuts. But that's another bug...
>>
>>Anybody know why the serial driver is losing data?
>>
>>I'm not on linux-kernel, so please reply directly.
>
>
> Hi Dan,
>
> We use Exar ST16C654D chips on a cPCI 16-port mux we build and have not
> (yet) had a problem report on it for this. Maybe I can reproduce the symptom
> on this board. What vendor marking is on your UARTs? Could you tell me more
> about your test setup and specifically how often data is dropped and how
> many characters are dropped each time? What kind of device is receiving the
> data and how much receive FIFO does it have left when it drops RTS to tell
> the Blue Heat to stop?
>
> Best regards,
> Ed
>
> ----------------------------------------------------------------
> Ed Vance serial24 (at) macrolink (dot) com
> Macrolink, Inc. 1500 N. Kellogg Dr Anaheim, CA 92807
> ----------------------------------------------------------------
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
Andreas Steinmetz
D.O.M. Datenverarbeitung GmbH

2002-09-12 08:15:36

by Alan Cox

[permalink] [raw]
Subject: Re: 2.4.18 serial drops characters with 16654

On Thu, 2002-09-12 at 00:12, Andreas Steinmetz wrote:
> I did see something that looks quite similarl like dropped characters on
> Redhat and 2.4.9 based UP systems (that's customers choice and couldn't
> be changed) equipped with a NS-87336.
> I can't go into detail but my company did port an application from DOS
> to Linux. The application communicates with an electronic cash device

Other than the usual PIO mode IDE suspects I've had no problems going up
to 460800bps with a decent UART (ie one with a fifo). At 920Kbit/sec you
begin to overrun the flip buffers if you run with the usual 100Mhz timer
tick.

2.4 is a bit worse nowdays because of the ksoftirqd stuff but you could
easily disable that if you think it is triggering.

2002-09-12 14:53:23

by Stuart MacDonald

[permalink] [raw]
Subject: Re: 2.4.18 serial drops characters with 16654

Dan, sorry I missed your original message.

From: "Ed Vance" <[email protected]>
> On Tue, September 10, 2002 at 3:22 PM, Dan Christian wrote:
> > I've got a 2.4.18-10 (RedHat) running on a 2 processor Athlon (1.5Ghz).
> > If I send data over a PCI 16654 serial card (Connect Tech Blue Heat) and
> > RTSCTS flow control is used, characters are dropped. The drops are
> > pretty consistent. As far as I can tell, the data can only be lost in
> > the driver (I'm re-trying the write until all the data gets out).

Data loss should not happen with flow control on. Please contact
myself directly, or our support ([email protected]) desk to open a call.
Otherwise I'll have to look at it in my CFT.

Things I'll need to know:
- kernel version :: 2.4.18-10
- distribution version :: RedHat ??
- serial driver version :: ??
(this is reported at boot, cat /var/log/messages or dmesg)
- are you using our official driver + patch set?

> We use Exar ST16C654D chips on a cPCI 16-port mux we build and have not
> (yet) had a problem report on it for this. Maybe I can reproduce the
symptom
> on this board. What vendor marking is on your UARTs? Could you tell me
more

We're using Exars as well, although perhaps not the rev D.

..Stu

--
We make multiport serial boards.
<http://www.connecttech.com>
(800) 426-8979


2002-09-19 17:23:03

by Dan Christian

[permalink] [raw]
Subject: Re: 2.4.18 serial drops characters with 16654

The weird thing is that it looks like the 16654 loses data on the
TRANSMIT side. A FIFO underrun on transmit should never loose data. A
16550 works perfectly. I don't think that this is a over/under run
problem.

The problem seems to be related to the RTS/CTS flow control handling.
The 16654 handles flow control in hardware, but the 16550 does it in
software (I've verified this with a digital oscilloscope). I don't
currently have the equipment to compare when the lines drop and which
characters are lost.

-Dan

On Thursday 12 September 2002 01:20, Alan Cox wrote:
> On Thu, 2002-09-12 at 00:12, Andreas Steinmetz wrote:
> > I did see something that looks quite similar like dropped
> > characters on Redhat and 2.4.9 based UP systems (that's customers
> > choice and couldn't be changed) equipped with a NS-87336.
> > I can't go into detail but my company did port an application from
> > DOS to Linux. The application communicates with an electronic cash
> > device
>
> Other than the usual PIO mode IDE suspects I've had no problems going
> up to 460800bps with a decent UART (ie one with a fifo). At
> 920Kbit/sec you begin to overrun the flip buffers if you run with the
> usual 100Mhz timer tick.
>
> 2.4 is a bit worse nowdays because of the ksoftirqd stuff but you
> could easily disable that if you think it is triggering.

2002-09-19 17:29:18

by Alan Cox

[permalink] [raw]
Subject: Re: 2.4.18 serial drops characters with 16654

On Thu, 2002-09-19 at 18:27, Dan Christian wrote:
> The problem seems to be related to the RTS/CTS flow control handling.
> The 16654 handles flow control in hardware, but the 16550 does it in
> software (I've verified this with a digital oscilloscope). I don't
> currently have the equipment to compare when the lines drop and which
> characters are lost.

Actually you can do it in hardware on the 16550 depending how its wired.
Take a look at the usenet-2 serial port design some day. The software
mode we do does in theory mean heavy delay to the bh handling might
delay the assertion excessively. That I think may be the real
explanation here.

Its
buffer full
bh handler delayed by bh load (tasklet nowdays I guess I mean)
overrun
overrun
...
ksoftirqd
Oh look I should do carrier

Russell - does that sound reasonable.

If so the answer yet again (as with the gige performance and some
others) might be to make it much much harder for stuff tofall back to
ksoftirqd.




2002-09-19 17:56:08

by Russell King

[permalink] [raw]
Subject: Re: 2.4.18 serial drops characters with 16654

On Thu, Sep 19, 2002 at 06:38:52PM +0100, Alan Cox wrote:
> Actually you can do it in hardware on the 16550 depending how its wired.
> Take a look at the usenet-2 serial port design some day. The software
> mode we do does in theory mean heavy delay to the bh handling might
> delay the assertion excessively. That I think may be the real
> explanation here.
>
> Its
> buffer full
> bh handler delayed by bh load (tasklet nowdays I guess I mean)
> overrun
> overrun
> ...
> ksoftirqd
> Oh look I should do carrier
>
> Russell - does that sound reasonable.

Hmm, looking at the tty stuff, I'd say its a distinct possibility. Even
more so since the flip buffer handler is put on tq_timer, which is subject
to ksoftirqd.

However, at the point when we hand data to the tty layer, we should have
2048 bytes left in the flip buffer before we really start soft overrunning
(vs hardware overrunning.) I notice that we don't make any attempt to
report such an event to user space, even when user space wants to know
about overruns.

Christian - what baud rate are you running these uarts at?

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2002-09-19 18:19:28

by Dan Christian

[permalink] [raw]
Subject: Re: 2.4.18 serial drops characters with 16654

This still isn't getting at the core problem. I'm sending data out the
port and dropping characters. The receive works fine.

It can't be a problem with the receiving device being over-run, since
the 16550 works (even though it sends several bytes after CTS drops),
and the 16654 doesn't (it stops after the current byte).

I think that data must be lost when the receiving device drops CTS.
Either this is a hardware flaw (and data is lost from the transmit
FIFO), or there is some kind of race condition between the CTS drop and
re-loading the FIFO.

-Dan

On Thursday 19 September 2002 10:38, Alan Cox wrote:
> On Thu, 2002-09-19 at 18:27, Dan Christian wrote:
> > The problem seems to be related to the RTS/CTS flow control
> > handling. The 16654 handles flow control in hardware, but the 16550
> > does it in software (I've verified this with a digital
> > oscilloscope). I don't currently have the equipment to compare
> > when the lines drop and which characters are lost.
>
> Actually you can do it in hardware on the 16550 depending how its
> wired. Take a look at the usenet-2 serial port design some day. The
> software mode we do does in theory mean heavy delay to the bh
> handling might delay the assertion excessively. That I think may be
> the real explanation here.
>
> Its
> buffer full
> bh handler delayed by bh load (tasklet nowdays I guess I
> mean) overrun
> overrun
> ...
> ksoftirqd
> Oh look I should do carrier
>
> Russell - does that sound reasonable.
>
> If so the answer yet again (as with the gige performance and some
> others) might be to make it much much harder for stuff tofall back to
> ksoftirqd.

2002-09-19 18:34:08

by Dan Christian

[permalink] [raw]
Subject: Re: 2.4.18 serial drops characters with 16654

On Thursday 19 September 2002 11:01, Russell King wrote:
> Hmm, looking at the tty stuff, I'd say its a distinct possibility.
> Even more so since the flip buffer handler is put on tq_timer, which
> is subject to ksoftirqd.
>
>
> However, at the point when we hand data to the tty layer, we should
> have 2048 bytes left in the flip buffer before we really start soft
> overrunning (vs hardware overrunning.) I notice that we don't make
> any attempt to report such an event to user space, even when user
> space wants to know about overruns.
>
>
> Christian - what baud rate are you running these uarts at?

I've reproduced the drops at very low baud rates. I didn't take notes,
but I think that I was getting the same sort of behavior at about
19.2Kb as 115Kb.

It really doesn't look like a speed thing. The CPU is >1Ghz and the CPU
usages is always below 10%. Since it works with a 16550 (interrupting
every 16? bytes) and not on a 16654 (interrupting every 64? bytes), it
doesn't seem like something is too slow.

-Dan

2002-09-19 18:38:08

by Russell King

[permalink] [raw]
Subject: Re: 2.4.18 serial drops characters with 16654

On Thu, Sep 19, 2002 at 11:24:24AM -0700, Dan Christian wrote:
> This still isn't getting at the core problem. I'm sending data out the
> port and dropping characters. The receive works fine.
>
> It can't be a problem with the receiving device being over-run, since
> the 16550 works (even though it sends several bytes after CTS drops),
> and the 16654 doesn't (it stops after the current byte).
>
> I think that data must be lost when the receiving device drops CTS.
> Either this is a hardware flaw (and data is lost from the transmit
> FIFO), or there is some kind of race condition between the CTS drop and
> re-loading the FIFO.

Ok, it doesn't sound like FIFO underrun, but FIFO overrun. In theory,
this should never ever happen on the transmit side, however you appear
to be seeing exactly this.

The first thing I'll ask is that you check that the port is being
recognised as a ST16654 and not 16650V2. The former has 64 bytes of
FIFO, the latter has 256 bytes.

Secondly, how many characters on average do you seem to be dropping in
one go? I'm not expecting an exact figure, just a rough idea will
probably do.

Thirdly, there is a possibility here that could be causing this, and
it surrounds the following code in transmit_chars() in serial.c:

count = info->xmit_fifo_size;
do {
serial_out(info, UART_TX, info->xmit.buf[info->xmit.tail]);
info->xmit.tail = (info->xmit.tail + 1) & (SERIAL_XMIT_SIZE-1);
info->state->icount.tx++;
if (info->xmit.head == info->xmit.tail)
break;
} while (--count > 0);

We always load a full FIFO-size chunk of data into the UART whenever
it says "hey, my transmit holding register is empty" since the FIFO
should be empty. I'm wondering if the ST16654 is giving an early
indication.

Could you try changing the first line in drivers/char/serial.c to:

count = info->xmit_fifo_size / 2;

to find out whether that improves the situation? Don't worry, I don't
intend this as a fix. Its more to (dis-)prove the point.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2002-09-19 18:32:31

by Alan Cox

[permalink] [raw]
Subject: Re: 2.4.18 serial drops characters with 16654

On Thu, 2002-09-19 at 19:24, Dan Christian wrote:
> This still isn't getting at the core problem. I'm sending data out the
> port and dropping characters. The receive works fine.
>
> It can't be a problem with the receiving device being over-run, since
> the 16550 works (even though it sends several bytes after CTS drops),
> and the 16654 doesn't (it stops after the current byte).

What happens if you setserial it to a 16450 ?