2000-11-30 03:10:26

by Federico Grau

[permalink] [raw]
Subject: rocketport pci question... it stopped working after 250 days uptime


Hello Kernel people,

I am not subscribed to the kernel mailing list, so please cc responses to me.

We have several linux boxes useing 8 port rocketport pci multiport serial
cards. Earlier last week 3 of them stopped working within a 24 hour period.
These three boxes had similar uptimes (since their last kernel rebuild); 249
days, 248 days, 250 days. Comparing the logs of each box, we saw that each
box's rocketport stopped working after aproximately 248 days 16 hours uptime.

The rocketports are used both for dialin and also to simply take in streams of
data from an audio encoder. The simptoms from dialin were that mgetty could
not initialize the modems (function mg_init_data timed out when waiting for a
read from the tty). The simptoms from the data stream were simply that no
data stream came in... a simple cat on /dev/ttyR0 did not dump any data.

The problem was bypassed by warmbooting (simply running shutdown -r now) the
boxes.

The hardware are intel pentium pro and intel pentium 2 cpus using brand name
motherboards (tyan). The rocketport cards are "Comtrol RocketPort 8 Oct (rev
4)". They are all running kernel version 2.2.14, with the stock rocketport
driver that comes with it. One of the boxes had the rocketport driver
compiled as a module... unloading and reloading the module had no effect on
the problem.



Checking to see if this bug had been corrected in later versions I compared
the changes in the rocketport code (/usr/src/linux/drivers/char/rocket.c) in
2.2.14 with 2.2.17... the latter only had 2 additional lines calling
wake_up_interruptible(). The 2.2.14 kernel lists the rocketport driver
version as 1.14c. However, comparing it with the 1.14 kernel at
rocketport.sourceforge.net it does not appear to match, though some parts are
similar I am not clear where they matched and where they forked. Regardless I
saw no mention of a bug like this being corrected in the HISTORY file that was
with version 1.22 from sourceforge.

I spent a couple hours looking through the rocketport code in the kernel and
could not see where this bug may be caused (I have some background with C but
not much with hacking the kernel). I am not even sure if the bug is with the
hardware, the rocketport driver, or possibly someplace else in the kernel
(doubtful I imagine).


So, my questions are:

- has anyone heard of such a bug before?

- does anyone have any suggestions to find/correct the bug?

- We have another box with the rocketport cards which we exprect to reach
this time limit in the next 48 hours (Dec 1st EST)... what kind of
debugging/analysis could I do to help track where the problem might be?




Thanks,
donfede



2000-11-30 04:51:47

by Alan

[permalink] [raw]
Subject: Re: rocketport pci question... it stopped working after 250 days uptime

> These three boxes had similar uptimes (since their last kernel rebuild); 249
> days, 248 days, 250 days. Comparing the logs of each box, we saw that each
> box's rocketport stopped working after aproximately 248 days 16 hours uptime.
> So, my questions are:
> - has anyone heard of such a bug before?

Yes. Someone is doing signed maths on time stamps (2^31 1/100th of a second)

Ted ?


2000-11-30 04:53:47

by jdow

[permalink] [raw]
Subject: Re: rocketport pci question... it stopped working after 250 days uptime

From: "Federico Grau" <[email protected]>

> We have several linux boxes useing 8 port rocketport pci multiport serial
> cards. Earlier last week 3 of them stopped working within a 24 hour period.
> These three boxes had similar uptimes (since their last kernel rebuild); 249
> days, 248 days, 250 days. Comparing the logs of each box, we saw that each
> box's rocketport stopped working after aproximately 248 days 16 hours uptime.

If it was 248 days 13 hours 13 minutes 56.48 seconds this represents a 32 bit
counter on a 5ms clock overflowing. I'd look for that in the RocketPort code.
Although I remember Jeff remarking about something else failing at about the
same uptime.

{^_^} Joanne Dow, [email protected]


2000-12-02 09:49:27

by Federico Grau

[permalink] [raw]
Subject: Re: rocketport pci question... it stopped working after 250 days uptime

Ok,

I have another box which has had the same failure (rocketport serial ports
stoped working after 248 days and 16 hours). I have about 2 more hours before
I need to reboot the box and get it back into production.

I have both a working and non-working example, where I have re-compiled and
re-loaded the rocketport module with the debugging info turned on.


On my working example:
# when I load the module the init function gets called
int rp_init(void)

# when I cat from the tty device the open function gets called
static int rp_open(struct tty_struct *tty, struct file * filp) #

# then somehow automagically the rp_handle_port() gets called by
# rp_do_poll()... and data gets read
void rp_handle_port(struct r_port *info)
static void rp_do_poll(void)

On my non-working example:
# the init and the open seem to hapen fine, however the rp_do_poll() never
# gets called?!


I see that in rp_init(), rp_do_poll() gets saved to a "structure" called
"timer_table[COMTROL_TIMER].fn = rp_do_poll;", however I have yet to find what
that is.


Where else could I look to find the problem?

I am working with kernel 2.2.14 (/usr/src/linux/drivers/char/rocket.c), which
except for two lines is the same in 2.2.17. I am not yet subscribed to the
kernel list, so please cc responses to me.

Thanks,
donfede

On Thu, Nov 30, 2000 at 03:27:49AM +0000, Alan Cox wrote:
> > These three boxes had similar uptimes (since their last kernel rebuild); 249
> > days, 248 days, 250 days. Comparing the logs of each box, we saw that each
> > box's rocketport stopped working after aproximately 248 days 16 hours uptime.
> > So, my questions are:
> > - has anyone heard of such a bug before?
>
> Yes. Someone is doing signed maths on time stamps (2^31 1/100th of a second)
>
> Ted ?
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/