2003-02-22 08:39:52

by Marc Haber

[permalink] [raw]
Subject: ethernet-ATM-Router freezing

Hi,

we use Linux to terminate ATM PVCs and to move the traffic coming in
on the ATM link to Ethernet. We have two parallel machines with
identical hardware (about two years old), in a Debian GNU/Linux setup
built in November 2002. Current kernel is 2.4.20-ac1, and the setup
hasn't been touched since December 2002. One of the machines does the
work, while the other is waiting to be activated in case of failure.
Max load is about 6 Mbit, the machines are mostly idle. At least I
won't call them loaded.

A few days ago, the "active" machine has started spontaneously
freezing. The freezes don't happen at times with especially high or
low load, they don't happan during the same time of day. The freezes
are complete:
- no network respose on the ATM link
- no network response on the Ethernet link
- no response at all on the system console
- no error or panic messages on the console
- no response to Magic SysRq
- atsar doesn't show any strange patterns in CPU/memory usage
- syslog doesn't show anything strange
- mrtg doesn't show anything strange in network load

Reset button is needed to revive the frozen box.

These freezes occur on the machine actually doing the work. If I move
the work to the other box, the freezes go with the work. Thus, I am
pretty confident that this is not faulty hardware. I don't believe
either that this is a incompatibility of the kernel since the systems
in question have been working in this software configuration for two
months before the problems started.

Are there any known problems in the current ATM code that might cause
these freezes? Any other kernel versions I could try?

I am currently thinking about splitting the load between both boxes,
and downgrading one of them to a 2.4.19 or 2.4.18 kernel, and
upgrading the other one to a 2.4.21pre kernel. Have there been any
relevant changes to the ATM code recently?

Any hints will be appreciated. Thanks!

Cheers
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Karlsruhe, Germany | lose things." Winona Ryder | Fon: *49 721 966 32 15
Nordisch by Nature | How to make an American Quilt | Fax: *49 721 966 31 29


2003-02-22 10:24:11

by Francois Romieu

[permalink] [raw]
Subject: Re: ethernet-ATM-Router freezing

Marc Haber <[email protected]> :
[...]
> I am currently thinking about splitting the load between both boxes,
> and downgrading one of them to a 2.4.19 or 2.4.18 kernel, and
> upgrading the other one to a 2.4.21pre kernel. Have there been any
> relevant changes to the ATM code recently?

- what kind of hardware adapter do you have ?
- does the problem disappear with 2.4.20 ?

--
Ueimor

2003-02-22 11:15:46

by Marc Haber

[permalink] [raw]
Subject: Re: ethernet-ATM-Router freezing

On Sat, Feb 22, 2003 at 11:34:16AM +0100, [email protected] wrote:
> Marc Haber <[email protected]> :
> [...]
> > I am currently thinking about splitting the load between both boxes,
> > and downgrading one of them to a 2.4.19 or 2.4.18 kernel, and
> > upgrading the other one to a 2.4.21pre kernel. Have there been any
> > relevant changes to the ATM code recently?
>
> - what kind of hardware adapter do you have ?

00:0f.0 ATM network controller: FORE Systems Inc PCA-200E
Flags: bus master, medium devsel, latency 64, IRQ 9
Memory at efa00000 (32-bit, non-prefetchable) [size=2M]
Expansion ROM at effe0000 [disabled] [size=8K]

> - does the problem disappear with 2.4.20 ?

I will downgrade one of the boxes to 2.4.20 later today. This is going
to be a busy weekend anyway :-(

Greetings
Marc


--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Karlsruhe, Germany | lose things." Winona Ryder | Fon: *49 721 966 32 15
Nordisch by Nature | How to make an American Quilt | Fax: *49 721 966 31 29

2003-02-22 11:36:45

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: ethernet-ATM-Router freezing

On Sat, 2003-02-22 at 09:49, Marc Haber wrote:

> These freezes occur on the machine actually doing the work. If I move
> the work to the other box, the freezes go with the work. Thus, I am
> pretty confident that this is not faulty hardware. I don't believe
> either that this is a incompatibility of the kernel since the systems
> in question have been working in this software configuration for two
> months before the problems started.

Your reasoning is wrong. It can well be a HW failure, those can be
load related in various way (memory failure happening when memory
is actually used, thermal failure happening on CPU load, etc...)

If the exact same setup worked for a while with same/similar loads
and suddenly started to fail, there are great chances it's actually
HW failure (possibly RAM).

Ben.

2003-02-22 11:49:21

by Francois Romieu

[permalink] [raw]
Subject: Re: ethernet-ATM-Router freezing

Marc Haber <[email protected]> :
[...]
> 00:0f.0 ATM network controller: FORE Systems Inc PCA-200E
> Flags: bus master, medium devsel, latency 64, IRQ 9
> Memory at efa00000 (32-bit, non-prefetchable) [size=2M]
> Expansion ROM at effe0000 [disabled] [size=8K]

Ok, 2.4.20-ac1 made me fear an "guess who modified iphase driver"
answer :o)

[...]
> I will downgrade one of the boxes to 2.4.20 later today. This is going
> to be a busy weekend anyway :-(

Finding the first non-working kernel may help but you really should
include some more setup describing data (lspci -vv, lsmod and others
as suggested in REPORTING-BUGS file).

Regards

--
Ueimor

2003-02-22 12:56:57

by Marc Haber

[permalink] [raw]
Subject: Re: ethernet-ATM-Router freezing

On Sat, Feb 22, 2003 at 12:48:46PM +0100, Benjamin Herrenschmidt wrote:
> Your reasoning is wrong. It can well be a HW failure, those can be
> load related in various way (memory failure happening when memory
> is actually used, thermal failure happening on CPU load, etc...)
>
> If the exact same setup worked for a while with same/similar loads
> and suddenly started to fail, there are great chances it's actually
> HW failure (possibly RAM).

So you think that we have had two machines going bad on us with the
same kind of failure within just a few days?

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Karlsruhe, Germany | lose things." Winona Ryder | Fon: *49 721 966 32 15
Nordisch by Nature | How to make an American Quilt | Fax: *49 721 966 31 29

2003-02-22 13:05:13

by Marc Haber

[permalink] [raw]
Subject: Re: ethernet-ATM-Router freezing

On Sat, Feb 22, 2003 at 12:59:23PM +0100, [email protected] wrote:
> > I will downgrade one of the boxes to 2.4.20 later today. This is going
> > to be a busy weekend anyway :-(
>
> Finding the first non-working kernel may help

The freeze does happen about twice a week, so don't expect any results
today.

> but you really should
> include some more setup describing data (lspci -vv, lsmod and others
> as suggested in REPORTING-BUGS file).

If there is any information needed after seeing
http://q.bofh.de/~mh/stuff/typescript-april-failure, please ask.

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Karlsruhe, Germany | lose things." Winona Ryder | Fon: *49 721 966 32 15
Nordisch by Nature | How to make an American Quilt | Fax: *49 721 966 31 29

2003-02-22 16:23:23

by Douglas McNaught

[permalink] [raw]
Subject: Re: ethernet-ATM-Router freezing

Marc Haber <[email protected]> writes:

> On Sat, Feb 22, 2003 at 12:48:46PM +0100, Benjamin Herrenschmidt wrote:
> > If the exact same setup worked for a while with same/similar loads
> > and suddenly started to fail, there are great chances it's actually
> > HW failure (possibly RAM).
>
> So you think that we have had two machines going bad on us with the
> same kind of failure within just a few days?

If the hardware is from the same batch, it's not impossible. I've
heard stories of two identical drives bought at the same time failing
within hours of each other.

Not saying it *is* hardware, but it could be.

-Doug

2003-02-22 19:04:37

by Marc Haber

[permalink] [raw]
Subject: Re: ethernet-ATM-Router freezing

On Sat, Feb 22, 2003 at 11:33:26AM -0500, Doug McNaught wrote:
> Marc Haber <[email protected]> writes:
> > So you think that we have had two machines going bad on us with the
> > same kind of failure within just a few days?
>
> If the hardware is from the same batch, it's not impossible. I've
> heard stories of two identical drives bought at the same time failing
> within hours of each other.

I have had this for disks, but not for memory, CPU or board.

However, I have now shared the load between both machines, and
downgraded one of the boxes to vanilla 2.4.19.

We'll see how they behave, and probably exchange the hardware. The
colocation rack is getting full anyway, and those ServeLinux 1U boxes
are nifty.

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Karlsruhe, Germany | lose things." Winona Ryder | Fon: *49 721 966 32 15
Nordisch by Nature | How to make an American Quilt | Fax: *49 721 966 31 29

2003-02-23 09:46:56

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: ethernet-ATM-Router freezing

On Sat, 2003-02-22 at 14:07, Marc Haber wrote:
> On Sat, Feb 22, 2003 at 12:48:46PM +0100, Benjamin Herrenschmidt wrote:
> > Your reasoning is wrong. It can well be a HW failure, those can be
> > load related in various way (memory failure happening when memory
> > is actually used, thermal failure happening on CPU load, etc...)
> >
> > If the exact same setup worked for a while with same/similar loads
> > and suddenly started to fail, there are great chances it's actually
> > HW failure (possibly RAM).
>
> So you think that we have had two machines going bad on us with the
> same kind of failure within just a few days?

Sorry, my fault, I mis-read your post and though only one of the
boxes was freezing.

Ben.

2003-06-08 10:25:22

by Marc Haber

[permalink] [raw]
Subject: stability issues with 2.4.20-ac1 (was: ethernet-ATM-Router freezing)

On Sat, Feb 22, 2003 at 09:49:58AM +0100, Marc Haber wrote:
> we use Linux to terminate ATM PVCs and to move the traffic coming in
> on the ATM link to Ethernet. We have two parallel machines with
> identical hardware (about two years old), in a Debian GNU/Linux setup
> built in November 2002. Current kernel is 2.4.20-ac1, and the setup
> hasn't been touched since December 2002. One of the machines does the
> work, while the other is waiting to be activated in case of failure.
> Max load is about 6 Mbit, the machines are mostly idle. At least I
> won't call them loaded.
>
> A few days ago, the "active" machine has started spontaneously
> freezing. The freezes don't happen at times with especially high or
> low load, they don't happan during the same time of day. The freezes
> are complete:
> - no network respose on the ATM link
> - no network response on the Ethernet link
> - no response at all on the system console
> - no error or panic messages on the console
> - no response to Magic SysRq
> - atsar doesn't show any strange patterns in CPU/memory usage
> - syslog doesn't show anything strange
> - mrtg doesn't show anything strange in network load
>
> Reset button is needed to revive the frozen box.

In the mean time, we started experiencing the same behavior on routers
that have only ethernet interfaces, a web server and my personal
workstation. All machines run kernel 2.4.20-ac1, and downgrading to
2.4.19 seems to solve the problem in all cases. I am now pretty sure
that I am not experiencing bad hardware.

I am now running my personal workstation with 2.4.21-rc7, but cannot
comment on that kernel at the moment.

Are there any known stability issues with 2.4.20-ac1?

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Karlsruhe, Germany | lose things." Winona Ryder | Fon: *49 721 966 32 15
Nordisch by Nature | How to make an American Quilt | Fax: *49 721 966 31 29