2002-09-05 17:58:24

by Peter Surda

[permalink] [raw]
Subject: odd network freezing, reboot fixed

Dear kernel developers and other guys with deep linux knowledge,

I am posting here because I think this is an interesting story, and even if
noone can find a answer now, maybe this can serve as a reference for future.

I am not sure what information is relevant to the story, so I'll try to
mention anything I consider relevant.

The story is about what happened yesterday to one of our servers,
morpheus.panorama.sth.ac.at (picture here:
http://morpheus.panorama.sth.ac.at/morpheus.jpg). The machine is a P166/48MB
RAM, 2*13GB Quantum on Soft-Raid-1 (hda and hdc), 2*ne2kpci (Realtech 8029
chip), 3*parallel port, some isa VGA (unused, no X, no fbdev). The board is a
(according to lspci): Host bridge: Intel Corporation 430HX - 82439HX TXC
[Triton II] (rev 03). It runs on RH6.2 with latest updates, however, when the
story was happening, it wasn't running the latest kernel available from RH,
but 2.2.19-6.2.1. (the latest kernel, 2.2.19-6.2.16, was installed but there
was no reboot in between).

There are printers on all 3 LPTs, but only eth1 is connected to the network
(eth0 has no cable connected to it).

The machine runs mainly bind, apache, dhcpd, postgres, sendmail, imapd/ipop3d,
aprwatch, samba and lpd.

Sometime the day before yesterday, its uptime reached 497 days + something and
the uptime counter rolled over. I read about this limit before so I was
expecting that this happens. I know this may be totally unrelated to the whole
phenomenon, but I think I should mention it.

Suddenly, several hours after this rollover happened, following symptoms have
been happening for about 6 hours: for a couple of minutes, the network was
failing completely, and then after some more minutes, it magically "appeared"
again.

Description of what I mean with "failing completely": Active connections
froze, I couldn't even ping and tcpdump on a second computer (actually I tried
several) didn't show any packets from the network card. It may be that there
were SOME packets which a switched ethernet would filter, but at least then
there wouldn't be dozens of
arp who has morpheus tell machine1
arp who has morpheus tell machine1
arp who has morpheus tell machine1
arp who has morpheus tell machine1
arp who has morpheus tell machine1
arp who has morpheus tell machine1
arp who has morpheus tell machine1
arp who has morpheus tell machine1
, would there? (same with machine2 and ...., the name machine is just a bogus
one to demonstrate)

I wasn't present on site during the time, so I was instructing a colleague
over a phone what to do and was trying to ssh into it. He was sitting in front
of the machine and looking at it when this odd stuff was happning. There were
no suspicious physical activities. He also tried replugging the ether cable
(without any effect). There are several more computers connected to the same
switch, all in the same room, and the computers were working normally and
there was nobody playing with the switch.

After an hour I couldn't figure out WTF was happening, I told the guy to pull
the power plug (ctrl-alt-del is disabled, so are the reset and power buttons
and I didn't want to tell the root password over phone). The machine then
loaded the new kernel, after about 10 minutes of fscking started up and since
then there were no problems, the raid arrays took a couple of hours to resync
but didn't report any errors.

So, now to the details on what DIDN'T cause the problem. The machine collects
quite a lot of statistics, and you can check it out under
http://morpheus.panorama.sth.ac.at/mrtg/
(ignore the "ADSL" statistics, they don't have any mean anything useful).
The odd stuff was happening about between about 12:00 and 18:00 local time
yesterday.
As you see:
- no large network activity (the router statistics concur,
http://router.panorama.sth.ac.at/morpheus.html)
- no odd load peaks (those you see are caused by secondary mx flushing queue
when the connection was working temporarily).
- no overly swap usage
- no overly disk or CPU activity

Furthermore, I found out that
- there was plenty of disk space available
- ifconfig didn't show any errors (to be more precise the counters of
errors/overruns/carrier didn't change between the outages).
- there was noone logged on besides me (I also checked this interactively
during the short times the network connection was working).
- other than network outages, the machine was working as usual.

I also checked the logs for hardware failure or intrusion activity (actually I
check them regularly at least once a day) and didn't find anything odd. The
machine is IMHO secured pretty good: all packages updated (I don't think the
old kernel had remotely expoitable bugs), firewall both locally and on the
router, logcheck, viperdb. I don't claim it's 100% secure, just that a real
cracker probably wouldn't cause problems like this and a script kiddie
probably wouldn't get in at all, and if, (s)he'd be detected.

My conclusion is that the most probable cause is a bug in kernel, because I
think the other causes are less probable. Hardware problems wouldn't be fixed
by reboot and there would be something in logs. Userspace programs are
unlikely to cause this. Unauthorised access is also not very likely.

Now that the server works flawlessly again, I don't need to actually "repair"
this, but I think this should be researched in order to prevent it. If it was
you-know-what-operating-system, I'd find it normal that reboot fixes things,
but this is linux and it shouldn't be like this, "what, you mean you had to
reboot your server only after 497 days? that sucks." :-)

If needed I can provide full logs for the last 6 months and logcheck/viperdb
reports from roughly the last 2 years. I can also provide any additional
information if needed.

Please comment on this. I'd also like to ask you to cc me, as I'm not on the
list (I could look up the answers on some web archives but all I found take a
day to update).

Oh, and by "yesterday" I mean 4th September 2002, and by local time CEST.

Bye,

Peter Surda (Shurdeek) <[email protected]>, ICQ 10236103, +436505122023

--
It is easier to fix Unix than to live with NT.


Attachments:
(No filename) (6.02 kB)
(No filename) (232.00 B)
Download all attachments

2002-09-05 20:12:28

by Richard B. Johnson

[permalink] [raw]
Subject: Uptime timer-wrap

On Thu, 5 Sep 2002, Peter Surda wrote:

> Dear kernel developers and other guys with deep linux knowledge,
>
> I am posting here because I think this is an interesting story, and even if
> noone can find a answer now, maybe this can serve as a reference for future.
>
> I am not sure what information is relevant to the story, so I'll try to
> mention anything I consider relevant.
[SNIPPED timer-wrap story]

I tried to simulate your observation by making a driver that
set the 'jiffies' count upon an 'open'. The idea was to get
the jiffies count to something close to wrap so I didn't have to
wait a long time.

Anyway, I found that setting the jiffies count to more than a
few hundred counts into the future, causes the machine to halt
with no interrupts (no Capslock, no NumLock, no network ping, etc).

The machine just stops and I don't understand why.


spin_lock_irqsave(&xlock, flags);
jiffies += 0x1000;
spin_lock_irqrestore(&xlock, flags);

... works just fine, but, changing 0x1000 to 0x7fffffff causes
the machine to stop as reported.

Does anybody have a clue?


Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
The US military has given us many words, FUBAR, SNAFU, now ENRON.
Yes, top management were graduates of West Point and Annapolis.

2002-09-05 21:23:09

by Andreas Dilger

[permalink] [raw]
Subject: Re: Uptime timer-wrap

On Sep 05, 2002 16:16 -0400, Richard B. Johnson wrote:
> I tried to simulate your observation by making a driver that
> set the 'jiffies' count upon an 'open'. The idea was to get
> the jiffies count to something close to wrap so I didn't have to
> wait a long time.
>
> Anyway, I found that setting the jiffies count to more than a
> few hundred counts into the future, causes the machine to halt
> with no interrupts (no Capslock, no NumLock, no network ping, etc).
>
> The machine just stops and I don't understand why.
>
>
> spin_lock_irqsave(&xlock, flags);
> jiffies += 0x1000;
> spin_lock_irqrestore(&xlock, flags);
>
> ... works just fine, but, changing 0x1000 to 0x7fffffff causes
> the machine to stop as reported.
>
> Does anybody have a clue?

Yes, because now some kernel code is going to wait 147 days - 1s
or something like that to finish.

See Tim Schmielau <[email protected]> patch for testing jiffies
wrap. It _initializes_ jiffies to a high pre-wrap value, maybe 5
minutes before wrap, instead of playing around with the jiffies value
after the system is running.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-09-05 21:39:54

by Peter Surda

[permalink] [raw]
Subject: Re: Uptime timer-wrap

On Thu, Sep 05, 2002 at 04:16:52PM -0400, Richard B. Johnson wrote:
> I tried to simulate your observation by making a driver that
> set the 'jiffies' count upon an 'open'. The idea was to get
> the jiffies count to something close to wrap so I didn't have to
> wait a long time.
>
> Anyway, I found that setting the jiffies count to more than a
> few hundred counts into the future, causes the machine to halt
> with no interrupts (no Capslock, no NumLock, no network ping, etc).
I noticed I perhaps didn't describe the symptoms precisely. It didn't stop
working permanently, the network was going up and down unpredictably, in
intervals of about 5 to 20 minutes. Perhaps there was a pattern, but I hope
you'll excuse me for mot measuring the exact times as the main point was to
get it running, not to write a scientific report about it :-).

If it wasn't fixed by a reboot I'd say it was an odd hardware problem.

Bye,

Peter Surda (Shurdeek) <[email protected]>, ICQ 10236103, +436505122023

--
Where do you think you're going today?


Attachments:
(No filename) (1.04 kB)
(No filename) (232.00 B)
Download all attachments

2002-09-06 12:00:39

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Uptime timer-wrap

On Thu, 5 Sep 2002, Andreas Dilger wrote:

> On Sep 05, 2002 16:16 -0400, Richard B. Johnson wrote:
> > I tried to simulate your observation by making a driver that
> > set the 'jiffies' count upon an 'open'. The idea was to get
> > the jiffies count to something close to wrap so I didn't have to
> > wait a long time.
> >
> > Anyway, I found that setting the jiffies count to more than a
> > few hundred counts into the future, causes the machine to halt
> > with no interrupts (no Capslock, no NumLock, no network ping, etc).
> >
> > The machine just stops and I don't understand why.
> >
> >
> > spin_lock_irqsave(&xlock, flags);
> > jiffies += 0x1000;
> > spin_lock_irqrestore(&xlock, flags);
> >
> > ... works just fine, but, changing 0x1000 to 0x7fffffff causes
> > the machine to stop as reported.
> >
> > Does anybody have a clue?
>
> Yes, because now some kernel code is going to wait 147 days - 1s
> or something like that to finish.
>

Yes. And isn't this a bug? I fear some code is waiting for some
hardware timer value with the interrupts disabled. Since the CPU(s)
are not the only user's of I/O (bus-masters), they may miss completely
whatever they are spinning for. This could be the reason many
machines simply "stop", could it not?

Do you have an idea where to look? I need to prevent the possibility
of waiting forever for an event that may never occur, with interrupts
disabled, on at least one embedded system. Any wait-forever possibility
must be interruptible because any watch-dog timer that re-boots will end
up destroying data that must never be lost.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
The US military has given us many words, FUBAR, SNAFU, now ENRON.
Yes, top management were graduates of West Point and Annapolis.

2002-09-06 12:26:40

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: Uptime timer-wrap

On Fri, 6 Sep 2002, Richard B. Johnson wrote:

> Do you have an idea where to look? I need to prevent the possibility
> of waiting forever for an event that may never occur, with interrupts
> disabled, on at least one embedded system. Any wait-forever possibility
> must be interruptible because any watch-dog timer that re-boots will end
> up destroying data that must never be lost.

Remove the 'wait forever' (really if you're waiting forever you have a
bug anyway, be it hardware or otherwise) and break the entire kernel? Or
perhaps sprinkle timeouts in every little crevice of the kernel code.

Zwane

--
function.linuxpower.ca

2002-09-06 12:35:06

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Uptime timer-wrap

On Fri, 6 Sep 2002, Zwane Mwaikambo wrote:

> On Fri, 6 Sep 2002, Richard B. Johnson wrote:
>
> > Do you have an idea where to look? I need to prevent the possibility
> > of waiting forever for an event that may never occur, with interrupts
> > disabled, on at least one embedded system. Any wait-forever possibility
> > must be interruptible because any watch-dog timer that re-boots will end
> > up destroying data that must never be lost.
>
> Remove the 'wait forever' (really if you're waiting forever you have a
> bug anyway, be it hardware or otherwise) and break the entire kernel? Or
> perhaps sprinkle timeouts in every little crevice of the kernel code.
>
> Zwane

The kernel is waiting forever as previously shown.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
The US military has given us many words, FUBAR, SNAFU, now ENRON.
Yes, top management were graduates of West Point and Annapolis.