2002-01-23 08:23:32

by H. Peter Anvin

[permalink] [raw]
Subject: kernel.org: setting the record straight

It has come to my attention there are some unfounded rumours about the
kernel.org outage from this past weekend, and I wanted to set the
record straight:

In particular, the outage was *not* caused by any kind of failure in
the new Compaq server hardware. It worries me a bit that this
particular rumour was circulating, since it reflects badly on a donor
who has just provided us with a very nice machine.

The approximate history of the failure is as such:

Back in December, we were already planning to replace the old server
hardware after repeated problems, probably age-related. We were
originally planning to put the new server in production after the
holidays, however, when the old server really started to flake out on
us we decided to push it in service early.

ISC was very accomodating and arranged for us to put it in service on
short notice, despite several logistics problem. One of those
problems was the lack of a configured port for the management card in
the new server. As a result, it did not get wired up at that time.

This past Friday morning, the kernel on the server apparently stopped
servicing user-space processes; the details aren't known, other than
the fact that pings and TCP SYNs received replies, but we couldn't get
any actual data across. Futhermore, on and off there was as much as
95% packet loss in pings, although the machine didn't stop responding
to pings until it was power cycled on Sunday.

Due to a miscommunication between myself and the staff at ISC we
weren't able to get the machine power cycled until Sunday (it did not
return from the power cycle) and the management port connected until
Monday. Once the management port got connected, it was a 5-minute job
to bring the machine back to life.

Finally, I would like to thank the many people and organizations who
have offered to host another kernel.org server in one way or another.
I'm going to be evaluating our options, with the goal of getting at
least one additional server if at all possible. If so, my preference
will definitely be to try to obtain identical hardware with the one we
currently have.

-=hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>


2002-01-23 16:30:27

by Daniel Phillips

[permalink] [raw]
Subject: Re: kernel.org: setting the record straight

On January 23, 2002 09:23 am, H. Peter Anvin wrote:
> This past Friday morning, the kernel on the server apparently stopped
> servicing user-space processes; the details aren't known, other than
> the fact that pings and TCP SYNs received replies, but we couldn't get
> any actual data across. Futhermore, on and off there was as much as
> 95% packet loss in pings, although the machine didn't stop responding
> to pings until it was power cycled on Sunday.

Which kernel was it?

--
Daniel

2002-01-23 17:29:29

by H. Peter Anvin

[permalink] [raw]
Subject: Re: kernel.org: setting the record straight

Daniel Phillips wrote:

>
> Which kernel was it?
>


2.4.16

-hpa