2002-01-30 02:32:35

by H. Peter Anvin

[permalink] [raw]
Subject: master.kernel.org situation update

The situation on master.kernel.org is looking pretty grim. We were
trying to add disk capacity (the host uses a DAC960PRL RAID
controller) and the end result seems to be that a Mylex utility called
"ezsetup" has completely trashed the RAID configuration information.
What makes matters worse is that an MIS screwup here means no backups
have been running for a month or so.

Clearly, the archive section of the system is mirrored, and therefore
recoverable, but there are lots of scripts and anything else that
involves automation that may be lost if we cannot recover this data.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt


2002-01-30 03:11:48

by Larry McVoy

[permalink] [raw]
Subject: Re: master.kernel.org situation update

On Tue, Jan 29, 2002 at 06:31:32PM -0800, H. Peter Anvin wrote:
> The situation on master.kernel.org is looking pretty grim. We were
> trying to add disk capacity (the host uses a DAC960PRL RAID
> controller) and the end result seems to be that a Mylex utility called
> "ezsetup" has completely trashed the RAID configuration information.
> What makes matters worse is that an MIS screwup here means no backups
> have been running for a month or so.

This doesn't help you now, but what you just hit is why we take a
different approach to backups. For any data we care about, we
stick in a 3ware 6410 controller, run it in JBOD mode, and have
4 drives mounted as
/home
/nightly
/weekly
/monthly
and we copy all the data once a day to the appropriate spot. On top
of that, we run the gzip checksum over all the data and save a database
of
pathname, size, mtime, chksum

tuples and for all where path, size, mtime match we compare the chksum
which had better be the same, otherwise the disk, filesystem, or memory
has corrupted your data. That way we get warned before all the backups
are gone.

Using a RAID is a losing proposition - it means you still have exactly
one copy of the data and no way to verify that it is correct. The RAID
just does what the fs/block layer tells it to do and if the upper layers
are handing down bad data, the RAID will faithfully store that bad data.
And you never know until you need it.

The other nice thing about the 4 way mirror is that you can do stuff like

diff foo.c /nightly/$PWD

and try and figure out what you were smoking when you made that change
before the coffee started working.

I know this doesn't help right now and is probably unwelcome advice, but
I'd encourage you to consider this approach in the future. It's brute
force but has huge advantages over tapes, RAID, etc. You'd be back on
line right now, albeit with maybe 12 hour old data, if you had this.
We have all our scripts in a BitKeeper repository and I'll happily give
them to you if you want them.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2002-01-30 03:18:18

by H. Peter Anvin

[permalink] [raw]
Subject: Re: master.kernel.org situation update

Followup to: <[email protected]>
By author: Larry McVoy <[email protected]>
In newsgroup: linux.dev.kernel
>
> This doesn't help you now...
>

I have gotten several of this type of "helpful advice"... please lay
off. I really don't need this right now.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt