From: Nick Warne <nick@linicks.net>
To: linux-kernel@vger.kernel.org
Subject: Black Friday
Date: Fri, 15 Oct 2004 20:09:49 +0100
User-Agent: KMail/1.7
MIME-Version: 1.0
Content-Type: text/plain;
  charset="utf-8"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200410152009.49873.nick@linicks.net>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2793
Lines: 61

Hi all,

Here is a story that happened to me today, and a sightly warning for any 
sysadmins that read here.

At 9:50 am on the 1st Oct (2 weeks ago), main file server at work crashed big 
time (a Snapserver 4100).  It trashed the RAID (240GB - 170GB usable [50GB 
free]), but the OS (BSD based, I believe) is pretty good and rebuilt it - 
took 8 hours - no data lost out of 120+ GB

I also have a second Snapserver that runs Quantums own synchronisation 
software, so that at any point it time, server 'b' is an exact copy of server 
'a' - this I set to sync at 1:00 am each day.  The idea being you can 
recover/swap from server to server real time.

The first crash proved problems I never thought off in disaster recover 
options.  The Snapserver synchronisation software doesn't 'sync' directory 
share nor file permissions - just the actual binary data.  So the 'copy' 
server 'b' is not as is server 'a'.

OK, so since that 'black Friday' two weeks ago, I hacked a way to get the file 
permissions from box ''a' to box 'b' replicated manually so at least a quick 
swap over from box 'a' to box 'b' would be possible and the change over for 
the users would be invisible and all file/share permissions are correct.

Today at 9:50 Snapserver box 'a' crashed again.  I suspected that it was dying 
now, and the better move would be to push everybody to the back up box 'b' 
and replace the dodgy box 'a' until I could replace it.

Except box 'b' was AWOL as well (I didn't scream exactly...).  That crashed 
too at 9:50 (yes, in sync with box 'a').  The 'sync' software is bloody 
good ;)

After a lengthy discussion with Snap engineers, it turns out the OS does a 
KERNEL PANIC after a certain number of file opens/shares get accessed on the 
version OS I was running.  It only does this once the server reaches the 
point of whatever threshhold causes it - i.e. a growing, expanding fileserver 
- they didn't really tell me, nor elaborate.

But because two weeks ago, only one server crashed, I put it down to the 
gremlins... but as 'a' had not sync'ed with 'b' yet, that is why only one 
crashed then, and as both are sync'ed since, today both of them.

Two weeks later, both have the same threshhold to cause the kernel crash.

Snap guys gave me new OS firmware to flash - I have done one server, the main 
server  tomorrow.

The gist of this mail?  Never, ever think you are safe.

Nick
P.S still have DLT backup anyway - but users want to USE NOW not wait :(

-- 
"When you're chewing on life's gristle,
Don't grumble, Give a whistle..."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/