2003-06-12 15:42:41

by Mike Dresser

[permalink] [raw]
Subject: 3ware and two drive hardware raid1

If i have a hardware raid1 array of two 120 gig Maxtor DiamondMax 9 drives
on a 3ware 7000-2. Failure of one disk should not go all the way up to
the OS and cause the OS to report hard errors, and remount the drive as
read-only, right?

My understanding of raid1 was that if there was a disk failure it would
note it, mark the drive as bad, and switch to running off the other drive.
Software raid on Linux does that.

This certainly isn't that!

Jun 12 04:00:00 x kernel: 3w-xxxx: scsi1: Command failed: status = 0xc7, flags = 0x40, unit #0.
Jun 12 04:00:25 x last message repeated 4 times
Jun 12 04:00:25 x kernel: scsi1: ERROR on channel 0, id 0, lun 0, CDB: 0x28 00 00 86 b8 aa 00 00 08 00
Jun 12 04:00:25 x kernel: Info fld=0x0, Current sd08:06: sns = f0 3
Jun 12 04:00:25 x kernel: ASC=11 ASCQ= 0
Jun 12 04:00:25 x kernel: Raw sense data:0xf0 0x00 0x03 0x00 0x00 0x00 0x00 0x0a 0x00 0x00 0x00 0x00 0x11 0x00 0x00 0x00 0x00 0x00
Jun 12 04:00:25 x kernel: I/O error: dev 08:06, sector 41480
Jun 12 04:00:25 x kernel: journal_bmap: journal block not found at offset 5132 on sd(8,6)
Jun 12 04:00:25 x kernel: Aborting journal on device sd(8,6).
Jun 12 04:00:29 x kernel: ext3_abort called.
Jun 12 04:00:29 x kernel: EXT3-fs abort (device sd(8,6)): ext3_journal_start: Detected aborted journal
Jun 12 04:00:29 x kernel: Remounting filesystem read-only

I'll be out at the facility tomorrow to replace the dead drive(appears to
be unit #0), but am extremely curious why the 3ware unit did what it did!

I'm running a badblocks on the partition that was mounted readonly to see
if the filesystem is corrupted.(luckily it's all .tar files, so any
corruption will hopefully be easy to see.)

Running kernel 2.4.20 on Debian Stable.

Mike


2003-06-13 08:51:51

by Alan

[permalink] [raw]
Subject: Re: 3ware and two drive hardware raid1

On Iau, 2003-06-12 at 16:56, Mike Dresser wrote:
> If i have a hardware raid1 array of two 120 gig Maxtor DiamondMax 9 drives
> on a 3ware 7000-2. Failure of one disk should not go all the way up to
> the OS and cause the OS to report hard errors, and remount the drive as
> read-only, right?

Yes, but that won't help you if you lost both drives, which does happen
now and again - overheating, bad PSU, using two drives from the same
batch together and so on.

The trace looks like you may have lost both drives.

Alan

2003-06-13 14:07:03

by Mike Dresser

[permalink] [raw]
Subject: Re: 3ware and two drive hardware raid1

On 13 Jun 2003, Alan Cox wrote:

> On Iau, 2003-06-12 at 16:56, Mike Dresser wrote:
> > If i have a hardware raid1 array of two 120 gig Maxtor DiamondMax 9 drives
> > on a 3ware 7000-2. Failure of one disk should not go all the way up to
> > the OS and cause the OS to report hard errors, and remount the drive as
> > read-only, right?
>
> Yes, but that won't help you if you lost both drives, which does happen
> now and again - overheating, bad PSU, using two drives from the same
> batch together and so on.
>
> The trace looks like you may have lost both drives.
>
> Alan
>

I'm heading out there today to take a look at the machine and see what
happened. I'm rather dissappointed in the 3ware utility, it alternately
claims both drives are ok(./tw_cli info c1 is different from ./tw_cli
info c1 u0)

I was relying on that too much, and ignored the possiblity of two drive
failure. Looks like both drives would have failed at exactly the same
time, which sounds like a power spike.

I just got a report that another Windows98 workstation is randomly
rebooting after an hour of uptime at the same facility, so I'm suspecting
hardware failure like you did.

I'll see what's up when I get there. Powermax will tell me what's up.

Luckily the damage is contained to the data drive, I will be able to copy
everything over to a new set of drives and not lose anything that's not
trivially replaceable.

Thank you Alan,

Mike

2003-06-13 16:09:04

by David Rees

[permalink] [raw]
Subject: Re: 3ware and two drive hardware raid1

Mike Dresser said:
>
> I'm heading out there today to take a look at the machine and see what
> happened. I'm rather dissappointed in the 3ware utility, it alternately
> claims both drives are ok(./tw_cli info c1 is different from ./tw_cli
> info c1 u0)
>
> I was relying on that too much, and ignored the possiblity of two drive
> failure. Looks like both drives would have failed at exactly the same
> time, which sounds like a power spike.

On the 3ware boxes I use, I setup the 3DM utility to run weekly scans of
the unit to look for badblocks, do you do the same thing? I've had the
scan turn up bad disks before.

-Dave


2003-06-13 17:19:41

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 3ware and two drive hardware raid1

On Fri, 13 Jun 2003 09:23:01 -0700 (PDT)
"David Rees" <[email protected]> wrote:

> Mike Dresser said:
> >
> > I'm heading out there today to take a look at the machine and see what
> > happened. I'm rather dissappointed in the 3ware utility, it alternately
> > claims both drives are ok(./tw_cli info c1 is different from ./tw_cli
> > info c1 u0)
> >
> > I was relying on that too much, and ignored the possiblity of two drive
> > failure. Looks like both drives would have failed at exactly the same
> > time, which sounds like a power spike.
>
> On the 3ware boxes I use, I setup the 3DM utility to run weekly scans of
> the unit to look for badblocks, do you do the same thing? I've had the
> scan turn up bad disks before.
>
> -Dave

I can confirm that the 3dm daemon is very handy. Especially the media scan is
highly recommended, as it finds problems on areas where there is no production
data yet. So there always is a good chance for replacement before actual failure.

Regards,
Stephan

2003-06-13 17:59:00

by Mike Dresser

[permalink] [raw]
Subject: Re: 3ware and two drive hardware raid1

On Fri, 13 Jun 2003, David Rees wrote:

> On the 3ware boxes I use, I setup the 3DM utility to run weekly scans of
> the unit to look for badblocks, do you do the same thing? I've had the
> scan turn up bad disks before.

Too much bloat in the 3dm utility, I use the tw_cli util.

I'm on-site right now, and it turns out it's a double drive failure.

One drive is dead to the point of not even being detected, and the other
is damaged enough that powermax was having difficulty running.

Fortunately the systefiles are completely intact, and it's just a matter
of seeing what backup files are damaged and working around them.

I'll be taking it up with 3ware about why their utility falsely
reported/didn't report the drive failures, according to it the second
drive was just fine.

Mike

2003-06-13 18:00:59

by Mike Dresser

[permalink] [raw]
Subject: Re: 3ware and two drive hardware raid1

On Fri, 13 Jun 2003, Stephan von Krawczynski wrote:

> I can confirm that the 3dm daemon is very handy. Especially the media scan is
> highly recommended, as it finds problems on areas where there is no production
> data yet. So there always is a good chance for replacement before actual failure.

The tw_cli has a similar function, in that you can maint verify c# u#

Mike

2003-06-13 20:32:40

by David Rees

[permalink] [raw]
Subject: Re: 3ware and two drive hardware raid1

Mike Dresser said:
> On Fri, 13 Jun 2003, Stephan von Krawczynski wrote:
>> I can confirm that the 3dm daemon is very handy. Especially the media
>> scan is highly recommended, as it finds problems on areas where there
>> is no production data yet. So there always is a good chance for
>> replacement before actual failure.
>
> The tw_cli has a similar function, in that you can maint verify c# u#

Were you using it to confirm the status of your disks?

-Dave


2003-06-13 20:46:49

by Mike Dresser

[permalink] [raw]
Subject: Re: 3ware and two drive hardware raid1

On Fri, 13 Jun 2003, David Rees wrote:

> Mike Dresser said:
> > On Fri, 13 Jun 2003, Stephan von Krawczynski wrote:
> >> I can confirm that the 3dm daemon is very handy. Especially the media
> >> scan is highly recommended, as it finds problems on areas where there
> >> is no production data yet. So there always is a good chance for
> >> replacement before actual failure.
> >
> > The tw_cli has a similar function, in that you can maint verify c# u#
>
> Were you using it to confirm the status of your disks?

Yes.

The other Windows PC onsite that I was working on turned out to be
rebooting randomly. When I investigated, turned out that it's the battery
backup. Unplug the battery backup, and a few seconds later(maybe 10), the
computer would reboot.

Based on that damaged batterybackup, and that it just started doing it
yesterday morning, I'd say we had a lightning hit on the two buildings,
especially since we had a storm that night.

Both drives in the backup server were damaged, one was completely dead,
the other was barely running. I was able to copy off all but about 10
files, of which all of it was easily replaceable anyways.

The drives were functional before the lightning hit.

Mike