2009-06-29 00:38:09

by Krzysztof Kosiński

[permalink] [raw]
Subject: Massive corruption on RAID0

Hello

Here is my story: I recently migrated a server from Windows to Ubuntu
9.04. I formatted all disks with ext4. The server has 5 disks: three
SCSI (9GB for /, 2x18GB for /data/small and /home) and two IDE
(2x300GB). I put the IDE disks in a RAID0: each had a single partition
with type set to "fd", and the entire resulting device (/dev/md0) was
formatted with ext4 as:

mkfs.ext4 -b 4096 -E stride=16 /dev/md0

All was well until a power outage that left the filesystem on /dev/md0
unmountable (the others were fine after an fsck). I made a backup of
the corrupted array to another disk and ran fsck, but it ended up in
an infinite loop. After some unsuccessful tinkering, I restored the
backup and found out that large portions of the group descriptor table
are filled with random junk. Moreover, all backup superblocks are
either corrupted or zeroed, and I found partial copies of an
identically corrupted table at various weird offsets (including
176888, 600344, 1036536, 1462520, 1887256 and 5326832); neither of
these copies were preceded by anything resembling a superblock. Here
is a copy of first 39 blocks of the corrupted disk:
http://tweenk.artfx.pl/super.bin

The group descriptors that are not filled with random junk all follow
a simple pattern, but filling the group descriptor table with with an
extension of it didn't yield anything interesting. A smartctl check
revealed that one of the disks forming the array has
Reallocated_Sector_Ct = 17 and Reallocated_Event_Count = 1
(coincidentally, 17 is also the number of backup superblocks this
device should have), the other has zero; however, this doesn't explain
the massive corruption.

Regards, Krzysztof Kosiński

PS Please CC replies to me as I'm not subscribed


2009-06-29 03:30:39

by Eric Sandeen

[permalink] [raw]
Subject: Re: Massive corruption on RAID0

Krzysztof Kosiński wrote:
> Hello
>
> Here is my story: I recently migrated a server from Windows to Ubuntu
> 9.04. I formatted all disks with ext4. The server has 5 disks: three
> SCSI (9GB for /, 2x18GB for /data/small and /home) and two IDE
> (2x300GB). I put the IDE disks in a RAID0: each had a single partition
> with type set to "fd", and the entire resulting device (/dev/md0) was
> formatted with ext4 as:
>
> mkfs.ext4 -b 4096 -E stride=16 /dev/md0
>
> All was well until a power outage that left the filesystem on /dev/md0
> unmountable (the others were fine after an fsck). I made a backup of
> the corrupted array to another disk and ran fsck, but it ended up in
> an infinite loop. After some unsuccessful tinkering, I restored the
> backup and found out that large portions of the group descriptor table
> are filled with random junk. Moreover, all backup superblocks are
> either corrupted or zeroed, and I found partial copies of an
> identically corrupted table at various weird offsets (including
> 176888, 600344, 1036536, 1462520, 1887256 and 5326832); neither of
> these copies were preceded by anything resembling a superblock. Here
> is a copy of first 39 blocks of the corrupted disk:
> http://tweenk.artfx.pl/super.bin

It's awfully hard to say what went wrong given this information.
However, power failures mean that write caches on drives go away and
without barriers (which md raid0 won't pass, IIRC), that means that
journal ordering guarantees are shot, and so corruption can happen - but
I would not expect huge swaths of crud sprinkled over the drive.

Is the super.bin above from before the fiddling you did (i.e. right
after the power loss?) The superblock is marked with errors, I wonder
if there were other errors reported on the filesystem prior to the power
loss; you might check your logs ...

-Eric

2009-06-29 15:12:49

by Krzysztof Kosiński

[permalink] [raw]
Subject: Re: Massive corruption on RAID0

W dniu 29 czerwca 2009 05:30 użytkownik Eric Sandeen
<[email protected]> napisał:
> Is the super.bin above from before the fiddling you did (i.e. right
> after the power loss?)
Yes, it's before I started my recovery attempts, though mount and
e2fsck -p were run on it during the boot process before I removed it
from fstab.

> The superblock is marked with errors, I wonder
> if there were other errors reported on the filesystem prior to the power
> loss; you might check your logs ...
I checked and it seems there were no errors until the power failure.
By the way, is there some way to have RAID0-like functionality with
write barriers?

Regards, Krzysztof Kosiński

2009-06-30 15:33:59

by Eric Sandeen

[permalink] [raw]
Subject: Re: Massive corruption on RAID0

Krzysztof Kosiński wrote:
> W dniu 29 czerwca 2009 05:30 użytkownik Eric Sandeen
> <[email protected]> napisał:
>> Is the super.bin above from before the fiddling you did (i.e. right
>> after the power loss?)
> Yes, it's before I started my recovery attempts, though mount and
> e2fsck -p were run on it during the boot process before I removed it
> from fstab.
>
>> The superblock is marked with errors, I wonder
>> if there were other errors reported on the filesystem prior to the power
>> loss; you might check your logs ...
> I checked and it seems there were no errors until the power failure.
> By the way, is there some way to have RAID0-like functionality with
> write barriers?

Mirrors can pass barriers, IIRC, but not stripes (IIRC...) - I don't
know if any work is being done to address this.

I wonder if there's any chance that your raid was reassembled
incorrectly....

-Eric

2009-06-30 16:46:18

by Eric Sandeen

[permalink] [raw]
Subject: Re: Massive corruption on RAID0

Greg Freemyer wrote:
> 2009/6/30 Eric Sandeen <[email protected]>:
>> Krzysztof Kosi?ski wrote:
> <snip>
>>>>> By the way, is there some way to have RAID0-like functionality with
>>> write barriers?
>> Mirrors can pass barriers, IIRC, but not stripes (IIRC...) - I don't
>> know if any work is being done to address this.
>>
> I'm pretty sure mdraid is not attempting to address it.
>
> The issue is that barriers with a single drive can simply be sent to
> the drive for it to do the heavy lifting.
>
> With raid-0, it is much more difficult.
>
> ie. you send a barrier to 2 different drives. One drive takes 30
> milliseconds to flush the pre-barrier queue to disk and then continues
> working on the post barrier data . The other drive takes 500
> milliseconds to do the same. The end result is out of sync barriers.
> Not at all what the filesystem expects.
>
> The only reliable solution is to disable write caching on the drives.
> Of course you don't need barriers then.

Agreed this is the best solution at least for now. But, the dm folks
are apparently working on some sort of barrier solution for stripes, I
think. I don't know the details, perhaps I should.... :)

-Eric

2009-06-30 18:28:08

by Mike Snitzer

[permalink] [raw]
Subject: Re: Massive corruption on RAID0

2009/6/30 Eric Sandeen <[email protected]>:
> Greg Freemyer wrote:
>> 2009/6/30 Eric Sandeen <[email protected]>:
>>> Krzysztof Kosi?ski wrote:
>> <snip>
>>>>>> By the way, is there some way to have RAID0-like functionality with
>>>> write barriers?
>>> Mirrors can pass barriers, IIRC, but not stripes (IIRC...) - I don't
>>> know if any work is being done to address this.
>>>
>> I'm pretty sure mdraid is not attempting to address it.
>>
>> The issue is that barriers with a single drive can simply be sent to
>> the drive for it to do the heavy lifting.
>>
>> With raid-0, it is much more difficult.
>>
>> ie. you send a barrier to 2 different drives. ?One drive takes 30
>> milliseconds to flush the pre-barrier queue to disk and then continues
>> working on the post barrier data . ?The other drive takes 500
>> milliseconds to do the same. ?The end result is out of sync barriers.
>> Not at all what the filesystem expects.
>>
>> The only reliable solution is to disable write caching on the drives.
>> Of course you don't need barriers then.
>
> Agreed this is the best solution at least for now. ?But, the dm folks
> are apparently working on some sort of barrier solution for stripes, I
> think. ?I don't know the details, perhaps I should.... :)

Yes, DM's stripe target has full barrier support in the latest
2.6.31-rc1. You can create a striped LV using:
lvcreate --stripes STRIPES --stripsesize STRIPESIZE ...

It should be noted that 2.6.31-rc1 had a small bug that causes
"misaligned" warnings to appear when activating DM devices. These
warnings can be ignored and obviously have nothing to do with barriers
(they are related to DM's new support for the topology
infrastructure). That warnings bug has been fixed and will likely be
pulled in for 2.6.31-rc2.

Mike

Mike