2009-08-31 00:54:26

by George Spelvin

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:

Actually, there is something the file system can do to make journaling
safe on degraded RAIDs: make the (checksummed) journal blocks equal to
the RAID stripe size. Or, equivalently, pad out to the RAID stripe
size each commit.

This sometimes leads to awkward block sizes, but while writing
to any *one* stripe on a degraded RAID-5 endangers the others, you
can write to *all* of them with the usual semantics.

That's something that's a good idea for performance anyway, so maybe
ext[34] should be more vociferous about it. E.g. check each mount
and warn if the journal is mis-sized. Or even change the journal
bock size on mount if it starts empty.

The other solution, of course, is RAID-1, which I like to use for
performance and simplicity reasons anyway. (It's really something
of a degenerate case of the RAID-[456] rule.)

That's one thing I really like about ZFS: its policy of "don't trust
the disks." If nothing else, simply telling you "your disks f*ed up,
and I caught them doing it", instead of the usual mysterious corruption
detectec three months later, is tremendoudly useful information.


2009-08-31 11:04:48

by Pavel Machek

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:

Hi!

> Actually, there is something the file system can do to make journaling
> safe on degraded RAIDs: make the (checksummed) journal blocks equal to
> the RAID stripe size. Or, equivalently, pad out to the RAID stripe
> size each commit.
>
> This sometimes leads to awkward block sizes, but while writing
> to any *one* stripe on a degraded RAID-5 endangers the others, you
> can write to *all* of them with the usual semantics.

Well, that would work... but you'd also have to journal data, with the
same block size. Not exactly fast, but at least safe...

> That's one thing I really like about ZFS: its policy of "don't trust
> the disks." If nothing else, simply telling you "your disks f*ed up,
> and I caught them doing it", instead of the usual mysterious corruption
> detectec three months later, is tremendoudly useful information.

The more I learn about storage, the more I like idea of zfs. Given the
subtle issues between filesystem and raid layer, integrating them just
makes sense.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-08-31 15:45:38

by David Lang

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:

On Mon, 31 Aug 2009, Pavel Machek wrote:

>> Actually, there is something the file system can do to make journaling
>> safe on degraded RAIDs: make the (checksummed) journal blocks equal to
>> the RAID stripe size. Or, equivalently, pad out to the RAID stripe
>> size each commit.
>>
>> This sometimes leads to awkward block sizes, but while writing
>> to any *one* stripe on a degraded RAID-5 endangers the others, you
>> can write to *all* of them with the usual semantics.
>
> Well, that would work... but you'd also have to journal data, with the
> same block size. Not exactly fast, but at least safe...
>
>> That's one thing I really like about ZFS: its policy of "don't trust
>> the disks." If nothing else, simply telling you "your disks f*ed up,
>> and I caught them doing it", instead of the usual mysterious corruption
>> detectec three months later, is tremendoudly useful information.
>
> The more I learn about storage, the more I like idea of zfs. Given the
> subtle issues between filesystem and raid layer, integrating them just
> makes sense.

note that all that zfs does is tell you that you already lost data (and
then only if the checksumming algorithm would be invalid on a blank block
being returned), it doesn't protect your data.

David Lang

2009-09-01 00:56:29

by George Spelvin

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:

>From [email protected] Mon Aug 31 15:46:19 2009
Date: Mon, 31 Aug 2009 08:45:38 -0700 (PDT)
From: [email protected]
X-X-Sender: [email protected]
To: Pavel Machek <[email protected]>
cc: George Spelvin <[email protected]>, [email protected],
[email protected], [email protected]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:
In-Reply-To: <[email protected]>
References: <[email protected]> <[email protected]>
User-Agent: Alpine 2.00 (DEB 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed

>>> That's one thing I really like about ZFS: its policy of "don't trust
>>> the disks." If nothing else, simply telling you "your disks f*ed up,
>>> and I caught them doing it", instead of the usual mysterious corruption
>>> detected three months later, is tremendoudly useful information.
>>
>> The more I learn about storage, the more I like idea of zfs. Given the
>> subtle issues between filesystem and raid layer, integrating them just
>> makes sense.
>
> Note that all that zfs does is tell you that you already lost data (and
> then only if the checksumming algorithm would be invalid on a blank block
> being returned), it doesn't protect your data.

Obviously, there are limits, but it does provide useful protection:
- You know where the missing data is.
- The error isn't amplified by believing corrupted metadata
- I seem to recall that ZFS does replicate metadata.
- Corrupted replicas can be "scrubbed" and rewritten from uncorrupted ones.
- If you have some storage redundancy, it can try different mirrors
to get the data back.

In particular, on a RAID-5 system, ZFS tries dropping out each data disk
in turn to see if the correct data can be reconstructed from the others
+ parity.

One of ZFS's big performance problems is that currently it only checksums
the entire RAID stripe, so it always has to read every drive, and doesn't
get RAID's IOPS advantage. But that's fairly straightforward to fix.
(It's something of a problem for RAID-5 in general, because reads want
larger chunk sizes to increase the chance that a single read can be
satisfied by one disk, while writes want small chunks so that you can
do whole-stripe writes.)

The fact that the ZFS decelopers observed drives writing the data to the
wrong location emphasizes the importance of keeping the checksum with
the pointer. An embedded checksum, no matter how good, can't tell you if
the data is stale; you need a way to distinguish versions in the pointer.

2009-09-01 08:36:22

by NeilBrown

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:

On Tue, September 1, 2009 10:56 am, George Spelvin wrote:
> The fact that the ZFS decelopers observed drives writing the data to the
> wrong location emphasizes the importance of keeping the checksum with
> the pointer. An embedded checksum, no matter how good, can't tell you if
> the data is stale; you need a way to distinguish versions in the pointer.

I would disagree with that.
If the embedded checksum is a function of both the data and the address
of the data (in whatever address space seems most appropriate) then it can
still verify that the data found with the checksum is the data that was
expected.
And storing the checksum with the data (where it is practical) means
index blocks can be more dense so on average fewer accesses to storage
are needed.

NeilBrown


2009-09-01 08:46:55

by Pavel Machek

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:

On Tue 2009-09-01 18:36:22, NeilBrown wrote:
> On Tue, September 1, 2009 10:56 am, George Spelvin wrote:
> > The fact that the ZFS decelopers observed drives writing the data to the
> > wrong location emphasizes the importance of keeping the checksum with
> > the pointer. An embedded checksum, no matter how good, can't tell you if
> > the data is stale; you need a way to distinguish versions in the pointer.
>
> I would disagree with that.
> If the embedded checksum is a function of both the data and the address
> of the data (in whatever address space seems most appropriate) then it can
> still verify that the data found with the checksum is the data that was
> expected.
> And storing the checksum with the data (where it is practical) means
> index blocks can be more dense so on average fewer accesses to storage
> are needed.

Well, storing checksum with the pointer means that you catch dropped
writes, too.

Imagine the disk drive just fails to write block A. Adding checksum of
address will not catch that...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-01 11:18:03

by George Spelvin

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:

>> An embedded checksum, no matter how good, can't tell you if
>> the data is stale; you need a way to distinguish versions in the pointer.

> I would disagree with that.
> If the embedded checksum is a function of both the data and the address
> of the data (in whatever address space seems most appropriate) then it can
> still verify that the data found with the checksum is the data that was
> expected.
> And storing the checksum with the data (where it is practical) means
> index blocks can be more dense so on average fewer accesses to storage
> are needed.

I must not have been clear. Originally, block 100 has contents version 1.
This includes a correctly computed checksum.

Then you write version 2 of the data there. But there's a bit error in
the address and the write goes to block 256+100 = 356. So block
100 still has the version 1 contents, complete with valid checksum.
(Yes, block 356 is now corrupted, but perhaps it's not even allocated.)

Then we go to read block 100, find a valid checksum, and return incorrect
data. Namely, version 1 data, when we expact and want version 2.

Basically, the pointer has to say which *version* of the data it points to,
not just the block address. Otherwise, it can't detect a missing write.

If density is a big issue, then including a small version field is a
possibility.

2009-09-01 12:35:40

by NeilBrown

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:

On Tue, September 1, 2009 9:18 pm, George Spelvin wrote:
>>> An embedded checksum, no matter how good, can't tell you if
>>> the data is stale; you need a way to distinguish versions in the
>>> pointer.
>
>> I would disagree with that.
>> If the embedded checksum is a function of both the data and the address
>> of the data (in whatever address space seems most appropriate) then it
>> can
>> still verify that the data found with the checksum is the data that was
>> expected.
>> And storing the checksum with the data (where it is practical) means
>> index blocks can be more dense so on average fewer accesses to storage
>> are needed.
>
> I must not have been clear. Originally, block 100 has contents version 1.
> This includes a correctly computed checksum.
>
> Then you write version 2 of the data there. But there's a bit error in
> the address and the write goes to block 256+100 = 356. So block
> 100 still has the version 1 contents, complete with valid checksum.
> (Yes, block 356 is now corrupted, but perhaps it's not even allocated.)
>
> Then we go to read block 100, find a valid checksum, and return incorrect
> data. Namely, version 1 data, when we expact and want version 2.
>
> Basically, the pointer has to say which *version* of the data it points
> to,
> not just the block address. Otherwise, it can't detect a missing write.

Agreed. I think the minimum is that the index block must be changed in
some way whenever data that it points to is changed. Exactly how
depends very much of other details of the filesystem layout.
For a copy-on-write filesystem where changed data is always written
to a new location, this is very easy to achieve as the 'physical'
address can probably be used as a version identifier in some way.
For write-in-place you would need the version information
to be more explicit as you say, whether a small version number
or a larger hash of the data.

>
> If density is a big issue, then including a small version field is a
> possibility.
>


NeilBrown


2009-09-01 15:25:55

by David Lang

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:

On Tue, 1 Sep 2009, NeilBrown wrote:

> On Tue, September 1, 2009 9:18 pm, George Spelvin wrote:
>>>> An embedded checksum, no matter how good, can't tell you if
>>>> the data is stale; you need a way to distinguish versions in the
>>>> pointer.
>>
>>> I would disagree with that.
>>> If the embedded checksum is a function of both the data and the address
>>> of the data (in whatever address space seems most appropriate) then it
>>> can
>>> still verify that the data found with the checksum is the data that was
>>> expected.
>>> And storing the checksum with the data (where it is practical) means
>>> index blocks can be more dense so on average fewer accesses to storage
>>> are needed.
>>
>> I must not have been clear. Originally, block 100 has contents version 1.
>> This includes a correctly computed checksum.
>>
>> Then you write version 2 of the data there. But there's a bit error in
>> the address and the write goes to block 256+100 = 356. So block
>> 100 still has the version 1 contents, complete with valid checksum.
>> (Yes, block 356 is now corrupted, but perhaps it's not even allocated.)
>>
>> Then we go to read block 100, find a valid checksum, and return incorrect
>> data. Namely, version 1 data, when we expact and want version 2.
>>
>> Basically, the pointer has to say which *version* of the data it points
>> to,
>> not just the block address. Otherwise, it can't detect a missing write.
>
> Agreed. I think the minimum is that the index block must be changed in
> some way whenever data that it points to is changed. Exactly how
> depends very much of other details of the filesystem layout.
> For a copy-on-write filesystem where changed data is always written
> to a new location, this is very easy to achieve as the 'physical'
> address can probably be used as a version identifier in some way.
> For write-in-place you would need the version information
> to be more explicit as you say, whether a small version number
> or a larger hash of the data.

but then don't you have to update the version on the index (and therefor
the pointer to that directory), on up to the point where you update the
root?

David Lang

2009-09-01 16:18:28

by Andreas Dilger

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:

On Aug 31, 2009 20:56 -0400, George Spelvin wrote:
> >> The more I learn about storage, the more I like idea of zfs. Given the
> >> subtle issues between filesystem and raid layer, integrating them just
> >> makes sense.
> >
> > Note that all that zfs does is tell you that you already lost data (and
> > then only if the checksumming algorithm would be invalid on a blank block
> > being returned), it doesn't protect your data.
>
> Obviously, there are limits, but it does provide useful protection:
> - You know where the missing data is.
> - The error isn't amplified by believing corrupted metadata
> - I seem to recall that ZFS does replicate metadata.

ZFS definitely does replicate data. At the lowest level it has RAID-1,
and RAID-Z/Z2, which are pretty close to RAID-5/6 respectively, but with
the important difference that every write is a full-stripe-width write,
so that it is not possible for RAID-Z/Z2 to cause corruption due to a
partially-written RAID parity stripe.

In addition, for internal metadata blocks there are 1 or 2 duplicate
copies written to different devices, so that in case of a fatal device
corruption (e.g. double failure of a RAID-Z device) the metadata tree
is still intact.

> - Corrupted replicas can be "scrubbed" and rewritten from uncorrupted ones.
> - If you have some storage redundancy, it can try different mirrors
> to get the data back.
>
> In particular, on a RAID-5 system, ZFS tries dropping out each data disk
> in turn to see if the correct data can be reconstructed from the others
> + parity.

What else is interesting is that in the case of 1-4-bit errors the
default checksum function can also be used as ECC to recover the correct
data even if there is no replicated copy of the data.

> One of ZFS's big performance problems is that currently it only checksums
> the entire RAID stripe, so it always has to read every drive, and doesn't
> get RAID's IOPS advantage.

Or this is a drawback of the Linux software RAID because it doesn't detect
the case when the parity is bad before there is a second drive failure and
the bad parity is used to reconstruct the data block incorrectly (which
will also go undetected because there is no checksum).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-09-01 21:15:07

by NeilBrown

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:

On Wed, September 2, 2009 1:25 am, [email protected] wrote:
> On Tue, 1 Sep 2009, NeilBrown wrote:
>
>> On Tue, September 1, 2009 9:18 pm, George Spelvin wrote:
>>>>> An embedded checksum, no matter how good, can't tell you if
>>>>> the data is stale; you need a way to distinguish versions in the
>>>>> pointer.
>>>
>>>> I would disagree with that.
>>>> If the embedded checksum is a function of both the data and the
>>>> address
>>>> of the data (in whatever address space seems most appropriate) then it
>>>> can
>>>> still verify that the data found with the checksum is the data that
>>>> was
>>>> expected.
>>>> And storing the checksum with the data (where it is practical) means
>>>> index blocks can be more dense so on average fewer accesses to storage
>>>> are needed.
>>>
>>> I must not have been clear. Originally, block 100 has contents version
>>> 1.
>>> This includes a correctly computed checksum.
>>>
>>> Then you write version 2 of the data there. But there's a bit error in
>>> the address and the write goes to block 256+100 = 356. So block
>>> 100 still has the version 1 contents, complete with valid checksum.
>>> (Yes, block 356 is now corrupted, but perhaps it's not even allocated.)
>>>
>>> Then we go to read block 100, find a valid checksum, and return
>>> incorrect
>>> data. Namely, version 1 data, when we expact and want version 2.
>>>
>>> Basically, the pointer has to say which *version* of the data it points
>>> to,
>>> not just the block address. Otherwise, it can't detect a missing
>>> write.
>>
>> Agreed. I think the minimum is that the index block must be changed in
>> some way whenever data that it points to is changed. Exactly how
>> depends very much of other details of the filesystem layout.
>> For a copy-on-write filesystem where changed data is always written
>> to a new location, this is very easy to achieve as the 'physical'
>> address can probably be used as a version identifier in some way.
>> For write-in-place you would need the version information
>> to be more explicit as you say, whether a small version number
>> or a larger hash of the data.
>
> but then don't you have to update the version on the index (and therefor
> the pointer to that directory), on up to the point where you update the
> root?

Yes, all the way to the root. This is true no matter what data verification
scheme you use, if you want to be able to detect silently failing writes.
This makes it a very neat fit for copy-on-write designs, which have to do
that anyway, and a more awkward fit for update-in-place designs.

NeilBrown


2009-09-02 01:10:20

by George Spelvin

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:

>> - I seem to recall that ZFS does replicate metadata.
>
> ZFS definitely does replicate data. At the lowest level it has RAID-1,
> and RAID-Z/Z2, which are pretty close to RAID-5/6 respectively, but with
> the important difference that every write is a full-stripe-width write,
> so that it is not possible for RAID-Z/Z2 to cause corruption due to a
> partially-written RAID parity stripe.
>
> In addition, for internal metadata blocks there are 1 or 2 duplicate
> copies written to different devices, so that in case of a fatal device
> corruption (e.g. double failure of a RAID-Z device) the metadata tree
> is still intact.

Forgive me for implying by omission that ZFS did not replicate data.
What I was trying to point out is that it replicates metadata *more*,
and you can choose among the redundant backups.

> What else is interesting is that in the case of 1-4-bit errors the
> default checksum function can also be used as ECC to recover the correct
> data even if there is no replicated copy of the data.

Interesting. Do you actually see suhc low-bit-weight errors in
practice? I had assumed that modern disks were complicated enough
that errors would be high-bit-weight miscorrections.

>> One of ZFS's big performance problems is that currently it only checksums
>> the entire RAID stripe, so it always has to read every drive, and doesn't
>> get RAID's IOPS advantage.
>
> Or this is a drawback of the Linux software RAID because it doesn't detect
> the case when the parity is bad before there is a second drive failure and
> the bad parity is used to reconstruct the data block incorrectly (which
> will also go undetected because there is no checksum).

Well, all conventional RAID systems lack block checksums (or, more to
the point, rely on the drive's checksumming), and have this problem.

I was pointing out that ZFS currently doesn't support partial-stripe
*reads*, thus limiting IOPS in random-read applications. But that's
an "implementation detail", not a major architectural issue.