2009-09-01 08:39:11

by NeilBrown

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:

On Tue, September 1, 2009 10:56 am, George Spelvin wrote:
> The fact that the ZFS decelopers observed drives writing the data to the
> wrong location emphasizes the importance of keeping the checksum with
> the pointer. An embedded checksum, no matter how good, can't tell you if
> the data is stale; you need a way to distinguish versions in the pointer.

I would disagree with that.
If the embedded checksum is a function of both the data and the address
of the data (in whatever address space seems most appropriate) then it can
still verify that the data found with the checksum is the data that was
expected.
And storing the checksum with the data (where it is practical) means
index blocks can be more dense so on average fewer accesses to storage
are needed.

NeilBrown


2009-09-01 08:46:56

by Pavel Machek

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:

On Tue 2009-09-01 18:36:22, NeilBrown wrote:
> On Tue, September 1, 2009 10:56 am, George Spelvin wrote:
> > The fact that the ZFS decelopers observed drives writing the data to the
> > wrong location emphasizes the importance of keeping the checksum with
> > the pointer. An embedded checksum, no matter how good, can't tell you if
> > the data is stale; you need a way to distinguish versions in the pointer.
>
> I would disagree with that.
> If the embedded checksum is a function of both the data and the address
> of the data (in whatever address space seems most appropriate) then it can
> still verify that the data found with the checksum is the data that was
> expected.
> And storing the checksum with the data (where it is practical) means
> index blocks can be more dense so on average fewer accesses to storage
> are needed.

Well, storing checksum with the pointer means that you catch dropped
writes, too.

Imagine the disk drive just fails to write block A. Adding checksum of
address will not catch that...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-01 11:18:08

by George Spelvin

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:

>> An embedded checksum, no matter how good, can't tell you if
>> the data is stale; you need a way to distinguish versions in the pointer.

> I would disagree with that.
> If the embedded checksum is a function of both the data and the address
> of the data (in whatever address space seems most appropriate) then it can
> still verify that the data found with the checksum is the data that was
> expected.
> And storing the checksum with the data (where it is practical) means
> index blocks can be more dense so on average fewer accesses to storage
> are needed.

I must not have been clear. Originally, block 100 has contents version 1.
This includes a correctly computed checksum.

Then you write version 2 of the data there. But there's a bit error in
the address and the write goes to block 256+100 = 356. So block
100 still has the version 1 contents, complete with valid checksum.
(Yes, block 356 is now corrupted, but perhaps it's not even allocated.)

Then we go to read block 100, find a valid checksum, and return incorrect
data. Namely, version 1 data, when we expact and want version 2.

Basically, the pointer has to say which *version* of the data it points to,
not just the block address. Otherwise, it can't detect a missing write.

If density is a big issue, then including a small version field is a
possibility.

2009-09-01 12:38:30

by NeilBrown

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:

On Tue, September 1, 2009 9:18 pm, George Spelvin wrote:
>>> An embedded checksum, no matter how good, can't tell you if
>>> the data is stale; you need a way to distinguish versions in the
>>> pointer.
>
>> I would disagree with that.
>> If the embedded checksum is a function of both the data and the address
>> of the data (in whatever address space seems most appropriate) then it
>> can
>> still verify that the data found with the checksum is the data that was
>> expected.
>> And storing the checksum with the data (where it is practical) means
>> index blocks can be more dense so on average fewer accesses to storage
>> are needed.
>
> I must not have been clear. Originally, block 100 has contents version 1.
> This includes a correctly computed checksum.
>
> Then you write version 2 of the data there. But there's a bit error in
> the address and the write goes to block 256+100 = 356. So block
> 100 still has the version 1 contents, complete with valid checksum.
> (Yes, block 356 is now corrupted, but perhaps it's not even allocated.)
>
> Then we go to read block 100, find a valid checksum, and return incorrect
> data. Namely, version 1 data, when we expact and want version 2.
>
> Basically, the pointer has to say which *version* of the data it points
> to,
> not just the block address. Otherwise, it can't detect a missing write.

Agreed. I think the minimum is that the index block must be changed in
some way whenever data that it points to is changed. Exactly how
depends very much of other details of the filesystem layout.
For a copy-on-write filesystem where changed data is always written
to a new location, this is very easy to achieve as the 'physical'
address can probably be used as a version identifier in some way.
For write-in-place you would need the version information
to be more explicit as you say, whether a small version number
or a larger hash of the data.

>
> If density is a big issue, then including a small version field is a
> possibility.
>


NeilBrown

2009-09-01 15:26:12

by David Lang

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:

On Tue, 1 Sep 2009, NeilBrown wrote:

> On Tue, September 1, 2009 9:18 pm, George Spelvin wrote:
>>>> An embedded checksum, no matter how good, can't tell you if
>>>> the data is stale; you need a way to distinguish versions in the
>>>> pointer.
>>
>>> I would disagree with that.
>>> If the embedded checksum is a function of both the data and the address
>>> of the data (in whatever address space seems most appropriate) then it
>>> can
>>> still verify that the data found with the checksum is the data that was
>>> expected.
>>> And storing the checksum with the data (where it is practical) means
>>> index blocks can be more dense so on average fewer accesses to storage
>>> are needed.
>>
>> I must not have been clear. Originally, block 100 has contents version 1.
>> This includes a correctly computed checksum.
>>
>> Then you write version 2 of the data there. But there's a bit error in
>> the address and the write goes to block 256+100 = 356. So block
>> 100 still has the version 1 contents, complete with valid checksum.
>> (Yes, block 356 is now corrupted, but perhaps it's not even allocated.)
>>
>> Then we go to read block 100, find a valid checksum, and return incorrect
>> data. Namely, version 1 data, when we expact and want version 2.
>>
>> Basically, the pointer has to say which *version* of the data it points
>> to,
>> not just the block address. Otherwise, it can't detect a missing write.
>
> Agreed. I think the minimum is that the index block must be changed in
> some way whenever data that it points to is changed. Exactly how
> depends very much of other details of the filesystem layout.
> For a copy-on-write filesystem where changed data is always written
> to a new location, this is very easy to achieve as the 'physical'
> address can probably be used as a version identifier in some way.
> For write-in-place you would need the version information
> to be more explicit as you say, whether a small version number
> or a larger hash of the data.

but then don't you have to update the version on the index (and therefor
the pointer to that directory), on up to the point where you update the
root?

David Lang

2009-09-01 21:15:10

by NeilBrown

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3:

On Wed, September 2, 2009 1:25 am, [email protected] wrote:
> On Tue, 1 Sep 2009, NeilBrown wrote:
>
>> On Tue, September 1, 2009 9:18 pm, George Spelvin wrote:
>>>>> An embedded checksum, no matter how good, can't tell you if
>>>>> the data is stale; you need a way to distinguish versions in the
>>>>> pointer.
>>>
>>>> I would disagree with that.
>>>> If the embedded checksum is a function of both the data and the
>>>> address
>>>> of the data (in whatever address space seems most appropriate) then it
>>>> can
>>>> still verify that the data found with the checksum is the data that
>>>> was
>>>> expected.
>>>> And storing the checksum with the data (where it is practical) means
>>>> index blocks can be more dense so on average fewer accesses to storage
>>>> are needed.
>>>
>>> I must not have been clear. Originally, block 100 has contents version
>>> 1.
>>> This includes a correctly computed checksum.
>>>
>>> Then you write version 2 of the data there. But there's a bit error in
>>> the address and the write goes to block 256+100 = 356. So block
>>> 100 still has the version 1 contents, complete with valid checksum.
>>> (Yes, block 356 is now corrupted, but perhaps it's not even allocated.)
>>>
>>> Then we go to read block 100, find a valid checksum, and return
>>> incorrect
>>> data. Namely, version 1 data, when we expact and want version 2.
>>>
>>> Basically, the pointer has to say which *version* of the data it points
>>> to,
>>> not just the block address. Otherwise, it can't detect a missing
>>> write.
>>
>> Agreed. I think the minimum is that the index block must be changed in
>> some way whenever data that it points to is changed. Exactly how
>> depends very much of other details of the filesystem layout.
>> For a copy-on-write filesystem where changed data is always written
>> to a new location, this is very easy to achieve as the 'physical'
>> address can probably be used as a version identifier in some way.
>> For write-in-place you would need the version information
>> to be more explicit as you say, whether a small version number
>> or a larger hash of the data.
>
> but then don't you have to update the version on the index (and therefor
> the pointer to that directory), on up to the point where you update the
> root?

Yes, all the way to the root. This is true no matter what data verification
scheme you use, if you want to be able to detect silently failing writes.
This makes it a very neat fit for copy-on-write designs, which have to do
that anyway, and a more awkward fit for update-in-place designs.

NeilBrown