2009-09-01 05:19:55

by Martin K. Petersen

[permalink] [raw]
Subject: Re: Data integrity built into the storage stack

>>>>> "Greg" == Greg Freemyer <[email protected]> writes:

Greg> We already have the scsi data integrity patches that went in last
Greg> winter and I believe fit into the storage stack below the
Greg> filesystem layer.

The filesystems can actually use it. It's exposed at the bio level.


Greg> I do believe there is a patch floating around for device mapper to
Greg> add some integrity capability.

The patch is in mainline. It allows passthrough so the filesystems can
access the integrity features. But DM itself doesn't use any of them,
it merely acts as a conduit.

DIF is inherently tied to storage device's logical blocks. These are
likely to be smaller than the blocks we're interested in protecting.
However, you could conceivably use the application tag space to add a
checksum with filesystem or MD/DM blocking size granularity. All the
hooks are there.

The application tag space is pretty much only available on disk
drives--array vendors use it for internal purposes. But in the MD/DM
case we're likely to run on raw disk so that's probably ok.

That said, I really think btrfs is the right answer to many of the
concerns raised in this thread. Everything is properly checksummed and
can be verified at read time.

The strength of DIX/DIF is that we can detect corruption at write time
while we still have the buffer we care about write sitting in memory.

So btrfs and DIX/DIF go hand in hand as far as I'm concerned. They
solve different problems but both are squarely aimed at preventing
silent data corruption.

I do agree that we do have to be more prepared for collateral damage
scenarios. As we discussed at LS we have 4KB drives coming out that can
invalidate previously acknowledged I/Os if it gets a subsequent write
failure on a sector. And there's also the issue of fractured writes
when talking to disk arrays. That's really what my I/O topology changes
were all about: Correctness. The fact that they may increase
performance is nice but that was not the main motivator.

--
Martin K. Petersen Oracle Linux Engineering


2009-09-01 12:44:12

by Pavel Machek

[permalink] [raw]
Subject: Re: Data integrity built into the storage stack

Hi!

> I do agree that we do have to be more prepared for collateral damage
> scenarios. As we discussed at LS we have 4KB drives coming out that can
> invalidate previously acknowledged I/Os if it gets a subsequent write
> failure on a sector. And there's also the issue of fractured writes

Hmmm, future will be interesting.

'ext3 expects disks to behave like disks from 1995' (alarming).

Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-01 13:18:09

by jim owens

[permalink] [raw]
Subject: Re: Data integrity built into the storage stack

Pavel Machek wrote:
> Hi!
>
>> I do agree that we do have to be more prepared for collateral damage
>> scenarios. As we discussed at LS we have 4KB drives coming out that can
>> invalidate previously acknowledged I/Os if it gets a subsequent write
>> failure on a sector. And there's also the issue of fractured writes
>
> Hmmm, future will be interesting.
>
> 'ext3 expects disks to behave like disks from 1995' (alarming).

NO... stop saying "ext3". All file systems expect that
what the disk tell us is the "sector size" (now know by
disk vendors as "block size") is "atomic".

The problem is not when they say 4096 bytes is my block.

The problem Martin is talking about is that since most
filesystems expect and work with legacy 512-byte-sectors,
the disk vendors report "512 is my block" and do the
merge themselves to their real 4096 byte physical sector.

This is not "bad drive vendor" either, it is the price
of progress while supporting legacy expectations.

jim

2009-09-01 20:05:14

by Pavel Machek

[permalink] [raw]
Subject: Re: Data integrity built into the storage stack

On Tue 2009-09-01 09:18:07, jim owens wrote:
> Pavel Machek wrote:
>> Hi!
>>
>>> I do agree that we do have to be more prepared for collateral damage
>>> scenarios. As we discussed at LS we have 4KB drives coming out that can
>>> invalidate previously acknowledged I/Os if it gets a subsequent write
>>> failure on a sector. And there's also the issue of fractured writes
>>
>> Hmmm, future will be interesting.
>>
>> 'ext3 expects disks to behave like disks from 1995' (alarming).
>
> NO... stop saying "ext3". All file systems expect that
> what the disk tell us is the "sector size" (now know by
> disk vendors as "block size") is "atomic".

Yep, but ext3 disables barriers by default. So it has more than
blocksize issue :-(.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html