LinuxLists.cc - Data integrity built into the storage stack [was: Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)]

2009-08-29 21:23:50

Subject: Data integrity built into the storage stack [was: Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)]

I've read a fair amount of the various threads discussing sector /
data corruption of flash and raid devices, but by no means all.

It seems to me the key thing Pavel is highlighting is that many
storage devices / arrays have reasonably common failure modes where
data corruption is silently introduced to stable data. I have seen it
mentioned, but the scsi spec. recently got a "data integrity" option
that would allow these corruptions to at least be detected if the
option were in use.

Regardless, administrators have known forever that a bad cable / bad
ram / bad controller / etc. can cause data written to a hard drive to
be written with corrupted values that will not cause a media error on
read.

But most of us assume once we get data on a storage medium and verify
it, we can read it at some future point and it will either have the
correct data values, or it will have a media error. If there is a
media error we know to get out our backups.

The scary aspect of this to me is that with the failure modes Pavel
has brought up, valid data is being written to the storage medium. It
can be verified via hash, etc. But then at some point in the future,
the data can just change in a more or less random way even though it
is not being modified / overwritten.

I think a good phrase to describe this is "silent corruption of stable
data on a permanent storage medium". I'm sure that silent data
corruption can happen on a traditional harddrive, but I suspect the
odds are so slim most of us won't encounter it in a lifetime. The
failure modes Pavel is describing seem much more common, and I for one
was not familiar with them prior to this set of threads starting.

It seems to me a file system neutral document describing "silent
corruption of stable data on permanent storage medium" would be
appropriate. Then the linux kernel can start to be hardened to
properly respond to situations where the data read is not the data
written.

We already have the scsi data integrity patches that went in last
winter and I believe fit into the storage stack below the filesystem
layer.

I don't believe most of the storage stack has been modified to
add any data integrity features that would help to ensure the data
read is the data written, but this whole series of threads highlights
to me that a significant effort should go into increasing the data
integrity capability of the linux storage stack.

I do believe there is a patch floating around for device mapper to add
some integrity capability.

Greg
--
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
Preservation and Forensic processing of Exchange Repositories White Paper -
<http://www.norcrossgroup.com/forms/whitepapers/tng_whitepaper_fpe.html>

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

2009-08-30 00:35:06

by Rob Landley

[permalink] [raw]

Subject: Re: Data integrity built into the storage stack [was: Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)]

On Saturday 29 August 2009 16:23:50 Greg Freemyer wrote:
> I've read a fair amount of the various threads discussing sector /
> data corruption of flash and raid devices, but by no means all.
>
> It seems to me the key thing Pavel is highlighting is that many
> storage devices / arrays have reasonably common failure modes where
> data corruption is silently introduced to stable data. I have seen it
> mentioned, but the scsi spec. recently got a "data integrity" option
> that would allow these corruptions to at least be detected if the
> option were in use.
>
> Regardless, administrators have known forever that a bad cable / bad
> ram / bad controller / etc. can cause data written to a hard drive to
> be written with corrupted values that will not cause a media error on
> read.

Bad ram can do anything if you don't have ECC memory, sure. In my admittedly
limited experience, bad controllers tend not to fail _quietly_, with a problem
writing just this one sector.

I personally have had a tropism for software raid because over the years I've
seen more than one instance of data loss when a proprietary hardware raid card
went bad after years of service and the company couldn't find a sufficiently
similar replacement for the obsolete part capable of reading that strange
proprietary disk format the card was using. (Dell was notorious for this once
upon a time. These days it's a bit more standardized, but I still want to be
sure that I _can_ get the data off from a straight passthrough arrangement
before being _happy_ about it.)

As for bad cables, I believe ATA/33 and higher have checksummed the data going
across the cable for most of a decade now, at least for DMA transfers. (Don't
ask me about about scsi, I mostly didn't use it.)

You've got a 1 in 4 billion chance of a 32 bit checksum magically working out
even with corrupted data, of course, but that's a fluke failure and not a
design problem. And if it got enough failures it would downshift the speed or
drop to PIO mode and it was possible to detect that your hardware was flaky.

These days I'm pretty sure SATA and USB2 are both checksumming the data going
across the cable, because the PHY transcievers those use are descended from
the PHY transcievers originally developed for gigabit ethernet.

PC hardware has always been exactly as cheap and crappy as it could get away
with, but that's a lot less crappy at gigabit speeds and terabyte sizes than
it was in the 16 bit ISA days. We'd be overwhelmed with all the failures
otherwise. (Note that the crappiness of USB flash keys is actually
_surprising_ to some of us , the severity and ease of triggering these failure
modes are beyond what we've come to expect.)

> It seems to me a file system neutral document describing "silent
> corruption of stable data on permanent storage medium" would be
> appropriate. Then the linux kernel can start to be hardened to
> properly respond to situations where the data read is not the data
> written.

According to http://lwn.net/Articles/342892/ some of the cool things about
btrfs are:

A) everything is checksummed, because inodes and dentries and data extents are
all just slightly differently tagged entries in one big tree, and every entry
in the tree is checksummed.

2) It has backreferences so you can find most entries from more than one place
if you have to rebuild a damaged tree.

III) The tree update algorithms are lockless so you can potentially run an
fsck/defrag on the sucker in the background, which can among other things re-
read the old data and make sure the checksums match. (So your recommended
regular fsck can in fact be a low priority cron job.)

That wouldn't prevent you from losing data to this sort of corruption (nothing
would), but it does give you potentially better ways to find and deal with it.
Heck, just looking at the stderr output of a simple:

find / -type f -print0 | xargs -0 -n 1 cat > /dev/null

Could potentially tell you something useful if the filesystem is giving you
read errors when the extent checksums don't match. And an rsync could
reliably get read errors (and abort the backup) due to checksum mismatch
instead of copying spans of zeroes over your archival copy.

So far this thread reads to _me_ as an implicit endorsement of btrfs. But so
far any suggestion that one filesystem might handle this problem better than
another has so far been taken as personal attacks against people's babies, so
I haven't asked...

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds

2009-09-01 05:19:53

by Martin K. Petersen

[permalink] [raw]

Subject: Re: Data integrity built into the storage stack

>>>>> "Greg" == Greg Freemyer <[email protected]> writes:

Greg> We already have the scsi data integrity patches that went in last
Greg> winter and I believe fit into the storage stack below the
Greg> filesystem layer.

The filesystems can actually use it. It's exposed at the bio level.

Greg> I do believe there is a patch floating around for device mapper to
Greg> add some integrity capability.

The patch is in mainline. It allows passthrough so the filesystems can
access the integrity features. But DM itself doesn't use any of them,
it merely acts as a conduit.

DIF is inherently tied to storage device's logical blocks. These are
likely to be smaller than the blocks we're interested in protecting.
However, you could conceivably use the application tag space to add a
checksum with filesystem or MD/DM blocking size granularity. All the
hooks are there.

The application tag space is pretty much only available on disk
drives--array vendors use it for internal purposes. But in the MD/DM
case we're likely to run on raw disk so that's probably ok.

That said, I really think btrfs is the right answer to many of the
concerns raised in this thread. Everything is properly checksummed and
can be verified at read time.

The strength of DIX/DIF is that we can detect corruption at write time
while we still have the buffer we care about write sitting in memory.

So btrfs and DIX/DIF go hand in hand as far as I'm concerned. They
solve different problems but both are squarely aimed at preventing
silent data corruption.

I do agree that we do have to be more prepared for collateral damage
scenarios. As we discussed at LS we have 4KB drives coming out that can
invalidate previously acknowledged I/Os if it gets a subsequent write
failure on a sector. And there's also the issue of fractured writes
when talking to disk arrays. That's really what my I/O topology changes
were all about: Correctness. The fact that they may increase
performance is nice but that was not the main motivator.

--
Martin K. Petersen Oracle Linux Engineering

2009-09-01 12:44:04

by Pavel Machek

[permalink] [raw]

Subject: Re: Data integrity built into the storage stack

Hi!

> I do agree that we do have to be more prepared for collateral damage
> scenarios. As we discussed at LS we have 4KB drives coming out that can
> invalidate previously acknowledged I/Os if it gets a subsequent write
> failure on a sector. And there's also the issue of fractured writes

Hmmm, future will be interesting.

'ext3 expects disks to behave like disks from 1995' (alarming).

Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-01 13:18:07

by jim owens

[permalink] [raw]

Subject: Re: Data integrity built into the storage stack

Pavel Machek wrote:
> Hi!
>
>> I do agree that we do have to be more prepared for collateral damage
>> scenarios. As we discussed at LS we have 4KB drives coming out that can
>> invalidate previously acknowledged I/Os if it gets a subsequent write
>> failure on a sector. And there's also the issue of fractured writes
>
> Hmmm, future will be interesting.
>
> 'ext3 expects disks to behave like disks from 1995' (alarming).

NO... stop saying "ext3". All file systems expect that
what the disk tell us is the "sector size" (now know by
disk vendors as "block size") is "atomic".

The problem is not when they say 4096 bytes is my block.

The problem Martin is talking about is that since most
filesystems expect and work with legacy 512-byte-sectors,
the disk vendors report "512 is my block" and do the
merge themselves to their real 4096 byte physical sector.

This is not "bad drive vendor" either, it is the price
of progress while supporting legacy expectations.

jim

2009-09-01 13:37:37

by Pavel Machek

[permalink] [raw]

Subject: Re: Data integrity built into the storage stack

On Tue 2009-09-01 09:18:07, jim owens wrote:
> Pavel Machek wrote:
>> Hi!
>>
>>> I do agree that we do have to be more prepared for collateral damage
>>> scenarios. As we discussed at LS we have 4KB drives coming out that can
>>> invalidate previously acknowledged I/Os if it gets a subsequent write
>>> failure on a sector. And there's also the issue of fractured writes
>>
>> Hmmm, future will be interesting.
>>
>> 'ext3 expects disks to behave like disks from 1995' (alarming).
>
> NO... stop saying "ext3". All file systems expect that
> what the disk tell us is the "sector size" (now know by
> disk vendors as "block size") is "atomic".

Yep, but ext3 disables barriers by default. So it has more than
blocksize issue :-(.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html