From: Rob Landley Subject: Re: Data integrity built into the storage stack [was: Re: [testcase] test your fs/storage stack (was Re: [patch] ext2/3: document conditions when reliable operation is possible)] Date: Sat, 29 Aug 2009 19:35:06 -0500 Message-ID: <200908291935.08831.rob@landley.net> References: <87f94c370908291423ub92922ft2cceab9e34ac6207@mail.gmail.com> Mime-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Cc: david@lang.hm, Pavel Machek , Ric Wheeler , Theodore Tso , Florian Weimer , Goswin von Brederlow , kernel list , Andrew Morton , mtk.manpages@gmail.com, rdunlap@xenotime.net, linux-doc@vger.kernel.org, linux-ext4@vger.kernel.org, corbet@lwn.net, "Martin K. Petersen" To: Greg Freemyer Return-path: In-Reply-To: <87f94c370908291423ub92922ft2cceab9e34ac6207@mail.gmail.com> Content-Disposition: inline Sender: linux-doc-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org On Saturday 29 August 2009 16:23:50 Greg Freemyer wrote: > I've read a fair amount of the various threads discussing sector / > data corruption of flash and raid devices, but by no means all. > > It seems to me the key thing Pavel is highlighting is that many > storage devices / arrays have reasonably common failure modes where > data corruption is silently introduced to stable data. I have seen it > mentioned, but the scsi spec. recently got a "data integrity" option > that would allow these corruptions to at least be detected if the > option were in use. > > Regardless, administrators have known forever that a bad cable / bad > ram / bad controller / etc. can cause data written to a hard drive to > be written with corrupted values that will not cause a media error on > read. Bad ram can do anything if you don't have ECC memory, sure. In my admittedly limited experience, bad controllers tend not to fail _quietly_, with a problem writing just this one sector. I personally have had a tropism for software raid because over the years I've seen more than one instance of data loss when a proprietary hardware raid card went bad after years of service and the company couldn't find a sufficiently similar replacement for the obsolete part capable of reading that strange proprietary disk format the card was using. (Dell was notorious for this once upon a time. These days it's a bit more standardized, but I still want to be sure that I _can_ get the data off from a straight passthrough arrangement before being _happy_ about it.) As for bad cables, I believe ATA/33 and higher have checksummed the data going across the cable for most of a decade now, at least for DMA transfers. (Don't ask me about about scsi, I mostly didn't use it.) You've got a 1 in 4 billion chance of a 32 bit checksum magically working out even with corrupted data, of course, but that's a fluke failure and not a design problem. And if it got enough failures it would downshift the speed or drop to PIO mode and it was possible to detect that your hardware was flaky. These days I'm pretty sure SATA and USB2 are both checksumming the data going across the cable, because the PHY transcievers those use are descended from the PHY transcievers originally developed for gigabit ethernet. PC hardware has always been exactly as cheap and crappy as it could get away with, but that's a lot less crappy at gigabit speeds and terabyte sizes than it was in the 16 bit ISA days. We'd be overwhelmed with all the failures otherwise. (Note that the crappiness of USB flash keys is actually _surprising_ to some of us , the severity and ease of triggering these failure modes are beyond what we've come to expect.) > It seems to me a file system neutral document describing "silent > corruption of stable data on permanent storage medium" would be > appropriate. Then the linux kernel can start to be hardened to > properly respond to situations where the data read is not the data > written. According to http://lwn.net/Articles/342892/ some of the cool things about btrfs are: A) everything is checksummed, because inodes and dentries and data extents are all just slightly differently tagged entries in one big tree, and every entry in the tree is checksummed. 2) It has backreferences so you can find most entries from more than one place if you have to rebuild a damaged tree. III) The tree update algorithms are lockless so you can potentially run an fsck/defrag on the sucker in the background, which can among other things re- read the old data and make sure the checksums match. (So your recommended regular fsck can in fact be a low priority cron job.) That wouldn't prevent you from losing data to this sort of corruption (nothing would), but it does give you potentially better ways to find and deal with it. Heck, just looking at the stderr output of a simple: find / -type f -print0 | xargs -0 -n 1 cat > /dev/null Could potentially tell you something useful if the filesystem is giving you read errors when the extent checksums don't match. And an rsync could reliably get read errors (and abort the backup) due to checksum mismatch instead of copying spans of zeroes over your archival copy. So far this thread reads to _me_ as an implicit endorsement of btrfs. But so far any suggestion that one filesystem might handle this problem better than another has so far been taken as personal attacks against people's babies, so I haven't asked... Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds