Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755681AbXFUOlD (ORCPT ); Thu, 21 Jun 2007 10:41:03 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751557AbXFUOkv (ORCPT ); Thu, 21 Jun 2007 10:40:51 -0400 Received: from lucidpixels.com ([75.144.35.66]:52214 "EHLO lucidpixels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751121AbXFUOku (ORCPT ); Thu, 21 Jun 2007 10:40:50 -0400 Date: Thu, 21 Jun 2007 10:40:50 -0400 (EDT) From: Justin Piszcz X-X-Sender: jpiszcz@p34.internal.lan To: Mattias Wadenstein cc: Neil Brown , David Chinner , Avi Kivity , david@lang.hm, linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org Subject: Re: limits on raid In-Reply-To: Message-ID: References: <18034.479.256870.600360@notabene.brown> <18034.3676.477575.490448@notabene.brown> <467273AB.9010202@argo.co.il> <18035.3009.568832.785308@notabene.brown> <20070618045759.GD85884050@sgi.com> <18041.59628.370832.633244@notabene.brown> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2041 Lines: 48 On Thu, 21 Jun 2007, Mattias Wadenstein wrote: > On Thu, 21 Jun 2007, Neil Brown wrote: > >> I have that - apparently naive - idea that drives use strong checksum, >> and will never return bad data, only good data or an error. If this >> isn't right, then it would really help to understand what the cause of >> other failures are before working out how to handle them.... > > In theory, that's how storage should work. In practice, silent data > corruption does happen. If not from the disks themselves, somewhere along the > path of cables, controllers, drivers, buses, etc. If you add in fcal, you'll > get even more sources of failure, but usually you can avoid SANs (if you care > about your data). > > Well, here is a couple of the issues that I've seen myself: > > A hw-raid controller returning every 64th bit as 0, no matter what's on disk. > With no error condition at all. (I've also heard from a collegue about this > on every 64k, but not seen that myself.) > > An fcal switch occasionally resetting, garbling the blocks in transit with > random data. Lost a few TB of user data that way. > > Add to this the random driver breakage that happens now and then. I've also > had a few broken filesystems due to in-memory corruption due to bad ram, not > sure there is much hope of fixing that though. > > Also, this presentation is pretty worrying on the frequency of silent data > corruption: > > https://indico.desy.de/contributionDisplay.py?contribId=65&sessionId=42&confId=257 > > /Mattias Wadenstein > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Very interesting slides/presentation, going to watch it shortly. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/