Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753096AbXFUGkQ (ORCPT ); Thu, 21 Jun 2007 02:40:16 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752324AbXFUGj5 (ORCPT ); Thu, 21 Jun 2007 02:39:57 -0400 Received: from netops-testserver-4-out.sgi.com ([192.48.171.29]:52086 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1757358AbXFUGjy (ORCPT ); Thu, 21 Jun 2007 02:39:54 -0400 Date: Thu, 21 Jun 2007 16:39:36 +1000 From: David Chinner To: Neil Brown Cc: David Chinner , Avi Kivity , david@lang.hm, linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org Subject: Re: limits on raid Message-ID: <20070621063936.GT85884050@sgi.com> References: <18034.479.256870.600360@notabene.brown> <18034.3676.477575.490448@notabene.brown> <467273AB.9010202@argo.co.il> <18035.3009.568832.785308@notabene.brown> <20070618045759.GD85884050@sgi.com> <18041.59628.370832.633244@notabene.brown> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <18041.59628.370832.633244@notabene.brown> User-Agent: Mutt/1.4.2.1i Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2956 Lines: 81 On Thu, Jun 21, 2007 at 12:56:44PM +1000, Neil Brown wrote: > On Monday June 18, dgc@sgi.com wrote: > > On Sat, Jun 16, 2007 at 07:59:29AM +1000, Neil Brown wrote: > > > Combining these thoughts, it would make a lot of sense for the > > > filesystem to be able to say to the block device "That blocks looks > > > wrong - can you find me another copy to try?". That is an example of > > > the sort of closer integration between filesystem and RAID that would > > > make sense. > > > > I think that this would only be useful on devices that store > > discrete copies of the blocks on different devices i.e. mirrors. If > > it's an XOR based RAID, you don't have another copy you can > > retreive.... > > You could reconstruct the block in question from all the other blocks > (including parity) and see if that differs from the data block read > from disk... For RAID6, there would be a number of different ways to > calculate alternate blocks. Not convinced that it is actually > something we want to do, but it is a possibility. Agreed - it's not as straight forward as a mirror, and it kind of assumes that you have software RAID. /me had his head stuck in hw raid land ;) > I have that - apparently naive - idea that drives use strong checksum, > and will never return bad data, only good data or an error. If this > isn't right, then it would really help to understand what the cause of > other failures are before working out how to handle them.... The drive is not the only source of errors, though. You could have a path problem that is corrupting random bits between the drive and the filesystem. So the data on the disk might be fine, and reading it via a redundant path might be all that is needed. Yeah, so I can see how having a different retry semantic would be a good idea. i.e. if we do a READ_VERIFY I/O, the underlying device attempts to verify the data is good in as many ways as possible before returning the verified data or an error. I guess a filesystem read would become something like this: verified = 0 error = read(block) if (error) { read_verify: error = read_verify(block) if (error) { OMG THE SKY IS FALLING return error } verified = 1 } /* check contents */ if (contents are bad) { if (!verified) goto read_verify OMG THE SKY HAS FALLEN return -EIO } Is this the sort of erro handling and re-issuing of I/O that you had in mind? FWIW, I don't think this really removes the need for a filesystem to be able to keep multiple copies of stuff about. If the copy(s) on a device are gone, you've still got to have another copy somewhere else to get it back... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/