Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754630AbXJZKp6 (ORCPT ); Fri, 26 Oct 2007 06:45:58 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751534AbXJZKpd (ORCPT ); Fri, 26 Oct 2007 06:45:33 -0400 Received: from relay.2ka.mipt.ru ([194.85.82.65]:43587 "EHLO 2ka.mipt.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751398AbXJZKp2 (ORCPT ); Fri, 26 Oct 2007 06:45:28 -0400 Date: Fri, 26 Oct 2007 14:44:59 +0400 From: Evgeniy Polyakov To: Kyle Moffett Cc: Andreas Dilger , Jeff Garzik , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: Re: Distributed storage. Move away from char device ioctls. Message-ID: <20071026104459.GA15295@2ka.mipt.ru> References: <20070914185429.GA9439@2ka.mipt.ru> <46EADC02.9070409@garzik.org> <20070915122957.GA25752@2ka.mipt.ru> <20070915172446.GC2990@schatzie.adilger.int> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.9i Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3473 Lines: 75 Returning back to this, since block based storage, which can act as a shared storage/transport layer, is ready with 5'th release of the DST. My couple of notes on proposed data distribution algorithm in FS. On Sun, Sep 16, 2007 at 03:07:11AM -0400, Kyle Moffett (mrmacman_g4@mac.com) wrote: > >I actually think there is a place for this - and improvements are > >definitely welcome. Even Lustre needs block-device level > >redundancy currently, though we will be working to make Lustre- > >level redundancy available in the future (the problem is WAY harder > >than it seems at first glance, if you allow writeback caches at the > >clients and servers). > > I really think that to get proper non-block-device-level filesystem > redundancy you need to base it on something similar to the GIT > model. Data replication is done in specific-sized chunks indexed by > SHA-1 sum and you actually have a sort of "merge algorithm" for when > local and remote changes differ. The OS would only implement a very > limited list of merge algorithms, IE one of: > > (A) Don't merge, each client gets its own branch and merges are manual > (B) Most recent changed version is made the master every X-seconds/ > open/close/write/other-event. > (C) The tree at X (usually a particular client/server) is always > used as the master when there are conflicts. This looks like a good way to work with offline clients (where I recall Coda), after offline node modified data, it should be merged back to the cluster with different algorithms. Data (supposed to be) written to the failed node during its offline time will be resynced from other nodes when failed one is online, there are no problems and/or special algorithms to be used here. Filesystem replication is not a 100% 'git way' - git tree contains already combined objects - i.e. the last blob for given path does not contain its history, only ready to be used data, while filesystem, especially that one which requires simultaneous write from different threads/nodes, should implement copy-on-write semantics, essentially putting all new data (git commit) to the new location and then collect it from different extents to present a ready file. At least that is how I see the filesystem I'm working on. ... > There's a lot of other technical details which would need resolution > in an actual implementation, but this is enough of a summary to give > you the gist of the concept. Most likely there will be some major > flaw which makes it impossible to produce reliably, but the concept > contains the things I would be interested in for a real "networked > filesystem". Git semantics and copy-on-write has actually quite a lot in common (on high enough level of abstraction), but SHA1 index is not a possible issue in filesystem - even besides amount of data to be hashed before key is ready. Key should also contain enough information about what underlying data is - git does not store that information (tree, blob or whatever) in its keys, since it does not require it. At least that is how I see it to be implemented. Overall I see this new project as a true copy-on-write FS. Thanks. > Cheers, > Kyle Moffett -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/