Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754737AbXFSHz3 (ORCPT ); Tue, 19 Jun 2007 03:55:29 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751410AbXFSHzK (ORCPT ); Tue, 19 Jun 2007 03:55:10 -0400 Received: from host86-137-96-215.range86-137.btcentralplus.com ([86.137.96.215]:47723 "EHLO hawkeye.stone.uk.eu.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752431AbXFSHzI (ORCPT ); Tue, 19 Jun 2007 03:55:08 -0400 Message-ID: <46778A7A.1080403@hawkeye.stone.uk.eu.org> Date: Tue, 19 Jun 2007 08:49:14 +0100 From: Jack Stone User-Agent: Thunderbird 2.0.0.4 (Windows/20070604) MIME-Version: 1.0 To: Kyle Moffett CC: Bryan Henderson , akpm@linux-foundation.org, alan , hpa@zytor.com, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, viro@zeniv.linux.org.uk, git@vger.kernel.org Subject: Re: Versioning file system References: <6E9A6F9E-8948-40F2-9129-1F1491D49D83@mac.com> In-Reply-To: <6E9A6F9E-8948-40F2-9129-1F1491D49D83@mac.com> X-Enigmail-Version: 0.95.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9104 Lines: 173 Kyle Moffett wrote: > On Jun 18, 2007, at 13:56:05, Bryan Henderson wrote: >>> The question remains is where to implement versioning: directly in >>> individual filesystems or in the vfs code so all filesystems can use it? >> >> Or not in the kernel at all. I've been doing versioning of the types >> I described for years with user space code and I don't remember >> feeling that I compromised in order not to involve the kernel. >> >> Of course, if you want to do it with snapshots and COW, you'll have to >> ask where in the kernel to put that, but that's not a file versioning >> question; it's the larger snapshot question. > > What I think would be particularly interesting in this domain is > something similar in concept to GIT, except in a file-system: > 1) Redundancy is easy, you just ensure that you have at least "N" > distributed copies of each object, where "N" is some function of the > object itself. > 2) Network replication is easy, you look up objects based on the SHA-1 > stored in the parent directory entry and cache them where needed (IE: > make the "N" function above dynamic based on frequency of access on a > given computer). > 3) Snapshots are easy and cheap; an RO snapshot is a tag and an RW > snapshot is a branch. These can be easily converted between. > 4) Compression is easy; you can compress objects based on any > arbitrary configurable criteria and the filesystem will record whether > or not an object is compressed. You can also compress differently when > archiving objects to secondary storage. > 5) Streaming fsck-like verification is easy; ensure the hash name > field matches the actual hash of the object. > 6) Fsck is easy since rollback is trivial, you can always revert to an > older tree to boot and start up services before attempting resurrection > of lost objects and trees in the background. > 7) Multiple-drive or multiple-host storage pools are easy: Think the > git "alternates" file. > 8) Network filesystem load-balancing is easy; SHA-1s are essentially > random so you can just assign SHA-1 prefixes to different systems for > data storage and your data is automatically split up. > > > Other issues: > > Q. How do you deal with block allocation? > A. Same way other filesystems deal with block allocation. > Reference-counting gets tricky, especially across a network, but it's > easy to play it safe with simple cross-network refcount-journalling. > Since the _only_ thing that needs journalling is the refcounts and > block-free data, you need at most a megabyte or two of journal. If in > doubt, it's easy to play it safe and keep an extra refcount around for > an in-the-background consistency check later on. When networked-gitfs > systems crash, you just assume they still have all the refcounts they > had at the moment they died, and compare notes when they start back up > again. If a node has a cached copy of data on its local disk then it > can just nonatomically increment the refcount for that object in its own > RAM (ordered with respect to disk-flushes, of course) and tell its peers > at some point. A node should probably cache most of its working set on > local disk for efficiency; it's trivially verified against updates from > other nodes and provides an easy way to keep refcounts for such data. > If a node increments the refcount on such data and dies before getting > that info out to its peers, then when it starts up again its peers will > just be told that it has a "new" node with insufficient replication and > they will clone it out again properly. For networked > refcount-increments you can do one of 2 things: (1) Tell at least X many > peers and wait for them to sync the update out to disk, or (2) Get the > object from any peer (at least one of whom hopefully has it in RAM) and > save it to local disk with an increased refcount. > > Q. How do you actually delete things? > A. Just replace all the to-be-erased tree and commit objects before a > specified point with "History erased" objects with their SHA-1's > magically set to that of the erased objects. If you want you may delete > only the "tree" objects and leave the commits intact. If you delete a > whole linear segment of history then you can just use a single "History > erased" commit object with its parent pointed to the object before the > erased segment. Probably needs some form of back-reference storage to > make it efficient; not sure how expensive that would be. This would > allow making a bunch of snapshots and purging them logarithmically based > on passage of time. For instance, you might have snapshots of every 5 > minutes for the last hour, every 30 minutes for the last day, every 4 > hours for the last week, every day for the last month, once per week for > the last year, once per month for the last 5 years, and once per year > beyond that. > > That's pretty impressive data-recovery resolution, and it accounts for > only 200 unique commits after it's been running for 10 years. > > Q. How do you archive data? > A. Same as deleting, except instead of a "History erased" object you > would use a "History archived" object with a little bit of string data > to indicate which volume it's stored on (and where on the volume). When > you stick that volume into the system you could easily tell the kernel > to use it as an alternate for the given storage group. > > Q. What enforces data integrity? > A. Ensure that a new tree object and its associated sub objects are on > disk before you delete the old one. Doesn't need any actual full syncs > at all, just barriers. If you replace the tree object before write-out > is complete then just skip writing the old one and write the new one in > its place. > > Q. What consists of a "commit"? > A. Anything the administrator wants to define it as. Useful algorithms > include: "Once per x Mbyte of page dirtying", "Once per 5 min", "Only > when sync() or fsync() are called", "Only when gitfs-commit is called". > You could even combine them: "Every x Mbyte of page dirtying or every 5 > minutes, whichever is shorter (or longer, depending on admin > requirements)". There would also be appropriate syscalls to trigger > appropriate git-like behavior. Network-accessible gitfs would want to > have mechanisms to trigger commits based on activity on other systems > (needs more thought). > > Q. How do you access old versions? > A. Mount another instance of the filesystem with an SHA-1 ID, a > tag-name, or a branch-name in a special mount option. Should be user > accessible with some restrictions (needs more thought). > > Q. How do you deal with conflicts on networked filesystems. > A. Once again, however the administrator wants to deal with them. Options: > 1) Forcibly create a new branch for the conflicted tree. > 2) Attempt to merge changes using the standard git-merge semantics > 3) Merge independent changes to different files and pick one for > changes to the same file > 4) Your Algorithm Here(TM). GIT makes it easy to extend > conflict-resolution. > > Q. How do you deal with little scattered changes in big (or sparse) files? > A. Two questions, two answers: For sparse files, git would need > extending to understand (and hash) the nature of the sparse-ness. For > big files, you should be able to introduce a "compound-file" datatype > and configure git to deal with specific X-Mbyte chunks of it > independently. This might not be a bad idea for native git as well. > Would need system-specific configuration. > > Q. How do you prevent massive data consumption by spurious tiny changes > A. You have a few options: > 1) Configure your commit algorithm as above to not commit so often > 2) Configure a stepped commit-discard algorithm as described above > in the "How do you delete things" question > 3) Archive unused data to secondary storage more often > > Q. What about all the unanswered questions? > A. These are all the ones I could think of off the top of my head but > there are at least a hundred more. I'm pretty sure these are some of > the most significant ones. > > Q. That's a great idea and I'll implement it right away! > A. Yay! (but that's not a question :-D) Good luck and happy hacking. > > Q. That's a stupid idea and would never ever work! > A. Thanks for your useful input! (but that's not a question either) I'm > sure anybody who takes up a project like this will consider such opinions. > > Q. *flamage* > A. I'm glad you have such strong opinions, feel free to to continue to > spam my /dev/null device (and that's also not a question). > > All opinions and comments welcomed. > > Cheers, > Kyle Moffett > > It sounds brilliant and I'd love to have a got at implementing it but I don't know enough (yet :-D) about how git works, a little research is called for I think. Jack - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/