Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S261319AbVECCqs (ORCPT ); Mon, 2 May 2005 22:46:48 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S261315AbVECCqs (ORCPT ); Mon, 2 May 2005 22:46:48 -0400 Received: from fire.osdl.org ([65.172.181.4]:61886 "EHLO smtp.osdl.org") by vger.kernel.org with ESMTP id S261314AbVECCqk (ORCPT ); Mon, 2 May 2005 22:46:40 -0400 Date: Mon, 2 May 2005 19:48:29 -0700 (PDT) From: Linus Torvalds To: Matt Mackall cc: Bill Davidsen , Morten Welinder , Sean , linux-kernel , git@vger.kernel.org Subject: Re: Mercurial 0.4b vs git patchbomb benchmark In-Reply-To: <20050503000011.GA22038@waste.org> Message-ID: References: <20050429165232.GV21897@waste.org> <427650E7.2000802@tmr.com> <20050502223002.GP21897@waste.org> <20050503000011.GA22038@waste.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2748 Lines: 65 On Mon, 2 May 2005, Matt Mackall wrote: > > Umm.. I am _not_ calculating the SHA of the delta itself. That'd be > silly. It's not silly. Meta-data consistency is supremely important. If people can corrupt their metadata in strange an unobservable ways, that's almost as bad as corrupting the data itself. In fact, to some degree it's worse, since you make people trust the thing, but you don't actually guarantee it. So how _do_ you guarantee consistency of a tree and the history that led up to it? And by that I don't mean any of the individual blobs - I realize that it's perfectly valid to just check out every single version, and have the sha1 of that. But how do you guarantee that the sha's you check are the sha's that you saved in the first place, and somebody didn't replace something in the middle? In other words, you need to hash the metadata too. Otherwise how do you consistency-check the _collection_ of files? It's absolutely not enough to just protect single-file content. That doesn't help one whit. It's not what a SCM is all about. You have to protect the state of _multiple_ files, ie the metadata has to be verifiable too. If that meta-data is the index, then the index needs to be protected by a SHA1. In git, that's why we don't just sha1 every blob, but every tree and every commit. That's the thing that gets consistency _beyond_ a single file. > As various people have pointed out, you can hack delta transmission > and file revision indexing on top of git. But to do that, you'll need > to build the same indices that Mercurial has. And you'll need to check > their integrity. No, absolutely not. Building indeces on top of git would be stupid. You can _cache_ deltas, but there's a big difference between a index that actually describes how random blobs go together, and a cache of a delta between two well-specified end-points. And in particular, there is no "consistency" to a delta. You don't need it. Why? Because either the delta is correct, or it isn't. If it's correct, the end result will be the right sha1. If it's not, the end result will be something else. So when you do a "pull" from another repository, you can trivially check whether the delta's you got were valid: did applying them result in the same sha1 that the other repository had? So git really validates the _only_ thing that matters: it validates the state of the data. It doesn't validate anything else, but if validates that one thing very completely indeed. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/