Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S261237AbVECAAs (ORCPT ); Mon, 2 May 2005 20:00:48 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S261238AbVECAAs (ORCPT ); Mon, 2 May 2005 20:00:48 -0400 Received: from waste.org ([216.27.176.166]:33707 "EHLO waste.org") by vger.kernel.org with ESMTP id S261237AbVECAAX (ORCPT ); Mon, 2 May 2005 20:00:23 -0400 Date: Mon, 2 May 2005 17:00:12 -0700 From: Matt Mackall To: Linus Torvalds Cc: Bill Davidsen , Morten Welinder , Sean , linux-kernel , git@vger.kernel.org Subject: Re: Mercurial 0.4b vs git patchbomb benchmark Message-ID: <20050503000011.GA22038@waste.org> References: <20050429165232.GV21897@waste.org> <427650E7.2000802@tmr.com> <20050502223002.GP21897@waste.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.6+20040907i Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4858 Lines: 113 On Mon, May 02, 2005 at 03:49:49PM -0700, Linus Torvalds wrote: > > > - you can drop old objects. > > > > You can't drop old objects without dropping all the changesets that > > refer to them or otherwise being prepared to deal with the broken > > links. [...] > I could write this up in ten minutes. It's really simple. It's still simple in Mercurial, but more importantly Mercurial _won't need it_. Dropping history is a work-around, not a feature. > > > delta models very fundamentally don't support this. > > > > The latter can be done in a pretty straightforward manner in mercurial > > with one pass over the data. But I have a goal to make keeping the > > whole history cheap enough that no one balks at it. > > With delta's, you have two choices: > > - change all the sha1 names (ie a pruned tree would no longer be > compatible with a non-pruned one) > - make the delta part not show up as part of the sha1 name (which means > that it's unprotected). > > which one would you have? Umm.. I am _not_ calculating the SHA of the delta itself. That'd be silly. There are an arbitrary number of ways to calculate a delta between two files. Similarly, there are an arbitrary number of ways to compress a file (gzip has at least 9, not counting all the permutations of flush). The only sensible thing to do is store a hash of the raw text and check it against the fully restored text, because that's what you care about being correct. In Mercurial, deltas are just a storage detail and are effectively completely hidden from everything except the innermost part of the back-end. What's important is that Mercurial knows that A is a revision of B in the backend and thus has enough information to opportunistically attempt to calculate a delta. So if the day ever comes when I want to prune the head of a log, I simply reconstruct the first version to keep, store it in a new file, then append all the deltas, unmodified. And fix up the offsets in the indices. None of the hashes change. > > What is a tree re-linker? Finds duplicate files and hard-links them? > > Ok, that makes some sense. But it's a win on one machine and a lose > > everywhere else. > > Where would it be a loss? Esepcially since with git, it's cheap (you don't > need to compare content to find objects to link - you can just compare > filename listings). Git repositories will be 10x larger than Mercurial everywhere that doesn't benefit from this linking of unrelated trees. That is, folks who aren't running gitbits.net. > > I've added an "hg verify" command to Mercurial. It doesn't attempt to > > fix anything up yet, but it can catch a couple things that git > > probably can't (like file revisions that aren't owned by any > > changeset), namely because there's more metadata around to look at. > > git-fsck-cache catches exactly those kinds of things. And since it checks > pretty much every _single_ assumption in git (which is not a lot, since > git doesn't have a lot of assumptions), I guarantee you that you can't > find any more than it does (the filename ordering is the big missing > piece: I _still_ don't verify that trees are ordered. I've been mentioning > it since the beginning, but I'm lazy). > > In other words, your verifier can't verify anything more. It's entirely > possible that more things can go _wrong_, since you have more indexes, so > your verifier will have more to check, but that's not an advantage, that's > a downside. Uh, no. It's just like a filesystem. Redundancy is what lets you recover. The extra indices are also very useful in their own right: - they let you do easily do delta storage - they let you efficiently do delta transmission - they let you find past revisions of a file in O(1) - they let you efficiently do "annotate" - they let you do smarter merge At least the first four seem fairly critical to me. As various people have pointed out, you can hack delta transmission and file revision indexing on top of git. But to do that, you'll need to build the same indices that Mercurial has. And you'll need to check their integrity. Unfortunately, since the git back-end refuses to know anything about the relation between file revisions, this will all happen in the front end, and you'll have done almost all the work needed to do delta storage without actually getting it. How sad. You'll also likely end up with something quite a bit more complicated than Mercurial because of the extra layering. This all strongly suggests to me that the git back-end is just a little bit too simple. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/