2005-04-06 15:40:23

by Linus Torvalds

[permalink] [raw]
Subject: Kernel SCM saga..


Ok,
as a number of people are already aware (and in some cases have been
aware over the last several weeks), we've been trying to work out a
conflict over BK usage over the last month or two (and it feels like
longer ;). That hasn't been working out, and as a result, the kernel team
is looking at alternatives.

[ And apparently this just hit slashdot too, so by now _everybody_ knows ]

It's not like my choice of BK has been entirely conflict-free ("No,
really? Do tell! Oh, you mean the gigabytes upon gigabytes of flames we
had?"), so in some sense this was inevitable, but I sure had hoped that it
would have happened only once there was a reasonable open-source
alternative. As it is, we'll have to scramble for a while.

Btw, don't blame BitMover, even if that's probably going to be a very
common reaction. Larry in particular really did try to make things work
out, but it got to the point where I decided that I don't want to be in
the position of trying to hold two pieces together that would need as much
glue as it seemed to require.

We've been using BK for three years, and in fact, the biggest problem
right now is that a number of people have gotten very very picky about
their tools after having used the best. Me included, but in fact the
people that got helped most by BitKeeper usage were often the people
_around_ me who had a much easier time merging with my tree and sending
their trees to me.

Of course, there's also probably a ton of people who just used BK as a
nicer (and much faster) "anonymous CVS" client. We'll get that sorted out,
but the immediate problem is that I'm spending most my time trying to see
what the best way to co-operate is.

NOTE! BitKeeper isn't going away per se. Right now, the only real thing
that has happened is that I've decided to not use BK mainly because I need
to figure out the alternatives, and rather than continuing "things as
normal", I decided to bite the bullet and just see what life without BK
looks like. So far it's a gray and bleak world ;)

So don't take this to mean anything more than it is. I'm going to be
effectively off-line for a week (think of it as a normal "Linus went on a
vacation" event) and I'm just asking that people who continue to maintain
BK trees at least try to also make sure that they can send me the result
as (individual) patches, since I'll eventually have to merge some other
way.

That "individual patches" is one of the keywords, btw. One thing that BK
has been extremely good at, and that a lot of people have come to like
even when they didn't use BK, is how we've been maintaining a much finer-
granularity view of changes. That isn't going to go away.

In fact, one impact BK ha shad is to very fundamentally make us (and me in
particular) change how we do things. That ranges from the fine-grained
changeset tracking to just how I ended up trusting submaintainers with
much bigger things, and not having to work on a patch-by-patch basis any
more. So the three years with BK are definitely not wasted: I'm convinced
it caused us to do things in better ways, and one of the things I'm
looking at is to make sure that those things continue to work.

So I just wanted to say that I'm personally very happy with BK, and with
Larry. It didn't work out, but it sure as hell made a big difference to
kernel development. And we'll work out the temporary problem of having to
figure out a set of tools to allow us to continue to do the things that BK
allowed us to do.

Let the flames begin.

Linus

PS. Don't bother telling me about subversion. If you must, start reading
up on "monotone". That seems to be the most viable alternative, but don't
pester the developers so much that they don't get any work done. They are
already aware of my problems ;)


2005-04-06 16:01:07

by Greg KH

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Wed, Apr 06, 2005 at 08:42:08AM -0700, Linus Torvalds wrote:
>
> So I just wanted to say that I'm personally very happy with BK, and with
> Larry. It didn't work out, but it sure as hell made a big difference to
> kernel development. And we'll work out the temporary problem of having to
> figure out a set of tools to allow us to continue to do the things that BK
> allowed us to do.

I'd also like to publicly say that BK has helped out immensely in the
past few years with kernel development, and has been one of the main
reasons we have been able to keep up such a high patch rate over such a
long period of time. Larry, and his team, have been nothing but great
in dealing with all of the crap that we have been flinging at him due to
the very odd demands such a large project as the kernel has caused. And
I definitely owe him a beer the next time I see him.

thanks,

greg k-h

2005-04-06 16:09:23

by Daniel Phillips

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Wednesday 06 April 2005 11:42, Linus Torvalds wrote:
> it got to the point where I decided that I don't want to be in
> the position of trying to hold two pieces together that would need as much
> glue as it seemed to require.

Hi Linus,

Well I'm really pleased to hear that you won't be drinking this koolaid any
more. This is a really uplifting development for me, thanks.

Regards,

Daniel

2005-04-06 19:07:13

by [email protected]

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Apr 6, 2005 11:42 AM, Linus Torvalds <[email protected]> wrote:
> So I just wanted to say that I'm personally very happy with BK, and with
> Larry. It didn't work out, but it sure as hell made a big difference to
> kernel development. And we'll work out the temporary problem of having to
> figure out a set of tools to allow us to continue to do the things that BK
> allowed us to do.

Larry has stated several time that most of his revenue comes from
Windows. Has ODSL approached Bitmover about simply buying out the
source rights for the Linux version? From my experience in the
industry a fair price would probably be around $2M, but that should be
within ODSL's capabilities. ODSL could then GPL the code and quiet the
critics.

--
Jon Smirl
[email protected]

2005-04-06 19:24:42

by Matan Peled

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Jon Smirl wrote:
> ODSL could then GPL the code and quiet the
> critics.

And also cause aaid GPL'ed code to be immediatly ported over to Windows. I don't
think BitMover could ever agree to that.

--
[Name ] :: [Matan I. Peled ]
[Location ] :: [Israel ]
[Public Key] :: [0xD6F42CA5 ]
[Keyserver ] :: [keyserver.kjsl.com]
encrypted/signed plain text preferred


Attachments:
signature.asc (189.00 B)
OpenPGP digital signature

2005-04-06 19:39:24

by Paul Komkoff

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Replying to Linus Torvalds:
> Ok,
> as a number of people are already aware (and in some cases have been

Actually, I'm very disappointed things gone such counter-productive
way. All along the history, I was against Larry's opponents, but at
the end, they are right. That's pity. To quote vin diesel' character
Riddick, "there's no such word as friend", or something.

Anyway, seems that folks in Canonical was aware about it, and here's
the result of this awareness: http://bazaar-ng.org/
This need some testing though, along with really hard part - transfer
all history, nonlinear ... I don't know how anyone can do this till 1
Jul 2005, sorry :(

> PS. Don't bother telling me about subversion. If you must, start reading
> up on "monotone". That seems to be the most viable alternative, but don't
> pester the developers so much that they don't get any work done. They are
> already aware of my problems ;)

Monotone is good, but I don't really know limits of sqlite3 wrt kernel
case. And again, what we need to do to retain history ...


--
Paul P 'Stingray' Komkoff Jr // http://stingr.net/key <- my pgp key
This message represents the official view of the voices in my head

2005-04-06 19:49:19

by [email protected]

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Apr 6, 2005 3:24 PM, Matan Peled <[email protected]> wrote:
> Jon Smirl wrote:
> > ODSL could then GPL the code and quiet the
> > critics.
>
> And also cause aaid GPL'ed code to be immediatly ported over to Windows. I don't
> think BitMover could ever agree to that.

Windows Bitkeeper licenses are not that expensive, wouldn't you rather
keep your source in a licensed supported version? Who is going to do
this backport, then support it and track new releases? Why do people
pay for RHEL when they can get it for free? They want support and a
guarantee that their data won't be lost. Even with a GPL'd Linux
Bitkeeper I'll bet half of the existing Linux paying customers will
continue to use a paid version.

There is a large difference in the behavior of corporations with huge
source bases and college students with no money. The corporations will
pay to have someone responsible for ensuring that the product works.

--
Jon Smirl
[email protected]

2005-04-06 20:35:34

by Hua Zhong (hzhong)

[permalink] [raw]
Subject: RE: Kernel SCM saga..

> Even with a GPL'd Linux Bitkeeper I'll bet half of the existing Linux
> paying customers will continue to use a paid version.

By what? How much do you plan to put down to pay Larry in case you lose your
bet?

Hua

2005-04-06 21:39:46

by kfogel

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Linus Torvalds wrote:
> PS. Don't bother telling me about subversion. If you must, start reading
> up on "monotone". That seems to be the most viable alternative, but don't
> pester the developers so much that they don't get any work done. They are
> already aware of my problems ;)

By the way, the Subversion developers have no argument with the claim
that Subversion would not be the right choice for Linux kernel
development. We've written an open letter entitled "Please Stop
Bugging Linus Torvalds About Subversion" to explain why:

http://subversion.tigris.org/subversion-linus.html

Best,
-Karl Fogel (on behalf of the Subversion team)

2005-04-06 22:39:36

by Jeff Garzik

[permalink] [raw]
Subject: Re: Kernel SCM saga..

[email protected] wrote:
> Linus Torvalds wrote:
>
>>PS. Don't bother telling me about subversion. If you must, start reading
>>up on "monotone". That seems to be the most viable alternative, but don't
>>pester the developers so much that they don't get any work done. They are
>>already aware of my problems ;)
>
>
> By the way, the Subversion developers have no argument with the claim
> that Subversion would not be the right choice for Linux kernel
> development. We've written an open letter entitled "Please Stop
> Bugging Linus Torvalds About Subversion" to explain why:
>
> http://subversion.tigris.org/subversion-linus.html

A thoughtful post. Thanks for writing this.

Jeff


2005-04-06 23:22:33

by Jon Masters

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Apr 6, 2005 4:42 PM, Linus Torvalds <[email protected]> wrote:

> as a number of people are already aware (and in some
> cases have been aware over the last several weeks), we've
> been trying to work out a conflict over BK usage over the last
> month or two (and it feels like longer ;). That hasn't been
> working out, and as a result, the kernel team is looking at
> alternatives.

What about the 64K changeset limitation in current releases?

Did I miss something (like the fixes promised) or is there going to be
another interim release before the end of support?

Jon.

P.S. Apologies if this already got addressed.

2005-04-07 01:31:42

by Christoph Lameter

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Wed, 6 Apr 2005, Jon Smirl wrote:

> There is a large difference in the behavior of corporations with huge
> source bases and college students with no money. The corporations will
> pay to have someone responsible for ensuring that the product works.

Or they will merge with some other entity on the whim of some executive
and the corporation then decides to kill the product for good without
releasing the source leaving you out in the cold.

2005-04-07 01:41:21

by Martin Pool

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Wed, 06 Apr 2005 23:39:11 +0400, Paul P Komkoff Jr wrote:

> http://bazaar-ng.org/

I'd like bazaar-ng to be considered too. It is not ready for adoption
yet, but I am working (more than) full time on it and hope to have it
be usable in a couple of months.

bazaar-ng is trying to integrate a lot of the work done in other systems
to make something that is simple to use but also fast and powerful enough
to handle large projects.

The operations that are already done are pretty fast: ~60s to import a
kernel tree, ~10s to import a new revision from a patch.

Please check it out and do pester me with any suggestions about whatever
you think it needs to suit your work.

--
Martin


2005-04-07 01:47:42

by Jeff Garzik

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thu, Apr 07, 2005 at 11:40:23AM +1000, Martin Pool wrote:
> On Wed, 06 Apr 2005 23:39:11 +0400, Paul P Komkoff Jr wrote:
>
> > http://bazaar-ng.org/
>
> I'd like bazaar-ng to be considered too. It is not ready for adoption
> yet, but I am working (more than) full time on it and hope to have it
> be usable in a couple of months.
>
> bazaar-ng is trying to integrate a lot of the work done in other systems
> to make something that is simple to use but also fast and powerful enough
> to handle large projects.
>
> The operations that are already done are pretty fast: ~60s to import a
> kernel tree, ~10s to import a new revision from a patch.

By "importing", are you saying that importing all 60,000+ changesets of
the current kernel tree took only 60 seconds?

Jeff



2005-04-07 02:26:43

by Martin Pool

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Wed, 06 Apr 2005 21:47:27 -0400, Jeff Garzik wrote:

>> The operations that are already done are pretty fast: ~60s to import a
>> kernel tree, ~10s to import a new revision from a patch.
>
> By "importing", are you saying that importing all 60,000+ changesets of
> the current kernel tree took only 60 seconds?

Now that would be impressive.

No, I mean this:

% bzcat ../linux.pkg/patch-2.5.14.bz2| patch -p1

% time bzr add -v .
(find any new non-ignored files; deleted files automatically noticed)
6.06s user 0.41s system 89% cpu 7.248 total

% time bzr commit -v -m 'import 2.5.14'
7.71s user 0.71s system 65% cpu 12.893 total

(OK, a bit slower in this case but it wasn't all in core.)

This is only v0.0.3, but I think the interface simplicity and speed
compares well.

I haven't tested importing all 60,000+ changesets of the current bk tree,
partly because I don't *have* all those changesets. (Larry said
previously that someone (not me) tried to pull all of them using bkclient,
and he considered this abuse and blacklisted them.)

I have been testing pulling in release and rc patches, and it scales to
that level. It probably could not handle 60,000 changesets yet, but there
is a plan to get there. In the interim, although it cannot handle the
whole history forever it can handle large trees with moderate numbers of
commits -- perhaps as many as you might deal with in developing a feature
over a course of a few months.

The most sensible place to try out bzr, if people want to, is as a way to
keep your own revisions before mailing a patch to linus or the subsystem
maintainer.

--
Martin


2005-04-07 02:34:07

by David Lang

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thu, 7 Apr 2005, Martin Pool wrote:

> I haven't tested importing all 60,000+ changesets of the current bk tree,
> partly because I don't *have* all those changesets. (Larry said
> previously that someone (not me) tried to pull all of them using bkclient,
> and he considered this abuse and blacklisted them.)

pull the patches from the BK2CVS server. yes some patches are combined,
but it will get you in the ballpark.

David Lang

> I have been testing pulling in release and rc patches, and it scales to
> that level. It probably could not handle 60,000 changesets yet, but there
> is a plan to get there. In the interim, although it cannot handle the
> whole history forever it can handle large trees with moderate numbers of
> commits -- perhaps as many as you might deal with in developing a feature
> over a course of a few months.
>
> The most sensible place to try out bzr, if people want to, is as a way to
> keep your own revisions before mailing a patch to linus or the subsystem
> maintainer.
>
> --
> Martin
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare

2005-04-07 03:34:22

by Daniel Phillips

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Wednesday 06 April 2005 21:40, Martin Pool wrote:
> On Wed, 06 Apr 2005 23:39:11 +0400, Paul P Komkoff Jr wrote:
> > http://bazaar-ng.org/
>
> I'd like bazaar-ng to be considered too. It is not ready for adoption
> yet, but I am working (more than) full time on it and hope to have it
> be usable in a couple of months.
>
> bazaar-ng is trying to integrate a lot of the work done in other systems
> to make something that is simple to use but also fast and powerful enough
> to handle large projects.
>
> The operations that are already done are pretty fast: ~60s to import a
> kernel tree, ~10s to import a new revision from a patch.

Hi Martin,

When I tried it, it took 13 seconds to 'bzr add' the 2.6.11.3 tree on a
relatively slow machine.

Regards,

Daniel

2005-04-07 05:39:48

by Martin Pool

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Wed, 2005-04-06 at 19:32 -0700, David Lang wrote:
> On Thu, 7 Apr 2005, Martin Pool wrote:
>
> > I haven't tested importing all 60,000+ changesets of the current bk tree,
> > partly because I don't *have* all those changesets. (Larry said
> > previously that someone (not me) tried to pull all of them using bkclient,
> > and he considered this abuse and blacklisted them.)
>
> pull the patches from the BK2CVS server. yes some patches are combined,
> but it will get you in the ballpark.

OK, I just tried that. I know there are scripts to resynthesize
changesets from the CVS info but I skipped that for now and just pulled
each day's work into a separate bzr revision. It's up to the end of
March and still running.

Importing the first snapshot (2004-01-01) took 41.77s user, 1:23.79
total. Each subsequent day takes about 10s user, 30s elapsed to commit
into bzr. The speeds are comparable to CVS or a bit faster, and may be
faster than other distributed systems. (This on a laptop with a 5400rpm
disk.) Pulling out a complete copy of the tree as it was on a previous
date takes about 14 user, 60s elapsed.

I don't want to get too distracted by benchmarks now because there are
more urgent things to do and anyhow there is still lots of scope for
optimization. I wouldn't be at all surprised if those times could be
more than halved. I just wanted to show it is in (I hope) the right
ballpark.

--
Martin


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2005-04-07 06:36:58

by bert hubert

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Wed, Apr 06, 2005 at 11:39:11PM +0400, Paul P Komkoff Jr wrote:

> Monotone is good, but I don't really know limits of sqlite3 wrt kernel
> case. And again, what we need to do to retain history ...

I would't fret over that :-) the big issue I have with sqlite3 is that it
interacts horribly with ext3, resulting in dysmal journalled write
performance compared to ext2. I do not know if this is a sqlite3 or an ext3
problem though.

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

2005-04-07 06:49:09

by Paul Mackerras

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Linus,

> That "individual patches" is one of the keywords, btw. One thing that BK
> has been extremely good at, and that a lot of people have come to like
> even when they didn't use BK, is how we've been maintaining a much finer-
> granularity view of changes. That isn't going to go away.

Are you happy with processing patches + descriptions, one per mail?
Do you have it automated to the point where processing emailed patches
involves little more overhead than doing a bk pull? If so, then your
mailbox (or patch queue) becomes a natural serialization point for the
changes, and the need for a tool that can handle a complex graph of
changes is much reduced.

> In fact, one impact BK ha shad is to very fundamentally make us (and me in
> particular) change how we do things.

>From my point of view, the benefits that flowed from your using BK
were:

* Visibility into what you had accepted and committed to your
repository
* Lower latency of patches going into your repository
* Much reduced rate of patches being dropped

Those things are what have enabled us PPC developers to move away from
having our own trees (with all the synchronization problems that
entailed) and work directly with your tree. I don't see that it is
the distinctive features of BK (such as the ability to do merges
between peer repositories) that are directly responsible for producing
those benefits, so I have hope that things can work just as well with
some other system.

Paul.

2005-04-07 07:19:49

by David Woodhouse

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Wed, 2005-04-06 at 08:42 -0700, Linus Torvalds wrote:
> PS. Don't bother telling me about subversion. If you must, start reading
> up on "monotone". That seems to be the most viable alternative, but don't
> pester the developers so much that they don't get any work done. They are
> already aware of my problems ;)

One feature I'd want to see in a replacement version control system is
the ability to _re-order_ patches, and to cherry-pick patches from my
tree to be sent onwards. The lack of that capability is the main reason
I always hated BitKeeper.

--
dwmw2

2005-04-07 07:44:25

by Jan Hudec

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Wed, Apr 06, 2005 at 08:42:08 -0700, Linus Torvalds wrote:
> PS. Don't bother telling me about subversion. If you must, start reading
> up on "monotone". That seems to be the most viable alternative, but don't
> pester the developers so much that they don't get any work done. They are
> already aware of my problems ;)

I have looked at most systems currently available. I would suggest
following for closer look on:

1) GNU Arch/Bazaar. They use the same archive format, simple, have the
concepts right. It may need some scripts or add ons. When Bazaar-NG
is ready, it will be able to read the GNU Arch/Bazaar archives so
switching should be easy.
2) SVK. True, it is built on subversion, but adds all the distributed
features necessary. It keeps mirror of the repository localy (but can
mirror only some branches), but BitKeeper did that too. It just hit
1.0beta1, but the development is progressing rapidly. There was
a post about ability to track changeset dependencies lately on their
mailing-list.

I have looked at Monotone too, of course, but I did not find any way for
doing cherry-picking (ie. skipping some changes and pulling others) in
it and I feel it will need more rework of the meta-data before it is
possible. As for the sqlite backend, I'd not consider that a problem.

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <[email protected]>


Attachments:
(No filename) (1.40 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2005-04-07 07:49:03

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thu, 2005-04-07 at 16:51 +1000, Paul Mackerras wrote:
> Linus,
>
> > That "individual patches" is one of the keywords, btw. One thing that BK
> > has been extremely good at, and that a lot of people have come to like
> > even when they didn't use BK, is how we've been maintaining a much finer-
> > granularity view of changes. That isn't going to go away.
>
> Are you happy with processing patches + descriptions, one per mail?
> Do you have it automated to the point where processing emailed patches
> involves little more overhead than doing a bk pull? If so, then your
> mailbox (or patch queue) becomes a natural serialization point for the
> changes, and the need for a tool that can handle a complex graph of
> changes is much reduced.

alternatively you could send an mbox with your series in... that has a
natural sequence in it ;)

2005-04-07 07:51:04

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Wed, 6 Apr 2005, Jeff Garzik wrote:

> On Thu, Apr 07, 2005 at 11:40:23AM +1000, Martin Pool wrote:
> > On Wed, 06 Apr 2005 23:39:11 +0400, Paul P Komkoff Jr wrote:
> >
> > > http://bazaar-ng.org/
> >
> > I'd like bazaar-ng to be considered too. It is not ready for adoption
> > yet, but I am working (more than) full time on it and hope to have it
> > be usable in a couple of months.
> >
> > bazaar-ng is trying to integrate a lot of the work done in other systems
> > to make something that is simple to use but also fast and powerful enough
> > to handle large projects.
> >
> > The operations that are already done are pretty fast: ~60s to import a
> > kernel tree, ~10s to import a new revision from a patch.
>
> By "importing", are you saying that importing all 60,000+ changesets of
> the current kernel tree took only 60 seconds?

Probably `cvs import` equivalent.

2005-04-07 08:17:32

by Magnus Damm

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Apr 7, 2005 4:32 AM, David Lang <[email protected]> wrote:
> On Thu, 7 Apr 2005, Martin Pool wrote:
>
> > I haven't tested importing all 60,000+ changesets of the current bk tree,
> > partly because I don't *have* all those changesets. (Larry said
> > previously that someone (not me) tried to pull all of them using bkclient,
> > and he considered this abuse and blacklisted them.)
>
> pull the patches from the BK2CVS server. yes some patches are combined,
> but it will get you in the ballpark.

While at it, is there any ongoing effort to convert/export the kernel
BK repository to some well known format like broken out patches and a
series file? I think keeping the complete repository public in a well
known format is important regardless of SCM taste.

/ magnus

2005-04-07 08:50:33

by Andrew Morton

[permalink] [raw]
Subject: Re: Kernel SCM saga..

David Woodhouse <[email protected]> wrote:
>
> One feature I'd want to see in a replacement version control system is
> the ability to _re-order_ patches, and to cherry-pick patches from my
> tree to be sent onwards.

You just described quilt & patch-scripts.

The problem with those is letting other people get access to it. I guess
that could be fixed with a bit of scripting and rsyncing.

(I don't do that for -mm because -mm basically doesn't work for 99% of the
time. Takes 4-5 hours to out a release out assuming that nothing's busted,
and usually something is).

2005-04-07 09:19:57

by Paul Mackerras

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Andrew Morton writes:

> The problem with those is letting other people get access to it. I guess
> that could be fixed with a bit of scripting and rsyncing.

Yes.

> (I don't do that for -mm because -mm basically doesn't work for 99% of the
> time. Takes 4-5 hours to out a release out assuming that nothing's busted,
> and usually something is).

With -mm we get those nice little automatic emails saying you've put
the patch into -mm, which removes one of the main reasons for wanting
to be able to get an up-to-date image of your tree. The other reason,
of course, is to be able to see if a patch I'm about to send conflicts
with something you have already taken, and rebase it if necessary.

Paul.

2005-04-07 09:25:28

by David Woodhouse

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thu, 2005-04-07 at 01:50 -0700, Andrew Morton wrote:
> (I don't do that for -mm because -mm basically doesn't work for 99% of
> the time. Takes 4-5 hours to out a release out assuming that
> nothing's busted, and usually something is).

On the subject of -mm: are you going to keep doing the BK imports to
that for the time being, or would it be better to leave the BK trees
alone now and send you individual patches.

For that matter, will there be a brief amnesty after 2.6.12 where Linus
will use BK to pull those trees which were waiting for that, or will we
all need to export from BK manually?

--
dwmw2

2005-04-07 09:25:07

by Sergei Organov

[permalink] [raw]
Subject: Re: Kernel SCM saga..

David Woodhouse <[email protected]> writes:

> On Wed, 2005-04-06 at 08:42 -0700, Linus Torvalds wrote:
> > PS. Don't bother telling me about subversion. If you must, start reading
> > up on "monotone". That seems to be the most viable alternative, but don't
> > pester the developers so much that they don't get any work done. They are
> > already aware of my problems ;)
>
> One feature I'd want to see in a replacement version control system is
> the ability to _re-order_ patches, and to cherry-pick patches from my
> tree to be sent onwards. The lack of that capability is the main reason
> I always hated BitKeeper.

darcs? <http://www.abridgegame.org/darcs/>

2005-04-07 09:41:01

by David Vrabel

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Andrew Morton wrote:
> David Woodhouse <[email protected]> wrote:
>
>> One feature I'd want to see in a replacement version control system is
>> the ability to _re-order_ patches, and to cherry-pick patches from my
>> tree to be sent onwards.
>
> You just described quilt & patch-scripts.
>
> The problem with those is letting other people get access to it. I guess
> that could be fixed with a bit of scripting and rsyncing.

Where I work we've been using quilt for a while now and storing the
patch-set in CVS. To limit the number of potential stuff-ups due to two
people working on the same patch at the same time (the chance that CVS's
merge will get it right is zero) we use CVS's locking feature to ensure
that only one person can edit/update a patch or the series file at any
one time. It seems to work quite well (though admittedly there's only
two developers working on the patch-set and it currently contains a mere
61 patches).

We also have a few scripts to ensure we always due the correct locking.
The main ones are:

qec -- to edit a file either as part of the top 'working' patch or as an
existing patch. It does the quilt push which I always forget to do
otherwise.

qrefr -- like quilt refresh only it locks the patch first.

qimport -- like quilt import only it locks the series file first.

You can grab a tarball of these (and other, less interesting ones) from

http://www.davidvrabel.org.uk/quilt-n-cvs-scripts-1.tar.gz

Note that I'm providing this purely on an as-is basis in case any one is
interested.

And I've just realized I can't remember how exactly to set up the CVS
repository of the patch-set. I think you need to do a cvs watch on when
it's checked-out.

David Vrabel

2005-04-07 09:46:27

by Andrew Morton

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Paul Mackerras <[email protected]> wrote:
>
> With -mm we get those nice little automatic emails saying you've put
> the patch into -mm, which removes one of the main reasons for wanting
> to be able to get an up-to-date image of your tree.

Should have done that ages ago..

> The other reason,
> of course, is to be able to see if a patch I'm about to send conflicts
> with something you have already taken, and rebase it if necessary.

<hack, hack>

How's this?


This is a note to let you know that I've just added the patch titled

ppc32: Fix AGP and sleep again

to the -mm tree. Its filename is

ppc32-fix-agp-and-sleep-again.patch

Patches currently in -mm which might be from yourself are

add-suspend-method-to-cpufreq-core.patch
ppc32-fix-cpufreq-problems.patch
ppc32-fix-agp-and-sleep-again.patch
ppc32-fix-errata-for-some-g3-cpus.patch
ppc64-fix-semantics-of-__ioremap.patch
ppc64-improve-mapping-of-vdso.patch
ppc64-detect-altivec-via-firmware-on-unknown-cpus.patch
ppc64-remove-bogus-f50-hack-in-promc.patch

2005-04-07 09:49:59

by Andrew Morton

[permalink] [raw]
Subject: Re: Kernel SCM saga..

David Woodhouse <[email protected]> wrote:
>
> On Thu, 2005-04-07 at 01:50 -0700, Andrew Morton wrote:
> > (I don't do that for -mm because -mm basically doesn't work for 99% of
> > the time. Takes 4-5 hours to out a release out assuming that
> > nothing's busted, and usually something is).
>
> On the subject of -mm: are you going to keep doing the BK imports to
> that for the time being, or would it be better to leave the BK trees
> alone now and send you individual patches.

I really don't know - I'll continue to pull the bk trees for a while, until
we work out what the new (probably interim) regime looks like.

> For that matter, will there be a brief amnesty after 2.6.12 where Linus
> will use BK to pull those trees which were waiting for that, or will we
> all need to export from BK manually?
>

I think Linus has stopped using bk already.

2005-04-07 09:56:41

by Russell King

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thu, Apr 07, 2005 at 10:25:18AM +0100, David Woodhouse wrote:
> On Thu, 2005-04-07 at 01:50 -0700, Andrew Morton wrote:
> > (I don't do that for -mm because -mm basically doesn't work for 99% of
> > the time. Takes 4-5 hours to out a release out assuming that
> > nothing's busted, and usually something is).
>
> On the subject of -mm: are you going to keep doing the BK imports to
> that for the time being, or would it be better to leave the BK trees
> alone now and send you individual patches.
>
> For that matter, will there be a brief amnesty after 2.6.12 where Linus
> will use BK to pull those trees which were waiting for that, or will we
> all need to export from BK manually?

Linus indicated (maybe privately) that the end of his BK usage would
be immediately after the -rc2 release. I'm taking that to mean "no
more BK usage from Linus, period."

Thinking about it a bit, if you're asking Linus to pull your tree,
Linus would then have to extract the individual change sets as patches
to put into his new fangled patch management system. Is that a
reasonable expectation?

However, it's ultimately up to Linus to decide. 8)

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-04-07 10:11:49

by David Woodhouse

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thu, 2005-04-07 at 10:55 +0100, Russell King wrote:
> Thinking about it a bit, if you're asking Linus to pull your tree,
> Linus would then have to extract the individual change sets as patches
> to put into his new fangled patch management system. Is that a
> reasonable expectation?

I don't know if it's a reasonable expectation; that's why I'm asking.

I could live with having to export everything to patches; it's not so
hard. It's just that if the export to whatever ends up replacing BK can
be done in a way (or at a time) which allows the existing forest of BK
trees to be pulled from one last time, that may save a fair amount of
work all round, so I thought it was worth mentioning.

--
dwmw2

2005-04-07 10:30:40

by Matthias Andree

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thu, 07 Apr 2005, Sergei Organov wrote:

> David Woodhouse <[email protected]> writes:
>
> > On Wed, 2005-04-06 at 08:42 -0700, Linus Torvalds wrote:
> > > PS. Don't bother telling me about subversion. If you must, start reading
> > > up on "monotone". That seems to be the most viable alternative, but don't
> > > pester the developers so much that they don't get any work done. They are
> > > already aware of my problems ;)
> >
> > One feature I'd want to see in a replacement version control system is
> > the ability to _re-order_ patches, and to cherry-pick patches from my
> > tree to be sent onwards. The lack of that capability is the main reason
> > I always hated BitKeeper.
>
> darcs? <http://www.abridgegame.org/darcs/>

Close. Some things:

1. It's rather slow and quite CPU consuming and certainly I/O consuming
at times - I keep, to try it out, leafnode-2 in a DARCS repo, which
has a mere 20,000 lines in 140 files, with 1,436 changes so far, on a
RAID-1 with two 7200/min disk drives, with an Athlon XP 2500+ with
512 MB RAM. The repo has 1,700 files in 11.5 MB, the source itself
189 files in 1.8 MB.

Example: darcs annotate nntpd.c takes 23 s. (2,660 lines, 60 kByte)

The maintainer himself states that there's still optimization required.

2. It has an impressive set of dependencies around Glasgow Haskell
Compiler. I don't personally have issues with that, but I can already
hear the moaning and bitching.

3. DARCS is written in Haskell. This is not a problem either, but I'd
think there are fewer people who can hack Haskell than people who
can hack C, C++, Java, Python or similar. It is still better than
BitKeeper from the hacking POV as the code is available and under an
acceptable license.

Getting DARCS up to the task would probably require some polishing, and
should probably be discussed with the DARCS maintainer before making
this decision.

Don't get me wrong, DARCS looks promising, but I'm not convinced it's
ready for the linux kernel yet.

--
Matthias Andree

2005-04-07 10:42:45

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thu, 7 Apr 2005, Paul Mackerras wrote:
> Andrew Morton writes:
> > The problem with those is letting other people get access to it. I guess
> > that could be fixed with a bit of scripting and rsyncing.
>
> Yes.

Me too ;-)

> > (I don't do that for -mm because -mm basically doesn't work for 99% of the
> > time. Takes 4-5 hours to out a release out assuming that nothing's busted,
> > and usually something is).
>
> With -mm we get those nice little automatic emails saying you've put
> the patch into -mm, which removes one of the main reasons for wanting
> to be able to get an up-to-date image of your tree. The other reason,

FYI, for Linus' BK tree procmail was telling me, if it encountered a patch on
the commits list which was signed-off by me.

> of course, is to be able to see if a patch I'm about to send conflicts
> with something you have already taken, and rebase it if necessary.

And yet another reason: to monitor if files/subsystems I'm interested in are
changed.

Summarized: I'd be happy with a mailing list that would send out all patches
(incl. full comment headers, cfr. bk-commit) that Linus commits.

An added bonus would be that people would really be able to reconstruct the
full tree from the mails, unlike with bk-commits (due to `strange' csets caused
by merges). Just make sure there are strictly monotone sequence numbers in the
individual mails.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2005-04-07 10:54:17

by Andrew Walrond

[permalink] [raw]
Subject: Re: Kernel SCM saga..

I recently switched from bk to darcs (actually looked into it after the author
mentioned on LKML that he had imported the kernel tree). Very impressed so
far, but as you say,

> 1. It's rather slow and quite CPU consuming and certainly I/O consuming

I expect something as large as the kernel tree would cause problems in this
respect.

> 2. It has an impressive set of dependencies around Glasgow Haskell
> Compiler. I don't personally have issues with that, but I can already
> hear the moaning and bitching.

:) I try to built everthing from the original source, but in this case I
couldn't. The GHC needs the GHC + some GHC addons in order to compile
itself...

>
> 3. DARCS is written in Haskell. This is not a problem either, but I'd
> think there are fewer people who can hack Haskell than people who
> can hack C, C++, Java, Python or similar. It is still better than

True, though as you say, not a show-stopper.

>From a functionality standpoint, darcs seem very similar to monotone, with a
couple minor trade-offs in either direction.

I wonder if Linus would mind publishing his feature requests to the monotone
developers, so that other projects, like darcs, would know what needs working
on.

Andrew Walrond

2005-04-07 10:58:53

by Andrew Walrond

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Wednesday 06 April 2005 16:42, Linus Torvalds wrote:
>
> PS. Don't bother telling me about subversion. If you must, start reading
> up on "monotone". That seems to be the most viable alternative, but don't
> pester the developers so much that they don't get any work done. They are
> already aware of my problems ;)

Care to share your monotone wishlist?

Andrew Walrond

2005-04-07 11:17:36

by Paul Mackerras

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Andrew Morton writes:

> > The other reason,
> > of course, is to be able to see if a patch I'm about to send conflicts
> > with something you have already taken, and rebase it if necessary.
>
> <hack, hack>
>
> How's this?

Nice; but in fact I meant that I want to be able to see if a patch of
mine conflicts with one from somebody else.

Paul.

2005-04-07 15:07:38

by Daniel Phillips

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Wednesday 06 April 2005 23:35, Daniel Phillips wrote:
> When I tried it, it took 13 seconds to 'bzr add' the 2.6.11.3 tree on a
> relatively slow machine.

Oh, and 135 seconds to commit, so 148 seconds overall. Versus 87 seconds to
to bunzip the tree in the first place. So far, you are in the ballpark.

Regards,

Daniel

2005-04-07 15:11:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Thu, 7 Apr 2005, Paul Mackerras wrote:
>
> Are you happy with processing patches + descriptions, one per mail?

Yes. That's going to be my interim, I was just hoping that with 2.6.12-rc2
out the door, and us in a "calming down" period, I could afford to not
even do that for a while.

The real problem with the email thing is that it ends up piling up: what
BK did in this respect was that anythign that piled up in a BK repository
ended up still being there, and a single "bk pull" got it anyway - so if
somebody got ignored because I was busy with something else, it didn't add
any overhead. The queue didn't get "congested".

And that's a big thing. It comes from the "Linus pulls" model where people
just told me that they were ready, instead of the "everybody pushes to
Linus" model, where the destination gets congested at times.

So I do not want the "send Linus email patches" (whether mboxes or a
single patch per email) to be a very long-term strategy. We can handle it
for a while (in particular, I'm counting on it working up to the real
release of 2.6.12, since we _should_ be in the calm period for the next
month anyway), but it doesn't work in the long run.

> Do you have it automated to the point where processing emailed patches
> involves little more overhead than doing a bk pull?

It's more overhead, but not a lot. Especially nice numbered sequences like
Andrew sends (where I don't have to manually try to get the dependencies
right by trying to figure them out and hope I'm right, but instead just
sort by Subject: line) is not a lot of overhead. I can process a hundred
emails almost as easily as one, as long as I trust the maintainer (which,
when it's used as a BK replacement, I obviously do).

However, the SCM's I've looked at make this hard. One of the things (the
main thing, in fact) I've been working at is to make that process really
_efficient_. If it takes half a minute to apply a patch and remember the
changeset boundary etc (and quite frankly, that's _fast_ for most SCM's
around for a project the size of Linux), then a series of 250 emails
(which is not unheard of at all when I sync with Andrew, for example)
takes two hours. If one of the patches in the middle doesn't apply, things
are bad bad bad.

Now, BK wasn't a speed deamon either (actually, compared to everything
else, BK _is_ a speed deamon, often by one or two orders of magnitude),
and took about 10-15 seconds per email when I merged with Andrew. HOWEVER,
with BK that wasn't as big of an issue, since the BK<->BK merges were so
easy, so I never had the slow email merges with any of the other main
developers. So a patch-application-based SCM "merger" actually would need
to be _faster_ than BK is. Which is really really really hard.

So I'm writing some scripts to try to track things a whole lot faster.
Initial indications are that I should be able to do it almost as quickly
as I can just apply the patch, but quite frankly, I'm at most half done,
and if I hit a snag maybe that's not true at all. Anyway, the reason I can
do it quickly is that my scripts will _not_ be an SCM, they'll be a very
specific "log Linus' state" kind of thing. That will make the linear patch
merge a lot more time-efficient, and thus possible.

(If a patch apply takes three seconds, even a big series of patches is not
a problem: if I get notified within a minute or two that it failed
half-way, that's fine, I can then just fix it up manually. That's why
latency is critical - if I'd have to do things effectively "offline",
I'd by definition not be able to fix it up when problems happen).

> If so, then your mailbox (or patch queue) becomes a natural
> serialization point for the changes, and the need for a tool that can
> handle a complex graph of changes is much reduced.

Yes. In the short term. See above why I think the congestion issue will
really mean that we want to have parallell merging in the not _too_
distant future.

NOTE! I detest the centralized SCM model, but if push comes to shove, and
we just _can't_ get a reasonable parallell merge thing going in the short
timeframe (ie month or two), I'll use something like SVN on a trusted site
with just a few committers, and at least try to distribute the merging out
over a few people rather than making _me_ be the throttle.

The reason I don't really want to do that is once we start doing it that
way, I suspect we'll have a _really_ hard time stopping. I think it's a
broken model. So I'd much rather try to have some pain in the short run
and get a better model running, but I just wanted to let people know that
I'm pragmatic enough that I realize that we may not have much choice.

> * Visibility into what you had accepted and committed to your
> repository
> * Lower latency of patches going into your repository
> * Much reduced rate of patches being dropped

Yes.

Linus

2005-04-07 15:31:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Thu, 7 Apr 2005, David Woodhouse wrote:
>
> On Wed, 2005-04-06 at 08:42 -0700, Linus Torvalds wrote:
> > PS. Don't bother telling me about subversion. If you must, start reading
> > up on "monotone". That seems to be the most viable alternative, but don't
> > pester the developers so much that they don't get any work done. They are
> > already aware of my problems ;)
>
> One feature I'd want to see in a replacement version control system is
> the ability to _re-order_ patches, and to cherry-pick patches from my
> tree to be sent onwards. The lack of that capability is the main reason
> I always hated BitKeeper.

I really disliked that in BitKeeper too originally. I argued with Larry
about it, but Larry (correctly, I believe) argued that efficient and
reliable distribution really requires the concept of "history is
immutable". It makes replication much easier when you know that the known
subset _never_ shrinks or changes - you only add on top of it.

And that implies no cherry-picking.

Also, there's actually a second reason why I've decided that cherry-
picking is wrong, and it's non-technical.

The thing is, cherry-picking very much implies that the people "up" the
foodchain end up editing the work of the people "below" them. The whole
reason you want cherry-picking is that you want to fix up somebody elses
mistakes, ie something you disagree with.

That sounds like an obviously good thing, right? Yes it does.

The problem is, it actually results in the wrong dynamics and psychology
in the system. First off, it makes the implicit assumption that there is
an "up" and "down" in the food-chain, and I think that's wrong. It's
increasingly a "network" in the kernel. I'm less and less "the top", as
much as a "fairly central" person. And that is how it should be. I used to
think of kernel development as a hierarchy, but I long since switched to
thinking about it as a fairly arbitrary network.

The other thing it does is that it implicitly puts the burden of quality
control at the upper-level maintainer ("I'll pick the good things out of
your tree"), while _not_ being able to cherry-pick means that there is
pressure in both directions to keep the tree clean.

And that is IMPORTANT. I realize that not cherry-picking means that people
who want to merge upstream (or sideways or anything) are now forced to do
extra work in trying to keep their tree free of random crap. And that's a
HUGELY IMPORTANT THING! It means that the pressure to keep the tree clean
flows in all directions, and takes pressure off the "central" point. In
onther words it distributes the pain of maintenance.

In other words, somebody who can't keep their act together, and creates
crappy trees because he has random pieces of crud in it, quite
automatically gets actively shunned by others. AND THAT IS GOOD! I've
pushed back on some BK users to clean up their trees, to the point where
we've had a number of "let's just re-do that" over the years. That's
WONDERFUL. People are irritated at first, but I've seen what the end
result is, and the end result is a much better maintainer.

Some people actually end up doing the cleanup different ways. For example,
Jeff Garzik kept many separate trees, and had a special merge thing.
Others just kept a messy tree for development, and when they are happy,
they throw the messy tree away and re-create a cleaner one. Either is fine
- the point is, different people like to work different ways, and that's
fine, but makign _everybody_ work at being clean means that there is no
train wreck down the line when somebody is forced to try to figure out
what to cherry-pick.

So I've actually changed from "I want to cherry-pick" to "cherry-picking
between maintainers is the wrong workflow". Now, as part of cleaning up,
people may end up exporting the "ugly tree" as patches and re-importing it
into the clean tree as the fixed clean series of patches, and that's
"cherry-picking", but it's not between developers.

NOTE! The "no cherry-picking" model absolutely also requires a model of
"throw-away development trees". The two go together. BK did both, and an
SCM that does one but not the other would be horribly broken.

(This is my only real conceptual gripe with "monotone". I like the model,
but they make it much harder than it should be to have throw-away trees
due to the fact that they seem to be working on the assumption of "one
database per developer" rather than "one database per tree". You don't
have to follow that model, but it seems to be what the setup is geared
for, and together with their "branches" it means that I think a monotone
database easily gets very cruddy. The other problem with monotone is
just performance right now, but that's hopefully not _too_ fundamental).

Linus

2005-04-07 16:41:02

by Rik van Riel

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Wed, 6 Apr 2005, Greg KH wrote:

> the very odd demands such a large project as the kernel has caused. And
> I definitely owe him a beer the next time I see him.

Seconded. Besides, now that the code won't be on bkbits
any more, it's safe to get Larry drunk ;)

Larry, thanks for the help you have given us by making
bitkeeper available for all these years.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2005-04-07 16:59:52

by Daniel Phillips

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thursday 07 April 2005 11:10, Linus Torvalds wrote:
> On Thu, 7 Apr 2005, Paul Mackerras wrote:
> > Do you have it automated to the point where processing emailed patches
> > involves little more overhead than doing a bk pull?
>
> It's more overhead, but not a lot. Especially nice numbered sequences like
> Andrew sends (where I don't have to manually try to get the dependencies
> right by trying to figure them out and hope I'm right, but instead just
> sort by Subject: line)...

Hi Linus,

In that case, a nice refinement is to put the sequence number at the end of
the subject line so patch sequences don't interleave:

Subject: [PATCH] Unbork OOM Killer (1 of 3)
Subject: [PATCH] Unbork OOM Killer (2 of 3)
Subject: [PATCH] Unbork OOM Killer (3 of 3)
Subject: [PATCH] Unbork OOM Killer (v2, 1 of 3)
Subject: [PATCH] Unbork OOM Killer (v2, 2 of 3)
...

Regards,

Daniel

2005-04-07 17:08:16

by Daniel Phillips

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thursday 07 April 2005 11:32, Linus Torvalds wrote:
> On Thu, 7 Apr 2005, David Woodhouse wrote:
> > On Wed, 2005-04-06 at 08:42 -0700, Linus Torvalds wrote:
> > > PS. Don't bother telling me about subversion. If you must, start
> > > reading up on "monotone". That seems to be the most viable alternative,
> > > but don't pester the developers so much that they don't get any work
> > > done. They are already aware of my problems ;)
> >
> > One feature I'd want to see in a replacement version control system is
> > the ability to _re-order_ patches, and to cherry-pick patches from my
> > tree to be sent onwards. The lack of that capability is the main reason
> > I always hated BitKeeper.
>
> I really disliked that in BitKeeper too originally. I argued with Larry
> about it, but Larry (correctly, I believe) argued that efficient and
> reliable distribution really requires the concept of "history is
> immutable". It makes replication much easier when you know that the known
> subset _never_ shrinks or changes - you only add on top of it.

However, it would be easy to allow reordering before "publishing" a revision,
which would preserve immutability for all published revisions while allowing
the patch _author_ the flexibility of reordering/splitting/joining patches
when creating them. In other words, a virtuous marriage of the BK model with
Andrew's Quilt.

Regards,

Daniel

2005-04-07 17:10:19

by Al Viro

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thu, Apr 07, 2005 at 08:32:04AM -0700, Linus Torvalds wrote:
> Also, there's actually a second reason why I've decided that cherry-
> picking is wrong, and it's non-technical.
>
> The thing is, cherry-picking very much implies that the people "up" the
> foodchain end up editing the work of the people "below" them. The whole
> reason you want cherry-picking is that you want to fix up somebody elses
> mistakes, ie something you disagree with.

No. There's another reason - when you are cherry-picking and reordering
*your* *own* *patches*. That's what I had been unable to explain to
Larry and that's what made BK unusable for me.

As for the immutable history... Ever had to read or grade students'
homework?
* the dumbest kind: "here's an answer <expression>, whaddya
mean 'where's the solution'?".
* next one: "here's how I've solved the problem: <pages of text
documenting the attempts, with many 'oops, there had been a mistake,
here's how we fix it'>".
* what you really want to see: series of steps leading to answer,
with clean logical structure that allows to understand what's being
done and verify correctness.

The first corresponds to "here's a half-meg of patch, it fixes everything".
The second is chronological history (aka "this came from our CVS, all bugs
are fixed by now, including those introduced in the middle of it; see
CVS history for details"). The third is a decent patch series.

And to get from "here's how I came up to solution" to "here's a clean way
to reach the solution" you _have_ to reorder. There's also "here are
misc notes from today, here are misc notes from yesterday, etc." and to
get that into sane shape you will need to split, reorder and probably
collapse several into combined delta (possibly getting an empty delta
as the result, if later ones negate the prior).

The point being, both history and well, publishable result can be expressed
as series of small steps, but they are not the same thing. So far all I've
seen in the area (and that includes BK) is heavily biased towards history part
and attempts to use this stuff for manipulating patch series turn into fighting
the tool.

I'd *love* to see something that can handle both - preferably with
history of reordering, etc. being kept. IOW, not just a tree of changesets
but a lattice - with multiple paths leading to the same node. So far
I've seen nothing of that kind ;-/

2005-04-07 17:36:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Thu, 7 Apr 2005, Daniel Phillips wrote:
>
> In that case, a nice refinement is to put the sequence number at the end of
> the subject line so patch sequences don't interleave:

No. That makes it unsortable, and also much harder to pick put which part
of the subject line is the explanation, and which part is just metadata
for me.

So my prefernce is _overwhelmingly_ for the format that Andrew uses (which
is partly explained by the fact that I am used to it, but also by the fact
that I've asked for Andrew to make trivial changes to match my usage).

That canonical format is:

Subject: [PATCH 001/123] [<area>:] <explanation>

together with the first line of the body being a

From: Original Author <[email protected]>

followed by an empty line and then the body of the explanation.

After the body of the explanation comes the "Signed-off-by:" lines, and
then a simple "---" line, and below that comes the diffstat of the patch
and then the patch itself.

That's the "canonical email format", and it's that because my normal
scripts (in BK/tools, but now I'm working on making them more generic)
take input that way. It's very easy to sort the emails alphabetically by
subject line - pretty much any email reader will support that - since
because the sequence number is zero-padded, the numerical and alphabetic
sort is the same.

If you send several sequences, you either send a simple explaining email
before the second sequence (hey, it's not like I'm a machine - I can use
my brains too, and in particular if the final number of patches in each
sequence is different, even if the sequences got re-ordered and are
overlapping, I can still just extract one from the other by selecting for
"/123] " in the subject line), or you modify the Subject: line subtly to
still sort uniquely and alphabetically in-order, ie the subject lines for
the second series might be

Subject: [PATCHv2 001/207] x86: fix eflags tracking
...

All very unambiguous, and my scripts already remove everything inside the
brackets and will just replace it with "[PATCH]" in the final version.

Linus

2005-04-07 17:45:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Thu, 7 Apr 2005, Al Viro wrote:
>
> No. There's another reason - when you are cherry-picking and reordering
> *your* *own* *patches*.

Yes. I agree. There should be some support for cherry-picking in between a
temporary throw-away tree and a "cleaned-up-tree". However, it should be
something you really do need to think about, and in most cases it really
does boil down to "export as patch, re-import from patch". Especially
since you potentially want to edit things in between anyway when you
cherry-pick.

(I do that myself: If I have been a messy boy, and committed mixed-up
things as one commit, I export it as a patch, and then I split the patch
by hand into two or more pieces - sometimes by just editing the patch
directly, but sometimes with a combination of by applying it, and editing
the result, and then re-exporting it as the new version).

And in the cases where this happens, you in fact often have unrelated
changes to the _same_file_, so you really do end up having that middle
step.

In other words, this cherry-picking can generally be scripted and done
"outside" the SCM (you can trivially have a script that takes a revision
from one tree and applies it to the other). I don't believe that the SCM
needs to support it in any fundamentally inherent manner. After all, why
should it, when it really boilds down to

(cd old-tree ; scm export-as-patch-plus-comments) |
(cd new-tree ; scm import-patch-plus-comments)

where the "patch-plus-comments" part is just basically an extended patch
(including rename information etc, not just the comments).

Btw, this method of cherry-picking again requires two _separate_ active
trees at the same time. BK is great at that, and really, that's what
distributed SCM's should be all about anyway. It's not just distributed
between different machines, it's literally distributed even on the same
machine, and it's actively _used_ that way.

Linus

2005-04-07 17:48:05

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thu, Apr 07, 2005 at 10:38:06AM -0700, Linus Torvalds wrote:

> So my prefernce is _overwhelmingly_ for the format that Andrew uses
> (which is partly explained by the fact that I am used to it, but
> also by the fact that I've asked for Andrew to make trivial changes
> to match my usage).
>
> That canonical format is:
>
> Subject: [PATCH 001/123] [<area>:] <explanation>
>
> together with the first line of the body being a
>
> From: Original Author <[email protected]>
>
> followed by an empty line and then the body of the explanation.

Having a script to check people get this right before sending it via
email would be a nice thing to put into scripts/ or probably
Documentation/ perhaps?

Does such a thing already exist?

Subject: Re: Kernel SCM saga..

On Apr 7, 2005 7:10 PM, Al Viro <[email protected]> wrote:
> On Thu, Apr 07, 2005 at 08:32:04AM -0700, Linus Torvalds wrote:
> > Also, there's actually a second reason why I've decided that cherry-
> > picking is wrong, and it's non-technical.
> >
> > The thing is, cherry-picking very much implies that the people "up" the
> > foodchain end up editing the work of the people "below" them. The whole
> > reason you want cherry-picking is that you want to fix up somebody elses
> > mistakes, ie something you disagree with.
>
> No. There's another reason - when you are cherry-picking and reordering
> *your* *own* *patches*. That's what I had been unable to explain to
> Larry and that's what made BK unusable for me.

Yep, I missed this in BK a lot.

There is another situation in which cherry-picking is very useful:
even if you have a clean tree it still may contain bugfixes mixed with
unrelated cleanups and sometimes you want to only apply bugfixes.

Bartlomiej

2005-04-07 17:55:29

by Daniel Phillips

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thursday 07 April 2005 13:10, Al Viro wrote:
> The point being, both history and well, publishable result can be expressed
> as series of small steps, but they are not the same thing. So far all I've
> seen in the area (and that includes BK) is heavily biased towards history
> part and attempts to use this stuff for manipulating patch series turn into
> fighting the tool.
>
> I'd *love* to see something that can handle both - preferably with
> history of reordering, etc. being kept. IOW, not just a tree of changesets
> but a lattice - with multiple paths leading to the same node. So far
> I've seen nothing of that kind ;-/

Which is a perfect demonstration of why the scm tool has to be free/open
source. We should never have had to plead with BitMover to extend BK in a
direction like that, but instead, just get the source and make it do it, like
any other open source project.

Three years ago, there was no fully working open source distributed scm code
base to use as a starting point, so extending BK would have been the only
easy alternative. But since then the situation has changed. There are now
several working code bases to provide a good starting point: Monotone, Arch,
SVK, Bazaar-ng and others.

Sure, there are quibbles about all of those, but right now is not the time for
quibbling, because a functional replacement for BK is needed in roughly two
months, capable of losslessly importing the kernel version graph. It only
has to support a subset of BK functionality, e.g., pulling and cloning. It
is ok to be a little slow so long as it is not pathetically slow. The
purpose of the interim solution is just to get the patch flow process back
online.

The key is the _lossless_ part. So long as the interim solution imports the
metadata losslessly, we have the flexibility to switch to a better solution
later, on short notice and without much pain.

So I propose that everybody who is interested, pick one of the above projects
and join it, to help get it to the point of being able to losslessly import
the version graph. Given the importance, I think that _all_ viable
alternatives need to be worked on in parallel, so that two months from now we
have several viable options.

Regards,

Daniel

2005-04-07 18:04:21

by Jörn Engel

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thu, 7 April 2005 10:47:18 -0700, Linus Torvalds wrote:
> On Thu, 7 Apr 2005, Al Viro wrote:
> >
> > No. There's another reason - when you are cherry-picking and reordering
> > *your* *own* *patches*.
>
> Yes. I agree. There should be some support for cherry-picking in between a
> temporary throw-away tree and a "cleaned-up-tree". However, it should be
> something you really do need to think about, and in most cases it really
> does boil down to "export as patch, re-import from patch". Especially
> since you potentially want to edit things in between anyway when you
> cherry-pick.

For reordering, using patcher, you can simply edit the sequence file
and move lines around. Nice and simple interface.

There is no checking involved, though. If you mode dependent patches,
you end up with a mess and either throw it all away or seriously
scratch your head. So a serious SCM might do something like this:

$ cp series new_series
$ vi new_series
$ SCM --reorder new_series
# essentially "mv new_series series", if no checks fail

Merging patches isn't that hard either. Splitting them would remain
manual, as you described it.

> Btw, this method of cherry-picking again requires two _separate_ active
> trees at the same time. BK is great at that, and really, that's what
> distributed SCM's should be all about anyway. It's not just distributed
> between different machines, it's literally distributed even on the same
> machine, and it's actively _used_ that way.

Amen!

J?rn

--
He who knows that enough is enough will always have enough.
-- Lao Tsu

2005-04-07 18:07:01

by Magnus Damm

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Apr 7, 2005 7:38 PM, Linus Torvalds <[email protected]> wrote:
> So my prefernce is _overwhelmingly_ for the format that Andrew uses (which
> is partly explained by the fact that I am used to it, but also by the fact
> that I've asked for Andrew to make trivial changes to match my usage).
>
> That canonical format is:
>
> Subject: [PATCH 001/123] [<area>:] <explanation>
>
> together with the first line of the body being a
>
> From: Original Author <[email protected]>
>
> followed by an empty line and then the body of the explanation.
>
> After the body of the explanation comes the "Signed-off-by:" lines, and
> then a simple "---" line, and below that comes the diffstat of the patch
> and then the patch itself.

While specifying things, wouldn't it be useful to have a line
containing tags that specifies if the patch contains new features, a
bug fix or a high-priority security fix? Then that information could
be used to find patches for the sucker-tree.

/ magnus

2005-04-07 18:15:33

by Dmitry Yusupov

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thu, 2005-04-07 at 13:54 -0400, Daniel Phillips wrote:
> Three years ago, there was no fully working open source distributed scm code
> base to use as a starting point, so extending BK would have been the only
> easy alternative. But since then the situation has changed. There are now
> several working code bases to provide a good starting point: Monotone, Arch,
> SVK, Bazaar-ng and others.

Right. For example, SVK is pretty mature project and very close to 1.0
release now. And it supports all kind of merges including Cherry-Picking
Mergeback:

http://svk.elixus.org/?MergeFeatures

Dmitry

2005-04-07 18:25:57

by Daniel Phillips

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thursday 07 April 2005 14:04, J?rn Engel wrote:
> On Thu, 7 April 2005 10:47:18 -0700, Linus Torvalds wrote:
>> ... There should be some support for cherry-picking in between
> > a temporary throw-away tree and a "cleaned-up-tree". However, it should
> > be something you really do need to think about, and in most cases it
> > really does boil down to "export as patch, re-import from patch".
> > Especially since you potentially want to edit things in between anyway
> > when you cherry-pick.
>
> For reordering, using patcher, you can simply edit the sequence file
> and move lines around. Nice and simple interface.
>
> There is no checking involved, though. If you mode dependent patches,
> you end up with a mess and either throw it all away or seriously
> scratch your head. So a serious SCM might do something like this:
>
> $ cp series new_series
> $ vi new_series
> $ SCM --reorder new_series
> # essentially "mv new_series series", if no checks fail
>
> Merging patches isn't that hard either. Splitting them would remain
> manual, as you described it.

Well it's clear that adding cherry picking, patch reordering, splitting and
merging (two patches into one) is not even hard, it's just a matter of making
it convenient by _building it into the tool_. Now, can we just pick a tool
and do it, please? :-)

Regards,

Daniel

2005-04-07 18:28:13

by Daniel Phillips

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thursday 07 April 2005 14:13, Dmitry Yusupov wrote:
> On Thu, 2005-04-07 at 13:54 -0400, Daniel Phillips wrote:
> > Three years ago, there was no fully working open source distributed scm
> > code base to use as a starting point, so extending BK would have been the
> > only easy alternative. But since then the situation has changed. There
> > are now several working code bases to provide a good starting point:
> > Monotone, Arch, SVK, Bazaar-ng and others.
>
> Right. For example, SVK is pretty mature project and very close to 1.0
> release now. And it supports all kind of merges including Cherry-Picking
> Mergeback:
>
> http://svk.elixus.org/?MergeFeatures

So for an interim way to get the patch flow back online, SVK is ready to try
_now_, and we only need a way to import the version graph? (true/false)

Regards,

Daniel

2005-04-07 18:35:03

by Daniel Phillips

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thursday 07 April 2005 13:38, Linus Torvalds wrote:
> On Thu, 7 Apr 2005, Daniel Phillips wrote:
> > In that case, a nice refinement is to put the sequence number at the end
> > of the subject line so patch sequences don't interleave:
>
> No. That makes it unsortable, and also much harder to pick put which part
> of the subject line is the explanation, and which part is just metadata
> for me.

Well, my list in the parent post _was_ sorted by subject. But that is a
quibble, the important point is that you just officially defined the
canonical format, which everybody should stick to for now:

> That canonical format is:
>
> Subject: [PATCH 001/123] [<area>:] <explanation>
>
> together with the first line of the body being a
>
> From: Original Author <[email protected]>
>
> followed by an empty line and then the body of the explanation.
>
> After the body of the explanation comes the "Signed-off-by:" lines, and
> then a simple "---" line, and below that comes the diffstat of the patch
> and then the patch itself.
>
> That's the "canonical email format", and it's that because my normal
> scripts (in BK/tools, but now I'm working on making them more generic)
> take input that way. It's very easy to sort the emails alphabetically by
> subject line - pretty much any email reader will support that - since
> because the sequence number is zero-padded, the numerical and alphabetic
> sort is the same.
>
> If you send several sequences, you either send a simple explaining email
> before the second sequence (hey, it's not like I'm a machine - I can use
> my brains too, and in particular if the final number of patches in each
> sequence is different, even if the sequences got re-ordered and are
> overlapping, I can still just extract one from the other by selecting for
> "/123] " in the subject line), or you modify the Subject: line subtly to
> still sort uniquely and alphabetically in-order, ie the subject lines for
> the second series might be
>
> Subject: [PATCHv2 001/207] x86: fix eflags tracking
> ...
>
> All very unambiguous, and my scripts already remove everything inside the
> brackets and will just replace it with "[PATCH]" in the final version.

Regards,

Daniel

2005-04-07 19:56:06

by Sam Ravnborg

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thu, Apr 07, 2005 at 01:00:51PM -0400, Daniel Phillips wrote:
> On Thursday 07 April 2005 11:10, Linus Torvalds wrote:
> > On Thu, 7 Apr 2005, Paul Mackerras wrote:
> > > Do you have it automated to the point where processing emailed patches
> > > involves little more overhead than doing a bk pull?
> >
> > It's more overhead, but not a lot. Especially nice numbered sequences like
> > Andrew sends (where I don't have to manually try to get the dependencies
> > right by trying to figure them out and hope I'm right, but instead just
> > sort by Subject: line)...
>
> Hi Linus,
>
> In that case, a nice refinement is to put the sequence number at the end of
> the subject line so patch sequences don't interleave:
>
> Subject: [PATCH] Unbork OOM Killer (1 of 3)
> Subject: [PATCH] Unbork OOM Killer (2 of 3)
> Subject: [PATCH] Unbork OOM Killer (3 of 3)
> Subject: [PATCH] Unbork OOM Killer (v2, 1 of 3)
> Subject: [PATCH] Unbork OOM Killer (v2, 2 of 3)

This breaks the rule of a descriptive subject for each patch.
Consider 30 subjetcs telling you "Subject: PCI updates [001/030]
That is not good.

Sam

2005-04-07 20:54:56

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thu, 2005-04-07 at 20:04 +0200, Jörn Engel wrote:
> On Thu, 7 April 2005 10:47:18 -0700, Linus Torvalds wrote:
> > On Thu, 7 Apr 2005, Al Viro wrote:
> > >
> > > No. There's another reason - when you are cherry-picking and reordering
> > > *your* *own* *patches*.
> >
> > Yes. I agree. There should be some support for cherry-picking in between a
> > temporary throw-away tree and a "cleaned-up-tree". However, it should be
> > something you really do need to think about, and in most cases it really
> > does boil down to "export as patch, re-import from patch". Especially
> > since you potentially want to edit things in between anyway when you
> > cherry-pick.
>
> For reordering, using patcher, you can simply edit the sequence file
> and move lines around. Nice and simple interface.
>
> There is no checking involved, though. If you mode dependent patches,
> you end up with a mess and either throw it all away or seriously
> scratch your head. So a serious SCM might do something like this:


just fyi, patchutils has a tool that can "flip" the order of patches
even if they patch the same line of code in the files.... with it you
can make a "bubble sort" to move stuff about safely...


2005-04-07 23:28:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Thu, 7 Apr 2005, Martin Pool wrote:
>
> Importing the first snapshot (2004-01-01) took 41.77s user, 1:23.79
> total. Each subsequent day takes about 10s user, 30s elapsed to commit
> into bzr. The speeds are comparable to CVS or a bit faster, and may be
> faster than other distributed systems. (This on a laptop with a 5400rpm
> disk.) Pulling out a complete copy of the tree as it was on a previous
> date takes about 14 user, 60s elapsed.

If you have an exportable tree, can you just make it pseudo-public, tell
me where to get a buildable system that works well enough, point me to
some documentation, and maybe I can get some feel for it?

Linus

2005-04-07 23:27:22

by Dave Airlie

[permalink] [raw]
Subject: Re: Kernel SCM saga..

> > Are you happy with processing patches + descriptions, one per mail?
>
> Yes. That's going to be my interim, I was just hoping that with 2.6.12-rc2
> out the door, and us in a "calming down" period, I could afford to not
> even do that for a while.
>
> The real problem with the email thing is that it ends up piling up: what
> BK did in this respect was that anythign that piled up in a BK repository
> ended up still being there, and a single "bk pull" got it anyway - so if
> somebody got ignored because I was busy with something else, it didn't add
> any overhead. The queue didn't get "congested".
>
> And that's a big thing. It comes from the "Linus pulls" model where people
> just told me that they were ready, instead of the "everybody pushes to
> Linus" model, where the destination gets congested at times.

Something I think we'll miss is bkbits.net in the long run, being able
to just push all patches for Linus to a tree and then forget about
that tree until Linus pulled from it was invaluable.. the fact that
this tree was online the whole time and you didn't queue up huge mails
for Linus's INBOX to be missed, meant a lot to me compared to pre-bk
workings..

Maybe now that kernel.org has been 'pimped out' we could set some sort
of system up where maintainers can drop a big load of patchsets or
even one big patch into some sort of public area and say this is my
diffs for Linus for his next pull and let Linus pull it at his
lesuire... some kinda rsync'y type thing comes to mind ...

so I can mail Linus and say hey Linus please grab
rsync://pimpedout.kernel.org/airlied/drm-linus and you grab everything
in there and I get notified perhaps or just a log like the bkbits
stats page, and Andrew can grab the patchsets the same as he does for
bk-drm now ... and I can have airlied/drm-2.6 where I can queue stuff
for -mm then just re-generate the patches for drm-linus later on..

Dave.

2005-04-08 01:00:08

by Ian Wienand

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Wed, Apr 06, 2005 at 08:42:08AM -0700, Linus Torvalds wrote:
> If you must, start reading up on "monotone".

One slightly annoying thing is that monotone doesn't appear to have a
web interface. I used to use the bk one a lot when tracking down
bugs, because it was really fast to have a web browser window open and
click through the revisions of a file reading checkin comments, etc.
Does anyone know if one is being worked on?

bazaar-ng at least mention this is important in their design docs and
arch has one in development too.

-i
[email protected]
http://www.gelato.unsw.edu.au


Attachments:
(No filename) (594.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2005-04-08 00:58:26

by Jesse Barnes

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thursday, April 7, 2005 9:40 am, Rik van Riel wrote:
> Larry, thanks for the help you have given us by making
> bitkeeper available for all these years.

A big thank you from me too, I've really enjoyed using BK and I think it's
made me much more productive than I would have been otherwise. I don't envy
you having to put up with the frequent flamefests...

Jesse

2005-04-08 03:35:56

by Jeff Garzik

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Linus Torvalds wrote:
>
> On Thu, 7 Apr 2005, Daniel Phillips wrote:
>
>>In that case, a nice refinement is to put the sequence number at the end of
>>the subject line so patch sequences don't interleave:
>
>
> No. That makes it unsortable, and also much harder to pick put which part
> of the subject line is the explanation, and which part is just metadata
> for me.
>
> So my prefernce is _overwhelmingly_ for the format that Andrew uses (which
> is partly explained by the fact that I am used to it, but also by the fact
> that I've asked for Andrew to make trivial changes to match my usage).
>
> That canonical format is:
>
> Subject: [PATCH 001/123] [<area>:] <explanation>
>
> together with the first line of the body being a
>
> From: Original Author <[email protected]>


Nod. For future reference, people can refer to

http://linux.yyz.us/patch-format.html
and/or
http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt

Jeff


2005-04-08 03:42:03

by Jeff Garzik

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Linus Torvalds wrote:
> In other words, this cherry-picking can generally be scripted and done
> "outside" the SCM (you can trivially have a script that takes a revision
> from one tree and applies it to the other). I don't believe that the SCM
> needs to support it in any fundamentally inherent manner. After all, why
> should it, when it really boilds down to
>
> (cd old-tree ; scm export-as-patch-plus-comments) |
> (cd new-tree ; scm import-patch-plus-comments)
>
> where the "patch-plus-comments" part is just basically an extended patch
> (including rename information etc, not just the comments).


Not that it matters anymore, but that's precisely what the script
Documentation/BK-usage/cpcset
did, for BitKeeper.

Jeff


2005-04-08 04:13:50

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Wed, Apr 06, 2005 at 08:42:08AM -0700, Linus Torvalds wrote:

> PS. Don't bother telling me about subversion. If you must, start reading
> up on "monotone". That seems to be the most viable alternative, but don't
> pester the developers so much that they don't get any work done. They are
> already aware of my problems ;)

I'm playing with monotone right now. Superficially it looks like it
has tons of gee-whiz neato stuff... however, it's *agonizingly* slow.
I mean glacial. A heavily sedated sloth with no legs is probably
faster.

Using monotone to pull itself too over 2 hours wall-time and 71
minutes of CPU time.

Arguably brand-new CPUs are probably about 2x the speed of what I have
now and there might have been networking funnies --- but that's still
35 monutes to get ~40MB of data.

The kernel is ten times larger, so does that mean to do a clean pull
of the kernel we are looking at (71/2*10) ~ 355 minutes or 6 hours of
CPU time?

2005-04-08 04:40:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Thu, 7 Apr 2005, Chris Wedgwood wrote:
>
> I'm playing with monotone right now. Superficially it looks like it
> has tons of gee-whiz neato stuff... however, it's *agonizingly* slow.
> I mean glacial. A heavily sedated sloth with no legs is probably
> faster.

Yes. The silly thing is, at least in my local tests it doesn't actually
seem to be _doing_ anything while it's slow (there are no system calls
except for a few memory allocations and de-allocations). It seems to have
some exponential function on the number of pathnames involved etc.

I'm hoping they can fix it, though. The basic notions do not sound wrong.

In the meantime (and because monotone really _is_ that slow), here's a
quick challenge for you, and any crazy hacker out there: if you want to
play with something _really_ nasty (but also very _very_ fast), take a
look at kernel.org:/pub/linux/kernel/people/torvalds/.

First one to send me the changelog tree of sparse-git (and a tool to
commit and push/pull further changes) gets a gold star, and an honorable
mention. I've put a hell of a lot of clues in there (*).

I've worked on it (and little else) for the last two days. Time for
somebody else to tell me I'm crazy.

Linus

(*) It should be easier than it sounds. The database is designed so that
you can do the equivalent of a nonmerging (ie pure superset) push/pull
with just plain rsync, so replication really should be that easy (if
somewhat bandwidth-intensive due to the whole-file format).

Never mind merging. It's not an SCM, it's a distribution and archival
mechanism. I bet you could make a reasonable SCM on top of it, though.
Another way of looking at it is to say that it's really a content-
addressable filesystem, used to track directory trees.

2005-04-08 05:05:03

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thu, Apr 07, 2005 at 09:42:04PM -0700, Linus Torvalds wrote:

> Yes. The silly thing is, at least in my local tests it doesn't
> actually seem to be _doing_ anything while it's slow (there are no
> system calls except for a few memory allocations and
> de-allocations). It seems to have some exponential function on the
> number of pathnames involved etc.

I see lots of brk calls changing the heap size, up, down, up, down,
over and over.

This smells a bit like c++ new/delete behavior to me.

2005-04-08 05:15:24

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Followup to: <[email protected]>
By author: Chris Wedgwood <[email protected]>
In newsgroup: linux.dev.kernel
>
> On Thu, Apr 07, 2005 at 09:42:04PM -0700, Linus Torvalds wrote:
>
> > Yes. The silly thing is, at least in my local tests it doesn't
> > actually seem to be _doing_ anything while it's slow (there are no
> > system calls except for a few memory allocations and
> > de-allocations). It seems to have some exponential function on the
> > number of pathnames involved etc.
>
> I see lots of brk calls changing the heap size, up, down, up, down,
> over and over.
>
> This smells a bit like c++ new/delete behavior to me.
>

Hmmm... can glibc be clued in to do some hysteresis on the memory
allocation?

-hpa

2005-04-08 05:57:46

by Martin Pool

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thu, 2005-04-07 at 16:27 -0700, Linus Torvalds wrote:
>
> On Thu, 7 Apr 2005, Martin Pool wrote:
> >
> > Importing the first snapshot (2004-01-01) took 41.77s user, 1:23.79
> > total. Each subsequent day takes about 10s user, 30s elapsed to commit
> > into bzr. The speeds are comparable to CVS or a bit faster, and may be
> > faster than other distributed systems. (This on a laptop with a 5400rpm
> > disk.) Pulling out a complete copy of the tree as it was on a previous
> > date takes about 14 user, 60s elapsed.
>
> If you have an exportable tree, can you just make it pseudo-public, tell
> me where to get a buildable system that works well enough, point me to
> some documentation, and maybe I can get some feel for it?

Hi,

There is a "stable" release here:
http://www.bazaar-ng.org/pkg/bzr-0.0.3.tgz

All you should need to do is unpack that and symlink bzr onto your path.

You can get the current bzr development tree, stored in itself, by
rsync:

rsync -av ozlabs.org::mbp/bzr/dev ~/bzr.dev

Inside that directory you can run 'bzr info', 'bzr status --all', 'bzr
unknowns', 'bzr log', 'bzr ignored'.

Repeated rsyncs will bring you up to date with what I've done -- and
will of course overwrite any local changes.

If someone was going to development on this then the method would
typically be to have two copies of the tree, one tracking my version and
another for your own work -- much as with bk. In your own tree, you can
do 'bzr add', 'bzr remove', 'bzr diff', 'bzr commit'.

At the moment all you can do is diff against the previous revision, or
manually diff the two trees, or use quilt, so it is just an archival
system not a full SCM system. In the near future there will be some
code to extract the differences as changesets to be mailed off.

I have done a rough-as-guts import from bkcvs into this, and I can
advertise that when it's on a server that can handle the load.

At a glance this looks very similar to git -- I can go into the
differences and why I did them the other way if you want.

--
Martin


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2005-04-08 06:15:09

by Matthias Urlichs

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Hi, Jan Hudec schrub am Thu, 07 Apr 2005 09:44:08 +0200:

> 1) GNU Arch/Bazaar. They use the same archive format, simple, have the
> concepts right. It may need some scripts or add ons. When Bazaar-NG is
> ready, it will be able to read the GNU Arch/Bazaar archives so
> switching should be easy.

Plus Bazaar has multiple implementations (C and Python). Plus arch can
trivially export single patches. Plus ... well, you get the idea. ;-)

Linus: Care to share your SCM feature requirement list?

--
Matthias Urlichs | {M:U} IT Design @ m-u-it.de | [email protected]


2005-04-08 06:40:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Fri, 8 Apr 2005, Martin Pool wrote:
>
> You can get the current bzr development tree, stored in itself, by
> rsync:

I was thinking more of an exportable kernel tree in addition to the tool.

The reason I mention that is just that I know several SCM's bog down under
load horribly, so it actually matters what the size of the tree is.

And I'm absolutely _not_ asking you for the 60,000 changesets that are in
the BK tree, I'd be prfectly happy with a 2.6.12-rc2-based one for
testing.

I know I can import things myself, but the reason I ask is because I've
got several SCM's I should check out _and_ I've been spending the last two
days writing my own fallback system so that I don't get screwed if nothing
out there works right now.

Which is why I'd love to hear from people who have actually used various
SCM's with the kernel. There's bound to be people who have already tried.

I've gotten a lot of email of the kind "I love XYZ, you should try it
out", but so far I've not seen anybody say "I've tracked the kernel with
XYZ, and it does ..."

So, this is definitely not a "Martin Pool should do this" kind of issue:
I'd like many people to test out many alternatives, to get a feel for
where they are especially for a project the size of the kernel..

Linus

2005-04-08 07:06:31

by Rogan Dawes

[permalink] [raw]
Subject: Re: Kernel SCM saga..

H. Peter Anvin wrote:
> Followup to: <[email protected]>
> By author: Chris Wedgwood <[email protected]>
> In newsgroup: linux.dev.kernel
>
>>On Thu, Apr 07, 2005 at 09:42:04PM -0700, Linus Torvalds wrote:
>>
>>
>>>Yes. The silly thing is, at least in my local tests it doesn't
>>>actually seem to be _doing_ anything while it's slow (there are no
>>>system calls except for a few memory allocations and
>>>de-allocations). It seems to have some exponential function on the
>>>number of pathnames involved etc.
>>
>>I see lots of brk calls changing the heap size, up, down, up, down,
>>over and over.
>>
>>This smells a bit like c++ new/delete behavior to me.
>>
>
>
> Hmmm... can glibc be clued in to do some hysteresis on the memory
> allocation?
>
> -hpa

Take a look at
http://www.linuxshowcase.org/2001/full_papers/ezolt/ezolt_html/

Abstract

GNU libc's default setting for malloc can cause a significant
performance penalty for applications that use it extensively, such as
Compaq's high performance extended math library, CXML. The default
malloc tuning can cause a significant number of minor page faults, and
result in application performance of only half of the true potential.
This paper describes how to remove the performance penalty using
environmental variables and the method used to discover the cause of the
malloc performance penalty.

Regards,

Rogan

2005-04-08 07:17:33

by ross

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thu, Apr 07, 2005 at 09:42:04PM -0700, Linus Torvalds wrote:
> In the meantime (and because monotone really _is_ that slow), here's a
> quick challenge for you, and any crazy hacker out there: if you want to
> play with something _really_ nasty (but also very _very_ fast), take a
> look at kernel.org:/pub/linux/kernel/people/torvalds/.

Interesting. I like it, with one modification (see below).

> First one to send me the changelog tree of sparse-git (and a tool to
> commit and push/pull further changes) gets a gold star, and an honorable
> mention. I've put a hell of a lot of clues in there (*).

Here's a partial solution. It does depend on a modified version of
cat-file that behaves like cat. I found it easier to have cat-file
just dump the object indicated on stdout. Trivial patch for that is included.

Two scripts are included:

1) makechlog.sh takes an object and generates a ChangeLog file
consisting of all the parents of the given object. It's probably
breakable, but correctly outputs the sparse-git changes when run on
HEAD. Handles multiple parents and breaks cycles.

This adds a line to each object "me <sha1>". This lets a change
identify itself.

It takes 35 seconds to produce all the change history on my box. It
produces a single file named "ChangeLog".

2) chkchlog.sh uses the "me" entries to verify that #1 didn't miss any
parents. It's mostly to prove my solution reasonably correct ::-)

The patch is below, the scripts are attached, and everything is
available here:

http://lug.udel.edu/~ross/git/

Now to see what I come up with for commit, push, and pull...

--
Ross Vandegrift
[email protected]

"The good Christian should beware of mathematicians, and all those who
make empty prophecies. The danger already exists that the mathematicians
have made a covenant with the devil to darken the spirit and to confine
man in the bonds of Hell."
--St. Augustine, De Genesi ad Litteram, Book II, xviii, 37


--- cat-file.orig.c 2005-04-08 01:53:54.000000000 -0400
+++ cat-file.c 2005-04-08 01:57:51.000000000 -0400
@@ -11,18 +11,11 @@
char type[20];
void *buf;
unsigned long size;
- char template[] = "temp_git_file_XXXXXX";
- int fd;

if (argc != 2 || get_sha1_hex(argv[1], sha1))
usage("cat-file: cat-file <sha1>");
buf = read_sha1_file(sha1, type, &size);
if (!buf)
exit(1);
- fd = mkstemp(template);
- if (fd < 0)
- usage("unable to create tempfile");
- if (write(fd, buf, size) != size)
- strcpy(type, "bad");
- printf("%s: %s\n", template, type);
+ printf ("%s", buf);
}


Attachments:
(No filename) (2.62 kB)
makechlog.sh (981.00 B)
chkchlog.sh (199.00 B)
Download all attachments

2005-04-08 07:19:46

by Daniel Phillips

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Friday 08 April 2005 03:05, Rogan Dawes wrote:
> Take a look at
> http://www.linuxshowcase.org/2001/full_papers/ezolt/ezolt_html/
>
> Abstract
>
> GNU libc's default setting for malloc can cause a significant
> performance penalty for applications that use it extensively, such as
> Compaq's high performance extended math library, CXML. The default
> malloc tuning can cause a significant number of minor page faults, and
> result in application performance of only half of the true potential.

This does not smell like an n*2 suckage, more like n^something suckage.
Finding the elephant under the rug should not be hard. Profile?

Regards,

Daniel

2005-04-08 07:23:13

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thu, Apr 07, 2005 at 09:42:04PM -0700, Linus Torvalds wrote:
> play with something _really_ nasty (but also very _very_ fast), take a
> look at kernel.org:/pub/linux/kernel/people/torvalds/.

Why not to use sql as backend instead of the tree of directories? That solves
userland journaling too (really one still has to be careful to know the
read-committed semantics of sql, which is not obvious stuff, but 99% of
common cases like this one just works safe automatically since all
inserts/delete/update are always atomic).

You can keep the design of your db exactly the same and even the command line
of your script the same, except you won't have deal with the implementation of
it anymore, and the end result may run even faster with proper btrees and you
won't have scalability issues if the directory of hashes fills up, and it'll
get userland journaling, live backups, runtime analyses of your queries with
genetic algorithms (pgsql 8 seems to have it) etc...

I seem to recall there's a way to do delayed commits too, so you won't
be sychronous, but you'll still have journaling. You clearly don't care
to do synchronous writes, all you care about is that the commit is
either committed completely or not committed at all (i.e. not an half
write of the patch that leaves your db corrupt).

Example:

CREATE TABLE patches (
patch BIGSERIAL PRIMARY KEY,

commiter_name VARCHAR(32) NOT NULL CHECK(commiter_name != ''),
commiter_email VARCHAR(32) NOT NULL CHECK(commiter_email != ''),

md5 CHAR(32) NOT NULL CHECK(md5 != ''),
len INTEGER NOT NULL CHECK(len > 0),
UNIQUE(md5, len),

payload BYTEA NOT NULL,

timestamp TIMESTAMP NOT NULL
);
CREATE INDEX patches_md5_index ON patches (md5);
CREATE INDEX patches_timestamp_index ON patches (timestamp);

s/md5/sha1/, no difference.

This will automatically spawn fatal errors if there are hash collisions and it
enforces a bit of checking.

Then you need a few lines of python to insert/lookup. Example for psycopg2:

import pwd, os, socket

[..]

patch = {'commiter_name': pwd.getpwuid(os.getuid())[4],
'commiter_email': pwd.getpwuid(os.getuid())[0] + '@' + socket.getfqdn(),
'md5' : md5.new(data).hexdigest(), 'len' : len(data),
payload : data, 'timestamp' : 'now'}
curs.execute("""INSERT INTO patches
VALUES (%(committer_name)s, %(commiter_email)s,
%(md5)s, %(len)s, %(payload)s, %(timestamp)s)""", patch)

('now' will be evaluated by the sql server, who knows about the time too)

The speed I don't know for sure, but especially with lots of data the sql way
should at least not be significantly slower, pgsql scales with terabytes
without apparent problems (modulo the annoyance of running vacuum once per day
in cron, to avoid internal sequence number overflows after >4 giga
committs, and once per day the analyser too so it learns about your
usage patterns and can optimize the disk format for it).

For sure the python part isn't going to be noticeable, you can still write it
in C if you prefer (it'll clearly run faster if you want to run tons of
inserts for a benchmark), so then everything will run at bare-hardware
speed and there will be no time wasted interpreting bytecode (only the
sql commands have to be interpreted).

The backup should also be tiny (runtime size is going to be somewhat larger due
the more data structure it has, how much larger I don't know). I know for sure
this kind of setup works like a charm on ppc64 (32bit userland), and x86 (32bit
and 64bit userland).

monotone using sqlite sounds a good idea infact (IMHO they could use a real
dbms too, so that you also get parallelism and you could attach another app to
the backing store at the same time or you could run a live backup and to
get all other high end performance features).

If you feel this is too bloated feel free to ignore this email of course! If
instead you'd like to give this a spin, let me know and I can help to
set it up quick (either today or from Monday).

I also like quick dedicated solutions and I was about to write a backing
store with a tree of dirs + hashes similar to yours for a similar
problem, but I give it up while planning the userland journaling part
and even worse the userland fs locking with live backups, when a DBMS
gets everything right including live backups (and it provides async
interface too via sockets). OTOH for this usage journaling and locking
aren't a big issue since you may have the patch to hash by hand to find
any potentially half-corrupted bit after reboot and you probably run it
serially.

About your compression of the data, I don't think you want to do that.
The size of the live image isn't the issue, the issue is the size of the
_backups_ and you want to compress an huge thing (i.e. the tarball of
the cleartext, or the sql cleartext backup), not many tiny patches.

Comparing the size of the repositories isn't interesting, the
interesting thing is to compare the size of the backups.

BTW, this fixed compliation for my system.

--- ./Makefile.orig 2005-04-08 09:07:17.000000000 +0200
+++ ./Makefile 2005-04-08 08:52:35.000000000 +0200
@@ -8,7 +8,7 @@ all: $(PROG)
install: $(PROG)
install $(PROG) $(HOME)/bin/

-LIBS= -lssl
+LIBS= -lssl -lz

init-db: init-db.o


Thanks.

2005-04-08 07:34:51

by Marcel Lanz

[permalink] [raw]
Subject: Re: Kernel SCM saga..

git on sarge

--- git-0.02/Makefile.orig 2005-04-07 23:06:19.000000000 +0200
+++ git-0.02/Makefile 2005-04-08 09:24:28.472672224 +0200
@@ -8,7 +8,7 @@ all: $(PROG)
install: $(PROG)
install $(PROG) $(HOME)/bin/

-LIBS= -lssl
+LIBS= -lssl -lz

init-db: init-db.o

2005-04-08 07:50:26

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Daniel Phillips wrote:
> On Friday 08 April 2005 03:05, Rogan Dawes wrote:
>
>>Take a look at
>>http://www.linuxshowcase.org/2001/full_papers/ezolt/ezolt_html/
>>
>>Abstract
>>
>>GNU libc's default setting for malloc can cause a significant
>>performance penalty for applications that use it extensively, such as
>>Compaq's high performance extended math library, CXML. The default
>>malloc tuning can cause a significant number of minor page faults, and
>>result in application performance of only half of the true potential.
>
>
> This does not smell like an n*2 suckage, more like n^something suckage.
> Finding the elephant under the rug should not be hard. Profile?
>

Lack of hysteresis can do that, with large swats of memory constantly
being claimed and returned to the system. One way to implement
hysteresis would be based on a decaying peak-based threshold;
unfortunately for optimal performance that requires the C runtime to
have a notion of time, and in extreme cases even be able to do
asynchronous deallocation, but in reality one can probably assume that
the rate of malloc/free is roughly constant over time.

-hpa

2005-04-08 08:43:22

by Matt Johnston

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thu, Apr 07, 2005 at 09:42:04PM -0700, Linus Torvalds wrote:
>
> On Thu, 7 Apr 2005, Chris Wedgwood wrote:
> >
> > I'm playing with monotone right now. Superficially it looks like it
> > has tons of gee-whiz neato stuff... however, it's *agonizingly* slow.
> > I mean glacial. A heavily sedated sloth with no legs is probably
> > faster.
>
> Yes. The silly thing is, at least in my local tests it doesn't actually
> seem to be _doing_ anything while it's slow (there are no system calls
> except for a few memory allocations and de-allocations). It seems to have
> some exponential function on the number of pathnames involved etc.
>
> I'm hoping they can fix it, though. The basic notions do not sound wrong.

That is indeed correct wrt pathnames. The current head of
monotone is a lot better in this regard (the order of 2-3
minutes for "monotone import" on a 2.6 linux untar). The
basic problem is that in the last release (0.17), a huge
amount of sanity checking code was added to ensure that
inconsistent or generally bad revisions can never be
written/received/transmitted.

The focus is now on speeding that up - there's a _lot_ of
low hanging fruit for us to look at.

Matt

2005-04-08 08:45:26

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thu, Apr 07, 2005 at 11:41:29PM -0700, Linus Torvalds wrote:
> I know I can import things myself, but the reason I ask is because I've
> got several SCM's I should check out _and_ I've been spending the last two
> days writing my own fallback system so that I don't get screwed if nothing
> out there works right now.

I tend to like bzr too (and I tend to like too many things ;), but even
if the export of the data would be available it seems still too early in
development to be able to help you this week, it seems to miss any form
of network export too.

> I'd like many people to test out many alternatives, to get a feel for
> where they are especially for a project the size of the kernel..

The huge number of changesets is the crucial point, there are good
distributed SCM already but they are apparently not efficient enough at
handling 60k changesets.

We'd need a regenerated coherent copy of BKCVS to pipe into those SCM to
evaluate how well they scale.

OTOH if your git project already allows storing the data in there,
that looks nice ;). I don't yet fully understand how the algorithms of
the trees are meant to work (I only understand well the backing store
and I tend to prefer DBMS over tree of dirs with hashes). So I've no
idea how it can plug in well for a SCM replacement or how you want to
use it. It seems a kind of fully lockless thing where you can merge from
one tree to the other without locks and where you can make quick diffs.
It looks similar to a diff -ur of two hardlinked trees, except this one
can save a lot of bandwidth to copy with rsync (i.e. hardlinks becomes
worthless after using rsync in the network, but hashes not). Clearly the
DBMS couldn't use the rsync binary to distribute the objects, but a
network protocol could do the same thing rsync does.

Thanks.

2005-04-08 09:23:42

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Fri, 8 Apr 2005, Marcel Lanz wrote:
> git on sarge
>
> --- git-0.02/Makefile.orig 2005-04-07 23:06:19.000000000 +0200
> +++ git-0.02/Makefile 2005-04-08 09:24:28.472672224 +0200
> @@ -8,7 +8,7 @@ all: $(PROG)
> install: $(PROG)
> install $(PROG) $(HOME)/bin/
>
> -LIBS= -lssl
> +LIBS= -lssl -lz
>
> init-db: init-db.o
>

I found a few more `issues' after adding `-O3 -Wall'.
Most are cosmetic, but the missing return value in remove_file_from_cache() is
a real bug. Hmm, upon closer look the caller uses its return value in a weird
way, so another bug may be hiding in add_file_to_cache().

Caveat: everything is untested, besides compilation ;-)

diff -purN git-0.02.orig/Makefile git-0.02/Makefile
--- git-0.02.orig/Makefile 2005-04-07 23:06:19.000000000 +0200
+++ git-0.02/Makefile 2005-04-08 11:02:02.000000000 +0200
@@ -1,4 +1,4 @@
-CFLAGS=-g
+CFLAGS=-g -O3 -Wall
CC=gcc

PROG=update-cache show-diff init-db write-tree read-tree commit-tree cat-file
@@ -8,7 +8,7 @@ all: $(PROG)
install: $(PROG)
install $(PROG) $(HOME)/bin/

-LIBS= -lssl
+LIBS= -lssl -lz

init-db: init-db.o

diff -purN git-0.02.orig/cat-file.c git-0.02/cat-file.c
--- git-0.02.orig/cat-file.c 2005-04-07 23:15:17.000000000 +0200
+++ git-0.02/cat-file.c 2005-04-08 11:07:28.000000000 +0200
@@ -5,6 +5,8 @@
*/
#include "cache.h"

+#include <string.h>
+
int main(int argc, char **argv)
{
unsigned char sha1[20];
@@ -25,4 +27,5 @@ int main(int argc, char **argv)
if (write(fd, buf, size) != size)
strcpy(type, "bad");
printf("%s: %s\n", template, type);
+ exit(0);
}
diff -purN git-0.02.orig/commit-tree.c git-0.02/commit-tree.c
--- git-0.02.orig/commit-tree.c 2005-04-07 23:15:17.000000000 +0200
+++ git-0.02/commit-tree.c 2005-04-08 11:06:08.000000000 +0200
@@ -6,6 +6,7 @@
#include "cache.h"

#include <pwd.h>
+#include <string.h>
#include <time.h>

#define BLOCKING (1ul << 14)
diff -purN git-0.02.orig/init-db.c git-0.02/init-db.c
--- git-0.02.orig/init-db.c 2005-04-07 23:15:17.000000000 +0200
+++ git-0.02/init-db.c 2005-04-08 11:07:33.000000000 +0200
@@ -5,10 +5,12 @@
*/
#include "cache.h"

+#include <string.h>
+
int main(int argc, char **argv)
{
char *sha1_dir = getenv(DB_ENVIRONMENT), *path;
- int len, i, fd;
+ int len, i;

if (mkdir(".dircache", 0700) < 0) {
perror("unable to create .dircache");
@@ -25,7 +27,7 @@ int main(int argc, char **argv)
if (sha1_dir) {
struct stat st;
if (!stat(sha1_dir, &st) < 0 && S_ISDIR(st.st_mode))
- return;
+ exit(1);
fprintf(stderr, "DB_ENVIRONMENT set to bad directory %s: ", sha1_dir);
}

diff -purN git-0.02.orig/read-cache.c git-0.02/read-cache.c
--- git-0.02.orig/read-cache.c 2005-04-07 23:23:43.000000000 +0200
+++ git-0.02/read-cache.c 2005-04-08 11:07:37.000000000 +0200
@@ -5,6 +5,8 @@
*/
#include "cache.h"

+#include <string.h>
+
const char *sha1_file_directory = NULL;
struct cache_entry **active_cache = NULL;
unsigned int active_nr = 0, active_alloc = 0;
@@ -89,7 +91,7 @@ void * read_sha1_file(unsigned char *sha
z_stream stream;
char buffer[8192];
struct stat st;
- int i, fd, ret, bytes;
+ int fd, ret, bytes;
void *map, *buf;
char *filename = sha1_file_name(sha1);

@@ -173,7 +175,7 @@ int write_sha1_file(char *buf, unsigned
int write_sha1_buffer(unsigned char *sha1, void *buf, unsigned int size)
{
char *filename = sha1_file_name(sha1);
- int i, fd;
+ int fd;

fd = open(filename, O_WRONLY | O_CREAT | O_EXCL, 0666);
if (fd < 0)
diff -purN git-0.02.orig/read-tree.c git-0.02/read-tree.c
--- git-0.02.orig/read-tree.c 2005-04-08 04:58:44.000000000 +0200
+++ git-0.02/read-tree.c 2005-04-08 11:07:41.000000000 +0200
@@ -5,6 +5,8 @@
*/
#include "cache.h"

+#include <string.h>
+
static void create_directories(const char *path)
{
int len = strlen(path);
@@ -72,7 +74,6 @@ static int unpack(unsigned char *sha1)

int main(int argc, char **argv)
{
- int fd;
unsigned char sha1[20];

if (argc != 2)
diff -purN git-0.02.orig/show-diff.c git-0.02/show-diff.c
--- git-0.02.orig/show-diff.c 2005-04-07 23:15:17.000000000 +0200
+++ git-0.02/show-diff.c 2005-04-08 11:07:44.000000000 +0200
@@ -5,6 +5,8 @@
*/
#include "cache.h"

+#include <string.h>
+
#define MTIME_CHANGED 0x0001
#define CTIME_CHANGED 0x0002
#define OWNER_CHANGED 0x0004
@@ -60,7 +62,6 @@ int main(int argc, char **argv)
struct stat st;
struct cache_entry *ce = active_cache[i];
int n, changed;
- unsigned int mode;
unsigned long size;
char type[20];
void *new;
diff -purN git-0.02.orig/update-cache.c git-0.02/update-cache.c
--- git-0.02.orig/update-cache.c 2005-04-07 23:15:17.000000000 +0200
+++ git-0.02/update-cache.c 2005-04-08 11:08:55.000000000 +0200
@@ -5,6 +5,8 @@
*/
#include "cache.h"

+#include <string.h>
+
static int cache_name_compare(const char *name1, int len1, const char *name2, int len2)
{
int len = len1 < len2 ? len1 : len2;
@@ -50,6 +52,7 @@ static int remove_file_from_cache(char *
if (pos < active_nr)
memmove(active_cache + pos, active_cache + pos + 1, (active_nr - pos - 1) * sizeof(struct cache_entry *));
}
+ return 0;
}

static int add_cache_entry(struct cache_entry *ce)
@@ -250,4 +253,5 @@ int main(int argc, char **argv)
return 0;
out:
unlink(".dircache/index.lock");
+ exit(0);
}
diff -purN git-0.02.orig/write-tree.c git-0.02/write-tree.c
--- git-0.02.orig/write-tree.c 2005-04-07 23:15:17.000000000 +0200
+++ git-0.02/write-tree.c 2005-04-08 11:07:51.000000000 +0200
@@ -5,6 +5,8 @@
*/
#include "cache.h"

+#include <string.h>
+
static int check_valid_sha1(unsigned char *sha1)
{
char *filename = sha1_file_name(sha1);
@@ -31,7 +33,7 @@ static int prepend_integer(char *buffer,

int main(int argc, char **argv)
{
- unsigned long size, offset, val;
+ unsigned long size, offset;
int i, entries = read_cache();
char *buffer;


Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2005-04-08 11:42:36

by Catalin Marinas

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Chris Wedgwood <[email protected]> wrote:
> I'm playing with monotone right now. Superficially it looks like it
> has tons of gee-whiz neato stuff... however, it's *agonizingly* slow.
> I mean glacial. A heavily sedated sloth with no legs is probably
> faster.

I tried some time ago to import the BKCVS revisions since Linux 2.6.9
into a monotone-0.16 repository. I later tried to upgrade the database
(repository) to monotone version 0.17. The result - converting ~3500
revisions would have taken more than *one year*, fact confirmed by the
monotone developers. The bottleneck seemed to be the big size of the
manifest (which stores the file names and the corresponding SHA1
values) and all the validation performed when converting. The
solution, unsafe, is to disable the revision checks in monotone but
you can end up with an inconsistent repository (haven't tried this).

--
Catalin

2005-04-08 12:02:55

by Matthias Andree

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Andrea Arcangeli schrieb am 2005-04-08:

> On Thu, Apr 07, 2005 at 09:42:04PM -0700, Linus Torvalds wrote:
> > play with something _really_ nasty (but also very _very_ fast), take a
> > look at kernel.org:/pub/linux/kernel/people/torvalds/.
>
> Why not to use sql as backend instead of the tree of directories? That solves
> userland journaling too (really one still has to be careful to know the
> read-committed semantics of sql, which is not obvious stuff, but 99% of
> common cases like this one just works safe automatically since all
> inserts/delete/update are always atomic).
>
> You can keep the design of your db exactly the same and even the command line
> of your script the same, except you won't have deal with the implementation of
> it anymore, and the end result may run even faster with proper btrees and you
> won't have scalability issues if the directory of hashes fills up, and it'll
> get userland journaling, live backups, runtime analyses of your queries with
> genetic algorithms (pgsql 8 seems to have it) etc...
>
> I seem to recall there's a way to do delayed commits too, so you won't
> be sychronous, but you'll still have journaling. You clearly don't care
> to do synchronous writes, all you care about is that the commit is
> either committed completely or not committed at all (i.e. not an half
> write of the patch that leaves your db corrupt).
>
> Example:
>
> CREATE TABLE patches (
> patch BIGSERIAL PRIMARY KEY,
>
> commiter_name VARCHAR(32) NOT NULL CHECK(commiter_name != ''),
> commiter_email VARCHAR(32) NOT NULL CHECK(commiter_email != ''),

The length is too optimistic and insufficient to import the current BK
stuff. I'd vote for 64 or at least 48 for each, although 48 is going to
be a tight fit. It costs a bit but considering the expected payload
size it's irrelevant.

Committer (double t) email is up to 36 characters at the moment and the
name up to 43 characters when analyzing the shortlog script with this
little Perl snippet:

------------------------------------------------------------------------
while (($k, $v) = each %addresses) {
$lk = length $k;
$lv = length $v;
if ($lk > $mk) { $mk = $lk; }
if ($lv > $mv) { $mv = $lv; }
}
print "max key len $mk, max val len $mv\n";
------------------------------------------------------------------------

which prints: (key is the email, val the name)

max key len 43, max val len 36

--
Matthias Andree

2005-04-08 12:21:39

by Florian Weimer

[permalink] [raw]
Subject: Re: Kernel SCM saga..

* Matthias Andree:

>> commiter_name VARCHAR(32) NOT NULL CHECK(commiter_name != ''),
>> commiter_email VARCHAR(32) NOT NULL CHECK(commiter_email != ''),
>
> The length is too optimistic and insufficient to import the current BK
> stuff. I'd vote for 64 or at least 48 for each, although 48 is going to
> be a tight fit. It costs a bit but considering the expected payload
> size it's irrelevant.

You should also check your database documentation if VARCHAR(n) is
actually implemented implemented in the same way as TEXT (or what the
unbounded string type is called), plus an additional length check. It
doesn't make much sense to use VARCHAR if there isn't a performance
(or disk space) benefit, IMHO, especially for such data.

2005-04-08 14:24:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Fri, 8 Apr 2005, Andrea Arcangeli wrote:
>
> Why not to use sql as backend instead of the tree of directories?

Because it sucks?

I can come up with millions of ways to slow things down on my own. Please
come up with ways to speed things up instead.

Linus

2005-04-08 15:48:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Fri, 8 Apr 2005 [email protected] wrote:
>
> Here's a partial solution. It does depend on a modified version of
> cat-file that behaves like cat. I found it easier to have cat-file
> just dump the object indicated on stdout. Trivial patch for that is included.

Your trivial patch is trivially incorrect, though. First off, some files
may be binary (and definitely are - the "tree" type object contains
pathnames, and in order to avoid having to worry about special characters
they are NUL-terminated), and your modified "cat-file" breaks that.

Secondly, it doesn't check or print the tag.

That said, I think I agree with your concern, and cat-file should not use
a temp-file. I'll fix it, but I'll also make it verify the tag (so you'd
now have to know the tag in advance if you want to cat the data).

Something like

cat-file -t <sha1> # output the tag
cat-file <tag> <sha1> # output the data

or similar. Easy enough. That way you can do

torvalds@ppc970:~/git> ./cat-file -t `cat .dircache/HEAD `
commit

and

torvalds@ppc970:~/git> ./cat-file commit `cat .dircache/HEAD `

tree ca30cdf8df2f31545cc1f2c1be62619111b6f6aa
parent c2474b336d7a96fb4e03e65d229bcddc62b244fc
author Linus Torvalds <[email protected]> Fri Apr 8 08:16:38 2005
committer Linus Torvalds <[email protected]> Fri Apr 8 08:16:38 2005

Make "cat-file" output the file contents to stdout.

New syntax: "cat-file -t <sha1>" shows the tag, while "cat-file <tag> <sha1>"
outputs the file contents after checking that the supplied tag matches.

I'll rsync the .dircache directory to kernel.org. You'll need to update
your scripts.

> Now to see what I come up with for commit, push, and pull...

A "commit" (*) looks roughly like this:

# check with "show-diff" what has changed, and check if
# you need to add any files..

update-cache <list of files that have been changed/added/deleted>

# check with "show-diff" that it all looks right

oldhead=$(cat .dircache/HEAD)
newhead=$(commit-tree $(write-tree) -p $oldhead < commit-message)

# update the head information
if [ "$newhead" != "" ] ; then echo $newhead > .dircache/HEAD; fi

(*) I call this "commit", but it's really something much simpler. It's
really just a "I now have <this directory state>, I got here from
<collection of previous directory states> and the reason was <reason>".

The "push" I use is

rsync -avz --exclude index .dircache/ <destination-dir>

and you can pull the same way, except when you pull you should save _your_
HEAD file first (and then you're screed. There's no way to merge. If
you've made changes and committed them, your changes are still there, but
they are now on a different HEAD than the new one).

That, btw, is kind of the design. "git" really doesn't care about things
like merges. You can use _any_ SCM to do a merge. What "git" does is track
directory state (and how you got to that state), and nothing else. It
doesn't merge, it doesn't really do a whole lot of _anything_.

So when you "pull" or "push" on a git archive, you get the "union" of all
directory states in the destination. The HEAD thing is _one_ pointer into
the "sea of directory states", but you really have to use something else
to merge two directory states together.

Linus

2005-04-08 16:17:15

by Matthias-Christian Ott

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Linus Torvalds wrote:

>On Fri, 8 Apr 2005, Andrea Arcangeli wrote:
>
>
>>Why not to use sql as backend instead of the tree of directories?
>>
>>
>
>Because it sucks?
>
>I can come up with millions of ways to slow things down on my own. Please
>come up with ways to speed things up instead.
>
> Linus
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
SQL Databases like SQLite aren't slow.
But maybe a Berkeley Database v.4 is a better solution.

Matthias-Christian Ott

2005-04-08 16:47:50

by Catalin Marinas

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Linus Torvalds <[email protected]> wrote:
> Which is why I'd love to hear from people who have actually used various
> SCM's with the kernel. There's bound to be people who have already
> tried.

I (successfully) tried GNU Arch with the Linux kernel. I mirrored all
the BKCVS changesets since Linux 2.6.9 (5300+ changesets) using this
script:
http://wiki.gnuarch.org/BKCVS_20to_20Arch_20Script_20for_20Linux_20Kernel

My repository size is 1.1GB but this is because the script I use
creates a snapshot (i.e. a full tarball) of every main and -rc
release. For each individual changeset, an arch repository has a
patch-xxx directory with a compressed tarball containing the patch, a
log file and a checksum file.

GNU Arch may have some annoying things (file naming, long commands,
harder to get started, imposed version naming) and I won't try to
advocate them but, for me, it looked like the best (free) option
available regarding both features and speed. Being changeset oriented
also has some advantages from my point of view. Being distributed
means that you can create a branch on your local repository from a
tree stored on a (read-only) remote repository (hosted on an ftp/http
server).

I can't compare it with BK since I haven't used it.

The way I use it:

- a main repository tracking all the changes to the bk-head,
linux--main--2.6 (for those that never read/heard about arch, a tree
name has the form "name--branch--version")
- my main branch from the mainline tree, linux-arm--main--2.6, that
was integrating my patches and was periodically merging the latest
changes in linux--main--2.6
- different linux-arm--platformX--2.6 or linux-arm--deviceX--2.6 trees
that were eventually merged into the linux-arm--main--2.6 tree

The main merge algorithm is called star-merge and does a three-way
merge between the local tree, the remote one and the common ancestor
of these. Cherry picking is also supported for those that like it (I
found it very useful if, for example, I fix a general bug in a branch
that should be integrated in the main tree but the branch is not yet
ready for inclusion).

All the standard commands like commit, diff, status etc. are supported
by arch. A useful command is "missing" which shows what changes are
present in a tree and not in the current one. It is handy to see a
summary of the remote changes before doing a merge (and faster than a
full diff). It also supports file/directory renaming.

To speed things up, arch uses a revision library with a directory for
every revision, the files being hard-linked between revisions to save
space. You can also hard-link the working tree to the revision library
(which speeds the tree diff operation) but you need to make sure that
your editor renames the original file before saving a copy.

Having snapshots might take space but they are useful for both fast
getting a revision and creating a revision in the library.

A diff command takes usually around 1 min (on a P4 at 2.5GHz with IDE
drives) if the current revision is in the library. The tree diff is
the main time consuming operation when committing small changes. If
the revision is not in the library, it will try to create it by
hard-linking with a previous one and applying the corresponding
patches (later version I think can reverse-apply patches from newer
revisions).

The merge operation might take some time (minutes, even 10-20 minutes
for 1000+ changesets) depending on the number of changesets and
whether the revisions are already in the revision library. You can
specify a three-way merge that places conflict markers in the file
(like diff3 or cvs) or a two-way merge which is equivalent to applying
a patch (if you prefer a two-way merge, the "replay" command is
actually the fastest, it takes ~2 seconds to apply a small changeset
and doesn't need go to the revision library). Once a merge operation
completes, you would need to fix the conflicts and commit the
changes. All the logs are preserved but the newly merged individual
changes are seen as a single commit in the local tree.

In the way I use it (with a linux--main--2.6 tree similar to bk-head)
I think arch would get slow with time as changesets accumulate. The
way its developers advise to be used is to work, for example, on a
linux--main--2.6.12 tree for preparing this release and, once it is
ready, seal it (commit --seal). Further commits need to have a --fix
option and they should mainly be bug fixes. At this point you can
branch the linux--main--2.6.13 and start working on it. This new tree
can easily merge the bug fixes applied to the previous version. Arch
developers also recommend to use a new repository every year,
especially if there are many changesets.

A problem I found, even if the library revisions are hard-linked, they
still take a lot of space and should be cleaned periodically (a cron
script that checks the last access to them is available).

By default, arch also complains (with exit) about unknown files in the
working tree. Its developer(s) believe that the compilation should be
done in a different directory. I didn't find this a problem since I
use the same tree to compile for several platforms. Anyway, it can be
configured to ignore them, based on regexp.

I also tried monotone and darcs (since these two, unlike svn, can do
proper merging and preserve the merge history) but arch was by far the
fastest (CVS/RCS are hard to be bitten on speed).

Unfortunately, I can't make my repository public because of IT desk
issues but let me know if you'd like me to benchmark different
operations (or if you'd like a simple list of commands to create your
own).

Hope you find this useful.

--
Catalin

2005-04-08 17:12:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Fri, 8 Apr 2005, Matthias-Christian Ott wrote:
>
> SQL Databases like SQLite aren't slow.

After applying a patch, I can do a complete "show-diff" on the kernel tree
to see the effect of it in about 0.15 seconds.

Also, I can use rsync to efficiently replicate my database without having
to re-send the whole crap - it only needs to send the new stuff.

You do that with an sql database, and I'll be impressed.

Linus

2005-04-08 17:19:59

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Fri, Apr 08, 2005 at 10:14:22AM -0700, Linus Torvalds wrote:

> After applying a patch, I can do a complete "show-diff" on the kernel tree
> to see the effect of it in about 0.15 seconds.

How does that work? Can you stat the entire tree in that time? I
measure it as being higher than that.

2005-04-08 17:26:15

by Jon Masters

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Apr 7, 2005 6:54 PM, Daniel Phillips <[email protected]> wrote:

> So I propose that everybody who is interested, pick one of the above projects
> and join it, to help get it to the point of being able to losslessly import
> the version graph. Given the importance, I think that _all_ viable
> alternatives need to be worked on in parallel, so that two months from now we
> have several viable options.

What about BitKeeper licensing constraints on such involvement?

Jon.

2005-04-08 17:28:43

by Matthias-Christian Ott

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Linus Torvalds wrote:

>On Fri, 8 Apr 2005, Matthias-Christian Ott wrote:
>
>
>>SQL Databases like SQLite aren't slow.
>>
>>
>
>After applying a patch, I can do a complete "show-diff" on the kernel tree
>to see the effect of it in about 0.15 seconds.
>
>Also, I can use rsync to efficiently replicate my database without having
>to re-send the whole crap - it only needs to send the new stuff.
>
>You do that with an sql database, and I'll be impressed.
>
> Linus
>
>
>
Ok, but if you want to search for information in such big text files it
slow, because you do linear search -- most datases use faster search
algorithms like hashing and if you have multiple files (I don't if
you're system uses multiple files (like bitkeeper) or not) which need a
system call to be opened this will be very slow, because system calls
itself are slow. And using rsync is also possible because most databases
store their information as plain text with meta information.

Mattthias-Christian Ott

2005-04-08 17:37:16

by Jeff Garzik

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Linus Torvalds wrote:
>
> On Fri, 8 Apr 2005, Matthias-Christian Ott wrote:
>
>>SQL Databases like SQLite aren't slow.
>
>
> After applying a patch, I can do a complete "show-diff" on the kernel tree
> to see the effect of it in about 0.15 seconds.
>
> Also, I can use rsync to efficiently replicate my database without having
> to re-send the whole crap - it only needs to send the new stuff.
>
> You do that with an sql database, and I'll be impressed.

Well... it took me over 30 seconds just to 'rm -rf' the unpacked
tarballs of git and sparse-git, over my LAN's NFS.

Granted that this sort of stuff works well with (a) rsync and (b)
hardlinks, but it's still punishment on the i/dcache.

Jeff



2005-04-08 17:45:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Fri, 8 Apr 2005, Chris Wedgwood wrote:
> On Fri, Apr 08, 2005 at 10:14:22AM -0700, Linus Torvalds wrote:
>
> > After applying a patch, I can do a complete "show-diff" on the kernel tree
> > to see the effect of it in about 0.15 seconds.
>
> How does that work? Can you stat the entire tree in that time? I
> measure it as being higher than that.

I can indeed stat the entire tree in that time (assuming it's in memory,
of course, but my kernel trees are _always_ in memory ;), but in order to
do so, I have to be good at finding the names to stat.

In particular, you have to be extremely careful. You need to make sure
that you don't stat anything you don't need to. We're not talking just
blindly recursing the tree here, and that's exactly the point. You have to
know what you're doing, but the whole point of keeping track of directory
contents is that dammit, that's your whole job.

Anybody who can't list the files they work on _instantly_ is doing
something damn wrong.

"git" is really trivial, written in four days. Most of that was not
actually spent coding, but thinking about the data structures.

Linus


2005-04-08 18:06:14

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Fri, Apr 08, 2005 at 10:46:40AM -0700, Linus Torvalds wrote:

> I can indeed stat the entire tree in that time (assuming it's in memory,
> of course, but my kernel trees are _always_ in memory ;), but in order to
> do so, I have to be good at finding the names to stat.

<pause ... tapity tap>

I just tested this (I wanted to be sure you didn't have some 47GHz
LiHe cooled Xeon or something).

On my somewhat slowish machine[1] (by today's standards anyhow) I can
stat a checked out tree (ie. the source files and not SCM files) in
about 0.10s it seems and 0.26s for an entire tree with BK files in it.

> In particular, you have to be extremely careful. You need to make
> sure that you don't stat anything you don't need to.

Actually, I could probably make this *much* still faster with a
caveat. Given that my editor when I write a file will write a
temporary file and rename it, for files in directories where nlink==2
I can check chat first and skip the stat of the individual files.

And I guess if I was bored I could have my editor or some daemon
sitting in the background intelligently using dnotify to have this
information on-hand more or less instantly. For this purpose though
that seems like a lot of effort for no real gain right now.

> Anybody who can't list the files they work on _instantly_ is doing
> something damn wrong.

Well, I do like to do "bk sfiles -x" fairly often. But then again I
can stat dirs and compare against a cache to make that fast too.


[1] Dual AthlonMP 2200

2005-04-08 18:12:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Fri, 8 Apr 2005, Matthias-Christian Ott wrote:
>
> Ok, but if you want to search for information in such big text files it
> slow, because you do linear search

No I don't. I don't search for _anything_. I have my own
content-addressable filesystem, and I guarantee you that it's faster than
mysql, because it depends on the kernel doing the right thing (which it
does).

I never do a single "readdir". It's all direct data lookup, no "searching"
anywhere.

Databases aren't magical. Quite the reverse. They easily end up being
_slower_ than doing it by hand, simply because they have to solve a much
more generic issue. If you design your data structures and abstractions
right, a database is pretty much guaranteed to only incur overhead.

The advantage of a database is the abstraction and management it gives
you. But I did my own special-case abstraction in git.

Yeah, I bet "git" might suck if your OS sucks. I definitely depend on name
caching at an OS level so that I know that opening a file is fast. In
other words, there _is_ an indexing and caching database in there, and
it's called the Linux VFS layer and the dentry cache.

The proof is in the pudding. git is designed for _one_ thing, and one
thing only: tracking a series of directory states in a way that can be
replicated. It's very very fast at that. A database with a more flexible
abstraction migt be faster at other things, but the fact is, you do take a
hit.

The problem with databases are:

- they are damn hard to just replicate wildly and without control. The
database backing file inherently has a lot of internal state. You may
be able to "just copy it", but you have to copy the whole damn thing.

In "git", the data is all there in immutable blobs that you can just
rsync. In fact, you don't even need rsync: you can just look at the
filenames, and anything new you copy. No need for any fancy "read the
files to see that they match". They _will_ match, or you can tell
immediately that a file is corrupt.

Look at this:

torvalds@ppc970:~/git> sha1sum .dircache/objects/e7/bfaadd5d2331123663a8f14a26604a3cdcb678
e7bfaadd5d2331123663a8f14a26604a3cdcb678 .dircache/objects/e7/bfaadd5d2331123663a8f14a26604a3cdcb678

see a pattern anywhere? Imagine that you know the list of files you
have, and the list of files the other side has (never mind the
contents), and how _easy_ it is to synchronize. Without ever having to
even read the remote files that you know you already have.

How do you replicate your database incrementally? I've given you enough
clues to do it for "git" in probably five lines of perl.

- they tend to take time to set up and prime.

In contrast, the filesystem is always there. Sure, you effectively have
to "prime" that one too, but the thing is, if your OS is doing its job,
you basically only need to prime it once per reboot. No need to prime
it for each process you start or play games with connecting to servers
etc. It's just there. Always.

So if you think of the filesystem as a database, you're all set. If you
design your data structure so that there is just one index, you make that
the name, and the kernel will do all the O(1) hashed lookups etc for you.
You do have to limit yourself in some ways.

Oh, and you have to be willing to waste diskspace. "git" is _not_
space-efficient. The good news is that it is cache-friendly, since it is
also designed to never actually look at any old files that aren't part of
the immediate history, so while it wastes diskspace, it does not waste the
(much more precious) page cache.

IOW big file-sets are always bad for performance if you need to traverse
them to get anywhere, but if you index things so that you only read the
stuff you really really _need_ (which git does), big file-sets are just an
excuse to buy a new disk ;)

Linus

2005-04-08 18:29:26

by [email protected]

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Apr 8, 2005 2:14 PM, Linus Torvalds <[email protected]> wrote:
> How do you replicate your database incrementally? I've given you enough
> clues to do it for "git" in probably five lines of perl.

Efficient database replication is achieved by copying the transaction
logs and then replaying them. Most mid to high end databases support
this. You only need to copy the parts of the logs that you don't
already have.

--
Jon Smirl
[email protected]

2005-04-08 18:45:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Fri, 8 Apr 2005, Jeff Garzik wrote:
>
> Well... it took me over 30 seconds just to 'rm -rf' the unpacked
> tarballs of git and sparse-git, over my LAN's NFS.

Don't use NFS for development. It sucks for BK too.

That said, normal _use_ should actually be pretty efficient even over NFS.
It will "stat" a hell of a lot of files to do the "show-diff", but that
part you really can't avoid unless you depend on all the tools marking
their changes somewhere. Which BK does, actually, but that was pretty
painful, and means that bk needed to re-implement all the normal ops (ie
"bk patch").

What's also nice is that exactly because "git" depends on totally
immutable files, they actually cache very well over NFS. Even if you were
to share a database across machines (which is _not_ what git is meant to
do, but it's certainly possible).

So I actually suspect that if you actually _work_ with a tree in "git",
you will find performance very good indeed. The fact that it creates a
number of files when you pull in a new repository is a different thing.

> Granted that this sort of stuff works well with (a) rsync and (b)
> hardlinks, but it's still punishment on the i/dcache.

Actually, it's not. Not once it is set up. Exactly because "git" doesn't
actually access those files unless it literally needs the data in one
file, and then it's always set up so that it needs either none or _all_ of
the file. There is no data sharing anywhere, so you are never in the
situation where it needs "ten bytes from file X" and "25 bytes from file
Y".

For example, if you don't have any changes in your tree, there is exactly
_one_ file that a "show-diff" will read: the .dircache/index file. That's
it. After that, it will "stat()" exactly the files you are tracking, and
nothing more. It will not touch any internal "git" data AT ALL. That
"stat" will be somewhat expensive unless your client caches stat data too,
but that's it.

And if it turns out that you have changed a file (or even just touched it,
so that the data is the same, but the index file can no longer guarantee
it with just a single "stat()"), then git will open have exactly _one_
file (no searching, no messing around), which contains absolutely nothing
except for the compressed (and SHA1-signed) old contents of the file. It
obviously _has_ to do that, because in order to know whether you've
changed it, it needs to now compare it to the original.

IOW, "git" will literally touch the minimum IO necessary, and absolutely
minimum cache-footprint.

The fact is, when tracking the 17,000 files in the kernel directory, most
of them are never actually changed. They literally are "free". They aren't
brought into the cache by "git" - not the file itself, not the backing
store. You set up the index file once, and you never ever touch them
again. You could put the sha1 files on a tape, for all git cares.

The one exception obviously being when you actually instantiate the git
archive for the first time (or when you throw it away). At that time you
do touch all of the data, but that should be the only time.

THAT is what git is good at. It optimized for the "not a lot of changes"
things, and pretty much all the operations are O(n) in the "size of
change", not in "size of repo".

That includes even things like "give me the diff between the top of tree
and the tree 10 days ago". If you know what your head was 10 days ago,
"git" will open exactly _four_ small files for this operation (the current
"top" commit, the commit file of ten days ago, and the two "tree" files
associated with those). It will then need to open the backing store files
for the files that are different between the two versions, but IT WILL
NEVER EVEN LOOK at the files that it immediately sees are the same.

And that's actually true whether we're talking about the top-of-tree or
not. If I had the kernel history in git format (I don't - I estimate that
it would be about 1.5GB - 2GB in size, and would take me about ten days to
extract from BK ;), I could do a diff between _any_ tagged version (and I
mention "tagged" only as a way to look up the commit ID - it doesn't have
to be tagged if you know it some other way) in O(n) where 'n' is the
number of files that have changed between the revisions.

Number of changesets doesn't matter. Number of files doesn't matter. The
_only_ thing that matters is the size of the change.

Btw, I don't actually have a git command to do this yet. A bit of
scripting required to do it, but it's pretty trivial: you open the two
"commit" files that are the beginning/end of the thing, you look up what
the tree state was at each point, you open up the two tree files involved,
and you ignore all entries that match.

Since the tree files are already sorted, that "ignoring matches" is
basically free (technically that's O(n) in the number of files described,
but we're talking about something that even a slow machine can do so fast
you probably can't even time it with a stop-watch). You now have the
complete list of files that have been changed (removed, added or "exists
in both trees, but different contents"), and you can thus trivially create
the whole tree with opening up _only_ the indexes for those files.

Ergo: O(n) in size of change. Both in work and in disk/cache access (where
the latter is often the more important one). Absolutely _zero_ indirection
anywhere apart from the initial stage to go from "commit" to "tree", so
there's no seeking except to actually read the files once you know what
they are (and since you know them up-front and there are no dependencies
at that point, you could even tell the OS to prefetch them if you really
cared about getting minimal disk seeks).

Linus

2005-04-08 18:56:29

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Fri, Apr 08, 2005 at 11:47:10AM -0700, Linus Torvalds wrote:

> Don't use NFS for development. It sucks for BK too.

Some times NFS is unavoidable.

In the best case (see previous email wrt to only stat'ing the parent
directories when you can) for a current kernel though you can get away
with 894 stats --- over NFS that would probably be tolerable.

After claiming such an optimization is probably not worth while I'm
now thinking for network filesystems it might be.

2005-04-08 18:58:09

by Florian Weimer

[permalink] [raw]
Subject: Re: Kernel SCM saga..

* Jon Smirl:

> On Apr 8, 2005 2:14 PM, Linus Torvalds <[email protected]> wrote:
>> How do you replicate your database incrementally? I've given you enough
>> clues to do it for "git" in probably five lines of perl.
>
> Efficient database replication is achieved by copying the transaction
> logs and then replaying them.

Works only if the databases are in sync. Even if the transaction logs
are pretty high-level, you risk violating constraints specified by the
application. General multi-master replication is an unsolved problem.

2005-04-08 19:02:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Fri, 8 Apr 2005, Chris Wedgwood wrote:
>
> Actually, I could probably make this *much* still faster with a
> caveat. Given that my editor when I write a file will write a
> temporary file and rename it, for files in directories where nlink==2
> I can check chat first and skip the stat of the individual files.

Yes, doing the stat just on the directory (on leaf directories only, of
course, but nlink==2 does say that on most filesystems) is indeed a huge
potential speedup.

It doesn't matter so much for the cached case, but it _does_ matter for
the uncached one. Makes a huge difference, in fact (I was playing with
exactly that back when I started doing "bkr" in BK/tools - three years
ago).

It turns out that I expect to cache my source tree (at least the mail
outline), and that guides my optimizations, but yes, your dir stat does
help in the case of "occasionally working with lots of large projects"
rather than "mostly working on the same ones with enough RAM to cache it
all".

And "git" is actually fairly anal in this respect: it not only stats all
files, but the index file contains a lot more of the stat info than you'd
expect. So for example, it checks both ctime and mtime to the nanosecond
(did I mention that I didn't worry too much about portability?) exactly so
that it can catch any changes except for actively malicious things.

And if you do actively malicious things in your own directory, you get
what you deserve. It's actually _hard_ to try to fool git into believing a
file hasn't changed: you need to not only replace it with the exact same
file length and ctime/mtime, you need to reuse the same inode/dev numbers
(again - I didn't worry about portability, and filesystems where those
aren't stable are a "don't do that then") and keep the mode the same. Oh,
and uid/gid, but that was much me being silly.

Linus

2005-04-08 19:16:49

by Matthias-Christian Ott

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Linus Torvalds wrote:

>On Fri, 8 Apr 2005, Matthias-Christian Ott wrote:
>
>
>>Ok, but if you want to search for information in such big text files it
>>slow, because you do linear search
>>
>>
>
>No I don't. I don't search for _anything_. I have my own
>content-addressable filesystem, and I guarantee you that it's faster than
>mysql, because it depends on the kernel doing the right thing (which it
>does).
>
>
>
I'm not talking about mysql, i'm talking about fast databases like
sqlite or db4.

>I never do a single "readdir". It's all direct data lookup, no "searching"
>anywhere.
>
>Databases aren't magical. Quite the reverse. They easily end up being
>_slower_ than doing it by hand, simply because they have to solve a much
>more generic issue. If you design your data structures and abstractions
>right, a database is pretty much guaranteed to only incur overhead.
>
>The advantage of a database is the abstraction and management it gives
>you. But I did my own special-case abstraction in git.
>
>Yeah, I bet "git" might suck if your OS sucks. I definitely depend on name
>caching at an OS level so that I know that opening a file is fast. In
>other words, there _is_ an indexing and caching database in there, and
>it's called the Linux VFS layer and the dentry cache.
>
>The proof is in the pudding. git is designed for _one_ thing, and one
>thing only: tracking a series of directory states in a way that can be
>replicated. It's very very fast at that. A database with a more flexible
>abstraction migt be faster at other things, but the fact is, you do take a
>hit.
>
>The problem with databases are:
>
> - they are damn hard to just replicate wildly and without control. The
> database backing file inherently has a lot of internal state. You may
> be able to "just copy it", but you have to copy the whole damn thing.
>
>
This is _not_ true for every database (specialy plain/text databases
with meta information).

> In "git", the data is all there in immutable blobs that you can just
> rsync. In fact, you don't even need rsync: you can just look at the
> filenames, and anything new you copy. No need for any fancy "read the
> files to see that they match". They _will_ match, or you can tell
> immediately that a file is corrupt.
>
> Look at this:
>
> torvalds@ppc970:~/git> sha1sum .dircache/objects/e7/bfaadd5d2331123663a8f14a26604a3cdcb678
> e7bfaadd5d2331123663a8f14a26604a3cdcb678 .dircache/objects/e7/bfaadd5d2331123663a8f14a26604a3cdcb678
>
> see a pattern anywhere? Imagine that you know the list of files you
> have, and the list of files the other side has (never mind the
> contents), and how _easy_ it is to synchronize. Without ever having to
> even read the remote files that you know you already have.
> How do you replicate your database incrementally? I've given you enough
> clues to do it for "git" in probably five lines of perl.
>
>
I replicate my database incremently by using a hash list like you (the
client sends its hash list, the server compares the lists and acquaints
the client behind which data (data = hash + data) the data has to added
(this is like your solution -- you also submit the data and the location
(you have directories too, right?)). A database is in some cases (like
this one) like a filesystem, but it's build one top of better filesystem
like xfs, reiser4 or ext3 which support features like LVM, Quotas or
Journaling (Is your filesystem also build on top of existing filesystem?
I don't think so because you're talking about vfs operatations on the
filesystem).

> - they tend to take time to set up and prime.
>
> In contrast, the filesystem is always there. Sure, you effectively have
> to "prime" that one too, but the thing is, if your OS is doing its job,
> you basically only need to prime it once per reboot. No need to prime
> it for each process you start or play games with connecting to servers
> etc. It's just there. Always.
>
>
The database -- single file (sqlite or db4) -- is always there too
because it's on the filesystem and doesn't need a server.

>So if you think of the filesystem as a database, you're all set. If you
>design your data structure so that there is just one index, you make that
>the name, and the kernel will do all the O(1) hashed lookups etc for you.
>You do have to limit yourself in some ways.
>
>
But as mentioned you need to _open_ each file (It doesn't matter if it's
cached (this speeds up only reading it) -- you need a _slow_ system call
and _very slow_ hardware access anyway).
Have a look at this comparison:
If you have big chest and lots of small chests containing the same bulk
of gold, it's more work to collect the gold from the small chests than
from the big one (which would contain as many a cases as little chests
exist). You can faster find your gold because you don't have to walk to
the other chests and you don't have to open that much caps which saves
also time.

>Oh, and you have to be willing to waste diskspace. "git" is _not_
>space-efficient. The good news is that it is cache-friendly, since it is
>also designed to never actually look at any old files that aren't part of
>the immediate history, so while it wastes diskspace, it does not waste the
>(much more precious) page cache.
>
>IOW big file-sets are always bad for performance if you need to traverse
>them to get anywhere, but if you index things so that you only read the
>stuff you really really _need_ (which git does), big file-sets are just an
>excuse to buy a new disk ;)
>
> Linus
>
>
>
I hope my idea/opinion is clear now.

Matthias-Christian

2005-04-08 19:17:00

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Fri, Apr 08, 2005 at 12:03:49PM -0700, Linus Torvalds wrote:

> Yes, doing the stat just on the directory (on leaf directories only, of
> course, but nlink==2 does say that on most filesystems) is indeed a huge
> potential speedup.

Here I measure about 6ms for cache --- essentially below the noise
threshold for something that does real work.

> It doesn't matter so much for the cached case, but it _does_ matter
> for the uncached one.

Doing the minimal stat cold-cache here is about 6s for local disk.
I'm somewhat surprised it's that bad actually.

2005-04-08 19:30:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Fri, 8 Apr 2005, Matthias-Christian Ott wrote:
>
> But as mentioned you need to _open_ each file (It doesn't matter if it's
> cached (this speeds up only reading it) -- you need a _slow_ system call
> and _very slow_ hardware access anyway).

Nope. System calls aren't slow. What crappy OS are you running?

> I hope my idea/opinion is clear now.

Numbers talk. I've got something that you can test ;)

Linus

2005-04-08 19:38:16

by Florian Weimer

[permalink] [raw]
Subject: Re: Kernel SCM saga..

* Chris Wedgwood:

>> It doesn't matter so much for the cached case, but it _does_ matter
>> for the uncached one.
>
> Doing the minimal stat cold-cache here is about 6s for local disk.

Does sorting by inode number make a difference?

2005-04-08 19:37:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Fri, 8 Apr 2005, Chris Wedgwood wrote:
>
> > It doesn't matter so much for the cached case, but it _does_ matter
> > for the uncached one.
>
> Doing the minimal stat cold-cache here is about 6s for local disk.
> I'm somewhat surprised it's that bad actually.

One of the reasons I do inode numbers in the "index" file (apart from
checking that the inode hasn't changed) is in fact that "stat()" is damn
slow if it causes seeks. Since your stat loop is entirely

You can optimize your stat() patterns on traditional unix-like filesystems
by just sorting the stats by inode number (since the inode number is
historically a special index into the inode table - even when filesystems
distribute the inodes over several tables, sorting will generally do the
right thing from a seek perspective). It's a disgusting hack, but it
literally gets you orders-of-magnitude performance improvments in many
real-life cases.

It does have some downsides:
- it buys you nothing when it's cached (and obviously you have the
sorting overhead, although that's pretty cheap)
- on other filesystems it can make things slower.

But if the cold-cache case actually is a concern, I do have the solution
for it. Just a simple "prime-cache" program that does a qsort on the index
file entries and does the stat() on them all will bring the numbers down.
Those 6 seconds you see are the disk head seeking around like mad.

Linus

2005-04-08 19:45:44

by Matthias-Christian Ott

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Linus Torvalds wrote:

>On Fri, 8 Apr 2005, Matthias-Christian Ott wrote:
>
>
>>But as mentioned you need to _open_ each file (It doesn't matter if it's
>>cached (this speeds up only reading it) -- you need a _slow_ system call
>>and _very slow_ hardware access anyway).
>>
>>
>
>Nope. System calls aren't slow. What crappy OS are you running?
>
>
>
But they're slower because there're some instances checking them.

>>I hope my idea/opinion is clear now.
>>
>>
>
>Numbers talk. I've got something that you can test ;)
>
>
This doesn't mean it's better just because you had the time develope it
;). But anyhow the folk needs something, they can test to see if it's
good or not, most don't believe in concepts.

> Linus
>
>
>
We will see which solutions wins the "race". But I think you're
solutions will "win", because you're Linus Torvalds -- the "Boss" of
Linux and have to work with this system very day (usualy people are
using what they have developed :)) -- and I have not the time develop a
database based solution (maybe someone else is interested in developing it).

Matthias-Christian

2005-04-08 19:49:01

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Fri, Apr 08, 2005 at 09:38:09PM +0200, Florian Weimer wrote:

> Does sorting by inode number make a difference?

It almost certainly would. But I can sort more intelligently than
that even (all the world isn't ext2/3).

2005-04-08 20:12:27

by Ragnar Kjørstad

[permalink] [raw]
Subject: Uncached stat performace [ Was: Re: Kernel SCM saga.. ]

On Fri, Apr 08, 2005 at 12:39:26PM -0700, Linus Torvalds wrote:
> One of the reasons I do inode numbers in the "index" file (apart from
> checking that the inode hasn't changed) is in fact that "stat()" is damn
> slow if it causes seeks. Since your stat loop is entirely
>
> You can optimize your stat() patterns on traditional unix-like filesystems
> by just sorting the stats by inode number (since the inode number is
> historically a special index into the inode table - even when filesystems
> distribute the inodes over several tables, sorting will generally do the
> right thing from a seek perspective). It's a disgusting hack, but it
> literally gets you orders-of-magnitude performance improvments in many
> real-life cases.

It does, so why isn't there a way to do this without the disgusting
hack? (Your words, not mine :) )

E.g, wouldn't a aio_stat() allow simular or better speedups in a way
that doesn't depend on ext2/3 internals?

I bet it would make a significant difference from things like "ls -l" in
large uncached directories and imap-servers with maildir?



--
Ragnar Kj?rstad
Software Engineer
Scali - http://www.scali.com
Scaling the Linux Datacenter

2005-04-08 20:14:34

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Uncached stat performace [ Was: Re: Kernel SCM saga.. ]

On Fri, Apr 08, 2005 at 10:11:51PM +0200, Ragnar Kj?rstad wrote:

> It does, so why isn't there a way to do this without the disgusting
> hack? (Your words, not mine :) )

inode sorting probably a good guess for a number of filesystems, you
can map the blocks used to do better still (somewhat fs specific)

you can do better still if you multiple stats in parallel (up to a
point) and let the elevator sort things out

> I bet it would make a significant difference from things like "ls -l" in
> large uncached directories and imap-servers with maildir?

sort + concurrent stats would help here i think

i'm not sure i like the idea of ls using lots of threads though :)

2005-04-08 20:50:24

by Luck

[permalink] [raw]
Subject: Re: Kernel SCM saga..

It looks like an operation like "show me the history of mm/memory.c" will
be pretty expensive using git. I'd need to look at the current tree, and
then trace backwards through all 60,000 changesets to see which ones had
actual changes to this file. Could you expand the tuple in the tree object
to include a back pointer to the previous tree in which the tuple changed?
Or does adding history to the tree violate other goals of the tree type?

-Tony

2005-04-08 21:26:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Fri, 8 Apr 2005 [email protected] wrote:
>
> It looks like an operation like "show me the history of mm/memory.c" will
> be pretty expensive using git.

Yes. Per-file history is expensive in git, because if the way it is
indexed. Things are indexed by tree and by changeset, and there are no
per-file indexes.

You could create per-file _caches_ (*) on top of git if you wanted to make
it behave more like a real SCM, but yes, it's all definitely optimized for
the things that _I_ tend to care about, which is the whole-repository
operations.

Linus

(*) Doing caching on that level is probably find, especially since most
people really tend to want it for just the relatively few files that they
work on anyway. Limiting the caches to a subset of the tree should be
quite effective.

2005-04-08 22:04:17

by Daniel Phillips

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Friday 08 April 2005 13:24, Jon Masters wrote:
> On Apr 7, 2005 6:54 PM, Daniel Phillips <[email protected]> wrote:
> > So I propose that everybody who is interested, pick one of the above
> > projects and join it, to help get it to the point of being able to
> > losslessly import the version graph. Given the importance, I think that
> > _all_ viable alternatives need to be worked on in parallel, so that two
> > months from now we have several viable options.
>
> What about BitKeeper licensing constraints on such involvement?

They don't apply to me, for one.

Regards,

Daniel

Subject: Re: Kernel SCM saga..

Linus wrote:
>> It looks like an operation like "show me the history of mm/memory.c" will
>> be pretty expensive using git.
>
> Yes. Per-file history is expensive in git, because if the way it is
> indexed. Things are indexed by tree and by changeset, and there are no
> per-file indexes.

Although directory changes are tracked using change-sets, there
seems to be no easy way to answer "give me the diff corresponding to
the commit (change-set) object <sha1>". That will be really helpful to
review the changes.

Rajesh

2005-04-08 22:52:51

by Roman Zippel

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Hi,

On Thu, 7 Apr 2005, Linus Torvalds wrote:

> I really disliked that in BitKeeper too originally. I argued with Larry
> about it, but Larry (correctly, I believe) argued that efficient and
> reliable distribution really requires the concept of "history is
> immutable". It makes replication much easier when you know that the known
> subset _never_ shrinks or changes - you only add on top of it.

The problem is you pay a price for this. There must be a reason developers
were adding another GB of memory just to run BK.
Preserving the complete merge history does indeed make repeated merges
simpler, but it builds up complex meta data, which has to be managed
forever. I doubt that this is really an advantage in the long term. I
expect that we were better off serializing changesets in the main
repository. For example bk does something like this:

A1 -> A2 -> A3 -> BM
\-> B1 -> B2 --^

and instead of creating the merge changeset, one could merge them like
this:

A1 -> A2 -> A3 -> B1 -> B2

This results in a simpler repository, which is more scalable and which
is easier for users to work with (e.g. binary bug search).
The disadvantage would be it will cause more minor conflicts, when changes
are pulled back into the original tree, but which should be easily
resolvable most of the time.
I'm not saying with this that the bk model is bad, but I think it's a
problem if it's the only model applied to everything.

> The thing is, cherry-picking very much implies that the people "up" the
> foodchain end up editing the work of the people "below" them. The whole
> reason you want cherry-picking is that you want to fix up somebody elses
> mistakes, ie something you disagree with.
>
> That sounds like an obviously good thing, right? Yes it does.
>
> The problem is, it actually results in the wrong dynamics and psychology
> in the system. First off, it makes the implicit assumption that there is
> an "up" and "down" in the food-chain, and I think that's wrong.

These dynamics do exists and our tools should be able to represent them.
For example when people post patches, they get reviewed and often need
more changes and bk doesn't really help them to redo the patches.
Bk helped you to offload the cherry-picking process to other people, so
that you only had to do cherry-collecting very efficiently.
Another prime example of cherry-picking is Andrews mm tree, he picks a
number of patches which are ready for merging and forwards them to you.
Our current basic development model (at least until a few days ago) looks
something like this:

linux-mm -> linux-bk -> linux-stable

Ideally most changes would get into the tree via linux-mm and depending
on depending various conditions (e.g. urgency, review state) it would get
into the stable tree. In practice linux-mm is more an aggregation of
patches which need testing and since most bk users were developing
against linux-bk, it got a lot less testing and a lot of problems are
only caught at the next stage. Changes from the stable tree would even
flow in the opposite direction.
Bk supports certain aspects of the kernel development process very well,
but due its closed nature it was practically impossible to really
integrate it fully into this process (at least for anyone outside BM).
In the short term we probably are in for a tough ride and we take whatever
works best for you, but in the long term we need to think about how SCM
fits into our kernel development model, which includes development,
review, testing and releasing of kernel changes. This is more than just
pulling and merging kernel trees. I'm aiming at a tool that can also
support Andrews work, so that he can also better offload some of this
work (and take a break sometimes :) ). Unfortunately every existing tool I
know of is lacking in its own way, so we still have some way to go...

bye, Roman

2005-04-08 23:27:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Fri, 8 Apr 2005, Rajesh Venkatasubramanian wrote:
>
> Although directory changes are tracked using change-sets, there
> seems to be no easy way to answer "give me the diff corresponding to
> the commit (change-set) object <sha1>". That will be really helpful to
> review the changes.

Actually, it is very easy indeed. Here's what you do:

- look up the commit object ("cat-file commit <sha1>")

This object starts out with "tree <sha1>", followed by a list of
parent commit objects: "parent <sha1>"

Remember the tree object (it defines what the tree looks like at
the time of the commit). Pick the parent object you want to diff
against (normally the first one).

Also, print the checking messages at the end of the commit object.

- look up the parent object ("cat-file commit <parentsha1>")

Here you have the same kind of object, but this time you don't care
about going deeper, you just pick up the tree <sha1> that describes
the tree at the parent.

- look up the two tree objects. Unlike a commit object, a tree object
is a binary data blob, but the format is an _extremely_ simple table
of thse guys:

<ascii octal filemode> <space> <pathname> <NUL character> <20-byte sha1>

and the reason it's binary is really that that way "git" doesn't end
up having any issues with strange pathnames. If you want to have spaces
and newlines in your pathname, go wild.

In particular, the tree object is also _sorted_ by the pathname. This
makes things simple, because you now have to sorted trees, and the
first thing you do is just walk the two trees in lock-step, which is
trivial thanks to the sorted nature of the tree "array".

So now you have three cases:
- you have the same name, and the same sha1

ignore it - the file didn't change, you don't even have to look
at the contents (although if the file mode changed you might
want to note that)

- you have the same name in parent and child tree lists, but the
sha differs. Now you just need to do a "cat-file" on both of the
SHA1 values, and do a "diff -u" between them.

- you have the filename in only parent or only child. Do a
"create" or "delete" diff with the content of the sha1 file.

See? Very efficient. For any files that didn't change, you didn't have to
do anything at all - you didn't even have to look at their data.

Also note that the above algorithm really works for _any_ two commit
points (apart for the two first steps, which are obviously all about
finding the parent tree when you want to diff against a predecessor).

It doesn't have to be parent and child. Pick any commit you have. And pick
them in the other order, and you'll automatically get the reverse diff.

You can even do diffs between unrelated projects this way if you use the
shared sha1 directory model, although that obviously doesn't tend to be
all that sensible ;)

Linus

2005-04-08 23:38:28

by Daniel Phillips

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Friday 08 April 2005 04:38, Andrea Arcangeli wrote:
> On Thu, Apr 07, 2005 at 11:41:29PM -0700, Linus Torvalds wrote:
> The huge number of changesets is the crucial point, there are good
> distributed SCM already but they are apparently not efficient enough at
> handling 60k changesets.
>
> We'd need a regenerated coherent copy of BKCVS to pipe into those SCM to
> evaluate how well they scale.
>
> OTOH if your git project already allows storing the data in there,
> that looks nice ;).

Hi Andrea,

For the immediate future, all we need is something than can _losslessly_
capture the new metadata that's being generated. That buys time to bring one
of the promising open source candidates up to full speed.

By the way, which one are you working on? :-)

Regards,

Daniel

2005-04-08 23:47:21

by Tupshin Harper

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Roman Zippel wrote:

>Preserving the complete merge history does indeed make repeated merges
>simpler, but it builds up complex meta data, which has to be managed
>forever. I doubt that this is really an advantage in the long term. I
>expect that we were better off serializing changesets in the main
>repository. For example bk does something like this:
>
> A1 -> A2 -> A3 -> BM
> \-> B1 -> B2 --^
>
>and instead of creating the merge changeset, one could merge them like
>this:
>
> A1 -> A2 -> A3 -> B1 -> B2
>
>This results in a simpler repository, which is more scalable and which
>is easier for users to work with (e.g. binary bug search).
>The disadvantage would be it will cause more minor conflicts, when changes
>are pulled back into the original tree, but which should be easily
>resolvable most of the time.
>
Both darcs and arch (and arch's siblings) have ways of maintaining the
complete history but speeding up operations.

Arch use's revision libraries:
http://www.gnu.org/software/gnu-arch/tutorial/revision-libraries.html
though i'm not all that up on arch so I'll just leave it at that.

Darcs uses "darcs optimize --checkpoint"
http://darcs.net/manual/node7.html#SECTION00764000000000000000
which "allows for users to retrieve a working repository with limited
history with a savings of disk space and bandwidth." In darcs case, you
can pull a partial repository by doing "darcs get --partial", in which
case you only grab the state at the point that the repository was
optimized and subsequent patches, and all operations only need to work
against the set of patches since that optimize.

Note, that I'm not promoting darcs for kernel usage because of speed (or
the lack thereof) but I am curious why Linus would consider monotone
given its speed issues but not consider darcs.

-Tupshin

2005-04-09 00:11:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Fri, 8 Apr 2005, Andrea Arcangeli wrote:
>
> We'd need a regenerated coherent copy of BKCVS to pipe into those SCM to
> evaluate how well they scale.

Yes, that makes most sense, I believe. Especially as BKCVS does the
linearization that makes other SCM's _able_ to take the data in the first
place. Few enough SCM's really understand the BK merge model, although the
distributed ones obviously have to do something similar.

> OTOH if your git project already allows storing the data in there,
> that looks nice ;).

I can express the data, and I did a sparse .git archive to prove the
concept. It doesn't even try to save BK-specific details, but as far as I
can tell, my git-conversion did capture all the basic things (ie not just
the actual source tree, but hopefully all the "who did what" parts too).

Of course, my git visualization tools are so horribly crappy that it is
hard to make sure ;)

Also, I suspect that BKCVS actually bothers to get more details out of a
BK tree than I cared about. People have pestered Larry about it, so BKCVS
exports a lot of the nitty-gritty (per-file comments etc) that just
doesn't actually _matter_, but people whine about. Me, I don't care. My
sparse-conversion just took the important parts.

> I don't yet fully understand how the algorithms of the trees are meant
> to work

Well, things like actually merging two git trees is not even something git
tries to do. It leaves that to somebody else - you can see what the
relationship is, and you can see all the data, but as far as I'm
concerned, git is really a "filesystem". It's a way of expression
revisions, but it's not a way of creating them.

> It looks similar to a diff -ur of two hardlinked trees

Yes. You could really think of it that way. It's not really about
hardlinking, but the fact that objects are named by their content does
mean that two objects (regardless of their type) can be seen as
"hardlinked" whenever their contents match.

But the more interesting part is the hierarchical virtual format it has,
ie it is not only hardlinked, but it also has the three different levels
of "views" into those hardlinked objects ("blob", "tree", "revision").

So even though the hash tree looks flat in the _physcal_ filesystem, it
detinitely isn't flat in its own virtual world. It's just flattened to fit
in a normal filesystem ;)

[ There's also a fourth level view in "trust", but that one hasn't been
implemented yet since I think it might as well be done at a higher
level. ]

Btw, the sha1 file format isn't actually designed for "rsync", since rsync
is really a hell of a lot more capable than my format needs. The format is
really designed for something like a offline http grabber, in that you can
just grab files purely by filename (and verify that you got them right by
running sha1sum on the resulting local copy). So think "wget".

Linus

2005-04-09 00:27:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Fri, 8 Apr 2005, Linus Torvalds wrote:
>
> Also note that the above algorithm really works for _any_ two commit
> points (apart for the two first steps, which are obviously all about
> finding the parent tree when you want to diff against a predecessor).

Btw, if you want to try this, you should get an updated copy. I've pushed
a "raw" git archive of both git and sparse (the latter is much more
interesting from an archive standpoint, since it actually has 1400
changesets in it) to kernel.org, but I'm not convinced it gets mirrored
out. I think the mirror scripts may mirror only things they understand.

I've also added a partial "fsck" for the "git filesystem". It doesn't do
the connectivity analysis yet, but that should be pretty straightforward
to add - it already parses all the data, it just doesn't save it away (and
the connectivity analysis will automatically show how many "root"
changesets you have, and what the different HEADs are).

I'll make a tar-file (git-0.03), although at this point I've actually been
maintaining it in itself, so to some degree it's almost getting easier if
I'd just have a place to rsync it..

Linus

2005-04-09 01:00:52

by Roman Zippel

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Hi,

On Fri, 8 Apr 2005, Tupshin Harper wrote:

> > A1 -> A2 -> A3 -> B1 -> B2
> >
> > This results in a simpler repository, which is more scalable and which is
> > easier for users to work with (e.g. binary bug search).
> > The disadvantage would be it will cause more minor conflicts, when changes
> > are pulled back into the original tree, but which should be easily
> > resolvable most of the time.
> >
> Both darcs and arch (and arch's siblings) have ways of maintaining the
> complete history but speeding up operations.

Please show me how you would do a binary search with arch.

I don't really like the arch model, it's far too restrictive and it's
jumping through hoops to get to an acceptable speed.
What I expect from a SCM is that it maintains both a version index of the
directory structure and a version index of the individual files. Arch
makes it especially painful to extract this data quickly. For the common
cases it throws disk space at the problem and does a lot of caching, but
there are still enough problems (e.g. annotate), which require scanning of
lots of tarballs.

bye, Roman

2005-04-09 01:01:37

by Marcin Dalecki

[permalink] [raw]
Subject: Re: Kernel SCM saga..


On 2005-04-08, at 18:15, Matthias-Christian Ott wrote:

> Linus Torvalds wrote:
>>
> SQL Databases like SQLite aren't slow.
> But maybe a Berkeley Database v.4 is a better solution.

Yes it sucks less for this purpose. See subversion as reference.

2005-04-09 01:03:25

by Marcin Dalecki

[permalink] [raw]
Subject: Re: Kernel SCM saga..


On 2005-04-07, at 09:44, Jan Hudec wrote:
>
> I have looked at most systems currently available. I would suggest
> following for closer look on:
>
> 1) GNU Arch/Bazaar. They use the same archive format, simple, have the
> concepts right. It may need some scripts or add ons. When Bazaar-NG
> is ready, it will be able to read the GNU Arch/Bazaar archives so
> switching should be easy.

Arch isn't a sound example of software design. Quite contrary to the
random notes posted by it's author the following issues did strike me
the time I did evaluate it:

The application (tla) claims to have "intuitive" command names. However
I didn't see that as given. Most of them where difficult to remember
and appeared to be just infantile. I stopped looking further after I
saw:

tla my-id instead of: tla user-id or oeven tla set id ...

tla make-archive instead of tla init

tla my-default-archive [email protected]

No more "My Compuer" please...

Repository addressing requires you to use informally defined
very elaborated and typing error prone conventions:

mkdir ~/{archives}
tla make-archive [email protected]
~/{archives}/2005-VersionPatrol

You notice the requirement for two commands to accomplish a single task
already
well denoted by the second command? There is more of the same at quite
a few places
when you try to use it. You notice the triple zero it didn't catch?

As an added bonus it relies on the applications named by accident
patch and diff and installed on the host in question as well as few
other as well to
operate.

Better don't waste your time with looking at Arch. Stick with patches
you maintain by hand combined with some scripts containing a list of
apply commands
and you should be still more productive then when using Arch.

2005-04-09 01:04:30

by Marcin Dalecki

[permalink] [raw]
Subject: Re: Kernel SCM saga..


On 2005-04-06, at 23:13, [email protected] wrote:

> Linus Torvalds wrote:
>> PS. Don't bother telling me about subversion. If you must, start
>> reading
>> up on "monotone". That seems to be the most viable alternative, but
>> don't
>> pester the developers so much that they don't get any work done. They
>> are
>> already aware of my problems ;)
>
> By the way, the Subversion developers have no argument with the claim
> that Subversion would not be the right choice for Linux kernel
> development. We've written an open letter entitled "Please Stop
> Bugging Linus Torvalds About Subversion" to explain why:
>
> http://subversion.tigris.org/subversion-linus.html

Thumbs up "Subverters"! I just love you. I love your attitude toward
high engineering
quality. And I appreciate actually very much what you provide as
software. Both:
from function and in terms of quality of implementation.

2005-04-09 01:07:11

by Marcin Dalecki

[permalink] [raw]
Subject: Re: Kernel SCM saga..


On 2005-04-08, at 19:14, Linus Torvalds wrote:
>
> You do that with an sql database, and I'll be impressed.

It's possible. But what will impress you are either the price tag the
DB comes with or
the hardware it runs on :-)

2005-04-09 01:11:52

by Marcin Dalecki

[permalink] [raw]
Subject: Re: Kernel SCM saga..


On 2005-04-08, at 20:14, Linus Torvalds wrote:

>
>
> On Fri, 8 Apr 2005, Matthias-Christian Ott wrote:
>>
>> Ok, but if you want to search for information in such big text files
>> it
>> slow, because you do linear search
>
> No I don't. I don't search for _anything_. I have my own
> content-addressable filesystem, and I guarantee you that it's faster
> than
> mysql, because it depends on the kernel doing the right thing (which it
> does).

Linus.... Sorry but you mistake the frequently seen SQL db abuse as DATA
storage for what SQL databases are good at storing: well defined
RELATIONS.
Sure a filesystem is for data. SQL is for relations.

2005-04-09 01:13:30

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Sat, Apr 09, 2005 at 03:00:44AM +0200, Marcin Dalecki wrote:

> Yes it sucks less for this purpose. See subversion as reference.

Whatever solution people come up with, ideally it should be tolerant
to minor amounts of corruption (so I can recover the rest of my data
if need be) and it should also have decent sanity checks to find
corruption as soon as reasonable possible.

I've been bitten by problems that subversion didn't catch but bk did.
In the subversion case by the time I noticed much data was lost and
none of the subversion tools were able to recover the rest of it.

In the bk case, the data-loss was almost immediately noticeable and
only affected a few files making recovery much easier.

2005-04-09 01:15:45

by Marcin Dalecki

[permalink] [raw]
Subject: Re: Kernel SCM saga..


On 2005-04-08, at 20:28, Jon Smirl wrote:

> On Apr 8, 2005 2:14 PM, Linus Torvalds <[email protected]> wrote:
>> How do you replicate your database incrementally? I've given you
>> enough
>> clues to do it for "git" in probably five lines of perl.
>
> Efficient database replication is achieved by copying the transaction
> logs and then replaying them. Most mid to high end databases support
> this. You only need to copy the parts of the logs that you don't
> already have.
>
Databases supporting replication are called high end. You forgot the
cats dance
around the network this issue involves.

2005-04-09 01:22:40

by Marcin Dalecki

[permalink] [raw]
Subject: Re: Kernel SCM saga..


On 2005-04-09, at 03:09, Chris Wedgwood wrote:

> On Sat, Apr 09, 2005 at 03:00:44AM +0200, Marcin Dalecki wrote:
>
>> Yes it sucks less for this purpose. See subversion as reference.
>
> Whatever solution people come up with, ideally it should be tolerant
> to minor amounts of corruption (so I can recover the rest of my data
> if need be) and it should also have decent sanity checks to find
> corruption as soon as reasonable possible.

Yes this is the reason subversion is moving toward an alternative
back-end
based on a custom DB mapped closely to the file system.

2005-04-09 01:25:58

by Tupshin Harper

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Roman Zippel wrote:

>
>
>Please show me how you would do a binary search with arch.
>
>I don't really like the arch model, it's far too restrictive and it's
>jumping through hoops to get to an acceptable speed.
>What I expect from a SCM is that it maintains both a version index of the
>directory structure and a version index of the individual files. Arch
>makes it especially painful to extract this data quickly. For the common
>cases it throws disk space at the problem and does a lot of caching, but
>there are still enough problems (e.g. annotate), which require scanning of
>lots of tarballs.
>
>bye, Roman
>
>
I'm not going to defend or attack arch since I haven't used it enough. I
will say that darcs largely does suffer from the same problem that you
describe since its fundamental unit of storage is individual patches
(though it avoids the tarball issue). This is why David Roundy has
indicated his intention of eventually having a per-file cache:
http://kerneltrap.org/mailarchive/1/message/24317/flat

You could then make the argument that if you have a per-file
representation of the history, why do you also need/want a per-patch
representation as the canonical format, but that's been argued plenty on
both the darcs and arch mailing lists and probably isn't worth going
into here.

-Tupshin

2005-04-09 01:54:06

by David Lang

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Sat, 9 Apr 2005, Marcin Dalecki wrote:

> On 2005-04-08, at 20:28, Jon Smirl wrote:
>
>> On Apr 8, 2005 2:14 PM, Linus Torvalds <[email protected]> wrote:
>>> How do you replicate your database incrementally? I've given you enough
>>> clues to do it for "git" in probably five lines of perl.
>>
>> Efficient database replication is achieved by copying the transaction
>> logs and then replaying them. Most mid to high end databases support
>> this. You only need to copy the parts of the logs that you don't
>> already have.
>>
> Databases supporting replication are called high end. You forgot the cats
> dance
> around the network this issue involves.

And Postgres (which is Free in all senses of the word) is high end by this
definition.

I'm not saying that it's an efficiant thing to use for this task, but
don't be fooled into thinking you need something on the price of Oracle to
do this job.

David Lang

--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare

2005-04-09 02:26:57

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Fri, Apr 08, 2005 at 05:12:49PM -0700, Linus Torvalds wrote:
> really designed for something like a offline http grabber, in that you can
> just grab files purely by filename (and verify that you got them right by
> running sha1sum on the resulting local copy). So think "wget".

I'm not entirely convinced wget is going to be an efficient way to
synchronize and fetch your tree, its simplicitly is great though. It's a
tradeoff between optimzing and re-using existing tools (like webservers).
Perhaps that's why you were compressing the stuff too? It sounds better
not to compress the stuff on-disk, and to synchronize with a rsync-like
protocol (rsync server would make it) that handles the compression in
the network protocol itself, and in turn that can apply compression to a
large blob (i.e. the diff between the trees), and not to the single tiny
files.

2005-04-09 02:32:58

by David Lang

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Sat, 9 Apr 2005, Andrea Arcangeli wrote:

> On Fri, Apr 08, 2005 at 05:12:49PM -0700, Linus Torvalds wrote:
>> really designed for something like a offline http grabber, in that you can
>> just grab files purely by filename (and verify that you got them right by
>> running sha1sum on the resulting local copy). So think "wget".
>
> I'm not entirely convinced wget is going to be an efficient way to
> synchronize and fetch your tree, its simplicitly is great though. It's a
> tradeoff between optimzing and re-using existing tools (like webservers).
> Perhaps that's why you were compressing the stuff too? It sounds better
> not to compress the stuff on-disk, and to synchronize with a rsync-like
> protocol (rsync server would make it) that handles the compression in
> the network protocol itself, and in turn that can apply compression to a
> large blob (i.e. the diff between the trees), and not to the single tiny
> files.

note that many webservers will compress the data for you on the fly as
well, so there's even less need to have it pre-compressed

David Lang

--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare

2005-04-09 02:54:23

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: Kernel SCM saga..

Hello,

Dear diary, on Fri, Apr 08, 2005 at 05:50:21PM CEST, I got a letter
where Linus Torvalds <[email protected]> told me that...
>
>
> On Fri, 8 Apr 2005 [email protected] wrote:
> >
> > Here's a partial solution. It does depend on a modified version of
> > cat-file that behaves like cat. I found it easier to have cat-file
> > just dump the object indicated on stdout. Trivial patch for that is included.
>
> Your trivial patch is trivially incorrect, though. First off, some files
> may be binary (and definitely are - the "tree" type object contains
> pathnames, and in order to avoid having to worry about special characters
> they are NUL-terminated), and your modified "cat-file" breaks that.
>
> Secondly, it doesn't check or print the tag.

FWIW, I made few small fixes (to prevent some trivial usage errors to
cause cache corruption) and added scripts gitcommit.sh, gitadd.sh and
gitlog.sh - heavily inspired by what already went through the mailing
list. Everything is available at http://pasky.or.cz/~pasky/dev/git/
(including .dircache, even though it isn't shown in the index), the
cumulative patch can be found below. The scripts aim to provide some
(obviously very interim) more high-level interface for git.

I'm now working on tree-diff.c which will (surprise!) produce a diff
of two trees (I'll finish it after I get some sleep, though), and then I
will probably do some dwimmy gitdiff.sh wrapper for tree-diff and
show-diff. At that point I might get my hand on some pull more kind to
local changes.

Kind regards,
Petr Baudis

diff -ruN git-0.03/gitadd.sh git-devel-clean/gitadd.sh
--- git-0.03/gitadd.sh 1970-01-01 01:00:00.000000000 +0100
+++ git-devel-clean/gitadd.sh 2005-04-09 03:17:34.220577000 +0200
@@ -0,0 +1,13 @@
+#!/bin/sh
+#
+# Add new file to a GIT repository.
+# Copyright (c) Petr Baudis, 2005
+#
+# Takes a list of file names at the command line, and schedules them
+# for addition to the GIT repository at the next commit.
+#
+# FIXME: Those files are omitted from show-diff output!
+
+for file in "$@"; do
+ echo $file >>.dircache/add-queue
+done
diff -ruN git-0.03/gitcommit.sh git-devel-clean/gitcommit.sh
--- git-0.03/gitcommit.sh 1970-01-01 01:00:00.000000000 +0100
+++ git-devel-clean/gitcommit.sh 2005-04-09 03:17:34.220577000 +0200
@@ -0,0 +1,36 @@
+#!/bin/sh
+#
+# Commit into a GIT repository.
+# Copyright (c) Petr Baudis, 2005
+# Based on an example script fragment sent to LKML by Linus Torvalds.
+#
+# Ignores any parameters for now, excepts changelog entry on stdin.
+#
+# FIXME: Gets it wrong for filenames containing spaces.
+
+
+if [ -r .dircache/add-queue ]; then
+ mv .dircache/add-queue .dircache/add-queue-progress
+ addedfiles=$(cat .dircache/add-queue-progress)
+else
+ addedfiles=
+fi
+changedfiles=$(show-diff -s | grep -v ': ok$' | cut -d : -f 1)
+commitfiles="$addedfiles $changedfiles"
+if [ ! "$commitfiles" ]; then
+ echo 'Nothing to commit.' >&2
+ exit
+fi
+update-cache $commitfiles
+rm -f .dircache/add-queue-progress
+
+
+oldhead=$(cat .dircache/HEAD)
+treeid=$(write-tree)
+newhead=$(commit-tree $treeid -p $oldhead)
+
+if [ "$newhead" ]; then
+ echo $newhead >.dircache/HEAD
+else
+ echo "Error during commit (oldhead $oldhead, treeid $treeid)" >&2
+fi
diff -ruN git-0.03/gitlog.sh git-devel-clean/gitlog.sh
--- git-0.03/gitlog.sh 1970-01-01 01:00:00.000000000 +0100
+++ git-devel-clean/gitlog.sh 2005-04-09 04:28:51.227791000 +0200
@@ -0,0 +1,61 @@
+#!/bin/sh
+####
+#### Call this script with an object and it will produce the change
+#### information for all the parents of that object
+####
+#### This script was originally written by Ross Vandegrift.
+# multiple parents test 1d0f4aec21e5b66c441213643426c770dc6dedc0
+# parents: ffa098b2e187b71b86a76d3cd5eb77d074a2503c
+# 6860e0d9197c7f52155466c225baf39b42d62f63
+
+# regex for parent declarations
+PARENTS="^parent [A-z0-9]{40}$"
+
+TMPCL="/tmp/gitlog.$$"
+
+# takes an object and generates the object's parent(s)
+function unpack_parents () {
+ echo "me $1"
+ echo "me $1" >>$TMPCL
+ RENTS=""
+
+ TMPCM=$(mktemp)
+ cat-file commit $1 >$TMPCM
+ while read line; do
+ if echo "$line" | egrep -q "$PARENTS"; then
+ RENTS="$RENTS "$(echo $line | sed 's/parent //g')
+ fi
+ echo $line
+ done <$TMPCM
+ rm $TMPCM
+
+ echo -e "\n--------------------------\n"
+
+ # if the last object had no parents, return
+ if [ ! "$RENTS" ]; then
+ return;
+ fi
+
+ #useful for testing
+ #echo $RENTS
+ #read
+ for i in `echo $RENTS`; do
+ # break cycles
+ if grep -q "me $i" $TMPCL; then
+ echo "Already visited $i" >&2
+ continue
+ else
+ unpack_parents $i
+ fi
+ done
+}
+
+base=$1
+if [ ! "$base" ]; then
+ base=$(cat .dircache/HEAD)
+fi
+
+rm -f $TMPCL
+unpack_parents $base
+rm -f $TMPCL
+
diff -ruN git-0.03/read-cache.c git-devel-clean/read-cache.c
--- git-0.03/read-cache.c 2005-04-08 22:51:35.000000000 +0200
+++ git-devel-clean/read-cache.c 2005-04-09 03:53:44.049642000 +0200
@@ -264,11 +264,12 @@
size = 0; // avoid gcc warning
map = (void *)-1;
if (!fstat(fd, &st)) {
- map = NULL;
size = st.st_size;
errno = EINVAL;
if (size > sizeof(struct cache_header))
map = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
+ else
+ return (!hdr->entries) ? 0 : error("inconsistent cache");
}
close(fd);
if (-1 == (int)(long)map)
diff -ruN git-0.03/show-diff.c git-devel-clean/show-diff.c
--- git-0.03/show-diff.c 2005-04-08 17:55:09.000000000 +0200
+++ git-devel-clean/show-diff.c 2005-04-09 03:53:44.063638000 +0200
@@ -49,9 +49,17 @@

int main(int argc, char **argv)
{
+ int silent = 0;
int entries = read_cache();
int i;

+ while (argc-- > 1) {
+ if (!strcmp(argv[1], "-s"))
+ silent = 1;
+ else if (!strcmp(argv[1], "-h") || !strcmp(argv[1], "--help"))
+ usage("show-diff [-s]");
+ }
+
if (entries < 0) {
perror("read_cache");
exit(1);
@@ -77,6 +85,9 @@
for (n = 0; n < 20; n++)
printf("%02x", ce->sha1[n]);
printf("\n");
+ if (silent)
+ continue;
+
new = read_sha1_file(ce->sha1, type, &size);
show_differences(ce, &st, new, size);
free(new);
diff -ruN git-0.03/update-cache.c git-devel-clean/update-cache.c
--- git-0.03/update-cache.c 2005-04-08 17:53:44.000000000 +0200
+++ git-devel-clean/update-cache.c 2005-04-09 03:53:44.069637000 +0200
@@ -231,6 +231,9 @@
return -1;
}

+ if (argc < 2)
+ usage("update-cache <file>*");
+
newfd = open(".dircache/index.lock", O_RDWR | O_CREAT | O_EXCL, 0600);
if (newfd < 0) {
perror("unable to create new cachefile");

2005-04-09 02:54:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Fri, Apr 08, 2005 at 07:38:30PM -0400, Daniel Phillips wrote:
> For the immediate future, all we need is something than can _losslessly_
> capture the new metadata that's being generated. That buys time to bring one
> of the promising open source candidates up to full speed.

Agreed.

2005-04-09 03:06:05

by Brian Gerst

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Andrea Arcangeli wrote:
> On Fri, Apr 08, 2005 at 05:12:49PM -0700, Linus Torvalds wrote:
>
>>really designed for something like a offline http grabber, in that you can
>>just grab files purely by filename (and verify that you got them right by
>>running sha1sum on the resulting local copy). So think "wget".
>
>
> I'm not entirely convinced wget is going to be an efficient way to
> synchronize and fetch your tree, its simplicitly is great though. It's a
> tradeoff between optimzing and re-using existing tools (like webservers).
> Perhaps that's why you were compressing the stuff too? It sounds better
> not to compress the stuff on-disk, and to synchronize with a rsync-like
> protocol (rsync server would make it) that handles the compression in
> the network protocol itself, and in turn that can apply compression to a
> large blob (i.e. the diff between the trees), and not to the single tiny
> files.

It's my understanding that the files don't change. Only new ones are
created for each revision.

--
Brian gErst

2005-04-09 03:15:18

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Fri, Apr 08, 2005 at 11:08:58PM -0400, Brian Gerst wrote:
> It's my understanding that the files don't change. Only new ones are
> created for each revision.

I said diff between the trees, not diff between files ;). When you fetch
the new changes with rsync, it'll compress better and in turn it'll be
faster (assuming we're network bound and I am with 1mbit and 2.5ghz
cpu), if it's rsync applying gzip to the big "combined diff between
trees" instead of us compressing every single small file on disk, that
won't compress anymore inside rsync.

2005-04-09 04:09:23

by Walter Landry

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Linus Torvalds wrote:
> Which is why I'd love to hear from people who have actually used
> various SCM's with the kernel. There's bound to be people who have
> already tried.

At the end of my Codecon talk, there is a performance comparison of a
number of different distributed SCM's with the kernel.

http://superbeast.ucsd.edu/~landry/ArX/codecon/codecon.html

I develop ArX (http://www.nongnu.org/arx). You may find it of
interest ;)

Cheers,
Walter Landry
[email protected]

2005-04-09 05:43:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Sat, 9 Apr 2005, Andrea Arcangeli wrote:
>
> I'm not entirely convinced wget is going to be an efficient way to
> synchronize and fetch your tree

I don't think it's efficient per se, but I think it's important that
people can just "pass the files along". Ie it's a huge benefit if any
everyday mirror script (whether rsync, wget, homebrew or whatever) will
just automatically do the right thing.

> Perhaps that's why you were compressing the stuff too? It sounds better
> not to compress the stuff on-disk

I much prefer to waste some CPU time to save disk cache. Especially since
the compression is "free" if you do it early on (ie it's done only once,
since the files are stable). Also, if the difference is a 1.5GB kernel
repository or a 3GB kernel repository, I know which one I'll pick ;)

Also, I don't want people editing repostitory files by hand. Sure, the
sha1 catches it, but still... I'd rather force the low-level ops to use
the proper helper routines. Which is why it's a raw zlib compressed blob,
not a gzipped file.

Linus

2005-04-09 07:10:05

by Randy.Dunlap

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Sat, 9 Apr 2005 04:53:57 +0200 Petr Baudis wrote:

| Hello,
|
| Dear diary, on Fri, Apr 08, 2005 at 05:50:21PM CEST, I got a letter
| where Linus Torvalds <[email protected]> told me that...
| >
| >
| > On Fri, 8 Apr 2005 [email protected] wrote:
| > >
| > > Here's a partial solution. It does depend on a modified version of
| > > cat-file that behaves like cat. I found it easier to have cat-file
| > > just dump the object indicated on stdout. Trivial patch for that is included.
| >
| > Your trivial patch is trivially incorrect, though. First off, some files
| > may be binary (and definitely are - the "tree" type object contains
| > pathnames, and in order to avoid having to worry about special characters
| > they are NUL-terminated), and your modified "cat-file" breaks that.
| >
| > Secondly, it doesn't check or print the tag.
|
| FWIW, I made few small fixes (to prevent some trivial usage errors to
| cause cache corruption) and added scripts gitcommit.sh, gitadd.sh and
| gitlog.sh - heavily inspired by what already went through the mailing
| list. Everything is available at http://pasky.or.cz/~pasky/dev/git/
| (including .dircache, even though it isn't shown in the index), the
| cumulative patch can be found below. The scripts aim to provide some
| (obviously very interim) more high-level interface for git.
|
| I'm now working on tree-diff.c which will (surprise!) produce a diff
| of two trees (I'll finish it after I get some sleep, though), and then I
| will probably do some dwimmy gitdiff.sh wrapper for tree-diff and
| show-diff. At that point I might get my hand on some pull more kind to
| local changes.

Hi,

I'll look at your scripts this weekend. I've also been
working on some, but mine are a bit more experimental (cruder)
than yours are. Anyway, here they are (attached) -- also
available at http://developer.osdl.org/rddunlap/git/

gitin : checkin/commit
gitwhat sha1 : what is that sha1 file (type and contents if blob or commit)
gitlist (blob, commit, tree, or all) :
list all objects with type (commit, tree, blob, or all)

---
~Randy


Attachments:
gitin (742.00 B)
gitlist (580.00 B)
gitwhat (533.00 B)
Download all attachments

2005-04-09 07:21:07

by Willy Tarreau

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Fri, Apr 08, 2005 at 12:03:49PM -0700, Linus Torvalds wrote:

> And if you do actively malicious things in your own directory, you get
> what you deserve. It's actually _hard_ to try to fool git into believing a
> file hasn't changed: you need to not only replace it with the exact same
> file length and ctime/mtime, you need to reuse the same inode/dev numbers
> (again - I didn't worry about portability, and filesystems where those
> aren't stable are a "don't do that then") and keep the mode the same. Oh,
> and uid/gid, but that was much me being silly.

It would be even easier to touch the tree with a known date before
patching (eg:1/1/70). It would protect against any accidental date
change if for any reason your system time went backwards while
working on the tree.

Another trick I use when I build the 2.4-hf patches is to build a
list of filenames from the patches. It works only because I want
to keep all original patches and no change should appear outside
those patches. Using this + cp -al + diff -pruN makes the process
very fast. It would not work if I had to rebuild those patches from
hand-edited files of course.

Last but not least, it only takes 0.26 seconds on my dual athlon
1800 to find date/size changes between 2.6.11{,.7} and 4.7s if the
tool includes the md5 sum in its checks :

$ time flx check --ignore-owner --ignore-mode --ignore-ldate --ignore-dir \
--ignore-dot --only-new --ignore-sum linux-2.6.11/. linux-2.6.11.7/. |wc -l
47

real 0m0.255s
user 0m0.094s
sys 0m0.162s

$ time flx check --ignore-owner --ignore-mode --ignore-ldate --ignore-dir \
--ignore-dot --only-new linux-2.6.11/. linux-2.6.11.7/. |wc -l
47

real 0m4.705s
user 0m3.398s
sys 0m1.310s

(This was with 'flx', a tool a friend developped for file-system integrity
checking which we also use to build our packages). Anyway, what I wanted
to show is that once the trees are cached, even somewhat heavy operations
such as checksumming can be done occasionnaly (such as md5 for double
checking) without you waiting too long. And I don't think that a database
would provide all the comfort of a standard file-system (cp -al, rsync,
choice of tools, etc...).

Willy

2005-04-09 07:37:43

by Willy Tarreau

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Fri, Apr 08, 2005 at 11:56:09AM -0700, Chris Wedgwood wrote:
> On Fri, Apr 08, 2005 at 11:47:10AM -0700, Linus Torvalds wrote:
>
> > Don't use NFS for development. It sucks for BK too.
>
> Some times NFS is unavoidable.
>
> In the best case (see previous email wrt to only stat'ing the parent
> directories when you can) for a current kernel though you can get away
> with 894 stats --- over NFS that would probably be tolerable.
>
> After claiming such an optimization is probably not worth while I'm
> now thinking for network filesystems it might be.

I've just checked, it takes 5.7s to compare 2.4.29{,-hf3} over NFS (13300
files each) and 1.3s once the trees are cached locally. This is without
comparing file contents, just meta-data. And it takes 19.33s to compare
the file's md5 sums once the trees are cached. I don't know if there are
ways to avoid some NFS operations when everything is cached.

Anyway, the system does not seem much efficient on hard links, it caches
the files twice :-(

Willy

2005-04-09 07:47:31

by NeilBrown

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Saturday April 9, [email protected] wrote:
>
> I've just checked, it takes 5.7s to compare 2.4.29{,-hf3} over NFS (13300
> files each) and 1.3s once the trees are cached locally. This is without
> comparing file contents, just meta-data. And it takes 19.33s to compare
> the file's md5 sums once the trees are cached. I don't know if there are
> ways to avoid some NFS operations when everything is cached.
>
> Anyway, the system does not seem much efficient on hard links, it caches
> the files twice :-(

I suspect you'll be wanting to add a "no_subtree_check" export option
on your NFS server...

NeilBrown

2005-04-09 08:01:20

by Willy Tarreau

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Sat, Apr 09, 2005 at 05:47:08PM +1000, Neil Brown wrote:
> On Saturday April 9, [email protected] wrote:
> >
> > I've just checked, it takes 5.7s to compare 2.4.29{,-hf3} over NFS (13300
> > files each) and 1.3s once the trees are cached locally. This is without
> > comparing file contents, just meta-data. And it takes 19.33s to compare
> > the file's md5 sums once the trees are cached. I don't know if there are
> > ways to avoid some NFS operations when everything is cached.
> >
> > Anyway, the system does not seem much efficient on hard links, it caches
> > the files twice :-(
>
> I suspect you'll be wanting to add a "no_subtree_check" export option
> on your NFS server...

Thanks a lot, Neil ! This is very valuable information. I didn't
understand such implications from the exports(5) man page, but it
makes a great difference. And the diff sped up from 5.7 to 3.9s
and from 19.3 to 15.3s.

Cheers,
Willy

2005-04-09 08:32:33

by Jan Hudec

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Sat, Apr 09, 2005 at 03:01:29 +0200, Marcin Dalecki wrote:
>
> On 2005-04-07, at 09:44, Jan Hudec wrote:
> >
> >I have looked at most systems currently available. I would suggest
> >following for closer look on:
> >
> >1) GNU Arch/Bazaar. They use the same archive format, simple, have the
> > concepts right. It may need some scripts or add ons. When Bazaar-NG
> > is ready, it will be able to read the GNU Arch/Bazaar archives so
> > switching should be easy.
>
> Arch isn't a sound example of software design. Quite contrary to the

I actually _do_ agree with you. I like Arch, but it's user interface
certainly is broken and some parts of it would sure needs some redesign.

> random notes posted by it's author the following issues did strike me
> the time I did evaluate it:
>
> The application (tla) claims to have "intuitive" command names. However
> I didn't see that as given. Most of them where difficult to remember
> and appeared to be just infantile. I stopped looking further after I
> saw:
>
> tla my-id instead of: tla user-id or oeven tla set id ...
>
> tla make-archive instead of tla init

In this case, tla init would be a lot *worse*, because there are two
different things to initialize -- the archive and the tree. But
init-archive would be a little better, for consistency.

> tla my-default-archive [email protected]

This one is kinda broken. Even in concept it is.

> No more "My Compuer" please...
>
> Repository addressing requires you to use informally defined
> very elaborated and typing error prone conventions:
>
> mkdir ~/{archives}

*NO*. Usng this is name is STRONGLY recommended *AGAINST*. Tom once used
it in the example or in some of his archive and people started doing it,
but it's a compelete bogosity and it is not required anywhere.

> tla make-archive [email protected]
> ~/{archives}/2005-VersionPatrol
>
> You notice the requirement for two commands to accomplish a single task
> already well denoted by the second command? There is more of the same
> at quite a few places when you try to use it. You notice the triple
> zero it didn't catch?

I sure do. But the folks writing Bazaar are gradually fixing these.
There is a lot of them and it's not that long since they started, so
they did not fix all of them yey, but I think they eventually will.

> As an added bonus it relies on the applications named by accident
> patch and diff and installed on the host in question as well as few
> other as well to
> operate.

No. The build process actually checks that the diff and patch
applications are actually the GNU Diff and GNU Patch in sufficiently
recent version. It's was not always the case, but now it does.

> Better don't waste your time with looking at Arch. Stick with patches
> you maintain by hand combined with some scripts containing a list of
> apply commands
> and you should be still more productive then when using Arch.

I don't agree with you. Using Arch is more productive (eg. because it
does merges), but certainly one could do a lot better than Arch does.

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <[email protected]>


Attachments:
(No filename) (3.13 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2005-04-09 09:34:37

by NeilBrown

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Saturday April 9, [email protected] wrote:
> On Sat, Apr 09, 2005 at 05:47:08PM +1000, Neil Brown wrote:
> > On Saturday April 9, [email protected] wrote:
> > >
> > > I've just checked, it takes 5.7s to compare 2.4.29{,-hf3} over NFS (13300
> > > files each) and 1.3s once the trees are cached locally. This is without
> > > comparing file contents, just meta-data. And it takes 19.33s to compare
> > > the file's md5 sums once the trees are cached. I don't know if there are
> > > ways to avoid some NFS operations when everything is cached.
> > >
> > > Anyway, the system does not seem much efficient on hard links, it caches
> > > the files twice :-(
> >
> > I suspect you'll be wanting to add a "no_subtree_check" export option
> > on your NFS server...
>
> Thanks a lot, Neil ! This is very valuable information. I didn't
> understand such implications from the exports(5) man page, but it
> makes a great difference. And the diff sped up from 5.7 to 3.9s
> and from 19.3 to 15.3s.

No, that implication had never really occurred to me before either.
But when you said "caches the file twice" it suddenly made sense.
With subtree_check, the NFS file handle contains information about the
directory, and NFS uses the filehandle as the primary key to tell if
two things are the same or not.

Trond keeps prodding me to make no_subtree_check the default. Maybe it
is time that I actually did....

NeilBrown

2005-04-09 11:02:44

by Samium Gromoff

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Ok, this was literally screaming for a rebuttal! :-)

> Arch isn't a sound example of software design. Quite contrary to the
> random notes posted by it's author the following issues did strike me
> the time I did evaluate it:
(Note that here you take a stab at the Arch design fundamentals, but
actually fail to substantiate it later)

> The application (tla) claims to have "intuitive" command names. However
> I didn't see that as given. Most of them where difficult to remember
> and appeared to be just infantile. I stopped looking further after I
> saw:
[ UI issues snipped, not really core design ]

Yes, some people perceive that there _are_ UI issues in Arch.
However, as strange as it may sound, some don`t feel so.

> As an added bonus it relies on the applications named by accident
> patch and diff and installed on the host in question as well as few
> other as well to
> operate.

This is called modularity and code reuse.

And given that patch and diff are installed by default on all of the
relevant developer machines i fail to see as to why it is by any
measure a derogatory.

(and the rest you speak about is tar and gzip)

> Better don't waste your time with looking at Arch. Stick with patches
> you maintain by hand combined with some scripts containing a list of
> apply commands
> and you should be still more productive then when using Arch.

Sure, you should`ve had come up with something more based than that! :-)

Now to the real design issues...

Globally unique, meaningful, symbolic revision names -- the core of the
Arch namespace.

"Stone simple" on-disk format to store things -- a hierarchy
of directories with textual files and tarballs.

No smart server -- any sftp, ftp, webdav (or just http for read-only access)
server is exactly up to the task.

O(0) branching -- a branch is simply a tag, a continuation from some
point of development. A network-capable-symlink if you would like.
It is actually made possible due to the global Arch namespace.

Revision ancestry graph, of course. Enables smart merging.

Now, to the features:

Archives/revisions are trivially crypto-signed -- thanks to the "stone-simple"
on-disk format.

Trivial push/pull mirroring -- a mirror is exactly a read-only archive,
and can be turned into a full-blown archive by removal of a single
file.

Revision libraries as client-side operation speedup mechanism with partially
automated updates.

Cached revisions as server-side speedup.

Possibility for hardlinked checkouts for local archives. This requires that
your text editor is smart and deletes the original file when it writes
changes.

Various pre/post/whatever-commit hooks.

That much for starters... :-)

---
cheers,
Samium Gromoff

2005-04-09 11:29:58

by Samium Gromoff

[permalink] [raw]
Subject: Re: Kernel SCM saga..

It seems that Tom Lord, the primary architect behind GNU Arch
has recently published an open letter to Linus Torvalds.

Because no open letter to Linus would be really open without an
accompanying reference post on lkml, here it is:

http://lists.seyza.com/pipermail/gnu-arch-dev/2005-April/001001.html

---
cheers,
Samium Gromoff

2005-04-09 15:18:11

by Paul Jackson

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Linus wrote:
> you need to reuse the same inode/dev numbers
> (again - I didn't worry about portability, and filesystems where those
> aren't stable are a "don't do that then")

On filesystems that don't have a stable inode number, I use the md5sum
of the full (relative to mount point) pathname as the inode number.

Since these same file systems (not surprisingly) lack hard links as
well, the pathname _is_ essentially the stable inode number.


Off-topic details ...

This is on my backup program, which does a full snapshot of my 90 Gb
system, including some FAT file systems, in 6 or 7 minutes, plus time
proportional to actual changes. I have given up finding a backup
program I can tolerate, and write my own. It stores each md5sum unique
blob exactly once, but uses the same sort of tricks you describe to
detect changes from examining just the stat information so as to avoid
reading every damn byte on the disk. It works with smb, fat, vfat,
ntfs, reiserfs, xfs, ext2/3, ... A single manifest file, in plain
ascii, one file per line, captures a full snapshot, disk-to-disk, every
few hours.

This comment from my backup source explains more:

# Unfortunately, fat, vfat, smb, and ncpfs (Netware) file systems
# do not have unique disk-based persistent inode numbers.
# The kernel constructs transient inode numbers for inodes
# in its cache. But after an umount and re-mount, the inode
# numbers are all different. So we would end up recalculating
# the md5sums of all files in any such file systems.
#
# To avoid this, we keep track of which directories are on such
# file systems, and for files in any such directory, instead
# of using the inode value from stat'ing a file, we use the
# md5sum of its path as a pseudo-inode number. This digest of
# a file's path has improved persistance over it's transiently
# assigned inode number. Fields 5,6,7 (files total, free and
# avail) happen to be zero on file systems (fat, vfat, smb,
# ...) with no real inodes, so we we use this fallback means
# of getting a persistent pseudo-inode if a statvfs() call on
# its directory has fields 5,6,7 summing to zero:
# sum(os.statvfs(dir)[5:8]) == 0
# We include that dir in the fat_directories set in this case.

fat_directories = sets.Set() # set of directory paths on FAT file systems

# The Python statvfs() on Linux is a tad expensive - the
# glibc statvfs(2) code does several system calls, including
# scanning /proc/mounts and stat'ing its entries. We need
# to know for each file whether it is on a "fat" file system
# (see above), but for efficiency we only statvfs at mount
# points, then propagate the file system type from there down.

mountpoints = [m.split()[1] for m in open("/proc/mounts")]



--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-09 15:42:35

by Paul Jackson

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Linus wrote:
> then git will open have exactly _one_
> file (no searching, no messing around), which contains absolutely nothing
> except for the compressed (and SHA1-signed) old contents of the file. It
> obviously _has_ to do that, because in order to know whether you've
> changed it, it needs to now compare it to the original.

I must be missing something here ...

If the stat shows a possible change, then you shouldn't have to open the
original version to determine if it really changed - just compute the
SHA1 of the new file, and see if that changed from the original SHA1.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-09 15:44:05

by Paul Jackson

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Marcin wrote:
> But what will impress you are either the price tag the
> DB comes with or
> the hardware it runs on :-)

The payroll for the staffing to care and feed for these
babies is often impressive as well.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-09 15:50:43

by Paul Jackson

[permalink] [raw]
Subject: Re: Kernel SCM saga..

> in order to avoid having to worry about special characters
> they are NUL-terminated)

Would this be a possible alternative - newline terminated (convert any
newlines embedded in filenames to the 3 chars '%0A', and leave it as an
exercise to the reader to de-convert them.)

Line formatted ASCII files are really nice - worth pissing on embedded
newlines in paths to obtain.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-09 16:14:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Sat, 9 Apr 2005, Paul Jackson wrote:
>
> I must be missing something here ...
>
> If the stat shows a possible change, then you shouldn't have to open the
> original version to determine if it really changed - just compute the
> SHA1 of the new file, and see if that changed from the original SHA1.

Yes. However, I've got two reasons for this:

(a) it may actually be cheaper to just unpack the compressed thing than
it is to compute the sha, _especially_ since it's very likely that
you have to do that anyway (ie if it turns out that they _are_
different, you need the unpacked data to then look at the
differences).

So when you come from your backup angle, you only care about "has it
changed", and you'll do a backup. In "git", you usually care about
the old contents too.

(b) while I depend on the fact that if the SHA of an object matches, the
objects are the same, I generally try to avoid the reverse
dependency. Why? Because if I end up changing the way I pack objects,
and still want to work with old objects, I may end up in the
situation that two identical objects could get different object
names.

I don't actually know how valid a point "(b)" is, and I don't think it's
likely, but imagine that SHA1 ends up being broken (*) and I decide that I
want to pack new objects with a new-and-improved-SHA256 or something. Such
a thing would obviously mean that you end up with lots of _duplicate_ data
(any new data that is repackaged with the new name will now cause a new
git object), but "duplicate" is better than "broken".

I don't actually guarantee that "git" could handle that right, but I've
been idly trying to avoid locking myself into the mindset that "file
equality has to mean name equality over the long run". So while the system
right now works on the 1:1 "name" <-> "content" mapping, it's possible
that it _could_ work with a more relaxed 1:n "content" -> "name" mapping.

But it's entirely possible that I'm being a git about this.

Linus

(*) yeah, yeah, I know about the current theoretical case, and I don't
care. Not only is it theoretical, the way my objects are packed you'd have
to not just generate the same SHA1 for it, it would have to _also_ still
be a valid zlib object _and_ get the header to match the "type + length"
of object part. IOW, the object validity checks are actually even stricter
than just "sha1 matches".

2005-04-09 16:21:27

by Paul Jackson

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Linus wrote:
> If you want to have spaces
> and newlines in your pathname, go wild.

So long as there is only one pathname in a record, you don't need
nul-terminators to be allow spaces in the name. The rest of the record
is well known, so the pathname is just whatever is left after chomping
off the rest of the record.

It's only the support for embedded newlines that forces you to use
nul-terminators.

Not worth it - in my view. Rather, do just enough hackery that
such a pathname doesn't break you, even if it means not giving
full service to such names.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-09 16:23:09

by David Roundy

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thu, Apr 07, 2005 at 12:30:18PM +0200, Matthias Andree wrote:
> On Thu, 07 Apr 2005, Sergei Organov wrote:
> > darcs? <http://www.abridgegame.org/darcs/>
>
> Close. Some things:
>
> 1. It's rather slow and quite CPU consuming and certainly I/O consuming
> at times - I keep, to try it out, leafnode-2 in a DARCS repo, which
> has a mere 20,000 lines in 140 files, with 1,436 changes so far, on a
> RAID-1 with two 7200/min disk drives, with an Athlon XP 2500+ with
> 512 MB RAM. The repo has 1,700 files in 11.5 MB, the source itself
> 189 files in 1.8 MB.
>
> Example: darcs annotate nntpd.c takes 23 s. (2,660 lines, 60 kByte)
>
> The maintainer himself states that there's still optimization required.

Indeed, there's still a lot of optimization to be done. I've recently made
some improvements recently which will reduce the memory use (and speed
things up) for a few of the worst-performing commands. No improvement to
the initial record, but on the plus side, that's only done once. But I was
able to cut down the memory used checking out a kernel repository to 500m.
(Which, sadly enough, is a major improvement.)

You would do much better if you recorded the initial state one directory at
a time, since it's the size of the largest changeset that determines the
memory use on checkout, but that's ugly.

> Getting DARCS up to the task would probably require some polishing, and
> should probably be discussed with the DARCS maintainer before making
> this decision.
>
> Don't get me wrong, DARCS looks promising, but I'm not convinced it's
> ready for the linux kernel yet.

Indeed, I do believe that darcs has a way to go before it'll perform
acceptably on the kernel. On the other hand, tar seems to perform
unacceptably slow on the kernel, so I'm not sure how slow is too slow.
Definitely input from interested kernel developers on which commands are
too slow would be welcome.
--
David Roundy
http://www.darcs.net

2005-04-09 16:24:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Sat, 9 Apr 2005, Paul Jackson wrote:
>
> > in order to avoid having to worry about special characters
> > they are NUL-terminated)
>
> Would this be a possible alternative - newline terminated (convert any
> newlines embedded in filenames to the 3 chars '%0A', and leave it as an
> exercise to the reader to de-convert them.)

Sure, you could obviously do escaping (you need to remember to escape '%'
too when you do that ;).

However, whenever you do escaping, that means that you're already going to
have to use a tool to unpack the dang thing. So you didn't actually win
anything. I pretty much guarantee that my existing format is easier to
unpack than your escaped format.

ASCII isn't magical.

This is "fsck_tree()", which walks the unpacked tree representation and
checks that it looks sane and marks the sha1's it finds as being
needed (so that you can do reachability analysis in a second pass). It's
not exactly complicated:

static int fsck_tree(unsigned char *sha1, void *data, unsigned long size)
{
while (size) {
int len = 1+strlen(data);
unsigned char *file_sha1 = data + len;
char *path = strchr(data, ' ');
if (size < len + 20 || !path)
return -1;
data += len + 20;
size -= len + 20;
mark_needs_sha1(sha1, "blob", file_sha1);
}
return 0;
}

and there's one HUGE advantage to _not_ having escaping: sorting and
comparing.

If you escape things, you now have to decide how you sort filenames. Do
you sort them by the escaped representation, or by the "raw"
representation? Do you always have to escape or unescape the name in order
to sort it.

So I like ASCII as much as the next guy, but it's not a religion. If there
isn't any point to it, there isn't any point to it.

The biggest irritation I have with the "tree" format I chose is actually
not the name (which is trivial), it's the <sha1> part. Almost everything
else keeps the <sha1> in the ASCII hexadecimal representation, and I
should have done that here too. Why? Not because it's a <sha1> - hey, the
binary representation is certainly denser and equivalent - but because an
ASCII representation there would have allowed me to much more easily
change the key format if I ever wanted to. Now it's very SHA1-specific.

Which I guess is fine - I don't really see any reason to change, and if I
do change, I could always just re-generate the whole tree. But I think it
would have been cleaner to have _that_ part in ASCII.

Linus

2005-04-09 16:33:55

by Roman Zippel

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Hi,

On Fri, 8 Apr 2005, Linus Torvalds wrote:

> Also, I suspect that BKCVS actually bothers to get more details out of a
> BK tree than I cared about. People have pestered Larry about it, so BKCVS
> exports a lot of the nitty-gritty (per-file comments etc) that just
> doesn't actually _matter_, but people whine about. Me, I don't care. My
> sparse-conversion just took the important parts.

As soon as you want to synchronize and merge two trees, you will know why
this information does matter.
(/me looks closer at the sparse-conversion...)
It seems you exported the complete parent information and this is exactly
the "nitty-gritty" I was "whining" about and which is not available via
bkcvs or bkweb and it's the most crucial information to make the bk data
useful outside of bk. Larry was previously very clear about this that he
considers this proprietary bk meta data and anyone attempting to export
this information is in violation with the free bk licence, so you indeed
just took the important parts and this is/was explicitly verboten for
normal bk users.

bye, Roman

2005-04-09 16:52:34

by Eric D. Mudama

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Apr 8, 2005 4:52 PM, Roman Zippel <[email protected]> wrote:
> The problem is you pay a price for this. There must be a reason developers
> were adding another GB of memory just to run BK.
> Preserving the complete merge history does indeed make repeated merges
> simpler, but it builds up complex meta data, which has to be managed
> forever. I doubt that this is really an advantage in the long term. I
> expect that we were better off serializing changesets in the main
> repository. For example bk does something like this:
>
> A1 -> A2 -> A3 -> BM
> \-> B1 -> B2 --^
>
> and instead of creating the merge changeset, one could merge them like
> this:
>
> A1 -> A2 -> A3 -> B1 -> B2
>
> This results in a simpler repository, which is more scalable and which
> is easier for users to work with (e.g. binary bug search).
> The disadvantage would be it will cause more minor conflicts, when changes
> are pulled back into the original tree, but which should be easily
> resolvable most of the time.

The kicker comes that B1 was developed based on A1, so any test
results were based on B1 being a single changeset delta away from A1.
If the resulting 'BM' fails testing, and you've converted into the
linear model above where B2 has failed, you lose the ability to
isolate B1's changes and where they came from, to revalidate the
developer's results.

With bugs and fixes that can be validated in a few hours, this may not
be a problem, but when chasing a bug that takes days or weeks to
manifest, that a developer swears they fixed, one has to be able to
reproduce their exact test environment.

I believe that flattening the change graph makes history reproduction
impossible, or alternately, you are imposing on each developer to test
the merge results at B1 + A1..3 before submission, but in doing so,
the test time may require additional test periods etc and with
sufficient velocity, might never close. This is the problem CVS has
if you don't create micro branches for every single modification.

--eric

2005-04-09 17:08:42

by Paul Jackson

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Linus wrote:
> (you need to remember to escape '%'
> too when you do that ;).

No - don't have to. Not if I don't mind giving fools that embed
newlines in paths second class service.

In my case, if I create a file named "foo\nbar", then backup and restore
it, I end up with a restored file named "foo%0Abar". If I had backed up
another file named "foo%0Abar", and now restore it, it collides, and
last one to be restored wins. If I really need the "foo\nbar" file back
as originally named, I will have to dig it out by hand.

I dare say that Linux kernel source does not require first class support
for newlines embedded in pathnames.

> ASCII isn't magical.

No - but it's damn convenient. Alot of tools work on line-oriented
ASCII that don't work elsewhere.

I guess Perl-hackers won't care much, but those working with either
classic shell script tools or Python will find line formatted ASCII more
convenient.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-09 17:14:13

by Roman Zippel

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Hi,

On Fri, 8 Apr 2005, Linus Torvalds wrote:

> Yes. Per-file history is expensive in git, because if the way it is
> indexed. Things are indexed by tree and by changeset, and there are no
> per-file indexes.
>
> You could create per-file _caches_ (*) on top of git if you wanted to make
> it behave more like a real SCM, but yes, it's all definitely optimized for
> the things that _I_ tend to care about, which is the whole-repository
> operations.

Per file history is also expensive for another reason. The basic reason is
that I think that a hash based storage is not the best approach for SCM.
It's lacking locality, so the more it grows the more it has to seek to
collect all the data.
To reduce the space usage you could replace the parent file with a sha1
reference + delta to the new file. This is basically what monotone does
and might cause perfomance problems if you need to restore old versions
(e.g. if you want to annotate a file).

bye, Roman

2005-04-09 17:18:42

by Paul Jackson

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Linus wrote:
> In "git", you usually care about
> the old contents too.

True - in your case, you probably want the old contents
so might as well dig them out as soon as it becomes
convenient to have them.

I was objecting to your claim that you _had_ to dig out
the old contents to determine if a file changed.

You don't _have_ to ... but I agree that it's a good
time to do so.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-09 17:36:06

by Paul Jackson

[permalink] [raw]
Subject: Re: Kernel SCM saga..

> (b) while I depend on the fact that if the SHA of an object matches, the
> objects are the same, I generally try to avoid the reverse
> dependency.

It might be a valid point that you want to leave the door open to using
a different (than SHA1) digest. (So this means you going to store it
as an ASCII string, right?)

But I don't see how that applies here. Any optimization that avoids
rereading old versions if the digests match will never trigger on the
day you change digests. No problem here - you doomed to reread the old
version in any case.

Either you got your logic backwards, or I need another cup of coffee.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-09 17:41:06

by Roman Zippel

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Hi,

On Sat, 9 Apr 2005, Eric D. Mudama wrote:

> > For example bk does something like this:
> >
> > A1 -> A2 -> A3 -> BM
> > \-> B1 -> B2 --^
> >
> > and instead of creating the merge changeset, one could merge them like
> > this:
> >
> > A1 -> A2 -> A3 -> B1 -> B2
> >
> > This results in a simpler repository, which is more scalable and which
> > is easier for users to work with (e.g. binary bug search).
> > The disadvantage would be it will cause more minor conflicts, when changes
> > are pulled back into the original tree, but which should be easily
> > resolvable most of the time.
>
> The kicker comes that B1 was developed based on A1, so any test
> results were based on B1 being a single changeset delta away from A1.
> If the resulting 'BM' fails testing, and you've converted into the
> linear model above where B2 has failed, you lose the ability to
> isolate B1's changes and where they came from, to revalidate the
> developer's results.

What good does it do if you can revalidate the original B1? The important
point is that the end result works and if it only fails in the merged
version you have a big problem. The serialized version gives you the
chance to test whether it fails in B1 or B2.

> I believe that flattening the change graph makes history reproduction
> impossible, or alternately, you are imposing on each developer to test
> the merge results at B1 + A1..3 before submission, but in doing so,
> the test time may require additional test periods etc and with
> sufficient velocity, might never close.

The merge result has to be tested either way, so I'm not exactly sure,
what you're trying to say.

bye, Roman

2005-04-09 18:18:18

by Petr Baudis

[permalink] [raw]
Subject: [PATCH] Re: Kernel SCM saga..

Dear diary, on Sat, Apr 09, 2005 at 09:08:59AM CEST, I got a letter
where "Randy.Dunlap" <[email protected]> told me that...
> On Sat, 9 Apr 2005 04:53:57 +0200 Petr Baudis wrote:
..snip..
> | FWIW, I made few small fixes (to prevent some trivial usage errors to
> | cause cache corruption) and added scripts gitcommit.sh, gitadd.sh and
> | gitlog.sh - heavily inspired by what already went through the mailing
> | list. Everything is available at http://pasky.or.cz/~pasky/dev/git/
> | (including .dircache, even though it isn't shown in the index), the
> | cumulative patch can be found below. The scripts aim to provide some
> | (obviously very interim) more high-level interface for git.
> |
> | I'm now working on tree-diff.c which will (surprise!) produce a diff
> | of two trees (I'll finish it after I get some sleep, though), and then I
> | will probably do some dwimmy gitdiff.sh wrapper for tree-diff and
> | show-diff. At that point I might get my hand on some pull more kind to
> | local changes.
>
> Hi,

Hi,

> I'll look at your scripts this weekend. I've also been
> working on some, but mine are a bit more experimental (cruder)
> than yours are. Anyway, here they are (attached) -- also
> available at http://developer.osdl.org/rddunlap/git/
>
> gitin : checkin/commit
> gitwhat sha1 : what is that sha1 file (type and contents if blob or commit)
> gitlist (blob, commit, tree, or all) :
> list all objects with type (commit, tree, blob, or all)

thanks - I had a look, but so far I borrowed only the prompt message
from your gitin. ;-) I'm not sure if gitwhat would be useful for me in
any way and gitlist doesn't appear too practical to me either.

In the meantime, I've made some progress too. I made ls-tree, which
will just convert the tree object to a human readable (and script
processable) form, and wrapper gitls.sh, which will also try to guess
the tree ID. parent-id will just return the commit ID(s) of the previous
commit(s), practical if you want to diff against the previous commit
easily etc. And finally, there is gitdiff.sh, which will produce a diff
of any two trees.

Everything is again available at http://pasky.or.cz/~pasky/dev/git/
and again including .dircache, even though it's invisible in the index.
The cumulative patch (against 0.03) is there as well as below, generated
by the

./gitdiff.sh 0af20307bb4c634722af0f9203dac7b3222c4a4f

command. The empty entries are changed modes (664 vs 644), I will yet
have to think about how to denote them if the content didn't change;
or I might ignore them altogether...?

You can obviously fetch any arbitrary change by doing the appropriate
gitdiff.sh call. You can find the ids in the ChangeLog, which was
generated by the plain

./gitlog.sh

command. (That is for HEAD. 0af20307bb4c634722af0f9203dac7b3222c4a4f is
the last commit on the Linus' branch, pass that to gitlog.sh to get his
ChangeLog. ;-)

Next, I will probably do some bk-style pull tool. Or perhaps first
a gitpatch.sh which will verify the sha1s and do the mode changes.

Linus, could you please have a look and tell me what do you think
about it so far?

Thanks,

Petr Baudis

Index: Makefile
===================================================================
--- 6be98a9e92a3f131a3fdf0dc3a8576fba6421569/Makefile (mode:100664 sha1:270cd4f8a8bf10cd513b489c4aaf76c14d4504a7)
+++ 3f6cc0ad3e076e05281438b0de69a7d6a5522d17/Makefile (mode:100644 sha1:185ff422e68984e68da011509dec116f05fc6f8d)
@@ -1,7 +1,7 @@
CFLAGS=-g -O3 -Wall
CC=gcc

-PROG=update-cache show-diff init-db write-tree read-tree commit-tree cat-file fsck-cache
+PROG=update-cache show-diff init-db write-tree read-tree commit-tree cat-file fsck-cache ls-tree

all: $(PROG)

@@ -30,6 +30,9 @@
cat-file: cat-file.o read-cache.o
$(CC) $(CFLAGS) -o cat-file cat-file.o read-cache.o $(LIBS)

+ls-tree: ls-tree.o read-cache.o
+ $(CC) $(CFLAGS) -o ls-tree ls-tree.o read-cache.o $(LIBS)
+
fsck-cache: fsck-cache.o read-cache.o
$(CC) $(CFLAGS) -o fsck-cache fsck-cache.o read-cache.o $(LIBS)

Index: README
===================================================================
Index: cache.h
===================================================================
Index: cat-file.c
===================================================================
Index: commit-tree.c
===================================================================
Index: fsck-cache.c
===================================================================
Index: gitadd.sh
===================================================================
--- 6be98a9e92a3f131a3fdf0dc3a8576fba6421569/gitadd.sh
+++ 3f6cc0ad3e076e05281438b0de69a7d6a5522d17/gitadd.sh (mode:100755 sha1:d23be758c0c9fc1cf9756bcd3ee4d7266c60a2c9)
@@ -0,0 +1,13 @@
+#!/bin/sh
+#
+# Add new file to a GIT repository.
+# Copyright (c) Petr Baudis, 2005
+#
+# Takes a list of file names at the command line, and schedules them
+# for addition to the GIT repository at the next commit.
+#
+# FIXME: Those files are omitted from show-diff output!
+
+for file in "$@"; do
+ echo $file >>.dircache/add-queue
+done
Index: gitcommit.sh
===================================================================
--- 6be98a9e92a3f131a3fdf0dc3a8576fba6421569/gitcommit.sh
+++ 3f6cc0ad3e076e05281438b0de69a7d6a5522d17/gitcommit.sh (mode:100755 sha1:67a743c6cbc9dffaa6f571d3dc83ceec2bd0c039)
@@ -0,0 +1,38 @@
+#!/bin/sh
+#
+# Commit into a GIT repository.
+# Copyright (c) Petr Baudis, 2005
+# Based on an example script fragment sent to LKML by Linus Torvalds.
+#
+# Ignores any parameters for now, excepts changelog entry on stdin.
+#
+# FIXME: Gets it wrong for filenames containing spaces.
+
+
+if [ -r .dircache/add-queue ]; then
+ mv .dircache/add-queue .dircache/add-queue-progress
+ addedfiles=$(cat .dircache/add-queue-progress)
+else
+ addedfiles=
+fi
+changedfiles=$(show-diff -s | grep -v ': ok$' | cut -d : -f 1)
+commitfiles="$addedfiles $changedfiles"
+if [ ! "$commitfiles" ]; then
+ echo 'Nothing to commit.' >&2
+ exit
+fi
+update-cache $commitfiles
+rm -f .dircache/add-queue-progress
+
+
+oldhead=$(cat .dircache/HEAD)
+treeid=$(write-tree)
+
+echo "Enter commit message, terminated by ctrl-D on a separate line:" >&2
+newhead=$(commit-tree $treeid -p $oldhead)
+
+if [ "$newhead" ]; then
+ echo $newhead >.dircache/HEAD
+else
+ echo "Error during commit (oldhead $oldhead, treeid $treeid)" >&2
+fi
Index: gitdiff.sh
===================================================================
--- 6be98a9e92a3f131a3fdf0dc3a8576fba6421569/gitdiff.sh
+++ 3f6cc0ad3e076e05281438b0de69a7d6a5522d17/gitdiff.sh (mode:100755 sha1:17aec840c7c0e0b4e4e78fd94b754fe6bc2f2ff2)
@@ -0,0 +1,104 @@
+#!/bin/sh
+#
+# Make a diff between two GIT trees.
+# Copyright (c) Petr Baudis, 2005
+#
+# Takes two parameters identifying the two trees/commits to compare.
+# Empty string will be substitued to HEAD revision.
+#
+# Outputs a diff converting the first tree to the second one.
+
+
+TREE="^tree [A-z0-9]{40}$"
+
+tree1ls=$(mktemp -t gitdiff.XXXXXX)
+tree2ls=$(mktemp -t gitdiff.XXXXXX)
+diffdir=$(mktemp -d -t gitdiff.XXXXXX)
+
+function die () {
+ echo gitdiff: $@ >&2
+ rm -f "$tree1ls" "$tree2ls"
+ rm -rf "$diffdir"
+ exit
+}
+
+function normalize_id () {
+ # XXX: This is basically a copy of gitls.sh
+ id=$1
+ if [ ! "$id" ]; then
+ id=$(cat .dircache/HEAD)
+ fi
+ if [ $(cat-file -t "$id") = "commit" ]; then
+ id=$(cat-file commit $id | egrep "$TREE" | cut -d ' ' -f 2)
+ fi
+ if [ ! $(cat-file -t "$id") = "tree" ]; then
+ die "Invalid ID supplied: $id"
+ fi
+ echo $id
+}
+
+function mkdiff () {
+ loc=$1; treeid=$2; fname=$3; mode=$4; sha1=$5;
+
+ if [ x"$sha1" != x"!" ]; then
+ cat-file blob $sha1 >$loc
+ else
+ >$loc
+ fi
+
+ label="$treeid/$fname";
+
+ labelapp=""
+ [ x"$mode" != x"!" ] && labelapp="$labelapp mode:$mode"
+ [ x"$sha1" != x"!" ] && labelapp="$labelapp sha1:$sha1"
+ labelapp=$(echo "$labelapp" | sed 's/^ *//')
+
+ [ "$labelapp" ] && label="$label ($labelapp)"
+
+ echo $label
+}
+
+id1=$(normalize_id "$1")
+id2=$(normalize_id "$2")
+
+[ "$2" != "$1" ] || die "Cannot diff tree against itself."
+
+ls-tree "$id1" >$tree1ls
+[ -s "$tree1ls" ] || die "Error retrieving the first tree."
+ls-tree "$id2" >$tree2ls
+[ -s "$tree2ls" ] || die "Error retrieving the second tree."
+
+diffdir1="$diffdir/$id1"
+diffdir2="$diffdir/$id2"
+mkdir $diffdir1 $diffdir2
+
+join -e ! -a 1 -a 2 -j 4 -o 0,1.1,1.3,2.1,2.3 $tree1ls $tree2ls | {
+ while read line; do
+ name=$(echo $line | cut -d ' ' -f 1)
+ mode1=$(echo $line | cut -d ' ' -f 2)
+ sha1=$(echo $line | cut -d ' ' -f 3)
+ mode2=$(echo $line | cut -d ' ' -f 4)
+ sha2=$(echo $line | cut -d ' ' -f 5)
+
+ # XXX: The diff format is currently pretty ugly;
+ # ideally, we should print the sha1 and mode at the
+ # +++ and --- lines, but
+
+ if [ "$mode1" != "$mode2" ] || [ "$sha1" != "$sha2" ]; then
+ echo "Index: $name"
+ echo "==================================================================="
+
+ loc1="$diffdir1/$name"
+ loc2="$diffdir2/$name"
+ mkdir -p $(dirname $loc1) $(dirname $loc2)
+
+ label1=$(mkdiff "$loc1" $id1 "$name" $mode1 $sha1)
+ label2=$(mkdiff "$loc2" $id2 "$name" $mode2 $sha2)
+
+ diff -L "$label1" -L "$label2" -u "$loc1" "$loc2"
+ fi
+ done
+}
+
+rm -f "$tree1ls" "$tree2ls"
+rm -rf "$diffdir"
Index: gitlog.sh
===================================================================
--- 6be98a9e92a3f131a3fdf0dc3a8576fba6421569/gitlog.sh
+++ 3f6cc0ad3e076e05281438b0de69a7d6a5522d17/gitlog.sh (mode:100755 sha1:e7a4eed8c0526821d00b08094c73fabb72eff4df)
@@ -0,0 +1,61 @@
+#!/bin/sh
+####
+#### Call this script with an object and it will produce the change
+#### information for all the parents of that object
+####
+#### This script was originally written by Ross Vandegrift.
+# multiple parents test 1d0f4aec21e5b66c441213643426c770dc6dedc0
+# parents: ffa098b2e187b71b86a76d3cd5eb77d074a2503c
+# 6860e0d9197c7f52155466c225baf39b42d62f63
+
+# regex for parent declarations
+PARENTS="^parent [A-z0-9]{40}$"
+
+TMPCL="/tmp/gitlog.$$"
+
+# takes an object and generates the object's parent(s)
+function unpack_parents () {
+ echo "me $1"
+ echo "me $1" >>$TMPCL
+ RENTS=""
+
+ TMPCM=$(mktemp)
+ cat-file commit $1 >$TMPCM
+ while read line; do
+ if echo "$line" | egrep -q "$PARENTS"; then
+ RENTS="$RENTS "$(echo $line | sed 's/parent //g')
+ fi
+ echo $line
+ done <$TMPCM
+ rm $TMPCM
+
+ echo -e "\n--------------------------\n"
+
+ # if the last object had no parents, return
+ if [ ! "$RENTS" ]; then
+ return;
+ fi
+
+ #useful for testing
+ #echo $RENTS
+ #read
+ for i in `echo $RENTS`; do
+ # break cycles
+ if grep -q "me $i" $TMPCL; then
+ echo "Already visited $i" >&2
+ continue
+ else
+ unpack_parents $i
+ fi
+ done
+}
+
+base=$1
+if [ ! "$base" ]; then
+ base=$(cat .dircache/HEAD)
+fi
+
+rm -f $TMPCL
+unpack_parents $base
+rm -f $TMPCL
+
Index: gitls.sh
===================================================================
--- 6be98a9e92a3f131a3fdf0dc3a8576fba6421569/gitls.sh
+++ 3f6cc0ad3e076e05281438b0de69a7d6a5522d17/gitls.sh (mode:100755 sha1:4fe78b764ac0ab3cdb16631bbfdd65edb138e47b)
@@ -0,0 +1,22 @@
+#!/bin/sh
+#
+# List contents of a particular tree in a GIT repository.
+# Copyright (c) Petr Baudis, 2005
+#
+# Optionally takes commit or tree id as a parameter, defaulting to HEAD.
+
+TREE="^tree [A-z0-9]{40}$"
+
+id=$1
+if [ ! "$id" ]; then
+ id=$(cat .dircache/HEAD)
+fi
+if [ $(cat-file -t "$id") = "commit" ]; then
+ id=$(cat-file commit $id | egrep "$TREE" | cut -d ' ' -f 2)
+fi
+if [ ! $(cat-file -t "$id") = "tree" ]; then
+ echo "Invalid ID supplied: $id" >&2
+ exit
+fi
+
+ls-tree "$id"
Index: init-db.c
===================================================================
Index: ls-tree.c
===================================================================
--- 6be98a9e92a3f131a3fdf0dc3a8576fba6421569/ls-tree.c
+++ 3f6cc0ad3e076e05281438b0de69a7d6a5522d17/ls-tree.c (mode:100644 sha1:ed5b82cd7f41c3ea4140fa1ee4b80b786f190151)
@@ -0,0 +1,51 @@
+/*
+ * GIT - The information manager from hell
+ *
+ * Copyright (C) Linus Torvalds, 2005
+ */
+#include "cache.h"
+
+static int list(unsigned char *sha1)
+{
+ void *buffer;
+ unsigned long size;
+ char type[20];
+
+ buffer = read_sha1_file(sha1, type, &size);
+ if (!buffer)
+ usage("unable to read sha1 file");
+ if (strcmp(type, "tree"))
+ usage("expected a 'tree' node");
+ while (size) {
+ int len = strlen(buffer)+1;
+ unsigned char *sha1 = buffer + len;
+ char *path = strchr(buffer, ' ')+1;
+ unsigned int mode;
+
+ if (size < len + 20 || sscanf(buffer, "%o", &mode) != 1)
+ usage("corrupt 'tree' file");
+ buffer = sha1 + 20;
+ size -= len + 20;
+ /* XXX: We just assume the type is "blob" as it should be.
+ * It seems worthless to read each file just to get this
+ * and the file size. -- [email protected] */
+ printf("%03o\t%s\t%s\t%s\n", mode, "blob", sha1_to_hex(sha1), path);
+ }
+ return 0;
+}
+
+int main(int argc, char **argv)
+{
+ unsigned char sha1[20];
+
+ if (argc != 2)
+ usage("ls-tree <key>");
+ if (get_sha1_hex(argv[1], sha1) < 0)
+ usage("ls-tree <key>");
+ sha1_file_directory = getenv(DB_ENVIRONMENT);
+ if (!sha1_file_directory)
+ sha1_file_directory = DEFAULT_DB_ENVIRONMENT;
+ if (list(sha1) < 0)
+ usage("list failed");
+ return 0;
+}
Index: parent-id
===================================================================
--- 6be98a9e92a3f131a3fdf0dc3a8576fba6421569/parent-id
+++ 3f6cc0ad3e076e05281438b0de69a7d6a5522d17/parent-id (mode:100755 sha1:198c551b7367988b48aa7a69876e098d73c19e88)
@@ -0,0 +1,15 @@
+#!/bin/sh
+#
+# Get ID of parent commit to a given revision or HEAD.
+# Copyright (c) Petr Baudis, 2005
+#
+# Takes ID of the current commit, defaults to HEAD.
+
+PARENT="^parent [A-z0-9]{40}$"
+
+id=$1
+if [ ! "$id" ]; then
+ id=$(cat .dircache/HEAD)
+fi
+
+cat-file commit $id | egrep "$PARENT" | cut -d ' ' -f 2
Index: read-cache.c
===================================================================
--- 6be98a9e92a3f131a3fdf0dc3a8576fba6421569/read-cache.c (mode:100664 sha1:e51c9ee84874b5ff0f22b11dcd4fe1f905e72a5e)
+++ 3f6cc0ad3e076e05281438b0de69a7d6a5522d17/read-cache.c (mode:100644 sha1:3dbe6db46933683721ceafdcdd70da521a32269a)
@@ -264,11 +264,12 @@
size = 0; // avoid gcc warning
map = (void *)-1;
if (!fstat(fd, &st)) {
- map = NULL;
size = st.st_size;
errno = EINVAL;
if (size > sizeof(struct cache_header))
map = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
+ else
+ return (!hdr->entries) ? 0 : error("inconsistent cache");
}
close(fd);
if (-1 == (int)(long)map)
Index: read-tree.c
===================================================================
Index: show-diff.c
===================================================================
--- 6be98a9e92a3f131a3fdf0dc3a8576fba6421569/show-diff.c (mode:100664 sha1:45f6e3140b3923497fdec808aec0e86ecf358b92)
+++ 3f6cc0ad3e076e05281438b0de69a7d6a5522d17/show-diff.c (mode:100644 sha1:9beda1382103df29914d965fc135def0e6e7e839)
@@ -49,9 +49,17 @@

int main(int argc, char **argv)
{
+ int silent = 0;
int entries = read_cache();
int i;

+ while (argc-- > 1) {
+ if (!strcmp(argv[1], "-s"))
+ silent = 1;
+ else if (!strcmp(argv[1], "-h") || !strcmp(argv[1], "--help"))
+ usage("show-diff [-s]");
+ }
+
if (entries < 0) {
perror("read_cache");
exit(1);
@@ -77,6 +85,9 @@
for (n = 0; n < 20; n++)
printf("%02x", ce->sha1[n]);
printf("\n");
+ if (silent)
+ continue;
+
new = read_sha1_file(ce->sha1, type, &size);
show_differences(ce, &st, new, size);
free(new);
Index: update-cache.c
===================================================================
--- 6be98a9e92a3f131a3fdf0dc3a8576fba6421569/update-cache.c (mode:100664 sha1:9dcee6f628d5accaa5219f72a2e790c082d9dd9a)
+++ 3f6cc0ad3e076e05281438b0de69a7d6a5522d17/update-cache.c (mode:100644 sha1:916430a05a9da088dae1ea82eb8d5392033f548a)
@@ -231,6 +231,9 @@
return -1;
}

+ if (argc < 2)
+ usage("update-cache <file>*");
+
newfd = open(".dircache/index.lock", O_RDWR | O_CREAT | O_EXCL, 0600);
if (newfd < 0) {
perror("unable to create new cachefile");
Index: write-tree.c
===================================================================

2005-04-09 18:45:50

by Marcin Dalecki

[permalink] [raw]
Subject: Re: Kernel SCM saga..


On 2005-04-09, at 17:42, Paul Jackson wrote:

> Marcin wrote:
>> But what will impress you are either the price tag the
>> DB comes with or
>> the hardware it runs on :-)
>
> The payroll for the staffing to care and feed for these
> babies is often impressive as well.

Please don't forget the bill from the electric plant behind it!

2005-04-09 18:56:12

by Ray Lee

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Sat, 2005-04-09 at 19:40 +0200, Roman Zippel wrote:
> On Sat, 9 Apr 2005, Eric D. Mudama wrote:
> > > For example bk does something like this:
> > >
> > > A1 -> A2 -> A3 -> BM
> > > \-> B1 -> B2 --^
> > >
> > > and instead of creating the merge changeset, one could merge them like
> > > this:
> > >
> > > A1 -> A2 -> A3 -> B1 -> B2

> > I believe that flattening the change graph makes history reproduction
> > impossible, or alternately, you are imposing on each developer to test
> > the merge results at B1 + A1..3 before submission, but in doing so,
> > the test time may require additional test periods etc and with
> > sufficient velocity, might never close.
>
> The merge result has to be tested either way, so I'm not exactly sure,
> what you're trying to say.

The kernel changes. A lot. And often.

With that in mind, if (for example) A2 and A3 are simple changes that
are quick to test and B1 is large, or complex, or requires hours (days,
weeks) of testing to validate, then a maintainer's decision can
legitimately be to rebase a tree (say, -mm) upon the B1 line of
development, and toss the A2 branch back to those developers with a
"Sorry it didn't work out, something here causes Unhappiness with B1,
can you track down the problem and try again?"

Ray

2005-04-09 22:12:58

by Florian Weimer

[permalink] [raw]
Subject: Re: Kernel SCM saga..

* David Lang:

>> Databases supporting replication are called high end. You forgot
>> the cats dance around the network this issue involves.
>
> And Postgres (which is Free in all senses of the word) is high end by this
> definition.

I'm not aware of *any* DBMS, commercial or not, which can perform
meaningful multi-master replication on tables which mainly consist of
text files as records. All you can get is single-master replication
(which is well-understood), or some rather scary stuff which involves
throwing away updates, or taking extrema or averages (even automatic
3-way merges aren't available).

2005-04-09 22:58:07

by David Miller

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Fri, 8 Apr 2005 22:45:18 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

> Also, I don't want people editing repostitory files by hand. Sure, the
> sha1 catches it, but still... I'd rather force the low-level ops to use
> the proper helper routines. Which is why it's a raw zlib compressed blob,
> not a gzipped file.

I understand the arguments for compression, but I hate it for one
simple reason: recovery is more difficult when you corrupt some
file in your repository.

It's happened to me more than once and I did lose data.

Without compression, I might be able to recover if something
causes a block of zeros to be written to the middle of some
repository file. With compression, you pretty much just lose.

2005-04-09 23:15:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel SCM saga..



On Sat, 9 Apr 2005, David S. Miller wrote:
>
> I understand the arguments for compression, but I hate it for one
> simple reason: recovery is more difficult when you corrupt some
> file in your repository.

Trust me, the way git does things, you'll have so much redundancy that
you'll have to really _work_ at losing data.

That's the good news.

The bad news is that this is obviously why it does eat a lot of disk.
Since it saves full-file commits, you're going to have a lot of
(compressed) full files around.

Linus

2005-04-09 23:34:02

by Tupshin Harper

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Roman Zippel wrote:

>It seems you exported the complete parent information and this is exactly
>the "nitty-gritty" I was "whining" about and which is not available via
>bkcvs or bkweb and it's the most crucial information to make the bk data
>useful outside of bk. Larry was previously very clear about this that he
>considers this proprietary bk meta data and anyone attempting to export
>this information is in violation with the free bk licence, so you indeed
>just took the important parts and this is/was explicitly verboten for
>normal bk users.
>
>
Yes, this is exactly the information that would be necessary to create a
general interop tool between bk and darcs|arch|monotone, and is the
fundamental objection I and others have had to open source projects
using BK. Is Bitmover willing to grant a special dispensation to allow a
lossless conversion of the linux history to another format?

-Tupshin

2005-04-10 00:15:07

by Chris Wedgwood

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Sat, Apr 09, 2005 at 04:13:51PM -0700, Linus Torvalds wrote:

> > I understand the arguments for compression, but I hate it for one
> > simple reason: recovery is more difficult when you corrupt some
> > file in your repository.

I've had this too. Magic binary blobs are horrible here for data loss
which is why I'm not keen on subversion.

> Trust me, the way git does things, you'll have so much redundancy
> that you'll have to really _work_ at losing data.

It's not clear to me that compression should be *required* though.
Shouldn't we be able to turn this off in some cases?

> The bad news is that this is obviously why it does eat a lot of
> disk.

Disk is cheap, but sadly page-cache is not :-(

> Since it saves full-file commits, you're going to have a lot of
> (compressed) full files around.

How many is alot? Are we talking 100k, 1m, 10m?

2005-04-10 00:23:36

by Paul Jackson

[permalink] [raw]
Subject: Re: Kernel SCM saga..

David wrote:
> recovery is more difficult when you corrupt some
> file in your repository.

Agreed. I too have recovered RCS and SCCS files by hand editing.


Linus wrote:
> I don't want people editing repostitory files by hand.

Tyrant !;)

>From Wikipedia:

A tyrant is a usurper of rightful power.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-10 01:01:21

by Phillip Lougher

[permalink] [raw]
Subject: Re: Re: Kernel SCM saga..

On Apr 9, 2005 3:53 AM, Petr Baudis <[email protected]> wrote:

> FWIW, I made few small fixes (to prevent some trivial usage errors to
> cause cache corruption) and added scripts gitcommit.sh, gitadd.sh and
> gitlog.sh - heavily inspired by what already went through the mailing
> list. Everything is available at http://pasky.or.cz/~pasky/dev/git/
> (including .dircache, even though it isn't shown in the index), the
> cumulative patch can be found below. The scripts aim to provide some
> (obviously very interim) more high-level interface for git.

I did a bit of playing about with the changelog generate script,
trying to produce a faster version. The attached version uses a
couple of improvements to be a lot faster (e.g. no recursion in the
common case of one parent).

FWIW it is 7x faster than makechlog.sh (4.342 secs vs 34.129 secs) and
28x faster than gitlog.sh (4.342 secs vs 2 mins 4 secs) on my
hardware. You mileage may of course vary.

Regards

Phillip

--------------------------------------
#!/bin/sh

changelog() {
local parents new_parent
declare -a new_parent

new_parent[0]=$1
parents=1

while [ $parents -gt 0 ]; do
parent=${new_parent[$((parents-1))]}
echo $parent >> $TMP
cat-file commit $parent > $TMP_FILE

echo me $parent
cat $TMP_FILE
echo -e "\n--------------------------\n"

parents=0
while read type text; do
if [ $type = 'committer' ]; then
break;
elif [ $type = 'parent' ] &&
! grep -q $text $TMP ; then
new_parent[$parents]=$text
parents=$((parents+1))
fi
done < $TMP_FILE

i=0
while [ $i -lt $((parents-1)) ]; do
changelog ${new_parent[$i]}
i=$((i+1))
done
done
}

TMP=`mktemp`
TMP_FILE=`mktemp`

base=$1
if [ ! "$base" ]; then
base=$(cat .dircache/HEAD)
fi
changelog $base
rm -rf $TMP $TMP_FILE

2005-04-10 01:43:05

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: Re: Kernel SCM saga..

Dear diary, on Sun, Apr 10, 2005 at 03:01:12AM CEST, I got a letter
where Phillip Lougher <[email protected]> told me that...
> On Apr 9, 2005 3:53 AM, Petr Baudis <[email protected]> wrote:
>
> > FWIW, I made few small fixes (to prevent some trivial usage errors to
> > cause cache corruption) and added scripts gitcommit.sh, gitadd.sh and
> > gitlog.sh - heavily inspired by what already went through the mailing
> > list. Everything is available at http://pasky.or.cz/~pasky/dev/git/
> > (including .dircache, even though it isn't shown in the index), the
> > cumulative patch can be found below. The scripts aim to provide some
> > (obviously very interim) more high-level interface for git.
>
> I did a bit of playing about with the changelog generate script,
> trying to produce a faster version. The attached version uses a
> couple of improvements to be a lot faster (e.g. no recursion in the
> common case of one parent).
>
> FWIW it is 7x faster than makechlog.sh (4.342 secs vs 34.129 secs) and
> 28x faster than gitlog.sh (4.342 secs vs 2 mins 4 secs) on my
> hardware. You mileage may of course vary.

Wow, really impressive! Great work, I've merged it (if you don't object,
of course).

Wondering why I wasn't in the Cc list, BTW.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-10 01:57:59

by Phillip Lougher

[permalink] [raw]
Subject: Re: Re: Re: Kernel SCM saga..

On Apr 10, 2005 2:42 AM, Petr Baudis <[email protected]> wrote:
> Dear diary, on Sun, Apr 10, 2005 at 03:01:12AM CEST, I got a letter
> where Phillip Lougher <[email protected]> told me that...
> > On Apr 9, 2005 3:53 AM, Petr Baudis <[email protected]> wrote:
> >
> > > FWIW, I made few small fixes (to prevent some trivial usage errors to
> > > cause cache corruption) and added scripts gitcommit.sh, gitadd.sh and
> > > gitlog.sh - heavily inspired by what already went through the mailing
> > > list. Everything is available at http://pasky.or.cz/~pasky/dev/git/
> > > (including .dircache, even though it isn't shown in the index), the
> > > cumulative patch can be found below. The scripts aim to provide some
> > > (obviously very interim) more high-level interface for git.
> >
> > I did a bit of playing about with the changelog generate script,
> > trying to produce a faster version. The attached version uses a
> > couple of improvements to be a lot faster (e.g. no recursion in the
> > common case of one parent).
> >
> > FWIW it is 7x faster than makechlog.sh (4.342 secs vs 34.129 secs) and
> > 28x faster than gitlog.sh (4.342 secs vs 2 mins 4 secs) on my
> > hardware. You mileage may of course vary.
>
> Wow, really impressive! Great work, I've merged it (if you don't object,
> of course).

Of course I don't object...

>
> Wondering why I wasn't in the Cc list, BTW.

Weird, it wasn't intentional. I read LKML in Gmail (which I don't use
for much else), and just clicked "reply", expecting to do the right
thing. Replying to this email it's also left you off the CC list.
Looking at the email source I believe it's probably to do with the
following:

Mail-Followup-To: Linus Torvalds <[email protected]>,
[email protected],
Kernel Mailing List <[email protected]>>

I've CC'd you explicitly on this.

Phillip

2005-04-10 01:59:31

by Paul Jackson

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Chris wrote:
> How many is alot? Are we talking 100k, 1m, 10m?

I pulled some numbers out of my bk tree for Linux.

I have 16817 source files.

They average 12.2 bitkeeper changes per file (counting the number of
changes visible from doing 'bk sccslog' on each of the 16817 files).

These 16817 files consume:

224 MBytes uncompressed and
95 MBytes compressed

(using zlib's minigzip, on a 4 KB page reiserfs.)

Since each change will get its own copy of the file, multiplying these
two sizes (224 and 95) by 12.2 changes per file means the disk cost
would be:

2.73 GByte uncompressed, or
1.16 GBytes compressed.

I was pleasantly surprised at the degree of compression, shrinking files
to 42% of their original size. I expected, since the classic rule of
thumb here to archive before compressing wasn't being followed (nor
should it be) and we were compressing lots a little files, we would save
fewer disk blocks than this.

Of course, since as Linus reminds us, it's disk buffers in memory,
not blocks on disk, that are precious, it's more like we will save
224 - 95 == 129 MBytes of RAM to hold one entire tree.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-10 03:41:56

by Paul Jackson

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Linus wrote:
> Almost everything
> else keeps the <sha1> in the ASCII hexadecimal representation, and I
> should have done that here too. Why? Not because it's a <sha1> - hey, the
> binary representation is certainly denser and equivalent

Since the size of <compressed> ASCII sha1's is only about 18% larger
than the size of the same number of binary sha1's <compressed or not>, I
don't see you gain much from the binary.

I cast my non-existent vote for making the sha1 ascii - while you still can ;).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-10 04:37:23

by Albert Cahalan

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Linus Torvalds writes:

> NOTE! I detest the centralized SCM model, but if push comes to shove,
> and we just _can't_ get a reasonable parallell merge thing going in
> the short timeframe (ie month or two), I'll use something like SVN
> on a trusted site with just a few committers, and at least try to
> distribute the merging out over a few people rather than making _me_
> be the throttle.
>
> The reason I don't really want to do that is once we start doing
> it that way, I suspect we'll have a _really_ hard time stopping.
> I think it's a broken model. So I'd much rather try to have some
> pain in the short run and get a better model running, but I just
> wanted to let people know that I'm pragmatic enough that I realize
> that we may not have much choice.

I think you at least instinctively know this, but...

Centralized SCM means you have to grant and revoke commit access,
which means that Linux gets the disease of ugly BSD politics.

Under both the old pre-BitKeeper patch system and under BitKeeper,
developer rank is fuzzy. Everyone knows that some developers are
more central than others, but it isn't fully public and well-defined.
You can change things day by day without having to demote anyone.
While Linux development isn't completely without jealousy and pride,
few have stormed off (mostly IDE developers AFAIK) and none have
forked things as severely as OpenBSD and DragonflyBSD.

You may rank developer X higher than developer Y, but they have
only a guess as to how things are. Perhaps developer X would be
a prideful jerk if he knew. Perhaps developer Y would quit in
resentment if he knew.

Whatever you do, please avoid the BSD-style politics.

(the MAINTAINERS file is bad enough; it has caused problems)


2005-04-10 08:48:31

by David Lang

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Sat, 9 Apr 2005, Linus Torvalds wrote:

>
> The biggest irritation I have with the "tree" format I chose is actually
> not the name (which is trivial), it's the <sha1> part. Almost everything
> else keeps the <sha1> in the ASCII hexadecimal representation, and I
> should have done that here too. Why? Not because it's a <sha1> - hey, the
> binary representation is certainly denser and equivalent - but because an
> ASCII representation there would have allowed me to much more easily
> change the key format if I ever wanted to. Now it's very SHA1-specific.
>
> Which I guess is fine - I don't really see any reason to change, and if I
> do change, I could always just re-generate the whole tree. But I think it
> would have been cleaner to have _that_ part in ASCII.
>

just wanted to point out that recent news shows that sha1 isn't as good as
it was thought to be (far easier to deliberatly create collisions then it
should be)

this hasn't reached a point where you HAVE to quit useing it (especially
since you have the other validity checks in place), but it's a good reason
to expect that you may want to change to something else in a few years.

it's a lot easier to change things now to make that move easier then once
this is being used extensively

David Lang

--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare

2005-04-10 09:24:38

by Giuseppe Bilotta

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Sat, 9 Apr 2005 12:17:58 -0400, David Roundy wrote:

> I've recently made some improvements
> recently which will reduce the memory use

Does this include check for redundancy? ;)

--
Giuseppe "Oblomov" Bilotta

Hic manebimus optime

2005-04-10 09:40:15

by Junio C Hamano

[permalink] [raw]
Subject: Re: Kernel SCM saga..

>>>>> "DL" == David Lang <[email protected]> writes:

DL> just wanted to point out that recent news shows that sha1 isn't as
DL> good as it was thought to be (far easier to deliberatly create
DL> collisions then it should be)

I suspect there is no need to do so...

Message-ID: <[email protected]>
From: Linus Torvalds <[email protected]>
Subject: Re: Kernel SCM saga..
Date: Sat, 9 Apr 2005 09:16:22 -0700 (PDT)

...

Linus

(*) yeah, yeah, I know about the current theoretical case, and I don't
care. Not only is it theoretical, the way my objects are packed you'd have
to not just generate the same SHA1 for it, it would have to _also_ still
be a valid zlib object _and_ get the header to match the "type + length"
of object part. IOW, the object validity checks are actually even stricter
than just "sha1 matches".

2005-04-10 11:34:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: Kernel SCM saga..


* David S. Miller <[email protected]> wrote:

> On Fri, 8 Apr 2005 22:45:18 -0700 (PDT)
> Linus Torvalds <[email protected]> wrote:
>
> > Also, I don't want people editing repostitory files by hand. Sure, the
> > sha1 catches it, but still... I'd rather force the low-level ops to use
> > the proper helper routines. Which is why it's a raw zlib compressed blob,
> > not a gzipped file.
>
> I understand the arguments for compression, but I hate it for one
> simple reason: recovery is more difficult when you corrupt some
> file in your repository.
>
> It's happened to me more than once and I did lose data.
>
> Without compression, I might be able to recover if something
> causes a block of zeros to be written to the middle of some
> repository file. With compression, you pretty much just lose.

that depends on how you compress. You are perfectly right that with
default zlib compression, where you start the compression stream and
stop it at the end of the file, recovery in case of damage is very hard
for the portion that comes _after_ the damaged section. You'd have to
reconstruct the compression state which is akin to breaking a key.

But with zlib you can 'flush' the compression state every couple of
blocks and basically get the same recovery properties, at some very
minimal extra space cost (because when you flush out compression state
you get some extra padding bytes).

Flushing has another advantage as well: a small delta (even if it
increases/decreases the file size!) in the middle of a larger file will
still be compressed to the same output both before and after the change
area (modulo flush block size), which rsync can pick up just fine. (IIRC
that is one of the reasons why Debian, when compressing .deb's, does
zlib-flushes every couple of blocks, so that rsync/apt-get can pick up
partial .deb's as well.)

the zlib option is i think Z_PARTIAL_FLUSH, i'm using it in Tux to do
chunks of compression. The flushing cost ismax 12 bytes or so, so if
it's done every 4K we maximize the cost to 0.2%.

so flushing is both rsync-friendly and recovery-friendly.

(recovery isnt as simple as with plaintext, as you have to find the next
'block' and the block length will be inevitably variable. But it should
be pretty predictable, and tools might even exist.)

Ingo

2005-04-10 12:04:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: Kernel SCM saga..


* Paul Jackson <[email protected]> wrote:

> These 16817 files consume:
>
> 224 MBytes uncompressed and
> 95 MBytes compressed
>
> (using zlib's minigzip, on a 4 KB page reiserfs.)

that's a 42.4% compressed size. Using a (much) more CPU-intense
compression method (bzip -9), the compressed size is down to 45 MBytes.
(a ratio of 20.2%)

using default 'gzip' i get 57 MB compressed.

> Since each change will get its own copy of the file, multiplying these
> two sizes (224 and 95) by 12.2 changes per file means the disk cost
> would be:
>
> 2.73 GByte uncompressed, or
> 1.16 GBytes compressed.

with bzip2 -9 it would be 551 MBytes. It might as well be practical on
faster CPUs, a full tree (224 MBytes, 45 MBytes compressed) decompresses
in 24 seconds on a 3.4GHz P4 - single CPU. (and with dual core likely
becoming the standard, we might as well divide that by two) With default
gzip it's 3.3 seconds though, and that still compresses it down to 57
MB.

Ingo

2005-04-10 13:56:34

by David Roundy

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Sun, Apr 10, 2005 at 11:24:07AM +0200, Giuseppe Bilotta wrote:
> On Sat, 9 Apr 2005 12:17:58 -0400, David Roundy wrote:
>
> > I've recently made some improvements recently which will reduce the
> > memory use
>
> Does this include check for redundancy? ;)

Yeah, the only catch is that if the redundancy checks fail, we now may
leave the repository in an inconsistent, but repairable, state. (Only a
cache of the pristine tree is affected.) The recent improvements mostly
came by increasing the laziness of a few operations, which meant we don't
need to store the entire parsed tree (or parsed patch) in memory for
certain operations.
--
David Roundy
http://www.darcs.net

2005-04-10 16:59:09

by Bill Davidsen

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Sun, 10 Apr 2005, Junio C Hamano wrote:

> >>>>> "DL" == David Lang <[email protected]> writes:
>
> DL> just wanted to point out that recent news shows that sha1 isn't as
> DL> good as it was thought to be (far easier to deliberatly create
> DL> collisions then it should be)
>
> I suspect there is no need to do so...

It's possible to generate another object with the same hash, but:
- you can't just take your desired object and do magic to make it hash
right
- it may not have the same length (almost certainly)
- it's still non-trivial in terms of computation needed

>
> Message-ID: <[email protected]>
> From: Linus Torvalds <[email protected]>
> Subject: Re: Kernel SCM saga..
> Date: Sat, 9 Apr 2005 09:16:22 -0700 (PDT)
>
> ...
>
> Linus
>
> (*) yeah, yeah, I know about the current theoretical case, and I don't
> care. Not only is it theoretical, the way my objects are packed you'd have
> to not just generate the same SHA1 for it, it would have to _also_ still
> be a valid zlib object _and_ get the header to match the "type + length"
> of object part. IOW, the object validity checks are actually even stricter
> than just "sha1 matches".
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2005-04-10 17:24:50

by Paul Komkoff

[permalink] [raw]
Subject: Code snippet to reconstruct ancestry graph from bk repo

Replying to Roman Zippel:
> the "nitty-gritty" I was "whining" about and which is not available via
> bkcvs or bkweb and it's the most crucial information to make the bk data
> useful outside of bk. Larry was previously very clear about this that he
> considers this proprietary bk meta data and anyone attempting to export
> this information is in violation with the free bk licence, so you indeed
> just took the important parts and this is/was explicitly verboten for
> normal bk users.

(borrowed from Tommi Virtanen)

Code snippet to reconstruct ancestry graph from bk repo:
bk changes -end':I: $if(:PARENT:){:PARENT:$if(:MPARENT:){ :MPARENT:}} $unless(:PARENT:){-}' |tac

format is:
newrev parent1 [parent2]
parent2 present if merge occurs.

--
Paul P 'Stingray' Komkoff Jr // http://stingr.net/key <- my pgp key
This message represents the official view of the voices in my head

2005-04-10 17:41:39

by Paul Jackson

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Ingo wrote:
> With default gzip it's 3.3 seconds though,
> and that still compresses it down to 57 MB.

Interesting. I'm surprised how much a bunch of separate, modest sized
files can be compressed.

I'm unclear what matters most here.

Space on disk certainly isn't much of an issue. Even with Andrew Morton
on our side, we still can't grow the kernel as fast as the disk drive
manufacturers can grow disk sizes.

Main memory size of the compressed history matters to Linus and his top
20 lieutenants doing full kernel source patching as a primary mission if
they can't fit the source _history_ in main memory. But those people
are running 1 GByte or more of RAM - so whether it is 95, 57 or 45
MBytes, it fits fine. The rest of us are mostly concerned with whether
a kernel build fits in memory.

Looking at an arch i386 kernel build tree I have at hand, I see the
following disk usage:

102 MBytes - BitKeeper/*
287 MBytes - */SCCS/* (outside of already counted BitKeeper/*)
232 MBytes - checked out source files
94 MBytes - ELF and other build byproducts
---
715 MBytes - Total

Converting from bk to git, I guess this becomes:

97 MBytes - git (zlib)
232 MBytes - checked out source files
94 MBytes - ELF and other build byproducts
---
423 MBytes - Total

Size matters when its a two to one difference, but when we are down to a
10% to 15% difference in the Total, its presentation that matters. The
above numbers tell me that this is not a pure size issue for local disk
or memory usage.

What does matter that I can see:

1) Linus explicitly stated he wanted "a raw zlib compressed blob,
not a gzipped file", to encourage everyone to use the git tools to
access this data. He did not "want people editing repostitory files
by hand." I'm not sure what he gains here - it did annoy me for a
couple hours before I decided fixing my supper was more important.

2) The time to compress will be noticed by users as a delay when
checking in changes (I'm guessing zlib compresses relatively faster).

3) The time to copy compressed data over the internet will be
noticed by users when upgrading kernel versions (gzip can
compress smaller).

4) Decompress times are smaller so don't matter as much.

5) Zlib has a nice library, and is patent free. I don't know about gzip.

6) As you note, zlib has rsync-friendly, recovery-friendly Z_PARTIAL_FLUSH.
I don't know about gzip.

My guess is that Linus finds (2) and (3) to balance each other, and that
(1) decides the point, in favor of zlib. Well, that or a simpler
hypothesis, that he found the nice library (5) convenient, and (1)
sealed the deal, with the other tradeoffs passing through his
subconscious faster than he bothered to verbalize them.

You (Ingo) seem in your second message to be encouraging further
consideration of gzip, for its improved compression.

How will that matter to us, day to day?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-10 17:46:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: Kernel SCM saga..


* Paul Jackson <[email protected]> wrote:

> Ingo wrote:
> > With default gzip it's 3.3 seconds though,
> > and that still compresses it down to 57 MB.
>
> Interesting. I'm surprised how much a bunch of separate, modest sized
> files can be compressed.

sorry, what i measured was in essence the tarball. I.e. not the
compression of every file separately. I should have been clear about
that ...

Ingo

2005-04-10 17:51:42

by Paul Jackson

[permalink] [raw]
Subject: Re: Kernel SCM saga..

> It's possible to generate another object with the same hash, but:

Yeah - the real check is that the modified object has to
compile and do something useful for someone (the cracker
if no one else).

Just getting a random bucket of bits substituted for a
real kernel source file isn't going to get me into the
cracker hall of fame, only into their odd-news of the
day.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-10 17:55:50

by Matthias Andree

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Andrea Arcangeli schrieb am 2005-04-09:

> On Fri, Apr 08, 2005 at 05:12:49PM -0700, Linus Torvalds wrote:
> > really designed for something like a offline http grabber, in that you can
> > just grab files purely by filename (and verify that you got them right by
> > running sha1sum on the resulting local copy). So think "wget".
>
> I'm not entirely convinced wget is going to be an efficient way to
> synchronize and fetch your tree, its simplicitly is great though. It's a

wget is probably a VERY UNWISE choice:

<http://www.derkeiler.com/Mailing-Lists/securityfocus/bugtraq/2004-12/0106.html>

--
Matthias Andree

2005-04-10 17:57:43

by Paul Jackson

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Ingo wrote:
> not the compression of every file separately.

ok

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-10 18:22:43

by Roman Zippel

[permalink] [raw]
Subject: Re: Code snippet to reconstruct ancestry graph from bk repo

Hi,

On Sun, 10 Apr 2005, Paul P Komkoff Jr wrote:

> (borrowed from Tommi Virtanen)
>
> Code snippet to reconstruct ancestry graph from bk repo:
> bk changes -end':I: $if(:PARENT:){:PARENT:$if(:MPARENT:){ :MPARENT:}} $unless(:PARENT:){-}' |tac
>
> format is:
> newrev parent1 [parent2]
> parent2 present if merge occurs.

I know that this is possible and Larry's response would have been
something like this:
http://www.ussg.iu.edu/hypermail/linux/kernel/0502.1/0248.html

bye, Roman

2005-04-10 22:33:37

by Troy Benjegerdes

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Thu, Apr 07, 2005 at 02:29:24PM -0400, Daniel Phillips wrote:
> On Thursday 07 April 2005 14:13, Dmitry Yusupov wrote:
> > On Thu, 2005-04-07 at 13:54 -0400, Daniel Phillips wrote:
> > > Three years ago, there was no fully working open source distributed scm
> > > code base to use as a starting point, so extending BK would have been the
> > > only easy alternative. But since then the situation has changed. There
> > > are now several working code bases to provide a good starting point:
> > > Monotone, Arch, SVK, Bazaar-ng and others.
> >
> > Right. For example, SVK is pretty mature project and very close to 1.0
> > release now. And it supports all kind of merges including Cherry-Picking
> > Mergeback:
> >
> > http://svk.elixus.org/?MergeFeatures
>
> So for an interim way to get the patch flow back online, SVK is ready to try
> _now_, and we only need a way to import the version graph? (true/false)

Well, I followed some of the instructions to mirror the kernel tree on
svn.clkao.org/linux/cvs, and although it took around 12 hours to import
28232 versions, I seem to have a mirror of it on my own subversion
server now. I think the svn.clkao.org mirror was taken from bkcvs... the
last log message I see is "Rev 28232 - torvalds - 2005-04-04 09:08:33"

I have no idea what's missing. What is everyone's favorite web frontend
to subversion? I've got websvn (debian package) on there now, and it's a
bit sluggish, but it seems to work.

I hope to have time this week or next to actually make this machine
publicly accessible.

2005-04-11 00:00:53

by Christian Parpart

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Monday 11 April 2005 12:33 am, you wrote:
[......]
> Well, I followed some of the instructions to mirror the kernel tree on
> svn.clkao.org/linux/cvs, and although it took around 12 hours to import
> 28232 versions, I seem to have a mirror of it on my own subversion
> server now. I think the svn.clkao.org mirror was taken from bkcvs... the
> last log message I see is "Rev 28232 - torvalds - 2005-04-04 09:08:33"

I'd love to see svk as a real choice for you guys, but I don't mind as along
as I get a door open using svn/svk ;);)

> I have no idea what's missing. What is everyone's favorite web frontend
> to subversion?

Check out ViewCVS at: http://viewcvs.sourceforge.net/
This seem widely used (not just by me ^o^).

Regards,
Christian Parpart.

--
Netiquette: http://www.ietf.org/rfc/rfc1855.txt
01:55:08 up 18 days, 15:01, 2 users, load average: 0.27, 0.39, 0.36


Attachments:
(No filename) (882.00 B)
(No filename) (189.00 B)
Download all attachments

2005-04-11 02:26:54

by Miles Bader

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Marcin Dalecki <[email protected]> writes:
> Better don't waste your time with looking at Arch. Stick with patches
> you maintain by hand combined with some scripts containing a list of
> apply commands and you should be still more productive then when using
> Arch.

Arch has its problems, but please lay off the uninformed flamebait (the
"issues" you complain about are so utterly minor as to be laughable).

-Miles
--
Ich bin ein Virus. Mach' mit und kopiere mich in Deine .signature.

2005-04-11 02:56:56

by Marcin Dalecki

[permalink] [raw]
Subject: Re: Kernel SCM saga..


On 2005-04-11, at 04:26, Miles Bader wrote:

> Marcin Dalecki <[email protected]> writes:
>> Better don't waste your time with looking at Arch. Stick with patches
>> you maintain by hand combined with some scripts containing a list of
>> apply commands and you should be still more productive then when using
>> Arch.
>
> Arch has its problems, but please lay off the uninformed flamebait (the
> "issues" you complain about are so utterly minor as to be laughable).

I wish you a lot of laughter after replying to an already 3 days old
message,
which was my final on Arch.

2005-04-11 06:36:28

by Jan Hudec

[permalink] [raw]
Subject: Re: Kernel SCM saga..

On Mon, Apr 11, 2005 at 04:56:06 +0200, Marcin Dalecki wrote:
>
> On 2005-04-11, at 04:26, Miles Bader wrote:
>
> >Marcin Dalecki <[email protected]> writes:
> >>Better don't waste your time with looking at Arch. Stick with patches
> >>you maintain by hand combined with some scripts containing a list of
> >>apply commands and you should be still more productive then when using
> >>Arch.
> >
> >Arch has its problems, but please lay off the uninformed flamebait (the
> >"issues" you complain about are so utterly minor as to be laughable).
>
> I wish you a lot of laughter after replying to an already 3 days old
> message,
> which was my final on Arch.

Marcin Dalecki <[email protected]> complained:
> Arch isn't a sound example of software design. Quite contrary to the
> random notes posted by it's author the following issues did strike me
> the time I did evaluate it:
> [...]

I didn't comment on this first time, but I see I should have. *NONE* of
the issues you complained about were issues of *DESIGN*. They were all
issues of *ENGINEERING*. *ENGINEERING* issues can be fixed. One of the
issues does not even exist any longer (the diff/patch one -- it now
checks they are the right ones -- and in all other respects it is
*exactly* the same as depending on a library)

But what really matters here is the concept. Arch has a simple concept,
that works well. Others have different concepts, that work well or
almost well too (Darcs, Monotone).

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <[email protected]>


Attachments:
(No filename) (1.54 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2005-04-12 07:21:26

by Kedar Sovani

[permalink] [raw]
Subject: Re: Kernel SCM saga.. (bk license?)

I was wondering if working on git, is in anyway, in violation of the
Bitkeeper license, which states that you cannot work on any other SCM
(SCM-like?) tool for "x" amount of time after using Bitkeeper ?


Kedar.

On Apr 8, 2005 10:12 AM, Linus Torvalds <[email protected]> wrote:
>
>
> On Thu, 7 Apr 2005, Chris Wedgwood wrote:
> >
> > I'm playing with monotone right now. Superficially it looks like it
> > has tons of gee-whiz neato stuff... however, it's *agonizingly* slow.
> > I mean glacial. A heavily sedated sloth with no legs is probably
> > faster.
>
> Yes. The silly thing is, at least in my local tests it doesn't actually
> seem to be _doing_ anything while it's slow (there are no system calls
> except for a few memory allocations and de-allocations). It seems to have
> some exponential function on the number of pathnames involved etc.
>
> I'm hoping they can fix it, though. The basic notions do not sound wrong.
>
> In the meantime (and because monotone really _is_ that slow), here's a
> quick challenge for you, and any crazy hacker out there: if you want to
> play with something _really_ nasty (but also very _very_ fast), take a
> look at kernel.org:/pub/linux/kernel/people/torvalds/.
>
> First one to send me the changelog tree of sparse-git (and a tool to
> commit and push/pull further changes) gets a gold star, and an honorable
> mention. I've put a hell of a lot of clues in there (*).
>
> I've worked on it (and little else) for the last two days. Time for
> somebody else to tell me I'm crazy.
>
> Linus
>
> (*) It should be easier than it sounds. The database is designed so that
> you can do the equivalent of a nonmerging (ie pure superset) push/pull
> with just plain rsync, so replication really should be that easy (if
> somewhat bandwidth-intensive due to the whole-file format).
>
> Never mind merging. It's not an SCM, it's a distribution and archival
> mechanism. I bet you could make a reasonable SCM on top of it, though.
> Another way of looking at it is to say that it's really a content-
> addressable filesystem, used to track directory trees.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2005-04-12 09:36:26

by Catalin Marinas

[permalink] [raw]
Subject: Re: Kernel SCM saga.. (bk license?)

Kedar Sovani <[email protected]> wrote:
> I was wondering if working on git, is in anyway, in violation of the
> Bitkeeper license, which states that you cannot work on any other SCM
> (SCM-like?) tool for "x" amount of time after using Bitkeeper ?

That's valid for the new BK license only which probably wasn't
accepted by Linus.

--
Catalin

2005-04-12 23:26:31

by Pavel Machek

[permalink] [raw]
Subject: Re: Kernel SCM saga..

Hi!

> > It's possible to generate another object with the same hash, but:
>
> Yeah - the real check is that the modified object has to
> compile and do something useful for someone (the cracker
> if no one else).
>
> Just getting a random bucket of bits substituted for a
> real kernel source file isn't going to get me into the
> cracker hall of fame, only into their odd-news of the
> day.

I actually two different files with same md5 sum in my local CVS
repository. It would be very wrong if CVS did not do the right thing
with those files.

Yes, I was playing with md5, see "md5 to be considered harmfull
today". And I wanted old version of my "exploits" to be archived.

Pavel
--
Boycott Kodak -- for their patent abuse against Java.

2005-04-13 04:14:09

by Ricky Beam

[permalink] [raw]
Subject: Re: Kernel SCM saga.. (bk license?)

On Tue, 12 Apr 2005, Kedar Sovani wrote:
>I was wondering if working on git, is in anyway, in violation of the
>Bitkeeper license, which states that you cannot work on any other SCM
>(SCM-like?) tool for "x" amount of time after using Bitkeeper ?

Technically, yes, it is. However, as BitMover has given the community
little other choice, I don't see how they could hold anyone to it. They'd
have a hard time making that 1year clause stick given their abandonment
of the free product and refusal to grant licenses to OSDL employees.

Plus, there's nothing in the bkl specifically granting BitMover the
right to revoke the license and thus use of BK/Free at their whim.
They have every right to stop developing, supporting, and distributing
BK/Free, but recending all BK/Free licenses just for spite does not
appear to be within their legal rights.

(Sorry Larry, but that's what you're doing. Tridge was working on taking
your toys apart -- he does that, what can I say. He explicitly lied and
said he would stop, but of course didn't. And then you got all pissed
at OSDL for not smiting him when, technically, they can't -- an employer
is not responsible for the actions of their employees on their own time,
on their own property, unrelated to their employ. Sorry, but I know that
one by heart :-))

--Ricky