2005-05-12 09:44:43

by Matt Mackall

[permalink] [raw]
Subject: Mercurial 0.4e vs git network pull

Now that I'm back from vacation, there's a new Mercurial release as
well as snapshots at:

http://selenic.com/mercurial/

A combined self-hosting repository / web interface can be found at:

http://selenic.com/hg/

And there's now a mailing list at:

http://selenic.com/mailman/listinfo/mercurial

The big news is that Mercurial now has a very fast network protocol.
This benchmark is pulling and merging 819 changesets (again, taken
from 2.6.12-rc2-mm3) from one repo to another over DSL using
Mercurial's new delta protocol:

$ time hg merge hg://selenic.com/linux-hg/
retrieving changegroup
merging changesets
merging manifests
merging files

real 0m10.276s
user 0m3.299s
sys 0m0.689s

For comparison, rsyncing the same set of changes between git repos from
the same server:

$ time rsync -a rsync://10.0.0.12:2000/git/lgb/.git .
sent 171508 bytes received 31225542 bytes 312408.46 bytes/sec

real 1m40.470s
user 0m0.655s
sys 0m1.896s

The original broken-out.tar.bz2: 2.3M
The same, uncompressed: 15M
The same, rsynced with git: 30M
The same, pulled with hg (zlib): 2.5M <- what I used above
The same, pulled with hg (bz2): 2.1M

The server in question is a relatively busy 1GHz Athlon. The server
side of the hg protocol is stateless and is serviced by a simple CGI
script run under Apache.

Mercurial is more than 10 times as bandwidth efficient and
considerably more I/O efficient. On the server side, rsync uses about
twice as much CPU time as the Mercurial server and has about 10 times
the I/O and pagecache footprint as well.

Mercurial is also much smarter than rsync at determining what
outstanding changesets exist. Here's an empty pull as a demonstration:

$ time hg merge hg://selenic.com/linux-hg/
retrieving changegroup

real 0m0.363s
user 0m0.083s
sys 0m0.007s

That's a single http request and a one line response.

And now with rsync:

$ time rsync -av rsync://10.0.0.12:2000/git/lgb/.git .
receiving file list ... done

sent 76 bytes received 1280245 bytes 2560642.00 bytes/sec
total size is 85993841 speedup is 67.17

real 0m0.539s
user 0m0.185s
sys 0m0.148s

Mercurial's communication here scales O(min(changed branches, log new
changesets)) which is less than O(new changesets), while rsync scales
with O(total number of file revisions) (ouch!). The above transfer
size for an empty pull will go from 1.2M to >12M when there's similar
history in git to what's in BK.

--
Mathematics is the supreme nostalgia of our time.


2005-05-12 18:24:39

by Petr Baudis

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

Dear diary, on Thu, May 12, 2005 at 11:44:06AM CEST, I got a letter
where Matt Mackall <[email protected]> told me that...
> Mercurial is more than 10 times as bandwidth efficient and
> considerably more I/O efficient. On the server side, rsync uses about
> twice as much CPU time as the Mercurial server and has about 10 times
> the I/O and pagecache footprint as well.
>
> Mercurial is also much smarter than rsync at determining what
> outstanding changesets exist. Here's an empty pull as a demonstration:
>
> $ time hg merge hg://selenic.com/linux-hg/
> retrieving changegroup
>
> real 0m0.363s
> user 0m0.083s
> sys 0m0.007s
>
> That's a single http request and a one line response.

So, what about comparing it with something comparable, say git pull over
HTTP? :-)

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

2005-05-12 20:11:32

by Matt Mackall

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

On Thu, May 12, 2005 at 08:23:41PM +0200, Petr Baudis wrote:
> Dear diary, on Thu, May 12, 2005 at 11:44:06AM CEST, I got a letter
> where Matt Mackall <[email protected]> told me that...
> > Mercurial is more than 10 times as bandwidth efficient and
> > considerably more I/O efficient. On the server side, rsync uses about
> > twice as much CPU time as the Mercurial server and has about 10 times
> > the I/O and pagecache footprint as well.
> >
> > Mercurial is also much smarter than rsync at determining what
> > outstanding changesets exist. Here's an empty pull as a demonstration:
> >
> > $ time hg merge hg://selenic.com/linux-hg/
> > retrieving changegroup
> >
> > real 0m0.363s
> > user 0m0.083s
> > sys 0m0.007s
> >
> > That's a single http request and a one line response.
>
> So, what about comparing it with something comparable, say git pull over
> HTTP? :-)

..because I get a headache every time I try to figure out how to use git? :-P

Seriously, have a pointer to how this works?

--
Mathematics is the supreme nostalgia of our time.

2005-05-12 20:14:38

by Petr Baudis

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

Dear diary, on Thu, May 12, 2005 at 10:11:16PM CEST, I got a letter
where Matt Mackall <[email protected]> told me that...
> On Thu, May 12, 2005 at 08:23:41PM +0200, Petr Baudis wrote:
> > Dear diary, on Thu, May 12, 2005 at 11:44:06AM CEST, I got a letter
> > where Matt Mackall <[email protected]> told me that...
> > > Mercurial is more than 10 times as bandwidth efficient and
> > > considerably more I/O efficient. On the server side, rsync uses about
> > > twice as much CPU time as the Mercurial server and has about 10 times
> > > the I/O and pagecache footprint as well.
> > >
> > > Mercurial is also much smarter than rsync at determining what
> > > outstanding changesets exist. Here's an empty pull as a demonstration:
> > >
> > > $ time hg merge hg://selenic.com/linux-hg/
> > > retrieving changegroup
> > >
> > > real 0m0.363s
> > > user 0m0.083s
> > > sys 0m0.007s
> > >
> > > That's a single http request and a one line response.
> >
> > So, what about comparing it with something comparable, say git pull over
> > HTTP? :-)
>
> ..because I get a headache every time I try to figure out how to use git? :-P
>
> Seriously, have a pointer to how this works?

Either you use cogito and just pass cg-clone an HTTP URL (to the git
repository as in the case of rsync -
http://www.kernel.org/pub/scm/cogito/cogito.git should work), or you
invoke git-http-pull directly (passing it desired commit ID of the
remote HEAD you want to fetch, and the URL; see
Documentation/git-http-pull.txt).

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

2005-05-12 20:58:10

by Matt Mackall

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

On Thu, May 12, 2005 at 10:14:06PM +0200, Petr Baudis wrote:
> Dear diary, on Thu, May 12, 2005 at 10:11:16PM CEST, I got a letter
> where Matt Mackall <[email protected]> told me that...
> > On Thu, May 12, 2005 at 08:23:41PM +0200, Petr Baudis wrote:
> > > Dear diary, on Thu, May 12, 2005 at 11:44:06AM CEST, I got a letter
> > > where Matt Mackall <[email protected]> told me that...
> > > > Mercurial is more than 10 times as bandwidth efficient and
> > > > considerably more I/O efficient. On the server side, rsync uses about
> > > > twice as much CPU time as the Mercurial server and has about 10 times
> > > > the I/O and pagecache footprint as well.
> > > >
> > > > Mercurial is also much smarter than rsync at determining what
> > > > outstanding changesets exist. Here's an empty pull as a demonstration:
> > > >
> > > > $ time hg merge hg://selenic.com/linux-hg/
> > > > retrieving changegroup
> > > >
> > > > real 0m0.363s
> > > > user 0m0.083s
> > > > sys 0m0.007s
> > > >
> > > > That's a single http request and a one line response.
> > >
> > > So, what about comparing it with something comparable, say git pull over
> > > HTTP? :-)
> >
> > ..because I get a headache every time I try to figure out how to use git? :-P
> >
> > Seriously, have a pointer to how this works?
>
> Either you use cogito and just pass cg-clone an HTTP URL (to the git
> repository as in the case of rsync -
> http://www.kernel.org/pub/scm/cogito/cogito.git should work), or you
> invoke git-http-pull directly (passing it desired commit ID of the
> remote HEAD you want to fetch, and the URL; see
> Documentation/git-http-pull.txt).

Does this need an HTTP request (and round trip) per object? It appears
to. That's 2200 requests/round trips for my 800 patch benchmark.

How does git find the outstanding changesets?

--
Mathematics is the supreme nostalgia of our time.

2005-05-12 21:27:12

by Daniel Barkalow

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

On Thu, 12 May 2005, Matt Mackall wrote:

> Does this need an HTTP request (and round trip) per object? It appears
> to. That's 2200 requests/round trips for my 800 patch benchmark.

It requires a request per object, but it should be possible (with
somewhat more complicated code) to overlap them such that it doesn't
require a serial round trip for each. Since the server is sending static
files, the overhead for each should be minimal.

> How does git find the outstanding changesets?

In the present mainline, you first have to find the head commit you
want. I have a patch which does this for you over the same
connection. Starting from that point, it tracks reachability on the
receiving end, and requests anything it doesn't have.

For the case of having nothing to do, it should be a single one-line
request/response for a static file (after which the local end determines
that it has everything it needs without talking to the server).

-Daniel
*This .sig left intentionally blank*

2005-05-12 22:30:38

by Matt Mackall

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

On Thu, May 12, 2005 at 05:24:27PM -0400, Daniel Barkalow wrote:
> On Thu, 12 May 2005, Matt Mackall wrote:
>
> > Does this need an HTTP request (and round trip) per object? It appears
> > to. That's 2200 requests/round trips for my 800 patch benchmark.
>
> It requires a request per object, but it should be possible (with
> somewhat more complicated code) to overlap them such that it doesn't
> require a serial round trip for each. Since the server is sending static
> files, the overhead for each should be minimal.

It's not minimal. The size of an HTTP request is often not much
different than the size of a compressed file delta. Here's one of the
indexes from a file in an hg repo:

rev offset length base linkrev p1 p2 nodeid
0 0 2307 0 0 0000000000.. 0000000000.. b6444347c6..
1 2307 77 0 5 b6444347c6.. 0000000000.. 06763db6de..
2 2384 225 0 11 06763db6de.. 0000000000.. acc8e2b2f0..
3 2609 40 0 16 acc8e2b2f0.. 0000000000.. 461b079d98..
4 2649 261 0 17 461b079d98.. 0000000000.. 8507ba44cc..
5 2910 486 0 18 8507ba44cc.. 0000000000.. b68523252b..
6 3396 98 0 21 b68523252b.. 0000000000.. b3f2586243..
7 3494 238 0 22 b3f2586243.. 0000000000.. d73d0f8ee9..
8 3732 39 0 23 d73d0f8ee9.. 0000000000.. caaf506196..
9 3771 266 0 24 caaf506196.. 0000000000.. 54485fc96f..
10 4037 81 0 29 54485fc96f.. 0000000000.. b9eae7b990..
11 4118 310 0 31 b9eae7b990.. 0000000000.. a9926b092a..
12 4428 545 0 33 a9926b092a.. 0000000000.. f26c600172..
13 4973 419 0 34 f26c600172.. 0000000000.. ec4ab0acb7..
14 5392 136 0 38 ec4ab0acb7.. 0000000000.. eb5f3f76c8..
15 5528 161 0 39 eb5f3f76c8.. 0000000000.. 4fc5f3a3ae..
16 5689 258 0 46 4fc5f3a3ae.. 0000000000.. 3ad83891fb..
17 5947 171 0 49 3ad83891fb.. 0000000000.. 3983ac6cd2..
18 6118 195 0 50 3983ac6cd2.. 0000000000.. f138865e04..
19 6313 79 0 52 f138865e04.. 0000000000.. 3566c1f449..
20 6392 85 0 53 3566c1f449.. 0000000000.. 0694a4e3eb..
21 6477 91 0 54 0694a4e3eb.. 0000000000.. 5f98ae7426..
22 6568 208 0 56 5f98ae7426.. 0000000000.. dae5cb80db..
23 6776 286 0 62 dae5cb80db.. 0000000000.. 90ff243869..

All the junk that gets bundled in an http request/response will be
similar in size to the stuff in the third column.

Relative to the 10-20x overhead of not sending deltas, yes, it's only 10%.

> > How does git find the outstanding changesets?
>
> In the present mainline, you first have to find the head commit you
> want. I have a patch which does this for you over the same
> connection. Starting from that point, it tracks reachability on the
> receiving end, and requests anything it doesn't have.

Does it do this recursively? Eg, if the server has 800 new linear
commits, does the client have to do 800 round trips following parent
pointers to find all the new changesets? In this case, Mercurial does
about 6 round trips, totalling less than 1K, plus one requests
that pulls everything.

--
Mathematics is the supreme nostalgia of our time.

2005-05-13 00:34:56

by Daniel Barkalow

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

On Thu, 12 May 2005, Matt Mackall wrote:

> On Thu, May 12, 2005 at 05:24:27PM -0400, Daniel Barkalow wrote:
> > On Thu, 12 May 2005, Matt Mackall wrote:
> >
> > > Does this need an HTTP request (and round trip) per object? It appears
> > > to. That's 2200 requests/round trips for my 800 patch benchmark.
> >
> > It requires a request per object, but it should be possible (with
> > somewhat more complicated code) to overlap them such that it doesn't
> > require a serial round trip for each. Since the server is sending static
> > files, the overhead for each should be minimal.
>
> It's not minimal. The size of an HTTP request is often not much
> different than the size of a compressed file delta.

I was thinking of server-side processing overhead, not bandwidth. It's
true that the bandwidth could be noticeable for these small files.

> All the junk that gets bundled in an http request/response will be
> similar in size to the stuff in the third column.

kernel.org seems to send 283-byte responses, to be completely
precise. This could be cut down substantially if Apache were tweaked a bit
to skip all the optional headers which are useless or wrong in this
context. (E.g., that includes sending a content-type of "text/plain" for
the binary data)

> Does it do this recursively? Eg, if the server has 800 new linear
> commits, does the client have to do 800 round trips following parent
> pointers to find all the new changesets?

Yes, although that also includes pulling the commits, and may be
interleaved with pulling the trees and objects to cover the
latency. (I.e., one round trip gets the new head hash; the second gets
that commit; on the third the tree and the parent(s) can be requested at
once; on the fouth the contents of the tree and the grandparents, at
which point the bandwidth will probably be the limiting factor for the
rest of the operation.)

> In this case, Mercurial does about 6 round trips, totalling less than
> 1K, plus one requests that pulls everything.

I must be misunderstanding your numbers, because 6 HTTP responses is more
than 1K, ignoring any actual content from the server, and 1K for 800
commits is less than 2 bytes per commit.

I'm also worried about testing on 800 linear commits, since the projects
under consideration tend to have very non-linear histories.

-Daniel
*This .sig left intentionally blank*

2005-05-13 01:18:16

by Matt Mackall

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

On Thu, May 12, 2005 at 08:33:56PM -0400, Daniel Barkalow wrote:
> On Thu, 12 May 2005, Matt Mackall wrote:
>
> > On Thu, May 12, 2005 at 05:24:27PM -0400, Daniel Barkalow wrote:
> > > On Thu, 12 May 2005, Matt Mackall wrote:
> > >
> > > > Does this need an HTTP request (and round trip) per object? It appears
> > > > to. That's 2200 requests/round trips for my 800 patch benchmark.
> > >
> > > It requires a request per object, but it should be possible (with
> > > somewhat more complicated code) to overlap them such that it doesn't
> > > require a serial round trip for each. Since the server is sending static
> > > files, the overhead for each should be minimal.
> >
> > It's not minimal. The size of an HTTP request is often not much
> > different than the size of a compressed file delta.
>
> I was thinking of server-side processing overhead, not bandwidth. It's
> true that the bandwidth could be noticeable for these small files.
>
> > All the junk that gets bundled in an http request/response will be
> > similar in size to the stuff in the third column.
>
> kernel.org seems to send 283-byte responses, to be completely
> precise. This could be cut down substantially if Apache were tweaked a bit
> to skip all the optional headers which are useless or wrong in this
> context. (E.g., that includes sending a content-type of "text/plain" for
> the binary data)
>
> > Does it do this recursively? Eg, if the server has 800 new linear
> > commits, does the client have to do 800 round trips following parent
> > pointers to find all the new changesets?
>
> Yes, although that also includes pulling the commits, and may be
> interleaved with pulling the trees and objects to cover the
> latency. (I.e., one round trip gets the new head hash; the second gets
> that commit; on the third the tree and the parent(s) can be requested at
> once; on the fouth the contents of the tree and the grandparents, at
> which point the bandwidth will probably be the limiting factor for the
> rest of the operation.)

What if a changeset is smaller than the bandwidth-delay product of
your link? As an extreme example, Mercurial is currently at a point
where its -entire repo- changegroup (set of all changesets) can be in
flight on the wire on a typical link.

> > In this case, Mercurial does about 6 round trips, totalling less than
> > 1K, plus one requests that pulls everything.
>
> I must be misunderstanding your numbers, because 6 HTTP responses is more
> than 1K, ignoring any actual content from the server, and 1K for 800
> commits is less than 2 bytes per commit.

1k of application-level data, sorry. And my whole point is that I
don't send those 800 commit identifiers (which are 40 bytes each as
hex). I send about 30 or so. It's basically a negotiation to find the
earliest commits not known to the client with a minimum of round trips
and data exchange.

> I'm also worried about testing on 800 linear commits, since the projects
> under consideration tend to have very non-linear histories.

Not true at all. Dumps from Andrew to Linus via patch bombs will
result in runs of hundreds of linear commits on a regular basis.
Linear patch series are the preferred way to make changes and series
of 30 or 40 small patches are not at all uncommon.

--
Mathematics is the supreme nostalgia of our time.

2005-05-13 02:23:54

by Daniel Barkalow

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

On Thu, 12 May 2005, Matt Mackall wrote:

> On Thu, May 12, 2005 at 08:33:56PM -0400, Daniel Barkalow wrote:
>
> > Yes, although that also includes pulling the commits, and may be
> > interleaved with pulling the trees and objects to cover the
> > latency. (I.e., one round trip gets the new head hash; the second gets
> > that commit; on the third the tree and the parent(s) can be requested at
> > once; on the fouth the contents of the tree and the grandparents, at
> > which point the bandwidth will probably be the limiting factor for the
> > rest of the operation.)
>
> What if a changeset is smaller than the bandwidth-delay product of
> your link? As an extreme example, Mercurial is currently at a point
> where its -entire repo- changegroup (set of all changesets) can be in
> flight on the wire on a typical link.

If this is common for the repository in question, then it will be forced
to wait for the parent to come in, true. If you have a number of merges,
however, you start using more total bandwidth relative to latency while
tracking them in parallel.

> > I must be misunderstanding your numbers, because 6 HTTP responses is more
> > than 1K, ignoring any actual content from the server, and 1K for 800
> > commits is less than 2 bytes per commit.
>
> 1k of application-level data, sorry. And my whole point is that I
> don't send those 800 commit identifiers (which are 40 bytes each as
> hex). I send about 30 or so. It's basically a negotiation to find the
> earliest commits not known to the client with a minimum of round trips
> and data exchange.

Does this rely on the history being entirely linear? I suppose that
requesting a rev-list from the server (which could have it as a static
file generated when a new head was pushed) could jumpstart the
process. The client could request all of the commits it doesn't have in
rapid succession, and then request trees as the commits started coming
in. Of course, this would get inefficient if you were, for example,
pulling a merge with a branch with a long history, since you'd get a ton
of old mainline (which you already have) interleaved with occasional new
things.

> > I'm also worried about testing on 800 linear commits, since the projects
> > under consideration tend to have very non-linear histories.
>
> Not true at all. Dumps from Andrew to Linus via patch bombs will
> result in runs of hundreds of linear commits on a regular basis.
> Linear patch series are the preferred way to make changes and series
> of 30 or 40 small patches are not at all uncommon.

It has sounded like Andrew had some interest in using git, and a number of
other developers are using it already. If this becomes still more common,
it may be the case that, instead of sending patch bombs, Andrew will point
Linus at authors' original series, in which case the mainline would be
merges of a hundred linear series of various lengths. I had the
impression, although I never looked carefully, that this was happening on
a smaller scale with BK, where work by BK users got included using BK,
rather than as patches applied out of a bomb.

It certainly makes sense as a design goal to be able to support everything
happening within the system, rather than getting exported and reimported.

-Daniel
*This .sig left intentionally blank*

2005-05-13 02:44:59

by Matt Mackall

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

On Thu, May 12, 2005 at 10:23:01PM -0400, Daniel Barkalow wrote:
> On Thu, 12 May 2005, Matt Mackall wrote:
>
> > On Thu, May 12, 2005 at 08:33:56PM -0400, Daniel Barkalow wrote:
> >
> > > Yes, although that also includes pulling the commits, and may be
> > > interleaved with pulling the trees and objects to cover the
> > > latency. (I.e., one round trip gets the new head hash; the second gets
> > > that commit; on the third the tree and the parent(s) can be requested at
> > > once; on the fouth the contents of the tree and the grandparents, at
> > > which point the bandwidth will probably be the limiting factor for the
> > > rest of the operation.)
> >
> > What if a changeset is smaller than the bandwidth-delay product of
> > your link? As an extreme example, Mercurial is currently at a point
> > where its -entire repo- changegroup (set of all changesets) can be in
> > flight on the wire on a typical link.
>
> If this is common for the repository in question, then it will be forced
> to wait for the parent to come in, true. If you have a number of merges,
> however, you start using more total bandwidth relative to latency while
> tracking them in parallel.

No, you're missing my point. If you can request all the files in a
changeset in less than a round-trip time, you have a pipeline stall.
Let's say a changeset is 10k and round trip time is 100ms. That means
you'll stall on any pipe with more than 100k/s. You won't know what
changeset to request next as it'll still be in flight.

> > > I must be misunderstanding your numbers, because 6 HTTP responses is more
> > > than 1K, ignoring any actual content from the server, and 1K for 800
> > > commits is less than 2 bytes per commit.
> >
> > 1k of application-level data, sorry. And my whole point is that I
> > don't send those 800 commit identifiers (which are 40 bytes each as
> > hex). I send about 30 or so. It's basically a negotiation to find the
> > earliest commits not known to the client with a minimum of round trips
> > and data exchange.
>
> Does this rely on the history being entirely linear? I suppose that
> requesting a rev-list from the server (which could have it as a static
> file generated when a new head was pushed) could jumpstart the
> process. The client could request all of the commits it doesn't have in
> rapid succession, and then request trees as the commits started coming
> in. Of course, this would get inefficient if you were, for example,
> pulling a merge with a branch with a long history, since you'd get a ton
> of old mainline (which you already have) interleaved with occasional new
> things.

I don't depend on history being linear (I'm not reinventing CVS here)
and I don't grab a list of all revisions (the point is to be
scalable). In fact, I do something fairly clever, and something I
don't think will work with git, because, yet again, it lacks the
metadata.

> > > I'm also worried about testing on 800 linear commits, since the projects
> > > under consideration tend to have very non-linear histories.
> >
> > Not true at all. Dumps from Andrew to Linus via patch bombs will
> > result in runs of hundreds of linear commits on a regular basis.
> > Linear patch series are the preferred way to make changes and series
> > of 30 or 40 small patches are not at all uncommon.
>
> It has sounded like Andrew had some interest in using git, and a number of
> other developers are using it already. If this becomes still more common,
> it may be the case that, instead of sending patch bombs, Andrew will point
> Linus at authors' original series, in which case the mainline would be
> merges of a hundred linear series of various lengths. I had the
> impression, although I never looked carefully, that this was happening on
> a smaller scale with BK, where work by BK users got included using BK,
> rather than as patches applied out of a bomb.

Andrew already uses git, in a manner much like he used BK. He does a
pull from a repo, generates a patch of that repo vs mainline, and puts
that in -mm. And never passes that stuff on to Linus.

--
Mathematics is the supreme nostalgia of our time.

2005-05-13 05:44:37

by Petr Baudis

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

Dear diary, on Thu, May 12, 2005 at 11:24:27PM CEST, I got a letter
where Daniel Barkalow <[email protected]> told me that...
> In the present mainline, you first have to find the head commit you
> want. I have a patch which does this for you over the same
> connection. Starting from that point, it tracks reachability on the
> receiving end, and requests anything it doesn't have.

Could we get the patch, please? :-)

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

2005-05-15 00:40:25

by Christian Kujau

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Petr Baudis wrote:
> remote HEAD you want to fetch, and the URL; see
> Documentation/git-http-pull.txt).

where did you get this file from?

% ls Documentation/git-http-pull.txt
ls: Documentation/git-http-pull.txt: No such file or directory

% find . -iname "*git*" <-- returns nothing...

thanks,
Christian.
- --
BOFH excuse #28:

CPU radiator broken
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFChppu+A7rjkF8z0wRAvpAAKCUYcIWny+/+XcTqZYiAfLtu2Cy0ACfTPwM
5bSWMmrUdVpihsZodRSd0/o=
=jHtB
-----END PGP SIGNATURE-----

2005-05-15 06:24:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull


* Petr Baudis <[email protected]> wrote:

> > Mercurial is also much smarter than rsync at determining what
> > outstanding changesets exist. Here's an empty pull as a demonstration:
> >
> > $ time hg merge hg://selenic.com/linux-hg/
> > retrieving changegroup
> >
> > real 0m0.363s
> > user 0m0.083s
> > sys 0m0.007s
> >
> > That's a single http request and a one line response.
>
> So, what about comparing it with something comparable, say git pull
> over HTTP? :-)

Matt, did you get around to do such a comparison?

Ingo

2005-05-15 08:50:25

by Petr Baudis

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

Dear diary, on Sun, May 15, 2005 at 02:40:14AM CEST, I got a letter
where Christian Kujau <[email protected]> told me that...
> Petr Baudis wrote:
> > remote HEAD you want to fetch, and the URL; see
> > Documentation/git-http-pull.txt).
>
> where did you get this file from?
>
> % ls Documentation/git-http-pull.txt
> ls: Documentation/git-http-pull.txt: No such file or directory
>
> % find . -iname "*git*" <-- returns nothing...

It's in the git-pb and cogito trees. Linus is on holiday. :-)

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

2005-05-15 08:54:10

by Petr Baudis

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

Dear diary, on Thu, May 12, 2005 at 10:57:35PM CEST, I got a letter
where Matt Mackall <[email protected]> told me that...
> Does this need an HTTP request (and round trip) per object? It appears
> to. That's 2200 requests/round trips for my 800 patch benchmark.

Yes it does. On the other side, it needs no server-side CGI. But I guess
it should be pretty easy to write some kind of server-side CGI streamer,
and it would then easily take just a single HTTP request (telling the
server the commit ID and receiving back all the objects).

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

2005-05-15 12:33:20

by Adam J. Richter

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

On Sun, 15 May 2005 10:54:05 +0200, Petr Baudis wrote:
>Dear diary, on Thu, May 12, 2005 at 10:57:35PM CEST, I got a letter
>where Matt Mackall <[email protected]> told me that...
>> Does this need an HTTP request (and round trip) per object? It appears
>> to. That's 2200 requests/round trips for my 800 patch benchmark.

>Yes it does. On the other side, it needs no server-side CGI. But I guess
>it should be pretty easy to write some kind of server-side CGI streamer,
>and it would then easily take just a single HTTP request (telling the
>server the commit ID and receiving back all the objects).

I don't understand what was wrong with Jeff Garzik's previous
suggestion of using http/1.1 pipelining to coalesce the round trips.
If you're worried about queuing too many http/1.1 requests, the client
could adopt a policy of not having more than a certain number of
requests outstanding or perhaps even making a new http connection
after a certain number of requests to avoid starving other clients
when the number of clients doing one of these transfers exceeds the
number of threads that the http server uses.

Being able to do without a server side CGI script might
encourage deployment a bit more, both for security reasons and
effort of deployment.

In any case, using httpd or ftp makes it easier to deploy
servers in cases where it might be harder to modify firewall rules,
so I am glad to see that, even if it is through a CGI script.

__ ______________
Adam J. Richter \ /
[email protected] | g g d r a s i l

2005-05-15 12:41:19

by Petr Baudis

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

Dear diary, on Sun, May 15, 2005 at 01:22:19PM CEST, I got a letter
where "Adam J. Richter" <[email protected]> told me that...
> On Sun, 15 May 2005 10:54:05 +0200, Petr Baudis wrote:
> >Dear diary, on Thu, May 12, 2005 at 10:57:35PM CEST, I got a letter
> >where Matt Mackall <[email protected]> told me that...
> >> Does this need an HTTP request (and round trip) per object? It appears
> >> to. That's 2200 requests/round trips for my 800 patch benchmark.
>
> >Yes it does. On the other side, it needs no server-side CGI. But I guess
> >it should be pretty easy to write some kind of server-side CGI streamer,
> >and it would then easily take just a single HTTP request (telling the
> >server the commit ID and receiving back all the objects).
>
> I don't understand what was wrong with Jeff Garzik's previous
> suggestion of using http/1.1 pipelining to coalesce the round trips.
> If you're worried about queuing too many http/1.1 requests, the client
> could adopt a policy of not having more than a certain number of
> requests outstanding or perhaps even making a new http connection
> after a certain number of requests to avoid starving other clients
> when the number of clients doing one of these transfers exceeds the
> number of threads that the http server uses.

The problem is that to fetch a revision tree, you have to

send request for commit A
receive commit A
look at commit A for list of its parents
send request for the parents
receive the parents
look inside for list of its parents
...

(and same for the trees).

> Being able to do without a server side CGI script might
> encourage deployment a bit more, both for security reasons and
> effort of deployment.

You could still use it without the server side CGI script as it is now,
just without the speedups.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

2005-05-15 13:03:52

by Adam J. Richter

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

On Sun, 15 May 2005 14:40:42 +0200, Petr Baudis wrote:
>Dear diary, on Sun, May 15, 2005 at 01:22:19PM CEST, I got a letter
>where "Adam J. Richter" <[email protected]> told me that...
[...]
>> I don't understand what was wrong with Jeff Garzik's previous
>> suggestion of using http/1.1 pipelining to coalesce the round trips.
>> If you're worried about queuing too many http/1.1 requests, the client
>> could adopt a policy of not having more than a certain number of
>> requests outstanding or perhaps even making a new http connection
>> after a certain number of requests to avoid starving other clients
>> when the number of clients doing one of these transfers exceeds the
>> number of threads that the http server uses.

>The problem is that to fetch a revision tree, you have to

> send request for commit A
> receive commit A
> look at commit A for list of its parents
> send request for the parents
> receive the parents
> look inside for list of its parents
> ...

>(and same for the trees).

Don't you usually have a list of many files for which you
want to retrieve this information? I'd imagine that would usually
suffice to fill the pipeline.

__ ______________
Adam J. Richter \ /
[email protected] | g g d r a s i l

2005-05-15 14:23:29

by Petr Baudis

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

Dear diary, on Sun, May 15, 2005 at 01:52:50PM CEST, I got a letter
where "Adam J. Richter" <[email protected]> told me that...
> On Sun, 15 May 2005 14:40:42 +0200, Petr Baudis wrote:
> >Dear diary, on Sun, May 15, 2005 at 01:22:19PM CEST, I got a letter
> >where "Adam J. Richter" <[email protected]> told me that...
> [...]
> >> I don't understand what was wrong with Jeff Garzik's previous
> >> suggestion of using http/1.1 pipelining to coalesce the round trips.
> >> If you're worried about queuing too many http/1.1 requests, the client
> >> could adopt a policy of not having more than a certain number of
> >> requests outstanding or perhaps even making a new http connection
> >> after a certain number of requests to avoid starving other clients
> >> when the number of clients doing one of these transfers exceeds the
> >> number of threads that the http server uses.
>
> >The problem is that to fetch a revision tree, you have to
>
> > send request for commit A
> > receive commit A
> > look at commit A for list of its parents
> > send request for the parents
> > receive the parents
> > look inside for list of its parents
> > ...
>
> >(and same for the trees).
>
> Don't you usually have a list of many files for which you
> want to retrieve this information? I'd imagine that would usually
> suffice to fill the pipeline.

That might be true for the trees, but not for the commit lists. Most
commits have a single parent, except merges, which are however extremely
rare with more than two parents too.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

2005-05-15 15:12:13

by Christian Kujau

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Petr Baudis wrote:
>>>remote HEAD you want to fetch, and the URL; see
>>>Documentation/git-http-pull.txt).
>>
[..]
>
> It's in the git-pb and cogito trees. Linus is on holiday. :-)
>

ah, thanks. (that's why "cg-update" returns so quickly ;-))

- --
BOFH excuse #266:

All of the packets are empty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCh2bG+A7rjkF8z0wRAqWTAKCO2SW1Ax5+HPrMa6pTQCj/PaQ5mQCfUcKe
f9oyyKbdVTdxpEKGgbSKNIM=
=0YP6
-----END PGP SIGNATURE-----

2005-05-15 17:45:09

by Matt Mackall

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

On Sun, May 15, 2005 at 04:22:19AM -0700, Adam J. Richter wrote:
> On Sun, 15 May 2005 10:54:05 +0200, Petr Baudis wrote:
> >Dear diary, on Thu, May 12, 2005 at 10:57:35PM CEST, I got a letter
> >where Matt Mackall <[email protected]> told me that...
> >> Does this need an HTTP request (and round trip) per object? It appears
> >> to. That's 2200 requests/round trips for my 800 patch benchmark.
>
> >Yes it does. On the other side, it needs no server-side CGI. But I guess
> >it should be pretty easy to write some kind of server-side CGI streamer,
> >and it would then easily take just a single HTTP request (telling the
> >server the commit ID and receiving back all the objects).
>
> I don't understand what was wrong with Jeff Garzik's previous
> suggestion of using http/1.1 pipelining to coalesce the round trips.

You can't do pipelining if you can't look ahead far enough to fill the pipe.

--
Mathematics is the supreme nostalgia of our time.

2005-05-15 18:23:56

by Jeff Garzik

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

Matt Mackall wrote:
> On Sun, May 15, 2005 at 04:22:19AM -0700, Adam J. Richter wrote:
>
>>On Sun, 15 May 2005 10:54:05 +0200, Petr Baudis wrote:
>>
>>>Dear diary, on Thu, May 12, 2005 at 10:57:35PM CEST, I got a letter
>>>where Matt Mackall <[email protected]> told me that...
>>>
>>>>Does this need an HTTP request (and round trip) per object? It appears
>>>>to. That's 2200 requests/round trips for my 800 patch benchmark.
>>
>>>Yes it does. On the other side, it needs no server-side CGI. But I guess
>>>it should be pretty easy to write some kind of server-side CGI streamer,
>>>and it would then easily take just a single HTTP request (telling the
>>>server the commit ID and receiving back all the objects).
>>
>> I don't understand what was wrong with Jeff Garzik's previous
>>suggestion of using http/1.1 pipelining to coalesce the round trips.
>
>
> You can't do pipelining if you can't look ahead far enough to fill the pipe.

Even if you cannot fill a pipeline, HTTP/1.1 is sufficiently useful
simply by removing the per-request connection overhead.

Jeff


2005-05-16 01:12:26

by Matt Mackall

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

On Sun, May 15, 2005 at 02:23:29PM -0400, Jeff Garzik wrote:
> Matt Mackall wrote:
> >On Sun, May 15, 2005 at 04:22:19AM -0700, Adam J. Richter wrote:
> >
> >>On Sun, 15 May 2005 10:54:05 +0200, Petr Baudis wrote:
> >>
> >>>Dear diary, on Thu, May 12, 2005 at 10:57:35PM CEST, I got a letter
> >>>where Matt Mackall <[email protected]> told me that...
> >>>
> >>>>Does this need an HTTP request (and round trip) per object? It appears
> >>>>to. That's 2200 requests/round trips for my 800 patch benchmark.
> >>
> >>>Yes it does. On the other side, it needs no server-side CGI. But I guess
> >>>it should be pretty easy to write some kind of server-side CGI streamer,
> >>>and it would then easily take just a single HTTP request (telling the
> >>>server the commit ID and receiving back all the objects).
> >>
> >> I don't understand what was wrong with Jeff Garzik's previous
> >>suggestion of using http/1.1 pipelining to coalesce the round trips.
> >
> >
> >You can't do pipelining if you can't look ahead far enough to fill the
> >pipe.
>
> Even if you cannot fill a pipeline, HTTP/1.1 is sufficiently useful
> simply by removing the per-request connection overhead.

Sure. It cuts round trips by a factor of 2. But that's just about all
it does.

Mercurial already does:
- approximately O(log(new changesets)) requests/data to find new changesets
- one request to get an entire changegroup (set of all new
changesets), which comes back all nicely pipelined and sorted by file
- delta transfer

In "dumb http" mode, ie what's been there since about day three, it
can do:
- one request (size proportional to total number of changesets) to
find new changesets
- approximately two requests per changed file to pull all deltas
(vs request per file revision)
- delta transfer

--
Mathematics is the supreme nostalgia of our time.

2005-05-16 09:31:36

by Matthias Urlichs

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

Hi, Adam J. Richter wrote:

> Being able to do without a server side CGI script might
> encourage deployment a bit more, both for security reasons and
> effort of deployment.

A simple server-side CGI would be a "send me all changeset SHA-1s,
starting at HEAD until you reach FOO" operation (FOO being the SHA1 of
the previous head you've pulled before). This operation is simple enough
that it people should have no problem installing such a CGI.

You could then stream-pull the actual contents over HTTP/1.1 without
further CGI interaction.

--
Matthias Urlichs | {M:U} IT Design @ m-u-it.de | [email protected]


2005-05-16 22:28:51

by Tristan Wibberley

[permalink] [raw]
Subject: Re: Mercurial 0.4e vs git network pull

On Sun, 2005-05-15 at 14:40 +0200, Petr Baudis wrote:
> Dear diary, on Sun, May 15, 2005 at 01:22:19PM CEST, I got a letter
> where "Adam J. Richter" <[email protected]> told me that...
> >
> > I don't understand what was wrong with Jeff Garzik's previous
> > suggestion of using http/1.1 pipelining to coalesce the round trips.
> > If you're worried about queuing too many http/1.1 requests, the client
> > could adopt a policy of not having more than a certain number of
> > requests outstanding or perhaps even making a new http connection
> > after a certain number of requests to avoid starving other clients
> > when the number of clients doing one of these transfers exceeds the
> > number of threads that the http server uses.
>
> The problem is that to fetch a revision tree, you have to
>
> send request for commit A
> receive commit A
> look at commit A for list of its parents
> send request for the parents
> receive the parents
> look inside for list of its parents
> ...

What about IMAP? You could ask for just the parents for several messages
(via a message header), then start asking for message bodies (with the
juicy stuff in). You could also ask for a list of the new commits then
ask for each of the bodies (several at a time). Not as good as a "Just
give me all new data", but an *awful* lot more efficient than HTTP. And
very flexible. You just need to map changesets to IMAP messages (if such
a mapping can actually make sense :)

Prolly a bit more work though.

--
Tristan Wibberley

The opinions expressed in this message are my own opinions and not those
of my employer.