2008-02-23 01:01:24

by Chase Venters

[permalink] [raw]
Subject: Question about your git habits

I've been making myself more familiar with git lately and I'm curious what
habits others have adopted. (I know there are a few documents in circulation
that deal with using git to work on the kernel but I don't think this has
been specifically covered).

My question is: If you're working on multiple things at once, do you tend to
clone the entire repository repeatedly into a series of separate working
directories and do your work there, then pull that work (possibly comprising
a series of "temporary" commits) back into a separate local master
respository with --squash, either into "master" or into a branch containing
the new feature?

Or perhaps you create a temporary topical branch for each thing you are
working on, and commit arbitrary changes then checkout another branch when
you need to change gears, finally --squashing the intermediate commits when a
particular piece of work is done?

I'm using git to manage my project and I'm trying to determine the most
optimal workflow I can. I figure that I'm going to have an "official" master
repository for the project, and I want to keep the revision history clean in
that repository (ie, no messy intermediate commits that don't compile or only
implement a feature half way).

On older projects I was using a certalized revision control system like
*cough* Subversion *cough* and I'd create separate branches which I'd check
out into their own working trees.

It seems to me that having multiple working trees (effectively, cloning
the "master" repository every time I need to make anything but a trivial
change) would be most effective under git as well as it doesn't require
creating messy, intermediate commits in the first place (but allows for them
if they are used). But I wonder how that approach would scale with a project
whose git repo weighed hundreds of megs or more. (With a centralized rcs, of
course, you don't have to lug around a copy of the whole project history in
each working tree.)

Insight appreciated, and I apologize if I've failed to RTFM somewhere.

Thanks,
Chase


2008-02-23 01:36:20

by J.C. Pizarro

[permalink] [raw]
Subject: Re: Question about your git habits

2008/2/23, Chase Venters <[email protected]> wrote:
>
> ... blablabla
>
> My question is: If you're working on multiple things at once, do you tend to
> clone the entire repository repeatedly into a series of separate working
> directories and do your work there, then pull that work (possibly comprising
> a series of "temporary" commits) back into a separate local master
> respository with --squash, either into "master" or into a branch containing
> the new feature?
>
> ... blablabla
>
> I'm using git to manage my project and I'm trying to determine the most
> optimal workflow I can. I figure that I'm going to have an "official" master
> repository for the project, and I want to keep the revision history clean in
> that repository (ie, no messy intermediate commits that don't compile or only
> implement a feature half way).

I recomend you to use these complementary tools

1. google: gitk screenshots ( e.g. http://lwn.net/Articles/140350/ )

2. google: "git-gui" screenshots
( e.g. http://www.spearce.org/2007/01/git-gui-screenshots.html )

3. google: gitweb color meld

;)

2008-02-23 01:37:25

by Jan Engelhardt

[permalink] [raw]
Subject: Re: Question about your git habits


On Feb 22 2008 18:37, Chase Venters wrote:
>
>I've been making myself more familiar with git lately and I'm curious what
>habits others have adopted. (I know there are a few documents in circulation
>that deal with using git to work on the kernel but I don't think this has
>been specifically covered).
>
>My question is: If you're working on multiple things at once,

Impossible; Humans only have one core with only seven registers --
according to CodingStyle chapter 6 paragraph 4.

>do you tend to clone the entire repository repeatedly into a series
>of separate working directories

Too time consuming on consumer drives with projects the size of Linux.

>and do your work there, then pull
>that work (possibly comprising a series of "temporary" commits) back
>into a separate local master respository with --squash, either into
>"master" or into a branch containing the new feature?

No, just commit the current unfinished work to a new branch and deal
with it later (cherry-pick, rebase, reset --soft, commit --amend -i,
you name it). Or if all else fails, use git-stash.

You do not have to push these temporary branches at all, so it is
much nicer than svn. (Once all the work is done and cleanly in
master, you can kill off all branches without having a record
of their previous existence.)

>Or perhaps you create a temporary topical branch for each thing you
>are working on, and commit arbitrary changes then checkout another
>branch when you need to change gears, finally --squashing the
>intermediate commits when a particular piece of work is done?

if I don't collect arbitrary changes, I don't need squashing
(see reset --soft/amend above)

2008-02-23 01:44:58

by Al Viro

[permalink] [raw]
Subject: Re: Question about your git habits

On Sat, Feb 23, 2008 at 02:37:00AM +0100, Jan Engelhardt wrote:

> >do you tend to clone the entire repository repeatedly into a series
> >of separate working directories
>
> Too time consuming on consumer drives with projects the size of Linux.

git clone -l -s

is not particulary slow...

2008-02-23 01:57:04

by Junio C Hamano

[permalink] [raw]
Subject: Re: Question about your git habits

Al Viro <[email protected]> writes:

> On Sat, Feb 23, 2008 at 02:37:00AM +0100, Jan Engelhardt wrote:
>
>> >do you tend to clone the entire repository repeatedly into a series
>> >of separate working directories
>>
>> Too time consuming on consumer drives with projects the size of Linux.
>
> git clone -l -s
>
> is not particulary slow...

How big is a checkout of a single revision of kernel these days,
compared to a well-packed history since v2.6.12-rc2?

The cost of writing out the work tree files isn't ignorable and
probably more than writing out the repository data (which -s
saves for you).

2008-02-23 02:09:37

by Al Viro

[permalink] [raw]
Subject: Re: Question about your git habits

On Fri, Feb 22, 2008 at 05:51:04PM -0800, Junio C Hamano wrote:
> Al Viro <[email protected]> writes:
>
> > On Sat, Feb 23, 2008 at 02:37:00AM +0100, Jan Engelhardt wrote:
> >
> >> >do you tend to clone the entire repository repeatedly into a series
> >> >of separate working directories
> >>
> >> Too time consuming on consumer drives with projects the size of Linux.
> >
> > git clone -l -s
> >
> > is not particulary slow...
>
> How big is a checkout of a single revision of kernel these days,
> compared to a well-packed history since v2.6.12-rc2?
>
> The cost of writing out the work tree files isn't ignorable and
> probably more than writing out the repository data (which -s
> saves for you).

Depends... I'm using ext2 for that and noatime everywhere, so that might
change the picture, but IME it's fast enough... As for the size, it gets
to ~320Mb on disk, which is comparable to the pack size (~240-odd Mb).

2008-02-23 02:23:59

by J.C. Pizarro

[permalink] [raw]
Subject: Re: Question about your git habits

On 2008/2/23, Al Viro <[email protected]> wrote:
> On Fri, Feb 22, 2008 at 05:51:04PM -0800, Junio C Hamano wrote:
> > Al Viro <[email protected]> writes:
> >
> > > On Sat, Feb 23, 2008 at 02:37:00AM +0100, Jan Engelhardt wrote:
> > >
> > >> >do you tend to clone the entire repository repeatedly into a series
> > >> >of separate working directories
> > >>
> > >> Too time consuming on consumer drives with projects the size of Linux.
> > >
> > > git clone -l -s
> > >
> > > is not particulary slow...
> >
> > How big is a checkout of a single revision of kernel these days,
> > compared to a well-packed history since v2.6.12-rc2?
> >
> > The cost of writing out the work tree files isn't ignorable and
> > probably more than writing out the repository data (which -s
> > saves for you).
>
>
> Depends... I'm using ext2 for that and noatime everywhere, so that might
> change the picture, but IME it's fast enough... As for the size, it gets
> to ~320Mb on disk, which is comparable to the pack size (~240-odd Mb).

Yesterday, i had git cloned git://foo.com/bar.git ( 777 MiB )
Today, i've git cloned git://foo.com/bar.git ( 779 MiB )

Both repos are different binaries , and i used 777 MiB + 779 MiB = 1556 MiB
of bandwidth in two days. It's much!

Why don't we implement "binary delta between old git repo and recent git repo"
with "SHA1 built git repo verifier"?

Suppose the size cost of this binary delta is e.g. around 52 MiB instead of
2 MiB due to numerous mismatching of binary parts, then the bandwidth
in two days will be 777 MiB + 52 MiB = 829 MiB instead of 1556 MiB.

Unfortunately, this "binary delta of repos" is not implemented yet :|

2008-02-23 04:11:02

by Daniel Barkalow

[permalink] [raw]
Subject: Re: Question about your git habits

On Fri, 22 Feb 2008, Chase Venters wrote:

> I've been making myself more familiar with git lately and I'm curious what
> habits others have adopted. (I know there are a few documents in circulation
> that deal with using git to work on the kernel but I don't think this has
> been specifically covered).
>
> My question is: If you're working on multiple things at once, do you tend to
> clone the entire repository repeatedly into a series of separate working
> directories and do your work there, then pull that work (possibly comprising
> a series of "temporary" commits) back into a separate local master
> respository with --squash, either into "master" or into a branch containing
> the new feature?
>
> Or perhaps you create a temporary topical branch for each thing you are
> working on, and commit arbitrary changes then checkout another branch when
> you need to change gears, finally --squashing the intermediate commits when a
> particular piece of work is done?

I find that the sequence of changes I make is pretty much unrelated to the
sequence of changes that end up in the project's history, because my
changes as I make them involve writing a lot of stubs (so I can build) and
then filling them out. It's beneficial to have version control on this so
that, if I screw up filling out a stub, I can get back to where I was.

Having made a complete series, I then generate a new series of commits,
each of which does one thing, without any bugs that I've resolved, such
that the net result is the end of the messy history, except with any
debugging or useless stuff skipped. It's this series that gets merged into
the project history, and I discard the other history.

The real trick is that the early patches in a lot of series often refactor
existing code in ways that are generally good and necessary for your
eventual outcome, but which you'd never think of until you've written more
of the series. Generating a new commit sequence is necessary to end up
with a history where it looks from the start like you know where you're
going and have everything done that needs to be done when you get to the
point of needing it. Furthermore, you want to be able to test these
commits in isolation, without the distraction of the changes that actually
prompted them, which means that you want to have your working tree is a
state that you never actually had it in as you were developing the end
result.

This means that you'll usually want to rewrite commits for any series that
isn't a single obvious patch, so it's not a big deal to commit any time
you want to work on some different branch.

-Daniel
*This .sig left intentionally blank*

2008-02-23 04:37:41

by Rene Herman

[permalink] [raw]
Subject: Re: Question about your git habits

On 23-02-08 01:37, Chase Venters wrote:

> Or perhaps you create a temporary topical branch for each thing you are
> working on, and commit arbitrary changes then checkout another branch
> when you need to change gears, finally --squashing the intermediate
> commits when a particular piece of work is done?

No very specific advice to give but this is what I do and then pull all
(compilable) topic branches into a "local" branch for complation. Just
wanted to remark that a definite downside is that switching branches a lot
also touches the tree a lot and hence tends to trigger quite unwelcome
amounts of recompiles. Using ccache would proably be effective in this
situation but I keep neglecting to check it out...

Rene

2008-02-23 05:04:17

by Jeff Garzik

[permalink] [raw]
Subject: Re: Question about your git habits

Daniel Barkalow wrote:
> I find that the sequence of changes I make is pretty much unrelated to the
> sequence of changes that end up in the project's history, because my
> changes as I make them involve writing a lot of stubs (so I can build) and
> then filling them out. It's beneficial to have version control on this so
> that, if I screw up filling out a stub, I can get back to where I was.
>
> Having made a complete series, I then generate a new series of commits,
> each of which does one thing, without any bugs that I've resolved, such
> that the net result is the end of the messy history, except with any
> debugging or useless stuff skipped. It's this series that gets merged into
> the project history, and I discard the other history.
>
> The real trick is that the early patches in a lot of series often refactor
> existing code in ways that are generally good and necessary for your
> eventual outcome, but which you'd never think of until you've written more
> of the series.

That summarizes well how I do original development, too. Whether its a
branch of an existing repo, or a newly cloned repo, when working on new
code I will do a first pass, committing as I go to provide useful
checkpoints.

Once I reach a satisfactory state, I'll refactor the patches so that
they make sense for upstream submission.

Jeff

2008-02-23 08:45:51

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: Question about your git habits

On Sat, Feb 23, 2008 at 03:23:49AM +0100, J.C. Pizarro wrote:
> On 2008/2/23, Al Viro <[email protected]> wrote:
> > On Fri, Feb 22, 2008 at 05:51:04PM -0800, Junio C Hamano wrote:
> > > Al Viro <[email protected]> writes:
> > >
> > > > On Sat, Feb 23, 2008 at 02:37:00AM +0100, Jan Engelhardt wrote:
> > > >
> > > >> >do you tend to clone the entire repository repeatedly into a series
> > > >> >of separate working directories
> > > >>
> > > >> Too time consuming on consumer drives with projects the size of Linux.
> > > >
> > > > git clone -l -s
> > > >
> > > > is not particulary slow...
> > >
> > > How big is a checkout of a single revision of kernel these days,
> > > compared to a well-packed history since v2.6.12-rc2?
> > >
> > > The cost of writing out the work tree files isn't ignorable and
> > > probably more than writing out the repository data (which -s
> > > saves for you).
> >
> >
> > Depends... I'm using ext2 for that and noatime everywhere, so that might
> > change the picture, but IME it's fast enough... As for the size, it gets
> > to ~320Mb on disk, which is comparable to the pack size (~240-odd Mb).
>
> Yesterday, i had git cloned git://foo.com/bar.git ( 777 MiB )
> Today, i've git cloned git://foo.com/bar.git ( 779 MiB )
>
> Both repos are different binaries , and i used 777 MiB + 779 MiB = 1556 MiB
> of bandwidth in two days. It's much!
>
> Why don't we implement "binary delta between old git repo and recent git repo"
> with "SHA1 built git repo verifier"?
>
> Suppose the size cost of this binary delta is e.g. around 52 MiB instead of
> 2 MiB due to numerous mismatching of binary parts, then the bandwidth
> in two days will be 777 MiB + 52 MiB = 829 MiB instead of 1556 MiB.
>
> Unfortunately, this "binary delta of repos" is not implemented yet :|

See git-pull .

2008-02-23 08:56:59

by Willy Tarreau

[permalink] [raw]
Subject: Re: Question about your git habits

On Fri, Feb 22, 2008 at 06:37:14PM -0600, Chase Venters wrote:
> It seems to me that having multiple working trees (effectively, cloning
> the "master" repository every time I need to make anything but a trivial
> change) would be most effective under git as well as it doesn't require
> creating messy, intermediate commits in the first place (but allows for them
> if they are used). But I wonder how that approach would scale with a project
> whose git repo weighed hundreds of megs or more. (With a centralized rcs, of
> course, you don't have to lug around a copy of the whole project history in
> each working tree.)

Take a look at git-new-workdir in git's contrib directory. I'm using it a
lot now. It makes it possible to set up as many workdirs as you want, sharing
the same repo. It's very dangerous if you're not rigorous, but it saves a lot
of time when you work on several branches at a time, which is even more true
for a project's documentation. The real thing to care about is not to have
the same branch checked out at several places.

Regards,
Willy

2008-02-23 09:10:29

by Sam Ravnborg

[permalink] [raw]
Subject: Re: Question about your git habits

On Fri, Feb 22, 2008 at 06:37:14PM -0600, Chase Venters wrote:
> I've been making myself more familiar with git lately and I'm curious what
> habits others have adopted. (I know there are a few documents in circulation
> that deal with using git to work on the kernel but I don't think this has
> been specifically covered).
>
> My question is: If you're working on multiple things at once, do you tend to
> clone the entire repository repeatedly into a series of separate working
> directories and do your work there, then pull that work (possibly comprising
> a series of "temporary" commits) back into a separate local master
> respository with --squash, either into "master" or into a branch containing
> the new feature?

The simple (for me) workflow I use is to create a clone of the
kernel for each 'topic' I work on.
So at the same time I may have one or maybe up to five clones of the
kernel.

When I want to combine thing I use git format-patch and git am.
Often there is some amount of editing done before combining stuff
especially for larger changes where the first in the serie is often
preparational work that were identified in random order when I did
the inital work.

Sam

2008-02-23 09:46:16

by Mike Hommey

[permalink] [raw]
Subject: Re: Question about your git habits

On Fri, Feb 22, 2008 at 11:10:48PM -0500, Daniel Barkalow wrote:
> I find that the sequence of changes I make is pretty much unrelated to the
> sequence of changes that end up in the project's history, because my
> changes as I make them involve writing a lot of stubs (so I can build) and
> then filling them out. It's beneficial to have version control on this so
> that, if I screw up filling out a stub, I can get back to where I was.
>
> Having made a complete series, I then generate a new series of commits,
> each of which does one thing, without any bugs that I've resolved, such
> that the net result is the end of the messy history, except with any
> debugging or useless stuff skipped. It's this series that gets merged into
> the project history, and I discard the other history.
>
> The real trick is that the early patches in a lot of series often refactor
> existing code in ways that are generally good and necessary for your
> eventual outcome, but which you'd never think of until you've written more
> of the series. Generating a new commit sequence is necessary to end up
> with a history where it looks from the start like you know where you're
> going and have everything done that needs to be done when you get to the
> point of needing it. Furthermore, you want to be able to test these
> commits in isolation, without the distraction of the changes that actually
> prompted them, which means that you want to have your working tree is a
> state that you never actually had it in as you were developing the end
> result.
>
> This means that you'll usually want to rewrite commits for any series that
> isn't a single obvious patch, so it's not a big deal to commit any time
> you want to work on some different branch.

I do that so much that I have this alias:
reorder = !sh -c 'git rebase -i --onto $0 $0 $1'

... and actually pass it only one argument most of the time.

Mike

2008-02-23 13:08:55

by J.C. Pizarro

[permalink] [raw]
Subject: Re: Question about your git habits

On 2008/2/23, Charles Bailey <[email protected]> wrote:
> On Sat, Feb 23, 2008 at 03:47:07AM +0100, J.C. Pizarro wrote:
> >
> > Yesterday, i had git cloned git://foo.com/bar.git ( 777 MiB )
> > Today, i've git cloned git://foo.com/bar.git ( 779 MiB )
> >
> > Both repos are different binaries , and i used 777 MiB + 779 MiB = 1556 MiB
> > of bandwidth in two days. It's much!
> >
> > Why don't we implement "binary delta between old git repo and recent git repo"
> > with "SHA1 built git repo verifier"?
> >
> > Suppose the size cost of this binary delta is e.g. around 52 MiB instead of
> > 2 MiB due to numerous mismatching of binary parts, then the bandwidth
> > in two days will be 777 MiB + 52 MiB = 829 MiB instead of 1556 MiB.
> >
> > Unfortunately, this "binary delta of repos" is not implemented yet :|
>
>
> It sounds like what concerns you is the bandwith to git://foo.bar. If
> you are cloning the first repository to somewhere were the first
> clone is accessible and bandwidth between the clones is not an issue,
> then you should be able to use the --reference parameter to git clone
> to just fetch the missing ~2 MiB from foo.bar.
>
> A "binary delta of repos" should just be an 'incremental' pack file
> and the git protocol should support generating an appropriate one. I'm
> not quite sure what "not implemented yet" feature you are looking for.

But if the repos are aggressively repacked then the bit to bit differences
are not ~2 MiB.

2008-02-23 13:37:18

by J.C. Pizarro

[permalink] [raw]
Subject: Re: Question about your git habits

On 2008/2/23, Charles Bailey <[email protected]> wrote:
> On Sat, Feb 23, 2008 at 02:08:35PM +0100, J.C. Pizarro wrote:
> >
> > But if the repos are aggressively repacked then the bit to bit differences
> > are not ~2 MiB.
>
>
> It shouldn't matter how aggressively the repositories are packed or what
> the binary differences are between the pack files are. git clone
> should (with the --reference option) generate a new pack for you with
> only the missing objects. If these objects are ~52 MiB then a lot has
> been committed to the repository, but you're not going to be able to
> get around a big download any other way.

You're wrong, nothing has to be commited ~52 MiB to the repository.

I'm not saying "commit", i'm saying

"Assume A & B binary git repos and delta_B-A another binary file, i
request built
B' = A + delta_B-A where is verified SHA1(B') = SHA1(B) for avoiding
corrupting".

Assume B is the higher repacked version of "A + minor commits of the day"
as if B was optimizing 24 hours more the minimum spanning tree. Wow!!!

2008-02-23 13:47:55

by Charles Bailey

[permalink] [raw]
Subject: Re: Question about your git habits

On Sat, Feb 23, 2008 at 02:08:35PM +0100, J.C. Pizarro wrote:
>
> But if the repos are aggressively repacked then the bit to bit differences
> are not ~2 MiB.

It shouldn't matter how aggressively the repositories are packed or what
the binary differences are between the pack files are. git clone
should (with the --reference option) generate a new pack for you with
only the missing objects. If these objects are ~52 MiB then a lot has
been committed to the repository, but you're not going to be able to
get around a big download any other way.

2008-02-23 14:02:19

by Charles Bailey

[permalink] [raw]
Subject: Re: Question about your git habits

On Sat, Feb 23, 2008 at 02:36:59PM +0100, J.C. Pizarro wrote:
> On 2008/2/23, Charles Bailey <[email protected]> wrote:
> >
> > It shouldn't matter how aggressively the repositories are packed or what
> > the binary differences are between the pack files are. git clone
> > should (with the --reference option) generate a new pack for you with
> > only the missing objects. If these objects are ~52 MiB then a lot has
> > been committed to the repository, but you're not going to be able to
> > get around a big download any other way.
>
> You're wrong, nothing has to be commited ~52 MiB to the repository.
>
> I'm not saying "commit", i'm saying
>
> "Assume A & B binary git repos and delta_B-A another binary file, i
> request built
> B' = A + delta_B-A where is verified SHA1(B') = SHA1(B) for avoiding
> corrupting".
>
> Assume B is the higher repacked version of "A + minor commits of the day"
> as if B was optimizing 24 hours more the minimum spanning tree. Wow!!!
>

I'm not sure that I understand where you are going with this.
Originally, you stated that if you clone a 775 MiB repository on day
one, and then you clone it again on day two when it was 777 MiB, then
you currently have to download 775 + 777 MiB of data, whereas you
could download a 52 MiB binary diff. I have no idea where that value
of 52 MiB comes from, and I've no idea how many objects were committed
between day one and day two. If we're going to talk about details,
then you need to provide more details about your scenario.

Having said that, here is my original point in some more detail. git
repositories are not binary blobs, they are object databases. Better
than this, they are databases of immutable objects. This means that to
get the difference between one database and another, you only need to
add the objects that are missing from the other database. If the two
databases are actually a database and the same database at short time
interval later, then almost all the objects are going to be common and
the difference will be a small set of objects. Using git:// this set
of objects can be efficiently transfered as a pack file. You may have
a corner case scenario where the following isn't true, but in my
experience an incremental pack file will be a more compact
representation of this difference than a binary difference of two
aggressively repacked git repositories as generated by a generic
binary difference engine.

I'm sorry if I've misunderstood your last point. Perhaps you could
expand in the exact issue that are having if I have, as I'm not sure
that I've really answered your last message.

2008-02-23 17:11:19

by J.C. Pizarro

[permalink] [raw]
Subject: Re: Question about your git habits

On 2008/2/23, Charles Bailey <[email protected]> wrote:
> On Sat, Feb 23, 2008 at 02:36:59PM +0100, J.C. Pizarro wrote:
> > On 2008/2/23, Charles Bailey <[email protected]> wrote:
> > >
>
> > > It shouldn't matter how aggressively the repositories are packed or what
> > > the binary differences are between the pack files are. git clone
> > > should (with the --reference option) generate a new pack for you with
> > > only the missing objects. If these objects are ~52 MiB then a lot has
> > > been committed to the repository, but you're not going to be able to
> > > get around a big download any other way.
> >
> > You're wrong, nothing has to be commited ~52 MiB to the repository.
> >
> > I'm not saying "commit", i'm saying
> >
> > "Assume A & B binary git repos and delta_B-A another binary file, i
> > request built
> > B' = A + delta_B-A where is verified SHA1(B') = SHA1(B) for avoiding
> > corrupting".
> >
> > Assume B is the higher repacked version of "A + minor commits of the day"
> > as if B was optimizing 24 hours more the minimum spanning tree. Wow!!!
> >
>
>
> I'm not sure that I understand where you are going with this.
> Originally, you stated that if you clone a 775 MiB repository on day
> one, and then you clone it again on day two when it was 777 MiB, then
> you currently have to download 775 + 777 MiB of data, whereas you
> could download a 52 MiB binary diff. I have no idea where that value
> of 52 MiB comes from, and I've no idea how many objects were committed
> between day one and day two. If we're going to talk about details,
> then you need to provide more details about your scenario.

I don't said that "A & B binary git repos" are binary files, but i said that
delta_B-A is a binary file.

I said ago ~15 hours "Suppose the size cost of this binary delta is e.g. around
52 MiB instead of 2 MiB due to numerous mismatching of binary parts ..."

The binary delta is different to the textual delta (between lines of texts)
used in the git scheme (the commits or changesets use textual deltas).
The textual delta can be compressed resulting a smaller binary object.
Collecting binary objects and some more is the git repository.
You can't apply textual delta of git repository, only binary delta.
You can apply binary delta of both git-repacked repositories if there
is a program
that generates binary delta of both directories but it's not implement yet.
The SHA1 verifier is useful for avoid the corrupting of the generated repository
(if it's corrupted then it has to be cloned again delta or whole
until non-corrupted).
An example of same SHA1 of both directories can be implemented as same SHA1
of sorted SHA1s of contents, filenames and properties. Anything
alterated, added
or eliminated from them implies different SHA1.

Don't you understand i'm saying? I will give you a practical example.
1. zip -r -8 foo1.zip foo1 # in foo1 there are tons of information
as from git repo
2. mv foo1 foo2 ; cp bar.txt foo2/
3. zip -r -9 foo2.zip foo2 # still little bit more optimized (=
higher repacked)
4. Apply binary delta between foo1.zip & foo2.zip with a supposed program
deltaier and you get delta_foo1_foo2.bin. The size(delta_foo1_foo2.bin) is
not nearly ~( size(foo2.zip) - size(foo1.zip) )
5. Apply hexadecimal diff and you will understand why it gives the exemplar
~52 MiB instead of ~2 MiB that i said it.
6. You will know some identical parts in both foo1.zip and foo2.zip.
Identical parts are good for smaller binary deltas. It's possible to get
still smaller binary deltas when their identical parts are in
random offsets
or random locations depending of how deltaier program is advanced.
7. Same above but instead of both files, apply binary delta of both directories.

> Having said that, here is my original point in some more detail. git
> repositories are not binary blobs, they are object databases. Better
> than this, they are databases of immutable objects. This means that to
> get the difference between one database and another, you only need to
> add the objects that are missing from the other database.

Databases of immutable objects <--- You're wrong because you confuse.
There are mutable objects as the better deltas of min. spanning tree.

The missing objects are not only the missing sources that you're thinking,
they can be any thing (blob, tree, commit, tag, etc.). The deltas of the
minimum spanning tree too are objects of the database that can be erased
or added when the spanning tree is alterated (because the alterated spanning
tree is smaller than previous) for better repack. Best repack is still
NP-problem
and to solve this bigger NP-problem of each day is 24/365 (eternal computing).

The git database is the top-level ".git/" directory but it has repacked binary
information and has always some size measured normally in MiBs that i was
saying above.

> If the two
> databases are actually a database and the same database at short time
> interval later, then almost all the objects are going to be common and
> the difference will be a small set of objects. Using git:// this set
> of objects can be efficiently transfered as a pack file.

You're saying repacked(A) + new objects with the bandwith cost of
new objects
but i'm saying rerepacked(A+new objects) with the bandwith cost of
binary delta
where delta is repacked(A) -
rerepacked(A+new objects)
and rerepacked(X) is more
time repacking again X.

> You may have
> a corner case scenario where the following isn't true, but in my
> experience an incremental pack file will be a more compact
> representation of this difference than a binary difference of two
> aggressively repacked git repositories as generated by a generic
> binary difference engine.

Yes, it's more simple and compact, but the eternal repacking 24/365 can do it
e.g. 30% smaller after few weeks when the incremental pack has made nothing.

It's good idea that the weekly user picks the binary delta and the
daily developer
picks the incremental pack. Put both modes working in the git server.

> I'm sorry if I've misunderstood your last point. Perhaps you could
> expand in the exact issue that are having if I have, as I'm not sure
> that I've really answered your last message.

Misunderstood can be dissappeared ;)

2008-02-23 18:19:35

by J.C. Pizarro

[permalink] [raw]
Subject: Re: Question about your git habits

The google's gmail made a crap my last message that it did wrap
my message of X lines to the crap of (X+o) lines misconfiguring
my original lines of the message.

I don't see the motives of Google crapping my original lines
of the messages that i had sended.